CN114339197A - Video playing test method, device and equipment - Google Patents

Video playing test method, device and equipment Download PDF

Info

Publication number
CN114339197A
CN114339197A CN202111275743.6A CN202111275743A CN114339197A CN 114339197 A CN114339197 A CN 114339197A CN 202111275743 A CN202111275743 A CN 202111275743A CN 114339197 A CN114339197 A CN 114339197A
Authority
CN
China
Prior art keywords
information
video
quality
playing
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111275743.6A
Other languages
Chinese (zh)
Inventor
夏爽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111275743.6A priority Critical patent/CN114339197A/en
Publication of CN114339197A publication Critical patent/CN114339197A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

The application discloses a video playing testing method, device and equipment, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring video playing information corresponding to a target video; determining corresponding playing quality characteristic information of the target video on at least two information modalities according to the visual information, the auditory information and the data stream information; and generating an audio-visual quality test result corresponding to the target video played by the target player based on the corresponding playing quality characteristic information on the at least two information modes. According to the technical scheme, the playing quality characteristics of the video on at least two information modes are determined by obtaining the multi-dimensional video playing information, so that the test result capable of representing the audio-visual information propagation quality is generated, the problem of inaccurate test result caused by single dimension is avoided, the accuracy of audio-video quality test is improved, and the labor cost required by the audio-video quality test is reduced.

Description

Video playing test method, device and equipment
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, and a device for testing video playing.
Background
In recent years, the internet audio and video data is growing at a high speed, and the research field of audio and video quality evaluation is active all the time.
In the related technology, audio and video quality evaluation usually depends on human eye subjective evaluation, namely, the audio and video quality is scored by means of human visual observation; for the scheme of evaluating the audio and video quality from an objective angle, the quality evaluation of image dimensionality is often performed on the video, and the audio and video quality is scored according to the image quality.
In the related technology, the audio and video quality evaluation has the advantages of single dimension, low accuracy and high labor cost.
Disclosure of Invention
The embodiment of the application provides a video playing testing method, device and equipment, which can utilize media contents of various information modes to test and evaluate the audio and video quality, improve the accuracy of audio and video quality testing and reduce the testing labor cost.
According to an aspect of an embodiment of the present application, there is provided a method for testing video playback, the method including:
acquiring video playing information corresponding to a target video, wherein the video playing information comprises visual information and auditory information corresponding to the target video and data stream information corresponding to the target video played by a target player, and the data stream information is used for representing the processing quality of a data stream in the video playing process;
according to the visual information, the auditory information and the data stream information, determining corresponding playing quality characteristic information of the target video on at least two information modalities;
and generating an audio-visual quality test result corresponding to the target video played by the target player based on the playing quality characteristic information corresponding to the at least two information modes, wherein the audio-visual quality test result is used for representing the audio-visual information transmission quality of the target video corresponding to the target player.
According to an aspect of an embodiment of the present application, there is provided a video playback testing apparatus, including:
the playing information acquisition module is used for acquiring video playing information corresponding to a target video, wherein the video playing information comprises visual information and auditory information corresponding to the target video and data stream information corresponding to the target video played by a target player, and the data stream information is used for representing the processing quality of a data stream in the video playing process;
the quality characteristic determining module is used for determining corresponding playing quality characteristic information of the target video on at least two information modalities according to the visual information, the auditory information and the data stream information;
and the test result generation module is used for generating an audio-visual quality test result corresponding to the target video played by the target player based on the playing quality characteristic information corresponding to the at least two information modalities, wherein the audio-visual quality test result is used for representing the audio-visual information transmission quality corresponding to the target video on the target player.
According to an aspect of the embodiments of the present application, there is provided a computer device, the computer device includes a processor and a memory, the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the above-mentioned test method for video playing.
According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium, in which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the above-mentioned test method for video playing.
According to an aspect of embodiments herein, there is provided a computer program product comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the test method for video playing.
The technical scheme provided by the embodiment of the application can bring the following beneficial effects:
the method comprises the steps of obtaining visual information and auditory information which can be perceived by a user in the process of playing a video by a player and data stream information which reflects the data processing quality of a target player, determining the playing quality characteristics of the video in at least two information modes, generating an audio-visual quality test result which can represent the transmission quality of the audio-visual information by utilizing the playing quality characteristics in the at least two information modes, fully mining the playing quality information of media contents in different information modes, ensuring that the test result meets the subjective evaluation of the user, avoiding the problem of inaccurate test result caused by single information dimension, improving the accuracy of audio-video quality test, determining the video playing quality without the help of manual observation, and reducing the labor cost required by the audio-video quality test.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of an application execution environment provided by one embodiment of the present application;
FIG. 2 is a first flowchart of a method for testing video playback according to an embodiment of the present application;
fig. 3 is a flowchart of a method for testing video playing according to an embodiment of the present application;
fig. 4 illustrates a schematic diagram of determining video playback quality indicator data;
fig. 5 schematically illustrates a schematic diagram of determining audio playback quality indicator data;
FIG. 6 illustrates a schematic diagram of determining image quality indicator data;
FIG. 7 illustrates a schematic diagram of determining text quality indicator data;
FIG. 8 illustrates a schematic diagram of a multi-modal quality testing network;
fig. 9 is a schematic diagram illustrating an audio and video quality testing flow in the process of playing video by a terminal;
FIG. 10 is a block diagram of a video playback testing apparatus according to an embodiment of the present application;
fig. 11 is a block diagram of a computer device according to an embodiment of the present application.
Detailed Description
Before describing the embodiments of the method of the present application, an application scenario, related terms, or terms that may be involved in the embodiments of the method of the present application are described to facilitate understanding for those skilled in the art.
The human experience of the world is multi-modal, such as seeing objects, hearing sounds, feeling texture, smelling odors, tasting tastes, and the like. A modality refers to the manner in which an event occurs or experiences, and when a research question contains multiple modalities, it has the characteristics of multiple modalities. In order for artificial intelligence to make progress in understanding the world around humans, it needs to be able to interpret these multi-modal signals simultaneously.
In recent years, internet audio and video data are growing at a high speed, pure text information is removed, richer data such as voice, images and videos are not fully utilized and learned, and in the audio and video field, the information learning and application in the image field are limited, and other information is not fully utilized. When a general user watches videos, the evaluation on the video quality is comprehensive in multiple aspects, the multi-mode neural network can learn concepts among different modes, and multi-mode contents such as texts, voices, images and videos are combined to learn, so that integration of information of different modes is achieved.
In a scene that a terminal plays videos, the terminal needs to finally present the videos through decoding, rendering, post-processing and other processes, and in consideration of hardware performance and the like of different devices, the quality of the videos watched by a user at the terminal is different from that of the video sources of the user. The quality of the finally presented video needs to be evaluated and tested at the terminal, and an effective evaluation tool for the video playing quality is lacked. On one hand, the evaluation of the newly online video, different video formats, video terminal post-processing algorithms and the content quality of the video depends on human eye subjective evaluation, a large amount of manpower is needed for subjective evaluation, and the method wastes manpower, has poor timeliness and is low in coverage rate. Even if an evaluation tool is provided, the method belongs to single-dimensional evaluation such as images and the like, and cannot be completely applied to video scenes. On the other hand, for quality problems of various definitions, fluency, screen splash and the like in video playing, a monitoring mechanism is lacked.
Therefore, the video is mined and evaluated in a multi-mode, the multi-mode quality test model is introduced to effectively quantify the video playing quality, the video quality can be evaluated more comprehensively and convincingly, and the method and the device have a high application value in various playing scenes. Firstly, the evaluation of the video watching quality is beneficial to the effect evaluation of a player algorithm and a strategy; and secondly, the evaluation on the video content is beneficial to screening out a better video, and key pushing is carried out, so that the method is particularly valuable for scenes with high timeliness requirements such as short videos and live broadcasts.
The video playing test method provided by the embodiment of the application relates to an artificial intelligence technology, and the following brief description is provided to facilitate understanding by those skilled in the art.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.
Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and researched in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service, internet of vehicles, automatic driving, smart traffic and the like. In the embodiment of the method, the video playing quality can be tested and evaluated based on the media contents of various information modes through an artificial intelligence technology, and the multi-mode audio and video quality evaluation is realized by means of the artificial intelligence technology.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, a schematic diagram of an application execution environment according to an embodiment of the present application is shown. The application execution environment may include: a terminal 10 and a server 20.
The terminal 10 includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, a game console, an electronic book reader, a multimedia playing device, a wearable device, and other electronic devices. A client of the application may be installed in the terminal 10.
In the embodiment of the present application, the application may be any application capable of providing a video playing service. Typically, the application is a video-type application. Of course, besides video applications, other types of applications may provide video playing services. For example, the application may be a news application, a social interaction application, an interactive entertainment application, a browser application, a shopping application, a content sharing application, a Virtual Reality (VR) application, an Augmented Reality (AR) application, and the like, which is not limited in this embodiment. In addition, for different applications, the video playing services related to the applications may also be different, and the corresponding functions may also be different, which may be configured in advance according to actual requirements, and this is not limited in this embodiment of the application. Optionally, a client of the above application program runs in the terminal 10.
The server 20 is used to provide background services for clients of applications in the terminal 10. For example, the server 20 may be a backend server for the application described above. The server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform. Optionally, the server 20 provides background services for applications in multiple terminals 10 simultaneously.
Alternatively, the terminal 10 and the server 20 may communicate with each other through the network 30. The terminal 10 and the server 20 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited thereto.
Please refer to fig. 2, which shows a flowchart of a video playback testing method according to an embodiment of the present application. The method can be applied to a computer device, which refers to an electronic device with data calculation and processing capabilities, for example, the execution subject of each step can be the terminal 10 in the application program running environment shown in fig. 1. The method can comprise the following steps (210-230).
Step 210, video playing information corresponding to the target video is obtained.
The target video includes, but is not limited to, offline video, online video, live video, short video, and the like, which is not limited in this embodiment of the application.
The video playing information comprises visual information and auditory information corresponding to the target video and data stream information corresponding to the target video played by the target player, and the data stream information is used for representing the processing quality of the data stream in the video playing process.
The video playback data refers to data associated with the playback quality of the target video. The video generation process mainly comprises audio and video acquisition, audio and video coding compression and audio and video packaging, so that a video file with a certain video format is obtained. The embodiment of the present application also does not limit the video format.
In an exemplary embodiment, the playing process of the video mainly includes audio and video decapsulation, audio and video decoding, audio and video synchronization and rendering. The video player can perform video playing, and the video player can perform operations such as audio and video decapsulation, decoding, audio and video synchronization, rendering and the like, so as to generate the video playing information. Correspondingly, as shown in fig. 3, the implementation process of the step 210 includes the following steps (211-213), and fig. 3 shows a second flowchart of a video playing test method provided in an embodiment of the present application.
Step 211, in response to the video playing instruction, obtaining at least two data streams corresponding to the target video, where the at least two data streams include an original video frame data stream and an original audio data stream.
For a video player, the processing of three data streams of video, audio and subtitle corresponding to a target video is mainly performed. For subtitles, some videos have subtitles embedded in images, and no additional external subtitles are needed, so that the target video corresponds to at least two data streams, namely the original video frame data stream and the original audio data stream, the original video frame data stream is used for rendering and displaying a video frame sequence, and the original audio data stream is used for rendering and playing an audio signal.
Step 212, performing image display processing on the original video frame data stream to obtain a video frame sequence displayed on the target page and data stream information corresponding to the video frame sequence.
The image display processing includes operations of decapsulating, decoding, audio and video synchronization, rendering and the like on the original video frame data stream, so that image information in the original video frame data stream is displayed in the target page.
A sequence of video frames is used to characterize visual information on an image modality. The video frame sequence is an image sequence formed by a plurality of continuous video frames obtained by decoding original video frame data of a target video by a player. The video frame is a video image displayed on the target page after being decoded, and is visual information of an image mode which can be visually seen by a user.
The data stream information corresponding to the video frame sequence comprises at least one video playing quality index data capable of representing the playing quality of the target video in the video modality. The video playing quality index data in the data stream information is index data related to the data processing performance of the player, and includes, but is not limited to, a bitrate (bitrate), a frame rate (frame rate), a frame loss rate (frame rate), a fluency (fluency), a decoding accuracy (decode accuracy), a decoding time (decode time), and the like corresponding to the video frame sequence data stream. The calculation method of each index and the function of each index in measuring the playing quality of the video mode are as follows:
code rate: the code rate (in kbps) is the file size (KB) × 8/time(s).
Frame rate (fps): the number of frames displayed per second is counted.
Frame loss rate: the terminal receives the video original data packet, assuming that the total frame number is f, the actual video frame number on the screen after decoding and rendering is r, and the frame loss rate is: (r-f)/f, the lower the frame loss rate, the fewer errors the player has in decoding the rendering.
Fluency: and determining the fluency of video playing based on the frame rate and the frame loss rate. The higher the number of frames actually displayed per second in the process of playing the target video, the higher the fluency.
The decoding consumes time: the time consuming decoding of a frame of video. For example, for different film sources, HEVC film source decoding takes longer than that of a common film source, general device hardware decoding takes less time than software decoding, and statistical decoding takes less time to help deploy decoder strategies on different device types of a terminal.
In one example, as shown in fig. 4, a schematic diagram of determining video playback quality indicator data is illustrated. A video cover-book of a film with file name H264_864_486.mp4, size 18.5MB (megabyte), type MPEG-4(Moving Pictures Experts Group-4) is shown in fig. 4. When the movie is played, the code rate, the frame loss rate and the decoding time of the client terminal actually playing the movie can be determined.
Step 213, performing audio playing processing on the original audio data stream to generate an audio signal and data stream information corresponding to the audio signal.
The audio playing processing comprises the operations of decapsulation, decoding, audio and video synchronization, rendering and the like on the original audio data stream, so that the audio information in the original audio data stream is transmitted to the user.
The audio signal is used to characterize auditory information. The audio signal is played after being decoded and rendered, and the audio signal is auditory information of an audio modality which can be intuitively heard by a user.
The data stream information corresponding to the audio signal comprises at least one audio playing quality index data corresponding to the audio mode. The audio playing quality index data in the data stream information is index data associated with the data processing performance of the player, and includes, but is not limited to, a signal-to-noise ratio, a fluency, a short-term objective intelligible index, a volume, and the like corresponding to the audio data stream. The calculation method of each index is as follows:
Signal-to-Noise Ratio (SNR) has been the conventional method of scaling the speech enhancement computation for wideband Noise distortion. Thus, SNR is mainly used in simulations of algorithms where both clean speech signals and noise signals are known. The signal-to-noise ratio calculates the ratio of the average power of the speech signal to the noise signal over the entire time axis.
Fluency: similar to the frame loss rate in video quality, assuming that the total time length of audio playing is t, the time length of frame loss and pause in the process is b, and the fluency of a section of audio is measured by (t-b)/t.
Short-Time Objective Intelligibility indicator (STOI): the value range is 0-1, and the larger the value is, the higher the intelligibility is.
In one example, as shown in fig. 5, a schematic diagram of determining audio playback quality indicator data is illustrated. For the audio signal in the target video, the signal-to-noise ratio, the fluency, the short-time objective understandable index and the volume corresponding to the audio signal can be determined.
In an exemplary embodiment, the at least two data streams further include an original text data stream. Accordingly, in order to process the original text data stream, as shown in fig. 3, the implementation process of step 210 further includes the following step 214.
Step 214, performing text display processing on the original text data stream to obtain text content displayed in the target page and data stream information corresponding to the text content.
The text content is used for representing visual information corresponding to the text modality. Text content in video includes, but is not limited to, text in video frame images and subtitles for video. Subtitles refer to the display of video content in text form, including dialogs between video characters, and also including descriptive language for pictures.
And under the condition that the caption of the video is a plug-in caption or a video comment text exists, the at least two data streams also comprise an original text data stream in addition to the original video frame data stream and the original audio data stream, and the original text data stream is used for displaying text information of the target video.
The data stream information corresponding to the text content includes at least one text playback quality index data corresponding to the text modality, and the text playback quality index data in the data stream information may be data associated with the data processing performance of the player, including but not limited to index data such as code rate, fluency, decoding time consumption and the like corresponding to the text data stream.
According to the content, the audio and video content of the target video comprises the audio and video information of multiple modes and the data stream processing information in the player are fully utilized, and the accuracy of the audio and video quality test result is ensured.
In addition, the video playing data is determined by another process in the process of watching the video by the user, the video playing quality can be evaluated in the process of watching by the user, the most real quality evaluation is performed on multi-dimensional information such as video fluency, picture quality, sound quality, video content and the like, and great help is provided for the quality of a decision algorithm of a developer and the discovery and the solution of the existing problems.
And step 220, determining corresponding playing quality characteristic information of the target video on at least two information modes according to the visual information, the auditory information and the data stream information.
The at least two information modalities include, but are not limited to, a video modality, an audio modality, an image modality, a text modality, and a teletext modality corresponding to the image modality and the text modality. The image-text mode refers to a fusion information mode corresponding to the image mode and the text mode.
When a user actually watches a piece of video, the quality of the video is often determined from multiple dimensions. For example, if a video is played smoothly and clearly, but there is no sound or subtitles at all, the video is basically a video with zero viewability, but the angle of video quality evaluation is still single at present, and most of the video falls to the ground from the perspective of images, and an effective quality evaluation system is lacking. Therefore, by acquiring the play quality index corresponding to each of the plurality of information modalities, the purpose of determining the video quality from a plurality of dimensions can be achieved.
For any one of the two information modalities, the quality index data corresponding to the information modality can be obtained from the video playing data, and further the quality characteristic data of the target video on the information modality is determined.
For a video player, mainly for processing three data streams of video, audio and subtitle, several information modalities associated with the playing quality include: therefore, the process can be simply understood as a multi-modal feature extraction process, wherein feature extraction processing of each modality can be performed according to visual information, auditory information and data stream information, and quality feature data corresponding to the several information modalities is determined.
In an exemplary embodiment, the visual information includes a sequence of video frames corresponding to the target video, the auditory information includes an audio signal corresponding to the target video, and the data stream information includes data stream information corresponding to the sequence of video frames and data stream information corresponding to the audio signal. Correspondingly, as shown in fig. 3, the implementation process of the step 220 includes the following steps (221 to 223), which illustrate a second flowchart of a video playback testing method provided in an embodiment of the present application.
Step 221, determining the playing quality characteristic information corresponding to the target video in the video modality based on the data stream information corresponding to the video frame sequence.
The data stream information includes video playing quality index data associated with player processing performance, so that the video playing quality feature vector can be determined based on at least one video playing quality index data in the data stream information.
In a possible implementation manner, the at least one video playing quality index data may be spliced to obtain the video playing quality feature vector.
In another possible implementation manner, feature extraction processing may be performed on the at least one video playing quality indicator to obtain the video playing quality feature vector.
The video playing quality feature vector can be used as corresponding playing quality feature information of the target video in a video mode.
The playing quality characteristic information corresponding to the target video in the video mode can be determined according to the data stream information corresponding to the video frame sequence, the audio signal and the text content, the three data stream information can represent the data processing performance of the player, the three data stream information is subjected to cross-mode characteristic fusion, and the quality characteristic information of the video mode can be represented.
Optionally, the playing quality index data in the data stream information corresponding to the video frame sequence, the audio signal, and the text content are spliced to obtain the video playing quality feature vector. Optionally, feature extraction processing is performed on the playing quality index data in the data stream information corresponding to each of the video frame sequence, the audio signal, and the text content, so as to obtain the video playing quality feature vector.
Step 222, determining the corresponding playing quality characteristic information of the target video on the image modality based on the video frame sequence.
At least one image quality indicator data may be determined from image data of respective ones of a sequence of video frames for characterizing a quality of propagation of image modality information in video content.
The measurement of the image quality has different dimensionalities, and the artistic painting is more attractive without paying more attention to resolution, noise points and the like; for medical images, the image information quantity is more concerned, whether the symptom of the patient is revealed or not; for the video cover picture, whether the video cover picture reflects the video information quantity to the maximum extent or not is more concerned, and important characters or plot scenes are more concerned.
In the video playing process, the image content in the video frame sequence is various, so the image quality index data which mainly can represent the image quality includes, but is not limited to, relatively objective indexes such as definition, resolution, color saturation, texture ambiguity, picture integrity, aesthetic measure and the like. The image quality index data can be obtained by performing corresponding analysis processing on the image data of each video frame in the video frame sequence.
In one example, as shown in fig. 6, a schematic diagram of determining image quality indicator data is illustrated. In fig. 6, 4 images with the same picture content but different tones are shown, and the 4 images may be 4 video frames in a video. When the terminal plays the 4 video frames, image quality index data such as definition, resolution, chroma, picture integrity and the like can be determined.
Based on the at least one image quality indicator data, an image quality feature vector may be determined. In a possible implementation manner, the at least one image quality index data may be spliced to obtain the image quality feature vector. In another possible implementation, the at least one image quality index data may be subjected to feature extraction processing to obtain the image quality feature vector.
The image quality characteristic vector is used for representing the corresponding playing quality characteristic information of the target video on the image modality.
Step 223, determining the playing quality characteristic information corresponding to the target video in the audio modality based on the audio signal and the data stream information corresponding to the audio signal.
For a video modality, audio signals corresponding to each frame of image in a video are uncertain, the playing quality characteristic information of the video modality can be used as an independent evaluation dimension, and correspondingly, the playing quality characteristic information corresponding to the audio modality can also be used as an independent evaluation dimension.
At least one audio quality indicator data associated with the audio content may be determined based on the content of the audio signal. The audio quality index data determined based on the content of the audio signal is quality index data with weak correlation with the performance of the player, and the content itself, such as the type of the audio content, the audibility of the audio content, and the like, can be obtained by analyzing and processing the audio signal.
The data stream information corresponding to the audio signal includes audio quality index data associated with the performance of the player, such as signal-to-noise ratio, fluency, and other index data.
Based on the audio quality index data, the audio quality index data determined according to the content of the audio signal may be included, or the audio quality index data included in the data stream information corresponding to the audio signal may be included, so that the audio playing quality feature vector may be determined.
In a possible implementation manner, the at least one audio content quality indicator data and the audio quality indicator data included in the data stream information corresponding to the audio signal may be spliced to obtain the audio playing quality feature vector.
In another possible implementation manner, the at least one audio content quality indicator data and the audio quality indicator data included in the data stream information corresponding to the audio signal may be subjected to feature extraction processing, so as to obtain the audio playing quality feature vector.
The audio playing quality feature vector can be used as playing quality feature information of the target video in the audio mode.
In an exemplary embodiment, the visual information further includes text content corresponding to the target video, and the data stream information further includes data stream information corresponding to the text content. Accordingly, as shown in FIG. 3, the implementation of step 220 further includes the following steps (224-225).
Step 224, determining the playing quality characteristic information corresponding to the target video in the text modality based on the text content and the data stream information corresponding to the text content.
Based on the text content and the data stream information corresponding to the text content, at least one text quality index data can be determined for characterizing the propagation quality of the text mode information. The text quality index data comprises quality index data for measuring the text content and index data related to the data processing performance of the player in the data stream information corresponding to the text content.
For text in video, such as subtitles, the indicators that can measure the quality of subtitles include completeness, definition, and text quality. The completeness is used for representing whether the video contains subtitles or not, and further for foreign language episodes, the completeness is used for representing whether the video has multi-language subtitles or not. Definition comprises subtitle definition, and can represent rendering effects of embedded subtitles and external subtitles, for example, 1080P video renders a 270P subtitle, and the picture effect is relatively inconsistent. The text quality can be determined according to the actual text content or the text generation mode, for example, the quality of the caption provided by the caption group is more different from that of the machine translation caption.
In one example, as shown in FIG. 7, a diagram illustrating the determination of text quality indicator data is shown. Fig. 7 shows that the video frame image has the subtitle content "do you remember," and the terminal can detect the subtitle content in the video frame image and determine the integrity, definition and text quality of the subtitle.
Based on the at least one text quality indicator data, a text quality feature vector may be determined. In a possible implementation manner, the at least one text quality index data may be concatenated to obtain the text quality feature vector. In another possible implementation manner, the text quality feature vector may be obtained by performing feature extraction processing on the at least one image quality index data.
And step 225, performing characteristic information fusion processing on the playing quality characteristic information corresponding to the image modality and the playing quality characteristic information corresponding to the text modality to obtain the playing quality characteristic information corresponding to the target video in the image-text modality.
The image-text mode refers to a fusion information mode corresponding to an image mode and a text mode.
And carrying out feature fusion processing on the image quality feature vector and the text quality feature vector to obtain the image-text playing quality feature vector. The image-text playing quality characteristic vector is used for representing the playing quality characteristic information of the target video on the image-text mode.
For a frame of image in a video, there is often associated text information, such as subtitles or a bullet screen, so that features corresponding to an image modality and a text modality belong to strongly related features, and therefore, currently embedded subtitle features and image features can be spliced to serve as input of a subsequent playing quality test model. The feature fusion processing of the image quality feature vector and the text quality feature vector can adopt a bilinear pooling method, namely, the image-text playing quality feature vector is obtained by calculating the outer product of the image quality feature vector and the text quality feature vector and linearizing a matrix generated by the outer product into a vector to represent, namely the fusion feature of the image quality feature vector and the text quality feature vector.
Through the process, multi-modal playing quality characteristic information can be extracted, so that the quality characteristic information extracted in various modes can be used for evaluating the audio-visual information test result.
And step 230, generating an audio-visual quality test result corresponding to the target video played by the target player based on the playing quality characteristic information corresponding to the at least two information modalities.
The audiovisual quality test result is used for representing the audiovisual information transmission quality of the target video on the target player.
In an exemplary embodiment, the audiovisual quality test result includes a corresponding play quality attribution of the target video on each of the at least two information modalities, and the play quality attribution is used for representing the propagation quality of the single-modality information in the target video.
Accordingly, as shown in fig. 3, the implementation of step 230 includes the following step 231.
And 231, determining the corresponding play quality attribution of the target video on each information modality based on the play quality characteristic information corresponding to each information modality in the at least two information modalities.
After the play quality characteristic information on various information modes is determined, a process of multi-mode play quality test pre-training of videos, audios, texts, images, pictures and texts and the like can be carried out, so that quality evaluation models on various information modes can be obtained.
In an exemplary embodiment, for a video modality in information modalities, the playing quality feature information of the target video in the video modality, that is, the video playing quality feature vector is input to the video modality quality evaluation model, and a video modality quality score is output. The video modal quality is divided into corresponding playing quality attribution points of the target video on the video modal and is used for representing the dynamic information transmission quality of the target video and the video modal.
The video modal quality evaluation model is a machine learning model which takes the playing quality feature information of a sample video on a video modality as a training feature, namely the video playing quality feature vector, and takes a mark score as label information for training, and is used for performing playing quality test on video modal information and outputting a quantization test result corresponding to the video modality, namely the video modal quality score.
In an exemplary embodiment, for an audio modality in the information modalities, the playing quality feature information of the target video on the audio modality, that is, the audio playing quality feature vector is input to the audio modality quality evaluation model, and the audio modality quality score is output. The audio modal quality is divided into corresponding play quality attribution points of the target video on the audio modal, and is used for representing the transmission quality of the audio modal information in the target video.
The audio modal quality evaluation model is a machine learning model which takes the playing quality characteristic information of the sample video on the audio mode as a training characteristic, namely an audio playing quality characteristic vector, and takes the mark score as label information for training, and is used for carrying out the playing quality test of the audio modal information and outputting a quantitative test result corresponding to the audio mode, namely the audio modal quality score.
In an exemplary embodiment, for an image modality in the information modalities, the playing quality feature information of the target video on the image modality, that is, an image quality feature vector is input to the image modality quality evaluation model, and an image modality quality score is output. The image modality quality is divided into corresponding playing quality attribution points of the target video on the image modality and is used for representing the transmission quality of the image modality information in the target video.
The image modality quality evaluation model is a machine learning model which takes playing quality characteristic information of a sample video on an image modality as a training characteristic, namely an image quality characteristic vector, and takes a mark score as label information for training, and is used for carrying out playing quality test on the image modality information and outputting a quantitative test result corresponding to the image modality, namely the image modality quality score.
In an exemplary embodiment, for a text modality in information modalities, the playing quality feature information of the target video in the text modality, that is, a text quality feature vector, is input to a text modality quality evaluation model, and a text modality quality score is output. The text mode quality is divided into corresponding playing quality attribution points of the target video on the text mode and is used for representing the transmission quality of text mode information in the target video.
The text mode quality evaluation model is a machine learning model which takes the playing quality characteristic information of the sample video on the text mode as a training characteristic, namely a text quality characteristic vector, and takes the mark score as label information for training, and is used for performing the playing quality test of the text mode information and outputting a quantitative test result corresponding to the text mode, namely the text mode quality score.
In an exemplary embodiment, for a teletext modality in each information modality, the playing quality feature information of the target video in the teletext modality, namely a teletext playing quality feature vector, is input to a teletext modality quality evaluation model, and a teletext modality quality score is output. The image-text modal quality score is a playing quality attribution score corresponding to the image-text modal of the target video and is used for representing the transmission quality of the combined information of the image-text modal in the target video.
The image-text modal quality evaluation model is a machine learning model which takes playing quality characteristic information of a sample video on an image-text modal as a training characteristic, namely an image-text playing quality characteristic vector, and takes a mark score as label information for training, and is used for carrying out playing quality test on image-text modal combined information and outputting a quantitative test result corresponding to the image-text modal, namely the image-text modal quality score.
In an exemplary embodiment, the audiovisual quality test result further includes an audiovisual quality overall score corresponding to the target video, and the audiovisual quality overall score is used for integrally representing the audiovisual information propagation quality corresponding to the target video on the target player.
Accordingly, as shown in fig. 3, after the step 231, the method further includes the following step 232.
And step 232, carrying out fusion processing on the corresponding play quality attributions on each information modality to obtain the integral audio-visual quality points.
In an exemplary embodiment, the image-text modal quality score, the audio modal quality score and the video modal quality score are weighted and averaged to obtain an audio-visual quality overall score. Optionally, the weighting coefficients of the quality attribution points can be adjusted according to actual conditions. For example, the weighting coefficients corresponding to the three playback quality attribute scores are 1/3 respectively. By calculating the integral audio-visual quality scores and simultaneously keeping the playback quality attribution scores corresponding to the modes, the method is convenient for providing reference for video developers and optimizing in the dimension with poor quality.
In an exemplary embodiment, for a case where the image modality and the text modality are not fused, the above-mentioned audio modality quality score, video modality quality score, image modality quality score and text modality quality score may be weighted and averaged to obtain an overall audio-visual quality score.
In one example, as shown in FIG. 8, a schematic diagram of a multi-modal quality testing network is illustrated. The playing data corresponding to the four modes of voice, image, caption and video in the video playing process are respectively input to a voice network, an image network, a text network and a video network, the playing quality characteristics of media contents of each mode, namely the voice characteristics, the image characteristics, the text characteristics and the video characteristics shown in the figure, can be extracted from the corresponding networks, and then multi-mode characteristic weighting fusion is carried out based on the voice characteristics, the image characteristics, the text characteristics and the video characteristics, so that the video playing quality can be quantitatively evaluated according to characteristic grading. Based on the multi-mode quality testing network, when the playing quality of the terminal video is tested, multi-dimensional characteristics such as videos, pictures, audios and characters can be integrated, different modes are mapped to a unified semantic space through the encoder characteristic networks corresponding to the modes, and the stable multi-mode representation is integrated. By deploying the multi-mode quality testing network, in addition to quality evaluation of different information modes such as video, audio, pictures and subtitles, comprehensive quality scores can be given, and subjective quality evaluation of a user when watching videos can be more closely achieved.
In another embodiment, the cross-modal feature fusion processing may be directly performed on the play quality feature information in at least two information modalities to obtain the modal fusion feature information.
The cross-modal feature fusion processing refers to processing of fusing and mapping the playing quality feature information corresponding to different modalities to the same feature space.
On the premise of cross-modal semantic difference, the playing quality characteristics on various information modalities can be subjected to label sharing processing. For example, the audio modality and the text modality respectively correspond to features xa and xs belonging to different categories, and the caption feature xs can be forced to share the same label k as the audio xa. Through learning combination embedding of cross-modal characteristics, possible semantic differences among information modalities are reduced, task-related semantics are captured, and more knowledge transfer is promoted. Optionally, the tag sharing rule includes performing tag sharing according to video playing time information corresponding to data of each information modality. The label may be a mark score, such as a play quality score of the user for the video playing content at a certain time.
In a possible implementation manner, the video playing quality feature vector, the audio playing feature vector, the image quality feature vector, and the text quality feature vector are subjected to multi-mode fusion to obtain modal fusion feature data. The vector fusion mode includes, but is not limited to, superposition, splicing, combination embedding, and the like.
Specifically, a video playing quality feature vector, an audio playing feature vector, an image quality feature vector and a text quality feature vector are spliced to obtain a modal fusion feature vector. Or, overlapping the video playing quality feature vector, the audio playing feature vector, the image quality feature vector and the text quality feature vector to obtain a modal fusion feature vector. Or, the video playing quality feature vector, the audio playing feature vector, the image quality feature vector and the text quality feature vector are combined and embedded to obtain the modal fusion feature vector. The above-described combined embedding process may be implemented by a multi-modal knowledge transfer learning model.
And determining the integral audio-visual quality score corresponding to the target video based on the modal fusion characteristic information.
In one possible embodiment, the modal fusion feature information, i.e., the modal fusion feature vector, is input to the multi-modal quality test model, and the audio-visual quality integral score corresponding to the target video is output. Optionally, the multi-modal quality test model is a machine learning model obtained by performing distillation training on a pre-training model according to the sample characteristics of each modal quality test model and the shared label. Optionally, the sharing tag refers to the tag that is forced to be shared.
Semantic fusion of different modes is carried out through the extracted multi-dimensional playing quality characteristics, knowledge in the mode fusion characteristics learned by each mode can be distilled into a total system together, and further, a comprehensive quality evaluation result of video playing can be obtained according to the fusion characteristic knowledge.
In one example, as shown in fig. 9, a schematic diagram of an audio and video quality testing flow in a process of playing a video by a terminal is exemplarily shown. A user plays a video by using player software installed in a terminal; and after the terminal plays the video, running a multi-mode quality testing process in the background. The multi-mode quality testing system can perform multi-dimensional evaluation on video playing, and the multi-dimensional evaluation comprises different types of film sources or newly online video terminal processing algorithms, playing quality, video content quality and the like. In the process of multi-modal quality testing, for information of each mode of video, image, audio and subtitle, respective quality characteristic knowledge can be extracted from a corresponding network, and then the information is synthesized, embedded and input into the multi-mode network to obtain a video score representing audio-visual knowledge, namely the integral playing quality score. The video score can be understood as a quantitative quality test result corresponding to the player strategy. Optionally, the multi-modal network is a neural network model obtained by performing machine learning training based on a multi-modal pre-training model. For the video playing quality test, the multi-mode pre-training model is used for evaluating the quality of multi-dimensional information contents such as videos, images, texts, voices and the like contained in the videos, and audio-visual knowledge corresponding to multi-mode information can be fused, so that the test score is closer to the subjective quality evaluation of users watching the videos.
In an exemplary embodiment, as shown in fig. 3, after step 230 described above, the method further includes step 240 described below.
And 240, generating strategy adjustment information aiming at the data processing strategy information configured by the target player based on the audio-visual quality test result.
The data processing strategy information comprises at least one data processing strategy aiming at the data flow, and the strategy adjusting information is used for adjusting the data processing strategy.
The player is provided with a data processing strategy library for processing each path of data stream. The data processing policy library includes data policies for various modality information.
For the video modality, decoders for different encoding formats, such as decoders corresponding to H265 and H264 encoding modes, are configured in the player. Data processing strategies for different decoding modes, such as a hard decoding strategy MediaCodec, a VideoToolbox, a soft decoding strategy FFmpeg, etc., can also be configured in the player. Wherein, MediaCodec is a hardware encoder of an Android (Android operating system) platform; the VideoToolbox is the video processing framework of the iOS platform; FFmpeg is a set of open source computer programs that can be used to record, convert digital audio, video, and convert them into streams.
For the image modality, different image processing strategies are configured in the player, such as super-resolution, super-frame rate, color blindness, HDR (High-Dynamic Range High Dynamic Range image), VR, and other image processing strategies.
For the audio mode, different audio processing strategies are configured in the player, such as denoising, echo cancellation, sound effect processing, power amplification/enhancement, mixing/separation and other audio processing strategies.
For the text mode, different text processing strategies such as fonts and text processing strategies of embedded/plug-in subtitles are configured in the player.
The data processing policy for the data stream of the currently played video may be a processing policy set by the user in a setting page of the player, such as an operation of selecting an image special effect, or an operation set by the user in an operation bar in a video playing page, such as an operation of setting a double-speed playing and a definition.
The audio-visual quality test result can reflect the quality of a data processing strategy adopted by the player to play the video, so that strategy adjustment information can be generated according to the audio-visual quality test result to adjust the data processing strategy adopted by the player to play the video. For example, after the user sets the multi-speed playing, because the data calculation pressure is high, the situation that the video playing quality is low can occur, and therefore, the strategy adjustment information can be generated according to the audio-visual quality test result and used for prompting the user to set the ordinary multi-speed playing.
The player can also automatically adjust the currently used data processing strategy according to the strategy adjusting information.
In the scene of online video service, the test process can be synchronously executed in the video playing process, and the real-time evaluation of the online video playing quality can be realized based on the multi-mode quality test model, so that the problem can be found in time. After the video playing test result is obtained, the terminal can upload the video playing test result to the server to help a developer to analyze. The developer can perform related optimization according to the video playing test result, such as technical architecture design, coding selection, streaming media protocol, adaptive algorithm, connection and stuck logic, client software design and the like.
In summary, according to the technical solution provided by the embodiment of the present application, by obtaining the visual information and the auditory information perceivable by the user during the playing of the video by the player and the data stream information reflecting the data processing quality of the target player, the playing quality characteristics of the video on at least two information modes can be determined, the playing quality characteristics on the at least two information modes are utilized to generate an audio-visual quality test result capable of representing the propagation quality of the audio-visual information, the playing quality information of the media contents in different information modes can be fully mined, the test result is ensured to accord with the subjective evaluation of the user, the problem of inaccurate test result caused by single information dimension is avoided, the accuracy of the audio and video quality test is improved, moreover, the video playing quality can be determined without manual observation, and the labor cost required by the audio and video quality test is reduced.
The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.
Referring to fig. 10, a block diagram of a video playback testing apparatus according to an embodiment of the present application is shown. The device has the function of realizing the video playing test method, and the function can be realized by hardware or by hardware executing corresponding software. The device can be a computer device and can also be arranged in the computer device. The apparatus 1000 may include: a playing information obtaining module 1010, a quality characteristic determining module 1020 and a test result generating module 1030.
The playing information acquiring module 1010 is configured to acquire video playing information corresponding to a target video, where the video playing information includes visual information and auditory information corresponding to the target video, and data stream information corresponding to the target video played by a target player, and the data stream information is used to represent processing quality of a data stream in a video playing process.
A quality characteristic determining module 1020, configured to determine, according to the visual information, the auditory information, and the data stream information, play quality characteristic information corresponding to the target video in at least two information modalities.
A test result generating module 1030, configured to generate, based on the play quality feature information corresponding to the at least two information modalities, an audiovisual quality test result corresponding to the target video played by the target player, where the audiovisual quality test result is used to characterize audiovisual information propagation quality of the target video corresponding to the target player.
In an exemplary embodiment, the playing information obtaining module 1010 includes: the device comprises a data stream acquisition unit, a video frame stream processing unit and an audio stream processing unit.
And the data stream acquiring unit is used for responding to a video playing instruction and acquiring at least two data streams corresponding to the target video, wherein the at least two data streams comprise an original video frame data stream and an original audio data stream.
And the video frame stream processing unit is used for carrying out image display processing on the original video frame data stream to obtain a video frame sequence displayed on a target page and data stream information corresponding to the video frame sequence, wherein the video frame sequence is used for representing visual information on an image modality.
And the audio stream processing unit is used for carrying out audio playing processing on the original audio data stream to generate an audio signal and data stream information corresponding to the audio signal, wherein the audio signal is used for representing the auditory information.
In an exemplary embodiment, the at least two data streams further include an original text data stream, and the playing information obtaining module 1010 further includes: a text stream processing unit.
And the text stream processing unit is used for performing text display processing on the original text data stream to obtain text content displayed in the target page and data stream information corresponding to the text content, wherein the text content is used for representing visual information corresponding to a text mode.
In an exemplary embodiment, the audiovisual quality test result includes a corresponding play quality attribution of the target video on each of the at least two information modalities, and the play quality attribution is used for characterizing the propagation quality of single-modality information in the target video; the test result generation module 1130 includes: an attribution score determining unit.
And the attribution score determining unit is used for determining the attribution score of the playing quality of the target video corresponding to each information modality based on the playing quality characteristic information corresponding to each information modality in the at least two information modalities.
In an exemplary embodiment, the audiovisual quality test result further includes an audiovisual quality overall score corresponding to the target video, where the audiovisual quality overall score is used to represent audiovisual information propagation quality corresponding to the target video on the target player as a whole; the test result generation module 1130 further includes: and an overall score determining unit.
And the integral score determining unit is used for carrying out fusion processing on the corresponding play quality attribution scores on each information modality to obtain the audio-visual quality integral score.
In an exemplary embodiment, the visual information includes a sequence of video frames corresponding to the target video, the auditory information includes an audio signal corresponding to the target video, and the data stream information includes data stream information corresponding to the sequence of video frames and data stream information corresponding to the audio signal;
the quality characteristic determining module 1020 includes: the device comprises a video mode characteristic determining unit, an image mode characteristic determining unit and an audio mode characteristic determining unit.
And the video modality characteristic determining unit is used for determining the corresponding playing quality characteristic information of the target video on the video modality based on the data stream information corresponding to the video frame sequence.
And the image modality characteristic determining unit is used for determining corresponding playing quality characteristic information of the target video on an image modality based on the video frame sequence.
And the audio modality characteristic determining unit is used for determining corresponding playing quality characteristic information of the target video on an audio modality based on the audio signal and the data stream information corresponding to the audio signal.
In an exemplary embodiment, the visual information further includes text content corresponding to the target video, the data stream information further includes data stream information corresponding to the text content, and the quality characteristic determining module 1020 further includes: text modal characteristic determining unit and image-text modal characteristic determining unit.
The text mode characteristic determining unit is used for determining playing quality characteristic information corresponding to the target video in a text mode based on the text content and the data stream information corresponding to the text content;
and the image-text modal characteristic determining unit is used for performing characteristic information fusion processing on the playing quality characteristic information corresponding to the image modality and the playing quality characteristic information corresponding to the text modality to obtain the playing quality characteristic information corresponding to the target video in the image-text modality, wherein the image-text modality refers to a fusion information modality corresponding to the image modality and the text modality.
In an exemplary embodiment, the apparatus 1000 further comprises: and a play strategy adjusting module.
The playing strategy adjusting module is used for generating strategy adjusting information aiming at the data processing strategy information configured by the target player based on the audio-visual quality test result;
wherein the data processing policy information includes at least one data processing policy for the data flow, and the policy adjustment information is used to adjust the data processing policy.
In summary, according to the technical solution provided by the embodiment of the present application, by obtaining the visual information and the auditory information perceivable by the user during the playing of the video by the player and the data stream information reflecting the data processing quality of the target player, the playing quality characteristics of the video on at least two information modes can be determined, the playing quality characteristics on the at least two information modes are utilized to generate an audio-visual quality test result capable of representing the propagation quality of the audio-visual information, the playing quality information of the media contents in different information modes can be fully mined, the test result is ensured to accord with the subjective evaluation of the user, the problem of inaccurate test result caused by single information dimension is avoided, the accuracy of the audio and video quality test is improved, moreover, the video playing quality can be determined without manual observation, and the labor cost required by the audio and video quality test is reduced.
It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.
Referring to fig. 11, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device may be a terminal. The computer device is used for implementing the video playing test method provided in the above embodiment. Specifically, the method comprises the following steps:
generally, the computer device 1100 includes: a processor 1101 and a memory 1102.
Processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1101 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1101 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, the processor 1101 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.
Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one instruction, at least one program, set of codes, or set of instructions configured to be executed by one or more processors to implement the above-described method of testing for video playback.
In some embodiments, the computer device 1100 may also optionally include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102 and peripheral interface 1103 may be connected by a bus or signal lines. Various peripheral devices may be connected to the peripheral interface 1103 by buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, touch screen display 1105, camera assembly 1106, audio circuitry 1107, positioning assembly 1108, and power supply 1109.
Those skilled in the art will appreciate that the configuration illustrated in FIG. 11 does not constitute a limitation of the computer device 1100, and may include more or fewer components than those illustrated, or may combine certain components, or may employ a different arrangement of components.
In an exemplary embodiment, there is also provided a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions which, when executed by a processor, implement the above-described method of testing video playback.
Optionally, the computer-readable storage medium may include: ROM (Read Only Memory), RAM (Random Access Memory), SSD (Solid State drive), or optical disc. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory).
In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the video playing test method.
It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application. The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method for testing video playing, the method comprising:
acquiring video playing information corresponding to a target video, wherein the video playing information comprises visual information and auditory information corresponding to the target video and data stream information corresponding to the target video played by a target player, and the data stream information is used for representing the processing quality of a data stream in the video playing process;
according to the visual information, the auditory information and the data stream information, determining corresponding playing quality characteristic information of the target video on at least two information modalities;
and generating an audio-visual quality test result corresponding to the target video played by the target player based on the playing quality characteristic information corresponding to the at least two information modes, wherein the audio-visual quality test result is used for representing the audio-visual information transmission quality of the target video corresponding to the target player.
2. The method according to claim 1, wherein the obtaining video playing information corresponding to the target video includes:
responding to a video playing instruction, and acquiring at least two data streams corresponding to the target video, wherein the at least two data streams comprise an original video frame data stream and an original audio data stream;
performing image display processing on the original video frame data stream to obtain a video frame sequence displayed on a target page and data stream information corresponding to the video frame sequence, wherein the video frame sequence is used for representing visual information on an image modality;
and carrying out audio playing processing on the original audio data stream to generate an audio signal and data stream information corresponding to the audio signal, wherein the audio signal is used for representing the auditory information.
3. The method according to claim 2, wherein the at least two data streams further include an original text data stream, and after the at least two data streams corresponding to the target video are obtained in response to the video playing instruction, the method further includes:
and performing text display processing on the original text data stream to obtain text content displayed in the target page and data stream information corresponding to the text content, wherein the text content is used for representing visual information corresponding to a text mode.
4. The method according to claim 1, wherein the audiovisual quality test result comprises a corresponding playback quality attribute of the target video in each of the at least two information modalities, and the playback quality attribute is used for characterizing the propagation quality of single-modality information in the target video;
generating an audio-visual quality test result corresponding to the target video played by the target player based on the playing quality characteristic information corresponding to the at least two information modalities, including:
and determining the corresponding play quality attribution of the target video on each information modality based on the corresponding play quality characteristic information on each information modality of the at least two information modalities.
5. The method according to claim 4, wherein the audiovisual quality test result further includes an audiovisual quality overall score corresponding to the target video, and the audiovisual quality overall score is used to represent audiovisual information propagation quality corresponding to the target video on the target player as a whole;
after determining the playback quality attribution of the target video corresponding to each information modality based on the playback quality characteristic information corresponding to each information modality of the at least two information modalities, the method further includes:
and carrying out fusion processing on the corresponding play quality attribution points on each information modality to obtain the integral audio-visual quality points.
6. The method of claim 1, wherein the visual information comprises a sequence of video frames corresponding to the target video, wherein the auditory information comprises an audio signal corresponding to the target video, and wherein the data stream information comprises data stream information corresponding to the sequence of video frames and data stream information corresponding to the audio signal;
the determining, according to the visual information, the auditory information, and the data stream information, the playing quality feature information corresponding to the target video in at least two information modalities includes:
determining corresponding playing quality characteristic information of the target video on a video modality based on data stream information corresponding to the video frame sequence;
determining corresponding playing quality characteristic information of the target video on an image modality based on the video frame sequence;
and determining corresponding playing quality characteristic information of the target video on an audio modality based on the audio signal and the data stream information corresponding to the audio signal.
7. The method according to claim 6, wherein the visual information further includes text content corresponding to the target video, the data stream information further includes data stream information corresponding to text content, and wherein determining the corresponding play quality characteristic information of the target video in at least two information modalities according to the visual information, the auditory information and the data stream information further comprises:
determining playing quality characteristic information corresponding to the target video on a text mode based on the text content and the data stream information corresponding to the text content;
and performing feature information fusion processing on the playing quality feature information corresponding to the image modality and the playing quality feature information corresponding to the text modality to obtain the playing quality feature information corresponding to the target video in the image-text modality, wherein the image-text modality refers to a fusion information modality corresponding to the image modality and the text modality.
8. The method according to any one of claims 1 to 7, further comprising:
generating policy adjustment information for the data processing policy information configured for the target player based on the audiovisual quality test result;
wherein the data processing policy information includes at least one data processing policy for the data flow, and the policy adjustment information is used to adjust the data processing policy.
9. A device for testing video playback, the device comprising:
the playing information acquisition module is used for acquiring video playing information corresponding to a target video, wherein the video playing information comprises visual information and auditory information corresponding to the target video and data stream information corresponding to the target video played by a target player, and the data stream information is used for representing the processing quality of a data stream in the video playing process;
the quality characteristic determining module is used for determining corresponding playing quality characteristic information of the target video on at least two information modalities according to the visual information, the auditory information and the data stream information;
and the test result generation module is used for generating an audio-visual quality test result corresponding to the target video played by the target player based on the playing quality characteristic information corresponding to the at least two information modalities, wherein the audio-visual quality test result is used for representing the audio-visual information transmission quality corresponding to the target video on the target player.
10. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement a method of testing video playback as claimed in any of claims 1 to 8.
CN202111275743.6A 2021-10-29 2021-10-29 Video playing test method, device and equipment Pending CN114339197A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111275743.6A CN114339197A (en) 2021-10-29 2021-10-29 Video playing test method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111275743.6A CN114339197A (en) 2021-10-29 2021-10-29 Video playing test method, device and equipment

Publications (1)

Publication Number Publication Date
CN114339197A true CN114339197A (en) 2022-04-12

Family

ID=81045551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111275743.6A Pending CN114339197A (en) 2021-10-29 2021-10-29 Video playing test method, device and equipment

Country Status (1)

Country Link
CN (1) CN114339197A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114979737A (en) * 2022-05-17 2022-08-30 西安超涌现科技有限公司 Multimode collaborative display video playing system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114979737A (en) * 2022-05-17 2022-08-30 西安超涌现科技有限公司 Multimode collaborative display video playing system
CN114979737B (en) * 2022-05-17 2024-03-19 西安超涌现科技有限公司 Video playing system with multi-mode collaborative display

Similar Documents

Publication Publication Date Title
CN107979763B (en) Virtual reality equipment video generation and playing method, device and system
CN111901598B (en) Video decoding and encoding method, device, medium and electronic equipment
US20220392224A1 (en) Data processing method and apparatus, device, and readable storage medium
US20230041730A1 (en) Sound effect adjustment
Chao et al. Audio-visual perception of omnidirectional video for virtual reality applications
CN109074678A (en) A kind of processing method and processing device of information
CN113569892A (en) Image description information generation method and device, computer equipment and storage medium
CN111615002B (en) Video background playing control method, device and system and electronic equipment
CN112272327B (en) Data processing method, device, storage medium and equipment
CN111372141B (en) Expression image generation method and device and electronic equipment
CN112785669B (en) Virtual image synthesis method, device, equipment and storage medium
WO2023045635A1 (en) Multimedia file subtitle processing method and apparatus, electronic device, computer-readable storage medium, and computer program product
Sassatelli et al. New interactive strategies for virtual reality streaming in degraded context of use
WO2023197749A1 (en) Background music insertion time point determining method and apparatus, device, and storage medium
CN114339197A (en) Video playing test method, device and equipment
Zhu et al. Audio-visual saliency for omnidirectional videos
CN114445755A (en) Video quality evaluation method, device, equipment and storage medium
CN113762056A (en) Singing video recognition method, device, equipment and storage medium
CN115619882B (en) Video compression method
CN116962741A (en) Sound and picture synchronization detection method and device, computer equipment and storage medium
CN111757173B (en) Commentary generation method and device, intelligent sound box and storage medium
CN116561294A (en) Sign language video generation method and device, computer equipment and storage medium
CN114697741A (en) Multimedia information playing control method and related equipment
Costa et al. Deep Learning Approach for Seamless Navigation in Multi-View Streaming Applications
CN112672151B (en) Video processing method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination