CN114339197A

CN114339197A - Video playback test method, device and equipment

Info

Publication number: CN114339197A
Application number: CN202111275743.6A
Authority: CN
Inventors: 夏爽
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-04-12

Abstract

The application discloses a video playing testing method, device and equipment, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring video playing information corresponding to a target video; determining corresponding playing quality characteristic information of the target video on at least two information modalities according to the visual information, the auditory information and the data stream information; and generating an audio-visual quality test result corresponding to the target video played by the target player based on the corresponding playing quality characteristic information on the at least two information modes. According to the technical scheme, the playing quality characteristics of the video on at least two information modes are determined by obtaining the multi-dimensional video playing information, so that the test result capable of representing the audio-visual information propagation quality is generated, the problem of inaccurate test result caused by single dimension is avoided, the accuracy of audio-video quality test is improved, and the labor cost required by the audio-video quality test is reduced.

Description

Video playback test method, device and equipment

技术领域technical field

本申请涉及人工智能技术领域，特别涉及一种视频播放的测试方法、装置及设备。The present application relates to the technical field of artificial intelligence, and in particular, to a testing method, device and equipment for video playback.

背景技术Background technique

近年来，互联网音视频数据高速增长，音视频质量评价的研究领域一直较为活跃。In recent years, with the rapid growth of Internet audio and video data, the research field of audio and video quality evaluation has been relatively active.

相关技术中，音视频质量评价往往非常依赖于人眼主观评价，即通过人类肉眼观察的手段来对音视频质量进行评分；对于从客观角度进行音视频质量评价的方案，往往也是对视频进行图像维度的质量评价，通过图像质量对音视频质量进行评分。In related technologies, the audio and video quality evaluation often relies heavily on the subjective evaluation of the human eye, that is, the audio and video quality is scored by means of human visual observation; for the audio and video quality evaluation scheme from an objective point of view, the video is often imaged. Dimensional quality evaluation, scoring audio and video quality through image quality.

相关技术中，音视频质量评价的维度单一、准确性较低、人工成本较高。In the related art, the audio and video quality evaluation has a single dimension, low accuracy, and high labor cost.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种视频播放的测试方法、装置及设备，能够利用多种信息模态的媒体内容对音视频质量进行测试评价，提升音视频质量测试的准确性，降低测试人工成本。The embodiments of the present application provide a video playback testing method, device, and equipment, which can use media content of multiple information modalities to test and evaluate audio and video quality, improve the accuracy of audio and video quality testing, and reduce testing labor costs.

根据本申请实施例的一个方面，提供了一种视频播放的测试方法，所述方法包括：According to an aspect of the embodiments of the present application, a method for testing video playback is provided, the method comprising:

获取目标视频对应的视频播放信息，所述视频播放信息包括所述目标视频对应的视觉信息、听觉信息以及目标播放器播放所述目标视频对应的数据流信息，所述数据流信息用于表征视频播放过程中数据流的处理质量；Obtain video playback information corresponding to the target video, where the video playback information includes visual information, auditory information corresponding to the target video, and data stream information corresponding to the target video being played by the target player, and the data stream information is used to represent the video The processing quality of the data stream during playback;

根据所述视觉信息、所述听觉信息以及所述数据流信息，确定所述目标视频在至少两种信息模态上对应的播放质量特征信息；Determine, according to the visual information, the auditory information and the data stream information, playback quality feature information corresponding to the target video in at least two information modalities;

基于所述至少两种信息模态上对应的播放质量特征信息，生成所述目标播放器播放所述目标视频对应的视听质量测试结果，所述视听质量测试结果用于表征所述目标视频在所述目标播放器上对应的视听信息传播质量。Based on the corresponding playback quality feature information on the at least two information modalities, an audio-visual quality test result corresponding to the target video being played by the target player is generated, and the audio-visual quality test result is used to indicate that the target video is in the target video. The corresponding audiovisual information dissemination quality on the target player.

根据本申请实施例的一个方面，提供了一种视频播放的测试装置，所述装置包括：According to an aspect of the embodiments of the present application, a testing device for video playback is provided, the device comprising:

播放信息获取模块，用于获取目标视频对应的视频播放信息，所述视频播放信息包括所述目标视频对应的视觉信息、听觉信息以及目标播放器播放所述目标视频对应的数据流信息，所述数据流信息用于表征视频播放过程中数据流的处理质量；A playback information acquisition module, configured to acquire video playback information corresponding to the target video, where the video playback information includes visual information, auditory information corresponding to the target video, and data stream information corresponding to the target video being played by the target player. The data stream information is used to characterize the processing quality of the data stream during video playback;

质量特征确定模块，用于根据所述视觉信息、所述听觉信息以及所述数据流信息，确定所述目标视频在至少两种信息模态上对应的播放质量特征信息；a quality feature determination module, configured to determine playback quality feature information corresponding to the target video in at least two information modalities according to the visual information, the auditory information and the data stream information;

测试结果生成模块，用于基于所述至少两种信息模态上对应的播放质量特征信息，生成所述目标播放器播放所述目标视频对应的视听质量测试结果，所述视听质量测试结果用于表征所述目标视频在所述目标播放器上对应的视听信息传播质量。A test result generation module, configured to generate an audiovisual quality test result corresponding to the target video being played by the target player based on the corresponding playback quality feature information on the at least two information modalities, and the audiovisual quality test result is used for Characterizing the corresponding audiovisual information dissemination quality of the target video on the target player.

根据本申请实施例的一个方面，提供了一种计算机设备，所述计算机设备包括处理器和存储器，所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述视频播放的测试方法。According to an aspect of the embodiments of the present application, a computer device is provided, the computer device includes a processor and a memory, and the memory stores at least one instruction, at least one program, a code set or an instruction set, the at least one The instruction, the at least one piece of program, the code set or the instruction set are loaded and executed by the processor to implement the above-mentioned video playback test method.

根据本申请实施例的一个方面，提供了一种计算机可读存储介质，所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述视频播放的测试方法。According to an aspect of the embodiments of the present application, a computer-readable storage medium is provided, where at least one instruction, at least one segment of program, code set or instruction set is stored in the storage medium, the at least one instruction, the at least one segment of The program, the code set or the instruction set are loaded and executed by the processor to implement the above-mentioned testing method for video playback.

根据本申请实施例的一个方面，提供了一种计算机程序产品，所述计算机程序产品包括计算机指令，所述计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取所述计算机指令，所述处理器执行所述计算机指令，使得所述计算机设备执行以实现上述视频播放的测试方法。According to one aspect of the embodiments of the present application, a computer program product is provided, the computer program product includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes to implement the above-mentioned testing method for video playback.

本申请实施例提供的技术方案可以带来如下有益效果：The technical solutions provided in the embodiments of the present application can bring the following beneficial effects:

通过获取播放器播放视频过程中用户可感知的视觉信息、听觉信息以及反映目标播放器数据处理质量的数据流信息，可以确定视频在至少两种信息模态上的播放质量特征，通过利用上述至少两种信息模态上的播放质量特征来生成能够表征视听信息传播质量的视听质量测试结果，可以充分挖掘不同信息模态媒体内容的播放质量信息，保证了测试结果符合用户主观评价，避免信息维度单一导致测试结果不准确的问题，提升了音视频质量测试的准确性，并且无需借助人工观测即可确定视频播放质量，降低了音视频质量测试所需的人工成本。By acquiring the visual information, auditory information and data stream information that the player can perceive when the player plays the video, and the data stream information reflecting the data processing quality of the target player, the playback quality characteristics of the video in at least two information modalities can be determined. The playback quality features on the two information modalities are used to generate audio-visual quality test results that can characterize the quality of audio-visual information dissemination. The single problem that causes inaccurate test results improves the accuracy of audio and video quality testing, and can determine video playback quality without manual observation, reducing labor costs for audio and video quality testing.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.

图1是本申请一个实施例提供的应用程序运行环境的示意图；1 is a schematic diagram of an application running environment provided by an embodiment of the present application;

图2是本申请一个实施例提供的视频播放的测试方法的流程图一；Fig. 2 is a flowchart of a method for testing video playback provided by an embodiment of the present application;

图3是本申请一个实施例提供的视频播放的测试方法的流程图二；Fig. 3 is a flow chart 2 of a test method for video playback provided by an embodiment of the present application;

图4示例性示出了一种确定视频播放质量指标数据的示意图；FIG. 4 exemplarily shows a schematic diagram of determining video playback quality indicator data;

图5示例性示出了一种确定音频播放质量指标数据的示意图；FIG. 5 exemplarily shows a schematic diagram of determining audio playback quality indicator data;

图6示例性示出了一种确定图像质量指标数据的示意图；FIG. 6 exemplarily shows a schematic diagram of determining image quality index data;

图7示例性示出了一种确定文本质量指标数据的示意图；FIG. 7 exemplarily shows a schematic diagram of determining text quality indicator data;

图8示例性示出了一种多模态质量测试网络的示意图；FIG. 8 exemplarily shows a schematic diagram of a multimodal quality testing network;

图9示例性示出了一种终端播放视频过程中的音视频质量测试流程的示意图；Fig. 9 exemplarily shows a kind of schematic diagram of audio and video quality testing process in the process of terminal playing video;

图10是本申请一个实施例提供的视频播放的测试装置的框图；10 is a block diagram of a test device for video playback provided by an embodiment of the present application;

图11是本申请一个实施例提供的计算机设备的结构框图。FIG. 11 is a structural block diagram of a computer device provided by an embodiment of the present application.

具体实施方式Detailed ways

在介绍本申请方法实施例之前，先对本申请方法实施例可能涉及的应用场景、相关术语或者名词进行介绍，以便于本申请领域技术人员理解。Before introducing the method embodiments of the present application, application scenarios, related terms or nouns that may be involved in the method embodiments of the present application are first introduced, so as to facilitate the understanding of those skilled in the art of the present application.

人对世界的体验是多模态的，比如看到的物体，听到的声音，感觉到质地，闻到气味，尝到味道等。模态是指某件事发生或经历的方式，当一个研究问题包含多个模态时，它就具有多模态的特征。为了让人工智能在理解人类周围的世界方面取得进展，它需要能够同时解释这些多模态的信号。Human experience of the world is multimodal, such as seeing objects, hearing sounds, feeling textures, smelling smells, tasting tastes, etc. Modality refers to the way in which something happens or experiences, and a research question is characterized as multimodal when it contains multiple modalities. For AI to make progress in understanding the world around humans, it needs to be able to interpret these multimodal signals simultaneously.

近年来，互联网音视频数据高速增长，除去纯文本的信息，更丰富的语音、图像、视频等数据并未被充分利用和学习，在音视频领域，往往局限于图像领域的信息学习和应用，其它信息并没有得到充分的利用。而一般用户观看视频时，对视频质量的评价是多个方面综合的，多模态神经网络可以学习不同模态之间的概念，将文本、语音、图像、视频等多模态内容联合起来进行学习，从而实现不同模态信息的整合。In recent years, with the rapid growth of Internet audio and video data, except for information in plain text, richer data such as voice, image, and video have not been fully utilized and learned. In the field of audio and video, information learning and application in the field of images are often limited. Other information is not fully utilized. When general users watch videos, the evaluation of video quality is comprehensive in many aspects. Multimodal neural networks can learn the concepts between different modalities, and combine multimodal content such as text, voice, images, and videos. learning, so as to realize the integration of different modal information.

在终端进行视频播放的场景中，终端需要通过解码、渲染、后处理等过程后将视频最终呈现出来，并且考虑到不同设备的硬件性能等，用户在终端观看到的视频，其质量与本身片源存在不同。要在终端对最终呈现出来的视频质量进行评价测试，缺乏有效的视频播放质量的评价工具。一方面，对于新上线的视频、不同视频格式、视频终端后处理算法以及视频本身内容质量评估往往非常依赖于人眼主观评价，需要采用大量的人力进行主观评测，这种方式浪费人力、时效性差、覆盖率也很低。即使有评价工具，往往也是属于图像等单维度评价，不能完全适用于视频场景。另一方面，对于视频播放中出现的各种清晰度、流畅度、花屏等质量问题，缺少监测机制。In the scenario where the terminal performs video playback, the terminal needs to finally present the video through decoding, rendering, post-processing and other processes, and considering the hardware performance of different devices, etc., the quality of the video viewed by the user on the terminal is the same as that of the video itself. The source is different. To evaluate and test the final video quality on the terminal, there is a lack of effective video playback quality evaluation tools. On the one hand, for newly launched videos, different video formats, video terminal post-processing algorithms, and the content quality evaluation of the video itself, it is often very dependent on the subjective evaluation of the human eye, which requires a lot of manpower for subjective evaluation, which is a waste of manpower and has poor timeliness. , the coverage rate is also very low. Even if there are evaluation tools, they are often single-dimensional evaluations such as images, and cannot be fully applied to video scenes. On the other hand, there is a lack of monitoring mechanisms for various quality problems such as clarity, fluency, and blurry screens that appear in video playback.

因此，本申请实施例对视频进行多模态信息的挖掘与评估，引入多模态质量测试模型有效量化视频播放质量，有助于对视频质量进行更全面更有说服力的评价，在各种播放场景都有比较大的应用价值。首先对于视频观看质量的评测有助于播放器算法及策略的效果评测；第二，对于视频内容的评价有助于筛选出更优质的视频，进行重点推送，对于短视频和直播等时效性要求比较高的场景尤其具有价值。Therefore, in the embodiments of the present application, the multi-modal information mining and evaluation of the video is carried out, and the multi-modal quality test model is introduced to effectively quantify the video playback quality, which is helpful for a more comprehensive and convincing evaluation of the video quality. Playing scenes have a relatively large application value. First of all, the evaluation of video viewing quality is helpful for evaluating the effect of player algorithms and strategies; second, the evaluation of video content helps to screen out better videos for key pushes, and for short videos and live broadcasts and other timeliness requirements Higher scenarios are especially valuable.

本申请实施例提供的视频播放的测试方法涉及人工智能技术，下面对此进行简要说明，以便于本领域技术人员理解。The video playback test method provided in the embodiment of the present application involves artificial intelligence technology, which is briefly described below to facilitate understanding by those skilled in the art.

人工智能(Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.

人工智能技术是一门综合学科，涉及领域广泛，既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习、自动驾驶、智慧交通等几大方向。Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology. The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning, autonomous driving, and smart transportation.

计算机视觉技术(Computer Vision，CV)计算机视觉是一门研究如何使机器“看”的科学，更进一步的说，就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉，并进一步做图形处理，使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科，计算机视觉研究相关的理论和技术，试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、OCR、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建、自动驾驶、智慧交通等技术，还包括常见的人脸识别、指纹识别等生物特征识别技术。Computer Vision Technology (Computer Vision, CV) Computer vision is a science that studies how to make machines "see". Further, it refers to the use of cameras and computers instead of human eyes to identify, track and measure targets. Machine vision, And further do graphics processing, so that computer processing becomes more suitable for human eye observation or transmission to the instrument detection image. As a scientific discipline, computer vision studies related theories and technologies, trying to build artificial intelligence systems that can obtain information from images or multidimensional data. Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, 3D object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous localization and mapping Construction, autonomous driving, smart transportation and other technologies, as well as common biometric identification technologies such as face recognition and fingerprint recognition.

机器学习(Machine Learning，ML)是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。Machine Learning (ML) is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence. Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.

随着人工智能技术研究和进步，人工智能技术在多个领域展开研究和应用，例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服、车联网、自动驾驶、智慧交通等，相信随着技术的发展，人工智能技术将在更多的领域得到应用，并发挥越来越重要的价值。本申请方法实施例中，可通过人工智能技术基于多种信息模态的媒体内容对视频播放质量进行测试评价，借助人工智能技术实现多模态的音视频质量评价。With the research and progress of artificial intelligence technology, artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, autonomous driving, drones , robots, intelligent medical care, intelligent customer service, Internet of Vehicles, autonomous driving, intelligent transportation, etc. It is believed that with the development of technology, artificial intelligence technology will be applied in more fields and play an increasingly important value. In the embodiment of the method of the present application, the video playback quality can be tested and evaluated based on the media content of multiple information modalities through artificial intelligence technology, and multi-modal audio and video quality evaluation can be realized with the help of artificial intelligence technology.

为使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present application clearer, the embodiments of the present application will be further described in detail below with reference to the accompanying drawings.

请参考图1，其示出了本申请一个实施例提供的应用程序运行环境的示意图。该应用程序运行环境可以包括：终端10和服务器20。Please refer to FIG. 1 , which shows a schematic diagram of an application running environment provided by an embodiment of the present application. The application running environment may include: a terminal 10 and a server 20 .

终端10包括但不限于手机、电脑、智能语音交互设备、智能家电、车载终端、游戏主机、电子书阅读器、多媒体播放设备、可穿戴设备等电子设备。终端10中可以安装应用程序的客户端。The terminal 10 includes but is not limited to electronic devices such as mobile phones, computers, intelligent voice interaction devices, smart home appliances, vehicle terminals, game consoles, e-book readers, multimedia playback devices, and wearable devices. A client in the terminal 10 that can install applications.

在本申请实施例中，上述应用程序可以是任何能够提供视频播放服务的应用程序。典型地，该应用程序为视频类应用程序。当然，除了视频类应用程序之外，其它类型的应用程序中也可以提供视频播放服务。例如，新闻类应用程序、社交类应用程序、互动娱乐类应用程序、浏览器应用程序、购物类应用程序、内容分享类应用程序、虚拟现实(VirtualReality，VR)类应用程序、增强现实(Augmented Reality，AR)类应用程序等，本申请实施例对此不作限定。另外，对于不同的应用程序来说，其涉及的视频播放服务也会有所不同，且相应的功能也会有所不同，这都可以根据实际需求预先进行配置，本申请实施例对此不作限定。可选地，终端10中运行有上述应用程序的客户端。In this embodiment of the present application, the above-mentioned application program may be any application program that can provide a video playback service. Typically, the application is a video-type application. Of course, in addition to video applications, other types of applications can also provide video playback services. For example, news applications, social applications, interactive entertainment applications, browser applications, shopping applications, content sharing applications, virtual reality (Virtual Reality, VR) applications, augmented reality (Augmented Reality) applications , AR) applications, etc., which are not limited in the embodiments of the present application. In addition, for different applications, the video playback services involved will also be different, and the corresponding functions will also be different, which can be pre-configured according to actual needs, which is not limited in this embodiment of the present application . Optionally, the client terminal of the above application program runs in the terminal 10 .

服务器20用于为终端10中的应用程序的客户端提供后台服务。例如，服务器20可以是上述应用程序的后台服务器。服务器20可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content DeliveryNetwork，内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。可选地，服务器20同时为多个终端10中的应用程序提供后台服务。The server 20 is used to provide background services for the client of the application program in the terminal 10 . For example, the server 20 may be the background server of the above application. The server 20 may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud service, cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, Cloud servers for basic cloud computing services such as middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. Optionally, the server 20 provides background services for applications in multiple terminals 10 at the same time.

可选地，终端10和服务器20之间可通过网络30进行互相通信。终端10以及服务器20可以通过有线或无线通信方式进行直接或间接地连接，本申请在此不做限制。Optionally, the terminal 10 and the server 20 can communicate with each other through the network 30 . The terminal 10 and the server 20 may be directly or indirectly connected through wired or wireless communication, which is not limited in this application.

请参考图2，其示出了本申请一个实施例提供的视频播放的测试方法的流程图一。该方法可应用于计算机设备中，所述计算机设备是指具备数据计算和处理能力的电子设备，如各步骤的执行主体可以是图1所示的应用程序运行环境中的终端10。该方法可以包括以下几个步骤(210～230)。Please refer to FIG. 2 , which shows a flowchart 1 of a method for testing video playback provided by an embodiment of the present application. The method can be applied to computer equipment, which refers to electronic equipment with data computing and processing capabilities. For example, the execution subject of each step may be the terminal 10 in the application running environment shown in FIG. 1 . The method may include the following steps (210-230).

步骤210，获取目标视频对应的视频播放信息。Step 210: Obtain video playback information corresponding to the target video.

上述目标视频包括但不限于离线视频、在线视频、直播视频、短视频等类型的视频，本申请实施例对此不作限定。The above-mentioned target videos include but are not limited to offline videos, online videos, live videos, short videos, and other types of videos, which are not limited in this embodiment of the present application.

视频播放信息包括目标视频对应的视觉信息、听觉信息以及目标播放器播放目标视频对应的数据流信息，数据流信息用于表征视频播放过程中数据流的处理质量。The video playback information includes visual information, auditory information corresponding to the target video, and data stream information corresponding to the target video being played by the target player. The data stream information is used to represent the processing quality of the data stream during the video playback process.

视频播放数据是指与目标视频的播放质量关联的数据。视频的产生过程主要包括音频视频采集、音频视频的编码压缩、音视频封装，从而得到一定视频格式的视频文件。本申请实施例对视频格式也不作限定。The video playback data refers to data associated with the playback quality of the target video. The video generation process mainly includes audio and video collection, audio and video encoding and compression, and audio and video encapsulation, so as to obtain a video file in a certain video format. The embodiments of the present application also do not limit the video format.

在示例性实施例中，对于视频的播放过程，主要包括音视频解封装，音频视频解码，音视频同步和渲染。其中，可通过视频播放器进行视频播放，视频播放器会进行音视频解封装、解码、音视频同步、渲染等操作，随之产生上述视频播放信息。相应的，如图3所示，上述步骤210的实施过程包括如下步骤(211～213)，图3示出了本申请一个实施例提供的视频播放的测试方法的流程图二。In an exemplary embodiment, the video playback process mainly includes audio and video decapsulation, audio and video decoding, audio and video synchronization and rendering. Among them, video playback can be performed through a video player, and the video player will perform operations such as audio and video decapsulation, decoding, audio and video synchronization, and rendering, and then generate the above-mentioned video playback information. Correspondingly, as shown in FIG. 3 , the implementation process of the above step 210 includes the following steps ( 211 to 213 ). FIG. 3 shows a second flowchart of a video playback testing method provided by an embodiment of the present application.

步骤211，响应于视频播放指令，获取目标视频对应的至少两路数据流，至少两路数据流包括原始视频帧数据流和原始音频数据流。Step 211, in response to the video playback instruction, obtain at least two data streams corresponding to the target video, where the at least two data streams include an original video frame data stream and an original audio data stream.

对于视频播放器来说，主要是对于目标视频对应的视频、音频、字幕三路数据流的处理。对于字幕来说，有些视频有内嵌在图像内的字幕，无需另外的外挂字幕，所以上述目标视频对应的是至少两路数据流，即上述原始视频帧数据流和原始音频数据流，原始视频帧数据流用于渲染显示视频帧序列，原始音频数据流用于渲染播放音频信号。For the video player, it is mainly to process the three data streams of video, audio and subtitles corresponding to the target video. For subtitles, some videos have subtitles embedded in the image, and no additional external subtitles are required, so the above target video corresponds to at least two data streams, namely the above-mentioned original video frame data stream and original audio data stream, the original video The frame data stream is used to render and display the sequence of video frames, and the raw audio data stream is used to render the playback audio signal.

步骤212，对原始视频帧数据流进行图像显示处理，得到在目标页面显示的视频帧序列以及视频帧序列对应的数据流信息。Step 212: Perform image display processing on the original video frame data stream to obtain the video frame sequence displayed on the target page and the data stream information corresponding to the video frame sequence.

上述图像显示处理包括对原始视频帧数据流的解封装、解码、音视频同步、渲染等操作，以使原始视频帧数据流中的图像信息展示在目标页面中。The above image display processing includes operations such as decapsulation, decoding, audio and video synchronization, and rendering of the original video frame data stream, so that the image information in the original video frame data stream is displayed on the target page.

视频帧序列用于表征图像模态上的视觉信息。视频帧序列是播放器对目标视频的原始视频帧数据进行解码得到的连续的多个视频帧组成的图像序列。上述视频帧是解码后在目标页面显示的视频图像，是用户可直观看到的图像模态的视觉信息。A sequence of video frames is used to characterize visual information over image modalities. The video frame sequence is an image sequence composed of multiple consecutive video frames obtained by the player decoding the original video frame data of the target video. The above-mentioned video frame is the video image displayed on the target page after decoding, and is the visual information of the image mode that the user can see intuitively.

视频帧序列对应的数据流信息包括能够体现目标视频在视频模态的播放质量的至少一种视频播放质量指标数据。上述数据流信息内的视频播放质量指标数据是与播放器数据处理性能关联的指标数据，包括但不限于视频帧序列数据流对应的码率(bitrate)、帧率(framerate)、丢帧率、流畅度(fluency)、解码正确率(decode accuracy)、解码耗时(decodetime)等。其中各个指标的计算方法及其在衡量视频模态的播放质量中的作用如下：The data stream information corresponding to the video frame sequence includes at least one video playback quality indicator data that can reflect the playback quality of the target video in the video mode. The video playback quality indicator data in the above data stream information is the indicator data associated with the data processing performance of the player, including but not limited to the bitrate, frame rate, frame rate, frame rate, Fluency (fluency), decoding accuracy (decode accuracy), decoding time (decodetime) and so on. The calculation methods of each indicator and its role in measuring the playback quality of the video mode are as follows:

码率：码率(单位kbps)＝文件大小(KB)*8/时间(s)。Code rate: code rate (unit kbps) = file size (KB)*8/time (s).

帧率(fps)：统计每秒显示的帧数。Frame rate (fps): Count the number of frames displayed per second.

丢帧率：终端收到视频原数据包假设包含总帧数为f，经过解码、渲染后实际上屏的视频帧数为r，丢帧率即为：(r-f)/f，丢帧率越低，证明播放器在解码渲染中出现的错误越少。Frame loss rate: The terminal receives the original video data package and assumes that the total number of frames is f. After decoding and rendering, the actual number of video frames on the screen is r. The frame loss rate is: (r-f)/f. Low, which proves that the player has fewer errors in decoding rendering.

流畅度：基于上述帧率和丢帧率可以确定视频播放的流畅度。播放目标视频的过程中每秒实际显示的帧数越高，流畅度越高。Smoothness: The smoothness of video playback can be determined based on the above frame rate and frame loss rate. The higher the actual number of frames per second displayed during the playback of the target video, the higher the smoothness.

解码耗时：解码一帧视频帧的耗时。举例来讲，对于不同的片源，HEVC片源解码耗时相对普通片源更长，一般设备硬件解码会比软件解码耗时更少，统计解码耗时对于终端的不同设备类型上部署解码器策略有帮助。Decoding time: The time required to decode one video frame. For example, for different slice sources, HEVC slice source decoding takes longer than ordinary slice sources. Generally, hardware decoding takes less time than software decoding. Statistical decoding takes longer. Decoders are deployed on different device types of the terminal. Strategies help.

在一个示例中，如图4所示，其示例性示出了一种确定视频播放质量指标数据的示意图。对于图4中示出了文件名为H264_864_486.mp4、大小为18.5MB(Mbyte，兆字节)、类型为MPEG-4(Moving Pictures Experts Group，动态图象专家组-4)的影片的视频封面图。在播放该影片的时候，可以确定客户端实际播放该影片的码率、帧率、丢帧率以及解码耗时。In one example, as shown in FIG. 4 , it exemplarily shows a schematic diagram of determining data of video playback quality indicators. Figure 4 shows the video cover of the movie whose file name is H264_864_486.mp4, whose size is 18.5MB (Mbyte, megabyte) and whose type is MPEG-4 (Moving Pictures Experts Group, Moving Picture Experts Group-4). picture. When playing the movie, the bit rate, frame rate, frame loss rate, and decoding time-consuming of the movie actually played by the client can be determined.

步骤213，对原始音频数据流进行音频播放处理，生成音频信号以及音频信号对应的数据流信息。Step 213: Perform audio playback processing on the original audio data stream to generate an audio signal and data stream information corresponding to the audio signal.

上述音频播放处理包括对原始音频数据流的解封装、解码、音视频同步、渲染等操作，以使原始音频数据流中的音频信息向用户传播。The above audio playback processing includes operations such as decapsulation, decoding, audio and video synchronization, and rendering of the original audio data stream, so that the audio information in the original audio data stream is propagated to the user.

音频信号用于表征听觉信息。上述音频信号是经过解码渲染之后播放出来的信号，上述音频信号是用户可直观听到的音频模态的听觉信息。Audio signals are used to represent auditory information. The audio signal is a signal that is played after decoding and rendering, and the audio signal is auditory information of an audio mode that can be heard intuitively by a user.

音频信号对应的数据流信息包括音频模态对应的至少一种音频播放质量指标数据。上述数据流信息中的音频播放质量指标数据是与播放器数据处理性能关联的指标数据，包括但不限于音频数据流对应的信噪比、流畅度、短时客观可懂指标、音量等。各个指标的计算方法如下：The data stream information corresponding to the audio signal includes at least one audio playback quality indicator data corresponding to the audio mode. The audio playback quality indicator data in the above data stream information is indicator data related to the data processing performance of the player, including but not limited to the signal-to-noise ratio, fluency, short-term objectively understandable indicators, volume, etc. corresponding to the audio data stream. The calculation methods of each indicator are as follows:

信噪比(Signal-to-Noise Ratio，SNR)一直是衡量针对宽带噪声失真的语音增强算的常规方法。因此，SNR主要用于纯净语音信号和噪声信号都是己知的算法的仿真中。信噪比计算整个时间轴上的语音信号与噪声信号的平均功率之比。Signal-to-Noise Ratio (SNR) has always been a conventional method to measure speech enhancement algorithms for wideband noise distortion. Therefore, SNR is mainly used in the simulation of algorithms for which both clean speech signals and noisy signals are known. The signal-to-noise ratio calculates the ratio of the average power of the speech signal to the noise signal over the entire time axis.

流畅度：类似视频质量中的丢帧率，假设音频播放总时长为t，过程中出现丢帧卡顿的时长为b，用(t-b)/t来衡量一段音频的流畅度。Fluency: Similar to the frame loss rate in video quality, assuming that the total duration of audio playback is t, and the duration of frame loss during the process is b, the fluency of a piece of audio is measured by (t-b)/t.

短时客观可懂指标(Short-Time Objective Intelligibility，STOI)：取值范围为0-1，值越大，可懂性越高。Short-Time Objective Intelligibility (STOI): The value ranges from 0 to 1. The larger the value, the higher the intelligibility.

在一个示例中，如图5所示，其示例性示出了一种确定音频播放质量指标数据的示意图。对于目标视频中的音频信号，可以确定音频信号对应的信噪比、流畅度、短时客观可懂指标、音量。In one example, as shown in FIG. 5 , it exemplarily shows a schematic diagram of determining the data of the audio playback quality indicator. For the audio signal in the target video, the signal-to-noise ratio, fluency, short-term objective intelligibility index, and volume corresponding to the audio signal can be determined.

在示例性实施例中，上述至少两路数据流还包括原始文本数据流。相应的，为处理上述原始文本数据流，如图3所示，上述步骤210的实施过程还包括如下步骤214。In an exemplary embodiment, the above-mentioned at least two data streams further include original text data streams. Correspondingly, in order to process the above-mentioned original text data stream, as shown in FIG. 3 , the implementation process of the above-mentioned step 210 further includes the following step 214 .

步骤214，对原始文本数据流进行文本显示处理，得到在目标页面中显示的文本内容以及文本内容对应的数据流信息。Step 214: Perform text display processing on the original text data stream to obtain text content displayed on the target page and data stream information corresponding to the text content.

文本内容用于表征文本模态对应的视觉信息。视频中的文本内容包括但不限于视频帧图像中的文本以及视频的字幕。字幕指以文字形式显示视频内容，其中包括视频角色间的对话，也包括对画面的描述性语言。The text content is used to represent the visual information corresponding to the text modality. The text content in the video includes, but is not limited to, the text in the video frame image and the subtitles of the video. Subtitles refer to the display of video content in text form, including dialogue between video characters and descriptive language for pictures.

在视频的字幕是外挂字幕或者存在视频评论文本的情况下，上述至少两路数据流除上述原始视频帧数据流和原始音频数据流之外，还会包括原始文本数据流，用于显示目标视频的文本信息。When the subtitles of the video are external subtitles or there are video comment texts, the above at least two data streams, in addition to the above-mentioned original video frame data stream and original audio data stream, will also include an original text data stream for displaying the target video. text information.

文本内容对应的数据流信息包括文本模态对应的至少一种文本播放质量指标数据，上述数据流信息内的文本播放质量指标数据可以是与播放器数据处理性能关联的数据，包括但不限于文本数据流对应的码率、流畅度、解码耗时等指标数据。The data stream information corresponding to the text content includes at least one type of text playback quality indicator data corresponding to the text mode, and the text playback quality indicator data in the above-mentioned data stream information may be data associated with the data processing performance of the player, including but not limited to text Indicator data such as bit rate, fluency, and decoding time corresponding to the data stream.

通过上述内容可以看出，本申请实施例对目标视频的音视频内容中包含多种模态的视听信息以及播放器中的数据流处理信息进行了充分利用，确保音视频视听质量测试结果的准确性。It can be seen from the above content that the embodiments of the present application make full use of the audio-visual information of the target video and the audio-visual information in multiple modes and the data stream processing information in the player, so as to ensure the accuracy of the audio-video audio-visual quality test results. sex.

另外，通过在用户进行视频观看的过程中，另起进程确定视频播放数据，可以在用户进行观看的过程中进行视频播放质量评测，对于视频流畅度、画面质量、声音质量、视频内容等多维度信息有最真实的质量评测，对于开发者决策算法质量以及对现有问题的发现及解决有很大帮助。In addition, by starting a separate process to determine the video playback data during the user's video viewing process, the video playback quality evaluation can be performed during the user's viewing process. Information has the most authentic quality evaluation, which is of great help for developers to make decisions about the quality of algorithms and to discover and solve existing problems.

步骤220，根据视觉信息、听觉信息以及数据流信息，确定目标视频在至少两种信息模态上对应的播放质量特征信息。Step 220: Determine, according to the visual information, the auditory information, and the data stream information, playback quality feature information corresponding to the target video in at least two information modalities.

上述至少两种信息模态包括但不限于视频模态、音频模态、图像模态、文本模态以及图像模态和文本模态对应的图文模态。其中，图文模态是指图像模态和文本模态对应的融合信息模态。The above at least two information modalities include but are not limited to video modalities, audio modalities, image modalities, text modalities, and graphic modalities corresponding to the image modalities and the text modalities. Among them, the graphic mode refers to the fusion information mode corresponding to the image mode and the text mode.

用户实际观看一段视频时，往往是从多个维度来判定视频质量。举例说明，一个视频如果播放流畅、画面清晰，但是完全没有声音和字幕，那么这个视频基本是一个观看性为零的视频，但是目前视频质量评价的角度还很单一，更多是从图像的角度进行落地，缺乏有效的质量评价系统。因此，通过获取上述多种信息模态各自对应的播放质量指标，可以实现从多个维度来判定视频质量的目的。When a user actually watches a video, the video quality is often judged from multiple dimensions. For example, if a video is played smoothly and the picture is clear, but there is no sound and subtitles at all, then the video is basically a video with zero viewing ability, but the current video quality evaluation angle is still very single, more from the perspective of images For the implementation, there is no effective quality evaluation system. Therefore, by acquiring the respective playback quality indicators corresponding to the above-mentioned various information modalities, the purpose of determining the video quality from multiple dimensions can be achieved.

对于上述两种信息模态中的任一信息模态，可以从视频播放数据中获取该信息模态对应的质量指标数据，进而确定目标视频在该信息模态上的质量特征数据。For any one of the above two information modalities, the quality index data corresponding to the information modality can be obtained from the video playback data, and then the quality feature data of the target video on the information modality can be determined.

对于视频播放器来说，主要是对于视频、音频、字幕三路数据流的处理，与播放质量关联的几种信息模态包括：视频、图像、音频、文本，因此可以根据视觉信息、听觉信息以及数据流信息，进行各模态的特征提取处理，确定这几种信息模态对应的质量特征数据，这一过程可以简单理解为多模态特征提取的过程。For video players, it is mainly for the processing of three data streams of video, audio and subtitles. Several information modalities associated with playback quality include: video, image, audio, and text. Therefore, visual information and auditory information can be and data flow information, perform feature extraction processing for each modal, and determine the quality feature data corresponding to these information modalities. This process can be simply understood as the process of multi-modal feature extraction.

在示例性实施例中，视觉信息包括目标视频对应的视频帧序列，听觉信息包括目标视频对应的音频信号，数据流信息包括视频帧序列对应的数据流信息以及音频信号对应的数据流信息。相应的，如图3所示，上述步骤220的实施过程包括如下步骤(221～223)，图示出了本申请一个实施例提供的视频播放的测试方法的流程图二。In an exemplary embodiment, the visual information includes a video frame sequence corresponding to the target video, the auditory information includes an audio signal corresponding to the target video, and the data stream information includes data stream information corresponding to the video frame sequence and data stream information corresponding to the audio signal. Correspondingly, as shown in FIG. 3 , the implementation process of the above step 220 includes the following steps ( 221 - 223 ), which shows a second flowchart of the video playback testing method provided by an embodiment of the present application.

步骤221，基于视频帧序列对应的数据流信息，确定目标视频在视频模态上对应的播放质量特征信息。Step 221 , based on the data stream information corresponding to the video frame sequence, determine the playback quality feature information corresponding to the target video in the video mode.

上述数据流信息中包括与播放器处理性能关联的视频播放质量指标数据，因此可以基于数据流信息中的至少一种视频播放质量指标数据，确定视频播放质量特征向量。The data stream information includes video playback quality indicator data associated with the player's processing performance, so the video playback quality feature vector can be determined based on at least one video playback quality indicator data in the data stream information.

在一种可能的实施方式中，可以将上述至少一种视频播放质量指标数据进行拼接，得到上述视频播放质量特征向量。In a possible implementation manner, the above-mentioned at least one video playback quality indicator data may be spliced to obtain the above-mentioned video playback quality feature vector.

在另一种可能的实施方式中，可以对上述至少一种视频播放质量指标进行特征提取处理，得到上述视频播放质量特征向量。In another possible implementation manner, feature extraction processing may be performed on the above at least one video playback quality indicator to obtain the above video playback quality feature vector.

上述视频播放质量特征向量可作为目标视频在视频模态上对应的播放质量特征信息。The above-mentioned video playback quality feature vector may be used as playback quality feature information corresponding to the target video in the video mode.

对于目标视频在视频模态上对应的播放质量特征信息，还可以是根据视频帧序列、音频信号、文本内容各自对应的数据流信息进行确定，上述三种数据流信息都能表征播放器的数据处理性能，将上述三种数据流信息进行跨模态特征融合，可以进行视频模态的质量特征信息表示。The playback quality feature information corresponding to the target video in the video mode can also be determined according to the data stream information corresponding to the video frame sequence, audio signal, and text content. The above three kinds of data stream information can all represent the data of the player. Processing performance, cross-modal feature fusion of the above three data stream information can be used to represent the quality feature information of the video modality.

可选地，将视频帧序列、音频信号、文本内容各自对应的数据流信息中的播放质量指标数据进行拼接，得到上述视频播放质量特征向量。可选地，对视频帧序列、音频信号、文本内容各自对应的数据流信息中的播放质量指标数据进行特征提取处理，得到上述视频播放质量特征向量。Optionally, the above-mentioned video playback quality feature vector is obtained by splicing the playback quality indicator data in the data stream information corresponding to the video frame sequence, audio signal, and text content. Optionally, feature extraction is performed on the playback quality indicator data in the data stream information corresponding to the video frame sequence, the audio signal, and the text content, to obtain the above-mentioned video playback quality feature vector.

步骤222，基于视频帧序列，确定目标视频在图像模态上对应的播放质量特征信息。Step 222 , based on the video frame sequence, determine the playback quality feature information corresponding to the target video in the image mode.

根据视频帧序列中各个视频帧的图像数据，可确定至少一种图像质量指标数据，用于表征视频内容中图像模态信息的传播质量。According to the image data of each video frame in the video frame sequence, at least one image quality indicator data can be determined, which is used to characterize the propagation quality of image modality information in the video content.

对于图像质量的衡量有不同的维度，对于艺术品画作，更倾向于美观度，而不会更多关注分辨率、噪点等；对于医学图像，更关心图像信息量，是否揭示了病人症结；对于视频封面图，更关注其是否最大程度反映视频信息量，更关注于重要人物或者情节场景。There are different dimensions for the measurement of image quality. For artworks, it is more inclined to aesthetics, rather than resolution, noise, etc.; for medical images, it is more concerned with the amount of image information, whether it reveals the crux of the patient; The video cover image is more concerned with whether it reflects the amount of video information to the greatest extent, and more attention is paid to important characters or plot scenes.

对于视频播放过程中，视频帧序列中的图像内容是多样的，因此主要能够体现图像质量的图像质量指标数据包括但不限于清晰度、分辨率、色彩度、纹理模糊度、画面完整度、美观度等相对客观的指标。上述图像质量指标数据可以通过对视频帧序列中各个视频帧的图像数据进行相应的分析处理得到。For the video playback process, the image content in the video frame sequence is diverse, so the image quality index data that can mainly reflect the image quality include but not limited to sharpness, resolution, color, texture blur, picture integrity, aesthetics relatively objective indicators such as degree. The above image quality indicator data can be obtained by performing corresponding analysis and processing on the image data of each video frame in the video frame sequence.

在一个示例中，如图6所示，其示例性示出了一种确定图像质量指标数据的示意图。图6中示出了画面内容相同但是色调不同的4幅图像，这4幅图像可以是视频中的4个视频帧。在终端播放上述4个视频帧的时候，可以确定清晰度、分辨率、色彩度、画面完整度等图像质量指标数据。In one example, as shown in FIG. 6 , it exemplarily shows a schematic diagram of determining image quality index data. FIG. 6 shows 4 images with the same picture content but different tones, and these 4 images may be 4 video frames in the video. When the terminal plays the above four video frames, image quality index data such as definition, resolution, color, and picture integrity can be determined.

基于至少一种图像质量指标数据，可以确定图像质量特征向量。在一种可能的实施方式中，可以将上述至少一种图像质量指标数据进行拼接，得到上述图像质量特征向量。在另一种可能的实施方式中，可以对上述至少一种图像质量指标数据进行特征提取处理，得到上述图像质量特征向量。Based on at least one image quality indicator data, an image quality feature vector may be determined. In a possible implementation manner, the above-mentioned at least one image quality indicator data may be spliced to obtain the above-mentioned image quality feature vector. In another possible implementation manner, feature extraction processing may be performed on the above-mentioned at least one image quality index data to obtain the above-mentioned image quality feature vector.

上述图像质量特征向量用于表征目标视频在图像模态上对应的播放质量特征信息。The above image quality feature vector is used to represent the corresponding playback quality feature information of the target video on the image modality.

步骤223，基于音频信号以及音频信号对应的数据流信息，确定目标视频在音频模态上对应的播放质量特征信息。Step 223: Based on the audio signal and the data stream information corresponding to the audio signal, determine the playback quality feature information corresponding to the target video in the audio mode.

对于视频模态，视频中每帧图像对应的音频信号都是不确定的，上述视频模态的播放质量特征信息可以作为一个独立的评价维度，相应的，音频模态上对应的播放质量特征信息也可以作为一个独立的评价维度。For video mode, the audio signal corresponding to each frame of image in the video is uncertain. The above-mentioned playback quality feature information of the video mode can be used as an independent evaluation dimension. Correspondingly, the playback quality feature information corresponding to the audio mode It can also be used as an independent evaluation dimension.

基于音频信号的内容可以确定至少一种与音频内容关联的音频质量指标数据。基于音频信号的内容确定的音频质量指标数据，是与播放器性能相关性较弱的质量指标数据，更加注重内容本身，比如音频内容的类型，音频内容的动听度等等，均可以通过对音频信号进行分析处理得到。At least one audio quality indicator data associated with the audio content may be determined based on the content of the audio signal. The audio quality indicator data determined based on the content of the audio signal is the quality indicator data that has a weak correlation with the performance of the player, and pays more attention to the content itself, such as the type of audio content, the soundness of the audio content, etc. The signal is analyzed and processed.

音频信号对应的数据流信息中包括与播放器性能相关联的音频质量指标数据是，比如信噪比、流畅度等指标数据。The data stream information corresponding to the audio signal includes audio quality indicator data associated with the performance of the player, such as indicator data such as signal-to-noise ratio and fluency.

基于上述音频质量指标数据，既可以包括根据音频信号的内容确定的质量指标数据，也可以包括音频信号对应的数据流信息包括的音频质量指标数据，可以确定音频播放质量特征向量。Based on the above audio quality indicator data, it may include either quality indicator data determined according to the content of the audio signal or audio quality indicator data included in the data stream information corresponding to the audio signal, and an audio playback quality feature vector may be determined.

在一种可能的实施方式中，可以将上述至少一种音频内容质量指标数据，以及音频信号对应的数据流信息包括的音频质量指标数据进行拼接，得到上述音频播放质量特征向量。In a possible implementation manner, the above-mentioned at least one audio content quality indicator data and the audio quality indicator data included in the data stream information corresponding to the audio signal may be spliced to obtain the above-mentioned audio playback quality feature vector.

在另一种可能的实施方式中，可以对上述至少一种音频内容质量指标数据，以及音频信号对应的数据流信息包括的音频质量指标数据进行特征提取处理，得到上述音频播放质量特征向量。In another possible implementation, feature extraction may be performed on the at least one audio content quality indicator data and the audio quality indicator data included in the data stream information corresponding to the audio signal to obtain the audio playback quality feature vector.

上述音频播放质量特征向量可作为目标视频在音频模态上的播放质量特征信息。The above-mentioned audio playback quality feature vector may be used as playback quality feature information of the target video in the audio mode.

在示例性实施例中，视觉信息还包括目标视频对应的文本内容，数据流信息还包括文本内容对应的数据流信息。相应的，如图3所示，上述步骤220的实施过程还包括如下步骤(224～225)。In an exemplary embodiment, the visual information further includes text content corresponding to the target video, and the data flow information further includes data flow information corresponding to the text content. Correspondingly, as shown in FIG. 3 , the implementation process of the above step 220 further includes the following steps (224-225).

步骤224，基于文本内容以及文本内容对应的数据流信息，确定目标视频在文本模态上对应的播放质量特征信息。Step 224 , based on the text content and the data stream information corresponding to the text content, determine the playback quality feature information corresponding to the target video in the text mode.

基于文本内容以及文本内容对应的数据流信息，可以确定至少一种文本质量指标数据，用于表征文本模态信息的传播质量。上述文本质量指标数据包括对文本内容进行衡量的质量指标数据以及文本内容对应的数据流信息中与播放器数据处理性能关联的指标数据。Based on the text content and the data flow information corresponding to the text content, at least one kind of text quality indicator data can be determined, which is used to characterize the dissemination quality of the text modal information. The above-mentioned text quality index data includes quality index data for measuring the text content and index data associated with the data processing performance of the player in the data stream information corresponding to the text content.

对于视频中的文本，比如字幕，可以衡量字幕质量的指标包括完整度、清晰度、文本质量。完整度用于表征视频是否包含字幕，进一步对于外语剧集来说，完整度用于表征视频是否有多语种字幕。清晰度包括字幕清晰度，可表征内嵌和外挂字幕的渲染效果，比如1080P的视频渲染一个270P的字幕，那画面效果会比较不协调。文本质量可根据实际文本内容或者文本生成方式确定，比如字幕组提供字幕和机器翻译字幕质量相差较多。For text in video, such as subtitles, indicators that can measure the quality of subtitles include completeness, clarity, and text quality. The completeness is used to characterize whether the video contains subtitles, and further for foreign language episodes, the completeness is used to characterize whether the video has multilingual subtitles. Clarity includes subtitle clarity, which can represent the rendering effect of embedded and external subtitles. For example, if a 1080P video renders a 270P subtitle, the picture effect will be more inconsistent. The text quality can be determined according to the actual text content or the text generation method. For example, the quality of subtitles provided by a subtitle group and machine translation subtitles is quite different.

在一个示例中，如图7所示，其示例性示出了一种确定文本质量指标数据的示意图。图7示出了视频帧图像中具有字幕内容“母后，您还记得吗”，终端可以对视频图像中的字幕内容进行检测，确定字幕的完整度、清晰度和文本质量。In an example, as shown in FIG. 7 , it exemplarily shows a schematic diagram of determining text quality index data. Figure 7 shows that the video frame image has subtitle content "Mother, do you remember?", the terminal can detect the subtitle content in the video image to determine the completeness, definition and text quality of the subtitle.

根据上述至少一种文本质量指标数据，可以确定文本质量特征向量。在一种可能的实施方式中，可以将上述至少一种文本质量指标数据进行拼接，得到上述文本质量特征向量。在另一种可能的实施方式中，可以对上述至少一种图像质量指标数据进行特征提取处理，得到上述文本质量特征向量。According to the above at least one text quality indicator data, a text quality feature vector can be determined. In a possible implementation manner, the above-mentioned at least one text quality indicator data may be spliced to obtain the above-mentioned text quality feature vector. In another possible implementation manner, feature extraction processing may be performed on the above-mentioned at least one image quality index data to obtain the above-mentioned text quality feature vector.

步骤225，对图像模态上对应的播放质量特征信息以及文本模态上对应的播放质量特征信息进行特征信息融合处理，得到目标视频在图文模态上对应的播放质量特征信息。Step 225: Perform feature information fusion processing on the playback quality feature information corresponding to the image mode and the playback quality feature information corresponding to the text mode to obtain playback quality feature information corresponding to the target video in the graphic mode.

图文模态是指图像模态和文本模态两种信息模态对应的融合信息模态。Graphical and textual modalities refer to the fusion information modalities corresponding to the image modalities and the text modalities.

对图像质量特征向量和文本质量特征向量进行特征融合处理，得到图文播放质量特征向量。上述图文播放质量特征向量用于表征目标视频在图文模态上的播放质量特征信息。Feature fusion processing is performed on the image quality feature vector and the text quality feature vector to obtain the image and text playback quality feature vector. The above-mentioned graphic and text playback quality feature vector is used to represent the playback quality feature information of the target video in the graphic and text mode.

对于视频中的一帧图像来说，往往具有关联的文本信息，比如字幕或者弹幕，因此图像模态和文本模态对应的特征属于强相关特征，因此可将当前嵌入的字幕特征和图像特征进行拼接作为后续播放质量测试模型的输入。图像质量特征向量和文本质量特征向量的特征融合处理可以采用双线性池化的方法，即通过计算二者的外积，通过将外积生成的矩阵线性化成一个向量表示，得到上述图文播放质量特征向量，即为二者的融合特征。For a frame of image in the video, it often has associated text information, such as subtitles or bullet screens, so the features corresponding to the image mode and the text mode are strongly correlated features, so the currently embedded subtitle features and image features can be combined. Stitching is performed as input to the subsequent playback quality test model. The feature fusion processing of the image quality feature vector and the text quality feature vector can use the bilinear pooling method, that is, by calculating the outer product of the two, and by linearizing the matrix generated by the outer product into a vector representation, the above graphic playback can be obtained. The quality feature vector is the fusion feature of the two.

通过上述过程可以提取出多模态的播放质量特征信息，以便于将各种模态上提取到的质量特征信息用于视听信息测试结果的评定中。Through the above process, multi-modal playback quality feature information can be extracted, so that the quality feature information extracted from various modalities can be used in the evaluation of audio-visual information test results.

步骤230，基于至少两种信息模态上对应的播放质量特征信息，生成目标播放器播放目标视频对应的视听质量测试结果。Step 230 , based on the corresponding playback quality feature information on the at least two information modalities, generate an audiovisual quality test result corresponding to the target video being played by the target player.

上述视听质量测试结果用于表征目标视频在目标播放器上对应的视听信息传播质量。The above audiovisual quality test results are used to characterize the audiovisual information dissemination quality corresponding to the target video on the target player.

在示例性实施例中，视听质量测试结果包括目标视频在至少两种信息模态中各信息模态上对应的播放质量归因分，播放质量归因分用于表征目标视频中单模态信息的传播质量。In an exemplary embodiment, the audiovisual quality test result includes a corresponding playback quality attribution score of the target video on each of the at least two information modalities, and the playback quality attribution score is used to characterize single-modality information in the target video transmission quality.

相应的，如图3所示，上述步骤230的实施过程包括如下步骤231。Correspondingly, as shown in FIG. 3 , the implementation process of the above step 230 includes the following step 231 .

步骤231，基于至少两种信息模态中各信息模态上对应的播放质量特征信息，确定目标视频在各信息模态上对应的播放质量归因分。Step 231 , based on the playback quality feature information corresponding to each of the at least two information modalities, determine the corresponding playback quality attribution score of the target video on each of the information modalities.

确定上述各种信息模态上的播放质量特征信息后，可进行视频、音频、文本、图像、图文等多模态的播放质量测试预训练的过程，以得到各信息模态上的质量评测模型。After determining the playback quality feature information on the above-mentioned various information modalities, the pre-training process of multi-modal playback quality testing such as video, audio, text, image, and graphic text can be performed to obtain the quality evaluation of each information modal. Model.

在示例性实施例中，对于各信息模态中的视频模态，将目标视频在视频模态上的播放质量特征信息，即视频播放质量特征向量输入至视频模态质量评测模型，输出视频模态质量分。上述视频模态质量分为目标视频在视频模态上对应的播放质量归因分，用于表征目标视频中与视频模态的动态信息传播质量。In an exemplary embodiment, for the video modality in each information modality, the playback quality feature information of the target video on the video modality, that is, the video playback quality feature vector, is input to the video modality quality evaluation model, and the video modality is output. state quality score. The above-mentioned video modal quality is divided into the attribution score of the playback quality corresponding to the target video on the video modal, which is used to characterize the dynamic information dissemination quality in the target video and the video modal.

其中，上述视频模态质量评测模型是以样本视频在视频模态上的播放质量特征信息为训练特征，即上述视频播放质量特征向量，并以标记分数为标签信息进行训练的机器学习模型，用于进行视频模态信息的播放质量测试并输出视频模态对应的量化测试结果，即上述视频模态质量分。Among them, the above-mentioned video modal quality evaluation model takes the playback quality feature information of the sample video on the video modality as the training feature, that is, the above-mentioned video playback quality feature vector, and uses the label score as the label information. It is used to test the playback quality of the video mode information and output the quantization test result corresponding to the video mode, that is, the above-mentioned video mode quality score.

在示例性实施例中，对于各信息模态中的音频模态，将目标视频在音频模态上的播放质量特征信息，即音频播放质量特征向量输入至音频模态质量评测模型，输出音频模态质量分。上述音频模态质量分为目标视频在音频模态上对应的播放质量归因分，用于表征目标视频中音频模态信息的传播质量。In an exemplary embodiment, for the audio modality in each information modality, the playback quality feature information of the target video on the audio modality, that is, the audio playback quality feature vector, is input into the audio modality quality evaluation model, and the audio modality is output. state quality score. The above-mentioned audio modality quality is divided into attribution scores of playback quality corresponding to the audio modality of the target video, which are used to characterize the propagation quality of audio modality information in the target video.

其中，上述音频模态质量评测模型是以样本视频在音频模态上的播放质量特征信息为训练特征，即音频播放质量特征向量，并以标记分数为标签信息进行训练的机器学习模型，用于进行音频模态信息的播放质量测试并输出音频模态对应的量化测试结果，即上述音频模态质量分。Among them, the above-mentioned audio modality quality evaluation model is a machine learning model that uses the playback quality feature information of the sample video on the audio modality as the training feature, that is, the audio playback quality feature vector, and uses the label score as the label information for training. The playback quality test of the audio modality information is performed, and the quantized test result corresponding to the audio modality is output, that is, the above-mentioned audio modality quality score.

在示例性实施例中，对于各信息模态中的图像模态，将目标视频在图像模态上的播放质量特征信息，即图像质量特征向量输入至图像模态质量评测模型，输出图像模态质量分。上述图像模态质量分为目标视频在图像模态上对应的播放质量归因分，用于表征目标视频中图像模态信息的传播质量。In an exemplary embodiment, for the image modality in each information modality, the playback quality feature information of the target video on the image modality, that is, the image quality feature vector, is input into the image modality quality evaluation model, and the image modality is output. quality score. The above-mentioned image modality quality is divided into the attribution score of the playback quality corresponding to the image modality of the target video, which is used to characterize the dissemination quality of the image modality information in the target video.

其中，上述图像模态质量评测模型是以样本视频在图像模态上的播放质量特征信息为训练特征，即图像质量特征向量，并以标记分数为标签信息进行训练的机器学习模型，用于进行图像模态信息的播放质量测试并输出图像模态对应的量化测试结果，即上述图像模态质量分。Among them, the above-mentioned image modal quality evaluation model is a machine learning model that uses the playback quality feature information of the sample video on the image modality as the training feature, that is, the image quality feature vector, and uses the label score as the label information for training. Playing quality test of image modality information and outputting a quantified test result corresponding to the image modality, that is, the above-mentioned image modality quality score.

在示例性实施例中，对于各信息模态中的文本模态，将目标视频在文本模态上的播放质量特征信息，即文本质量特征向量输入至文本模态质量评测模型，输出文本模态质量分。上述文本模态质量分为目标视频在文本模态上对应的播放质量归因分，用于表征目标视频中文本模态信息的传播质量。In an exemplary embodiment, for the text modality in each information modality, the playback quality feature information of the target video on the text modality, that is, the text quality feature vector, is input into the text modality quality evaluation model, and the text modality is output. quality score. The above-mentioned text modal quality is divided into the attribution score of the playback quality corresponding to the target video on the text modal, which is used to characterize the dissemination quality of the text modal information in the target video.

其中，上述文本模态质量评测模型是以样本视频在文本模态上的播放质量特征信息为训练特征，即文本质量特征向量，并以标记分数为标签信息进行训练的机器学习模型，用于进行文本模态信息的播放质量测试并输出文本模态对应的量化测试结果，即上述文本模态质量分。Among them, the above text modal quality evaluation model is a machine learning model that uses the playback quality feature information of the sample video on the text modality as the training feature, that is, the text quality feature vector, and uses the label score as the label information for training. Playing quality test of text modal information and outputting the quantitative test result corresponding to the text modal, that is, the above-mentioned text modal quality score.

在示例性实施例中，对于各信息模态中的图文模态，将目标视频在图文模态上的播放质量特征信息，即图文播放质量特征向量输入至图文模态质量评测模型，输出图文模态质量分。上述图文模态质量分为目标视频在图文模态上对应的播放质量归因分，用于表征目标视频中与图文模态联合信息的传播质量。In an exemplary embodiment, for the graphic mode in each information modality, the playback quality feature information of the target video on the graphic mode, that is, the graphic playback quality feature vector, is input into the graphic mode quality evaluation model , and output the modal quality score of the image and text. The above-mentioned image-text modal quality is divided into the attribution score of the playback quality corresponding to the target video on the image-text modal, which is used to characterize the dissemination quality of the joint information in the target video and the image-text modal.

其中，上述图文模态质量评测模型是以样本视频在图文模态上的播放质量特征信息为训练特征，即图文播放质量特征向量，并以标记分数为标签信息进行训练的机器学习模型，用于进行图文模态联合信息的播放质量测试并输出图文模态对应的量化测试结果，即上述图文模态质量分。Among them, the above-mentioned image-text modal quality evaluation model is a machine learning model that uses the playback quality feature information of the sample video on the graphic-text modality as the training feature, that is, the image-text playback quality feature vector, and uses the label score as the label information for training. , which is used to test the playback quality of the joint information of the graphic and text modes and output the quantized test results corresponding to the graphic and text modes, that is, the above-mentioned graphic and text mode quality scores.

在示例性实施例中，视听质量测试结果还包括目标视频对应的视听质量整体分，视听质量整体分用于从整体上表征目标视频在目标播放器上对应的视听信息传播质量。In an exemplary embodiment, the audiovisual quality test result further includes an overall audiovisual quality score corresponding to the target video, and the overall audiovisual quality score is used to overall characterize the audiovisual information dissemination quality corresponding to the target video on the target player.

相应的，如图3所示，上述步骤231之后，上述方法还包括如下步骤232。Correspondingly, as shown in FIG. 3 , after the above step 231 , the above method further includes the following step 232 .

步骤232，对各信息模态上对应的播放质量归因分进行融合处理，得到视听质量整体分。Step 232: Perform fusion processing on the corresponding playback quality attribution scores on each information modality to obtain an overall audio-visual quality score.

在示例性实施例中，将上述图文模态质量分、音频模态质量分、视频模态质量分进行加权平均，得到视听质量整体分。可选地，上述各项质量归因分的权重系数可以根据实际情况进行调整。例如，三项播放质量归因分对应的权重系数各为1/3。通过计算上述视听质量整体分，并同时保留各个模态各自对应的播放质量归因分，便于提供给视频开发者做参考，在质量较差的维度进行优化。In an exemplary embodiment, the above-mentioned image and text modal quality scores, audio modal quality scores, and video modal quality scores are weighted and averaged to obtain an overall audio-visual quality score. Optionally, the weight coefficients of the above-mentioned quality attribution points can be adjusted according to the actual situation. For example, the weight coefficients corresponding to the three playback quality attribution points are each 1/3. By calculating the above overall audiovisual quality score, and at the same time retaining the corresponding playback quality attribution score of each mode, it is convenient to provide video developers for reference and optimize in the dimension of poor quality.

在示例性实施例中，对于图像模态和文本模态未进行融合的情况，可将上述音频模态质量分、视频模态质量分、图像模态质量分、文本模态质量分进行加权平均，得到视听质量整体分。In an exemplary embodiment, in the case where the image modality and the text modality are not fused, the above-mentioned audio modality quality score, video modality quality score, image modality quality score, and text modality quality score may be weighted and averaged , get the overall score of audio-visual quality.

在一个示例中，如图8所示，其示例性示出了一种多模态质量测试网络的示意图。将视频播放过程中的语音、图像、字幕和视频四种模态对应的播放数据分别输入至语音网络、图像网络、文本网络和视频网络，可通过相应的网络中从中提取出各模态媒体内容的播放质量特征，即图中所示的语音特征、图像特征、文本特征以及视频特征，再基于上述语音特征、图像特征、文本特征以及视频特征进行多模态特征加权融合，可以实现根据特征打分以对视频播放质量进行量化测评。基于上述多模态质量测试网络，在对终端视频播放质量进行测试时可综合视频、图片、音频、文字等多维度特征，将不同模态通过各自对应的编码器特征网络映射到统一语义空间，整合成稳定的多模态表征。通过部署上述多模态质量测试网络，除了视频、音频、画面、字幕这些不同信息模态的质量评测外，还可给出综合质量得分，能够更贴近用户观看视频时的主观质量评价。In one example, as shown in FIG. 8 , it exemplarily shows a schematic diagram of a multimodal quality testing network. Input the playback data corresponding to the four modes of voice, image, subtitle and video in the video playback process to the voice network, image network, text network and video network respectively, and extract the media content of each mode from the corresponding network. The playback quality features are the voice features, image features, text features and video features shown in the figure, and then multi-modal feature weighted fusion based on the above voice features, image features, text features and video features can be achieved. To quantify the video playback quality. Based on the above-mentioned multi-modal quality testing network, the multi-dimensional features such as video, picture, audio, and text can be integrated when testing the video playback quality of the terminal, and different modalities can be mapped to a unified semantic space through their corresponding encoder feature networks. integrated into stable multimodal representations. By deploying the above multimodal quality test network, in addition to the quality evaluation of different information modalities such as video, audio, picture, and subtitles, a comprehensive quality score can also be given, which can be closer to the subjective quality evaluation of users when watching videos.

在另一种实施方式中，可直接对至少两种信息模态上的播放质量特征信息进行跨模态特征融合处理，得到模态融合特征信息。In another embodiment, cross-modal feature fusion processing may be performed directly on the playback quality feature information on at least two information modalities to obtain modality fusion feature information.

上述跨模态特征融合处理是指将不同模态对应的播放质量特征信息进行融合映射至相同的特征空间的处理。The above-mentioned cross-modal feature fusion processing refers to the processing of merging and mapping playback quality feature information corresponding to different modalities to the same feature space.

在存在跨模态语义差距的前提下，可将各种信息模态上的播放质量特征进行标签共享处理。比如，音频模态和文本模态各自对应的特征xa、xs，属于不同的类别，字幕特征xs可被强制共享与音频xa相同的标签k。通过跨模态特征的学习组合嵌入，以缩小各信息模态之间可能的语义差距，并捕获任务相关的语义，以促进更多的知识转移。可选地，上述标签共享规则包括根据各信息模态的数据对应的视频播放时刻信息进行标签共享。上述标签可以是标记分数，比如用户对某时刻的视频播放内容的播放质量分数。Under the premise that there is a cross-modal semantic gap, the playback quality features on various information modalities can be processed by tag sharing. For example, the features xa and xs corresponding to the audio modality and the text modality belong to different categories, and the subtitle feature xs can be forced to share the same label k as the audio xa. Combining embeddings through the learning of cross-modal features to narrow the possible semantic gaps between information modalities and capture task-relevant semantics to facilitate more knowledge transfer. Optionally, the above tag sharing rule includes tag sharing according to the video playback time information corresponding to the data of each information modality. The above tag may be a marked score, such as a user's playback quality score for the video playback content at a certain moment.

在一种可能的实施方式中，将上述视频播放质量特征向量、音频播放特征向量、图像质量特征向量、文本质量特征向量进行多模态融合，得到模态融合特征数据。上述向量融合的方式包括但不限于叠加、拼接、组合嵌入等方式。In a possible implementation manner, the above-mentioned video playback quality feature vector, audio playback feature vector, image quality feature vector, and text quality feature vector are multi-modally fused to obtain modal fusion feature data. The above vector fusion methods include, but are not limited to, superposition, splicing, combined embedding, and the like.

具体地，将视频播放质量特征向量、音频播放特征向量、图像质量特征向量、文本质量特征向量进行拼接，得到模态融合特征向量。或者，将视频播放质量特征向量、音频播放特征向量、图像质量特征向量、文本质量特征向量进行叠加，得到模态融合特征向量。又或者，将视频播放质量特征向量、音频播放特征向量、图像质量特征向量、文本质量特征向量进行组合嵌入处理，得到模态融合特征向量。上述组合嵌入处理可由多模态知识迁移学习模型实现。Specifically, the feature vector of video playback quality, the feature vector of audio playback, the feature vector of image quality, and the feature vector of text quality are spliced to obtain a modal fusion feature vector. Alternatively, a feature vector of modal fusion is obtained by superimposing the feature vector of video playback quality, the feature vector of audio playback, the feature vector of image quality, and the feature vector of text quality. Alternatively, the video playback quality feature vector, the audio playback feature vector, the image quality feature vector, and the text quality feature vector are combined and embedded to obtain a modal fusion feature vector. The above combined embedding process can be implemented by a multimodal knowledge transfer learning model.

基于模态融合特征信息，确定目标视频对应的视听质量整体分。Based on the modal fusion feature information, the overall audio-visual quality score corresponding to the target video is determined.

在一种可能的实施方式中，将上述模态融合特征信息，即模态融合特征向量输入至多模态质量测试模型，输出目标视频对应的视听质量整体分。可选地，上述多模态质量测试模型是根据各模态质量测试模型的样本特征以及共享标签对预训练模型进行蒸馏训练得到的机器学习模型。可选地，上述共享标签是指上述被强制共享的标签。In a possible implementation, the above-mentioned modal fusion feature information, that is, the modal fusion feature vector, is input into the multimodal quality test model, and the overall audiovisual quality score corresponding to the target video is output. Optionally, the above-mentioned multi-modal quality test model is a machine learning model obtained by performing distillation training on the pre-training model according to the sample features and shared labels of each modal quality test model. Optionally, the above-mentioned shared label refers to the above-mentioned forced shared label.

通过提取出的多维度的播放质量特征进行不同模态的语义融合，可以将各个模态学习到的模态融合特征中的知识，共同蒸馏到一个总的体系当中，进而可以根据融合特征知识得到视频播放的综合质量评价结果。By performing semantic fusion of different modalities through the extracted multi-dimensional playback quality features, the knowledge in the modal fusion features learned by each modal can be distilled into a general system, which can then be obtained according to the fusion feature knowledge. Comprehensive quality evaluation results of video playback.

在一个示例中，如图9所示，其示例性示出了一种终端播放视频过程中的音视频质量测试流程的示意图。用户使用终端中安装的播放器软件播放视频；终端进行视频播放后，在后台运行多模态质量测试进程。多模态质量测试系统可对视频播放进行多维度的评测，包括不同类型的片源、或者新上线的视频终端处理算法、播放质量、视频内容质量等。在多模态质量测试进程中，对于视频、图像、音频和字幕各个模态的信息，可通过相应的网络中从中提取出各自的质量特征知识，然后合成嵌入并输入至多模态网络，得到表征视听知识的视频得分，即上述播放质量整体分。上述视频得分可以理解为播放器策略对应的量化质量测试结果。可选地，上述多模态网络是基于多模态预训练模型进行机器学习训练得到的神经网络模型。对于视频播放质量测试来讲，使用该多模态预训练模型对视频中包含的视频、图像、文本、语音等多维度的信息内容进行质量评价，可将多模态的信息对应的视听知识进行融合，使得测试得分更贴近用户观看视频时的主观质量评价。In an example, as shown in FIG. 9 , it exemplarily shows a schematic diagram of an audio and video quality testing process in a process of a terminal playing a video. The user uses the player software installed in the terminal to play the video; after the terminal plays the video, the multi-modal quality test process runs in the background. The multi-modal quality test system can perform multi-dimensional evaluation of video playback, including different types of film sources, or newly launched video terminal processing algorithms, playback quality, and video content quality. In the process of multi-modal quality testing, for the information of each modality of video, image, audio and subtitle, the respective quality feature knowledge can be extracted from the corresponding network, and then synthesized and embedded into the multi-modal network to obtain the representation. The video score of audiovisual knowledge, that is, the above-mentioned overall score of playback quality. The above video score can be understood as the quantitative quality test result corresponding to the player strategy. Optionally, the above-mentioned multi-modal network is a neural network model obtained by performing machine learning training based on the multi-modal pre-training model. For the video playback quality test, the multi-modal pre-training model is used to evaluate the quality of multi-dimensional information content such as video, image, text, and voice contained in the video, and the audio-visual knowledge corresponding to the multi-modal information can be evaluated. Fusion makes the test score closer to the user's subjective quality evaluation when watching videos.

在示例性实施例中，如图3所示，上述步骤230之后，方法还包括下述步骤240。In an exemplary embodiment, as shown in FIG. 3 , after the above step 230 , the method further includes the following step 240 .

步骤240，基于视听质量测试结果，生成针对目标播放器配置的数据处理策略信息的策略调整信息。Step 240 , based on the audio-visual quality test result, generate policy adjustment information for the data processing policy information configured by the target player.

其中，数据处理策略信息包括至少一种针对数据流的数据处理策略，策略调整信息用于调整数据处理策略。The data processing policy information includes at least one data processing policy for the data stream, and the policy adjustment information is used to adjust the data processing policy.

播放器中配置有数据处理策略库，用于对各路数据流进行处理。数据处理策略库包括各种模态信息的数据策略。The player is configured with a data processing strategy library for processing various data streams. The data processing strategy library includes data strategies for various modal information.

对于视频模态，播放器中配置针对不同编码格式的解码器，比如H265、H264编码模式对应的解码器。播放器中还可配置针对不同解码方式的数据处理策略，比如硬解处理策略MediaCodec、VideoToolbox，软解处理策略FFmpeg等。其中，MediaCodec是Android(安卓操作系统)平台的硬件编码器；VideoToolbox是iOS平台的视频处理框架；FFmpeg是一套可以用来记录、转换数字音频、视频，并能将其转化为流的开源计算机程序。For video modes, decoders for different encoding formats are configured in the player, such as decoders corresponding to H265 and H264 encoding modes. Data processing strategies for different decoding methods can also be configured in the player, such as hard decoding processing strategies MediaCodec, VideoToolbox, and soft decoding processing strategies FFmpeg. Among them, MediaCodec is the hardware encoder of the Android (Android operating system) platform; VideoToolbox is the video processing framework of the iOS platform; FFmpeg is a set of open source computers that can be used to record, convert digital audio and video, and convert them into streams. program.

对于图像模态，播放器中配置不同的图像处理策略，比如超分辨率、超帧率、色盲、HDR(High-Dynamic Range高动态范围图像)、VR等图像处理策略。For image modes, different image processing strategies are configured in the player, such as super resolution, super frame rate, color blindness, HDR (High-Dynamic Range), VR and other image processing strategies.

对于音频模态，播放器中配置不同的音频处理策略，比如去噪、回声消除、音效处理、功放/增强、混音/分离等音频处理策略。For audio mode, different audio processing strategies are configured in the player, such as denoising, echo cancellation, sound effect processing, power amplifier/enhancement, sound mixing/separation and other audio processing strategies.

对于文本模态，播放器中配置不同的文本处理策略，比如字体，内嵌/外挂字幕等文本处理策略。For text mode, different text processing strategies are configured in the player, such as fonts, embedded/external subtitles and other text processing strategies.

上述针对当前播放视频的数据流的数据处理策略可以是用户在播放器的设置页面中设置的处理策略，比如选择图像特效等操作，也可以是用户在视频播放页面中的操作栏中进行设置的，比如设置倍速播放、清晰度等操作。The above-mentioned data processing strategy for the data stream of the currently playing video may be the processing strategy set by the user in the settings page of the player, such as selecting an image special effect, or it may be set by the user in the operation bar on the video playback page. , such as setting double-speed playback, resolution and other operations.

上述视听质量测试结果可反映播放器播放视频所采用的数据处理策略的质量，因此根据上述视听质量测试结果可以生成策略调整信息，以对播放器播放视频所采用的数据处理策略进行调整。比如用户设置多倍速播放之后，由于数据计算压力较大，可以能出现视频播放质量较低的情况，因此可根据上述视听质量测试结果，生成策略调整信息，用于提示用户设置普通倍速播放。The above audiovisual quality test result can reflect the quality of the data processing strategy adopted by the player to play the video. Therefore, strategy adjustment information can be generated according to the above audiovisual quality test result to adjust the data processing strategy adopted by the player to play the video. For example, after the user sets multi-speed playback, the video playback quality may be low due to the large data calculation pressure. Therefore, based on the above audio-visual quality test results, strategy adjustment information can be generated to prompt the user to set normal multi-speed playback.

播放器根据上述策略调整信息也可以自动对当前使用的数据处理策略进行调整。The player can also automatically adjust the currently used data processing strategy according to the above strategy adjustment information.

在在线视频服务的场景中，上述测试进程可以在视频播放的过程中同步执行，基于上述多模态的质量测试模型，可以实现对线上视频播放质量的实时评测，以便于及时发现问题。在得到上述视频播放测试结果之后，终端可以将上述视频播放测试结果上传至服务器以帮助开发者分析。开发者可以根据视频播放测试结果进行相关优化，比如技术架构设计、编码选择、流媒体协议、自适应算法、连接与卡顿逻辑、客户端软件设计等。In the scenario of online video service, the above-mentioned test process can be performed synchronously during the video playback. Based on the above-mentioned multimodal quality test model, real-time evaluation of online video playback quality can be realized, so as to find problems in time. After obtaining the above video playback test result, the terminal may upload the above video playback test result to the server to help the developer to analyze. Developers can make related optimizations based on video playback test results, such as technical architecture design, encoding selection, streaming media protocols, adaptive algorithms, connection and freeze logic, client software design, etc.

综上所述，本申请实施例提供的技术方案，通过获取播放器播放视频过程中用户可感知的视觉信息、听觉信息以及反映目标播放器数据处理质量的数据流信息，可以确定视频在至少两种信息模态上的播放质量特征，通过利用上述至少两种信息模态上的播放质量特征来生成能够表征视听信息传播质量的视听质量测试结果，可以充分挖掘不同信息模态媒体内容的播放质量信息，保证了测试结果符合用户主观评价，避免信息维度单一导致测试结果不准确的问题，提升了音视频质量测试的准确性，并且无需借助人工观测即可确定视频播放质量，降低了音视频质量测试所需的人工成本。To sum up, the technical solutions provided by the embodiments of the present application, by acquiring the visual information and auditory information perceivable by the user in the process of playing the video by the player, and the data stream information reflecting the data processing quality of the target player, it can be determined that the video is in at least two The playback quality features on various information modalities, and by using the playback quality features on at least two information modalities above to generate audiovisual quality test results that can characterize the quality of audiovisual information dissemination, the playback quality of media content in different information modalities can be fully explored. information to ensure that the test results conform to the subjective evaluation of users, avoid the problem of inaccurate test results caused by a single information dimension, improve the accuracy of audio and video quality testing, and determine the video playback quality without manual observation, reducing the audio and video quality. Labor costs required for testing.

下述为本申请装置实施例，可用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节，请参照本申请方法实施例。The following are device embodiments of the present application, which can be used to execute the method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

请参考图10，其示出了本申请一个实施例提供的视频播放的测试装置的框图。该装置具有实现上述视频播放的测试方法的功能，所述功能可以由硬件实现，也可以由硬件执行相应的软件实现。该装置可以是计算机设备，也可以设置在计算机设备中。该装置1000可以包括：播放信息获取模块1010、质量特征确定模块1020以及测试结果生成模块1030。Please refer to FIG. 10 , which shows a block diagram of a video playback test apparatus provided by an embodiment of the present application. The device has the function of implementing the above-mentioned video playback test method, and the function can be implemented by hardware or by executing corresponding software by hardware. The apparatus may be computer equipment, or may be provided in computer equipment. The apparatus 1000 may include: a playback information acquisition module 1010 , a quality feature determination module 1020 and a test result generation module 1030 .

播放信息获取模块1010，用于获取目标视频对应的视频播放信息，所述视频播放信息包括所述目标视频对应的视觉信息、听觉信息以及目标播放器播放所述目标视频对应的数据流信息，所述数据流信息用于表征视频播放过程中数据流的处理质量。The playback information acquisition module 1010 is configured to acquire video playback information corresponding to the target video, where the video playback information includes visual information, auditory information corresponding to the target video, and data stream information corresponding to the target video being played by the target player. The data stream information is used to characterize the processing quality of the data stream during video playback.

质量特征确定模块1020，用于根据所述视觉信息、所述听觉信息以及所述数据流信息，确定所述目标视频在至少两种信息模态上对应的播放质量特征信息。The quality feature determination module 1020 is configured to determine, according to the visual information, the auditory information and the data stream information, the playback quality feature information corresponding to the target video in at least two information modalities.

测试结果生成模块1030，用于基于所述至少两种信息模态上对应的播放质量特征信息，生成所述目标播放器播放所述目标视频对应的视听质量测试结果，所述视听质量测试结果用于表征所述目标视频在所述目标播放器上对应的视听信息传播质量。The test result generation module 1030 is configured to generate the audiovisual quality test result corresponding to the target player playing the target video based on the corresponding playback quality feature information on the at least two information modalities. It is used to characterize the corresponding audiovisual information dissemination quality of the target video on the target player.

在示例性实施例中，所述播放信息获取模块1010，包括：数据流获取单元、视频帧流处理单元、音频流处理单元。In an exemplary embodiment, the playback information acquisition module 1010 includes: a data stream acquisition unit, a video frame stream processing unit, and an audio stream processing unit.

数据流获取单元，用于响应于视频播放指令，获取所述目标视频对应的至少两路数据流，所述至少两路数据流包括原始视频帧数据流和原始音频数据流。A data stream acquisition unit, configured to acquire at least two data streams corresponding to the target video in response to a video playback instruction, where the at least two data streams include an original video frame data stream and an original audio data stream.

视频帧流处理单元，用于对所述原始视频帧数据流进行图像显示处理，得到在目标页面显示的视频帧序列以及所述视频帧序列对应的数据流信息，所述视频帧序列用于表征图像模态上的视觉信息。A video frame stream processing unit, configured to perform image display processing on the original video frame data stream to obtain a video frame sequence displayed on the target page and data stream information corresponding to the video frame sequence, and the video frame sequence is used to represent Visual information on image modalities.

音频流处理单元，用于对所述原始音频数据流进行音频播放处理，生成音频信号以及所述音频信号对应的数据流信息，所述音频信号用于表征所述听觉信息。An audio stream processing unit, configured to perform audio playback processing on the original audio data stream, and generate an audio signal and data stream information corresponding to the audio signal, where the audio signal is used to represent the auditory information.

在示例性实施例中，所述至少两路数据流还包括原始文本数据流，所述播放信息获取模块1010，还包括：文本流处理单元。In an exemplary embodiment, the at least two data streams further include original text data streams, and the playback information acquisition module 1010 further includes: a text stream processing unit.

文本流处理单元，用于对所述原始文本数据流进行文本显示处理，得到在所述目标页面中显示的文本内容以及所述文本内容对应的数据流信息，所述文本内容用于表征文本模态对应的视觉信息。A text stream processing unit, configured to perform text display processing on the original text data stream to obtain text content displayed in the target page and data stream information corresponding to the text content, the text content being used to represent a text model corresponding visual information.

在示例性实施例中，所述视听质量测试结果包括所述目标视频在所述至少两种信息模态中各信息模态上对应的播放质量归因分，所述播放质量归因分用于表征所述目标视频中单模态信息的传播质量；所述测试结果生成模块1130，包括：归因分确定单元。In an exemplary embodiment, the audiovisual quality test result includes a corresponding playback quality attribution score of the target video on each of the at least two information modalities, and the playback quality attribution score is used for Characterizing the propagation quality of the single-modal information in the target video; the test result generation module 1130 includes: an attribution score determination unit.

归因分确定单元，用于基于所述至少两种信息模态中各信息模态上对应的播放质量特征信息，确定所述目标视频在所述各信息模态上对应的播放质量归因分。An attribution score determination unit, configured to determine, based on the playback quality feature information corresponding to each of the at least two information modalities, the playback quality attribution score of the target video corresponding to each of the information modalities .

在示例性实施例中，所述视听质量测试结果还包括所述目标视频对应的视听质量整体分，所述视听质量整体分用于从整体上表征所述目标视频在所述目标播放器上对应的视听信息传播质量；所述测试结果生成模块1130还包括：整体分确定单元。In an exemplary embodiment, the audio-visual quality test result further includes an overall audio-visual quality score corresponding to the target video, and the overall audio-visual quality score is used to overall characterize that the target video corresponds to the target video on the target player The audio-visual information dissemination quality; the test result generation module 1130 further includes: an overall score determination unit.

整体分确定单元，用于对所述各信息模态上对应的播放质量归因分进行融合处理，得到所述视听质量整体分。The overall score determination unit is configured to perform fusion processing on the corresponding playback quality attribution scores on each of the information modalities to obtain the overall audiovisual quality score.

在示例性实施例中，所述视觉信息包括所述目标视频对应的视频帧序列，所述听觉信息包括所述目标视频对应的音频信号，所述数据流信息包括所述视频帧序列对应的数据流信息以及所述音频信号对应的数据流信息；In an exemplary embodiment, the visual information includes a video frame sequence corresponding to the target video, the auditory information includes an audio signal corresponding to the target video, and the data stream information includes data corresponding to the video frame sequence Stream information and data stream information corresponding to the audio signal;

所述质量特征确定模块1020，包括：视频模态特征确定单元、图像模态特征确定单元、音频模态特征确定单元。The quality feature determination module 1020 includes: a video modality feature determination unit, an image modality feature determination unit, and an audio modality feature determination unit.

视频模态特征确定单元，用于基于所述视频帧序列对应的数据流信息，确定所述目标视频在视频模态上对应的播放质量特征信息。A video modality feature determining unit, configured to determine, based on the data stream information corresponding to the video frame sequence, the playback quality feature information corresponding to the target video on the video modality.

图像模态特征确定单元，用于基于所述视频帧序列，确定所述目标视频在图像模态上对应的播放质量特征信息。An image modality feature determination unit, configured to determine, based on the video frame sequence, the playback quality feature information corresponding to the target video on the image modality.

音频模态特征确定单元，用于基于所述音频信号以及所述音频信号对应的数据流信息，确定所述目标视频在音频模态上对应的播放质量特征信息。An audio modality feature determining unit, configured to determine, based on the audio signal and data stream information corresponding to the audio signal, the playback quality feature information corresponding to the target video in the audio modality.

在示例性实施例中，所述视觉信息还包括所述目标视频对应的文本内容，所述数据流信息还包括文本内容对应的数据流信息，所述质量特征确定模块1020，还包括：文本模态特征确定单元、图文模态特征确定单元。In an exemplary embodiment, the visual information further includes text content corresponding to the target video, the data stream information further includes data stream information corresponding to the text content, and the quality feature determining module 1020 further includes: a text model Modal feature determination unit, graphic modal feature determination unit.

文本模态特征确定单元，用于基于所述文本内容以及所述文本内容对应的数据流信息，确定所述目标视频在文本模态上对应的播放质量特征信息；a text modality feature determining unit, configured to determine the playback quality feature information corresponding to the target video on the text modality based on the text content and the data stream information corresponding to the text content;

图文模态特征确定单元，用于对所述图像模态上对应的播放质量特征信息以及所述文本模态上对应的播放质量特征信息进行特征信息融合处理，得到所述目标视频在图文模态上对应的播放质量特征信息，所述图文模态是指图像模态和文本模态两种信息模态对应的融合信息模态。The graphic and text modal feature determination unit is configured to perform feature information fusion processing on the corresponding playback quality feature information on the image modality and the corresponding playback quality feature information on the text modality to obtain the target video in the graphic and text format. The corresponding playback quality feature information on the modality, the graphic modality refers to the fusion information modality corresponding to the image modality and the text modality.

在示例性实施例中，所述装置1000还包括：播放策略调整模块。In an exemplary embodiment, the apparatus 1000 further includes: a playback strategy adjustment module.

播放策略调整模块，用于基于所述视听质量测试结果，生成针对所述目标播放器配置的数据处理策略信息的策略调整信息；A playback strategy adjustment module, configured to generate strategy adjustment information for the data processing strategy information configured for the target player based on the audiovisual quality test result;

其中，所述数据处理策略信息包括至少一种针对所述数据流的数据处理策略，所述策略调整信息用于调整所述数据处理策略。The data processing policy information includes at least one data processing policy for the data stream, and the policy adjustment information is used to adjust the data processing policy.

需要说明的是，上述实施例提供的装置，在实现其功能时，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将设备的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。另外，上述实施例提供的装置与方法实施例属于同一构思，其具体实现过程详见方法实施例，这里不再赘述。It should be noted that, when implementing the functions of the device provided in the above-mentioned embodiments, only the division of the above-mentioned functional modules is used as an example. The internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments, which will not be repeated here.

请参考图11，其示出了本申请一个实施例提供的计算机设备的结构框图。该计算机设备可以是终端。该计算机设备用于实施上述实施例中提供的视频播放的测试方法。具体来讲：Please refer to FIG. 11 , which shows a structural block diagram of a computer device provided by an embodiment of the present application. The computer device may be a terminal. The computer device is used to implement the video playback test method provided in the above embodiment. Specifically:

通常，计算机设备1100包括有：处理器1101和存储器1102。Generally, computer device 1100 includes: processor 1101 and memory 1102 .

处理器1101可以包括一个或多个处理核心，比如4核心处理器、8核心处理器等。处理器1101可以采用DSP(Digital Signal Processing，数字信号处理)、FPGA(FieldProgrammable Gate Array，现场可编程门阵列)、PLA(Programmable Logic Array，可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1101也可以包括主处理器和协处理器，主处理器是用于对在唤醒状态下的数据进行处理的处理器，也称CPU(Central ProcessingUnit，中央处理器)；协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中，处理器1101可以在集成有GPU(Graphics Processing Unit，图像处理器)，GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中，处理器1101还可以包括AI(Artificial Intelligence，人工智能)处理器，该AI处理器用于处理有关机器学习的计算操作。The processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1101 may be implemented in at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (FieldProgrammable Gate Array, field programmable gate array), and PLA (Programmable Logic Array, programmable logic array). The processor 1101 may also include a main processor and a coprocessor. The main processor is a processor used to process data in a wake-up state, also called a CPU (Central Processing Unit, central processing unit); A low-power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen. In some embodiments, the processor 1101 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.

存储器1102可以包括一个或多个计算机可读存储介质，该计算机可读存储介质可以是非暂态的。存储器1102还可包括高速随机存取存储器，以及非易失性存储器，比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中，存储器1102中的非暂态的计算机可读存储介质用于存储至少一个指令，至少一段程序、代码集或指令集，所述至少一条指令、至少一段程序、代码集或指令集，且经配置以由一个或者一个以上处理器执行，以实现上述视频播放的测试方法。Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, a non-transitory computer-readable storage medium in memory 1102 is used to store at least one instruction, at least one program, code set or instruction set, the at least one instruction, at least one program, code set or instruction set and configured to be executed by one or more processors to implement the above-described testing method for video playback.

在一些实施例中，计算机设备1100还可选包括有：外围设备接口1103和至少一个外围设备。处理器1101、存储器1102和外围设备接口1103之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口1103相连。具体地，外围设备包括：射频电路1104、触摸显示屏1105、摄像头组件1106、音频电路1107、定位组件1108和电源1109中的至少一种。In some embodiments, the computer device 1100 may also optionally include: a peripheral device interface 1103 and at least one peripheral device. The processor 1101, the memory 1102 and the peripheral device interface 1103 may be connected through a bus or a signal line. Each peripheral device can be connected to the peripheral device interface 1103 through a bus, a signal line or a circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1104 , a touch display screen 1105 , a camera assembly 1106 , an audio circuit 1107 , a positioning assembly 1108 and a power supply 1109 .

本领域技术人员可以理解，图11中示出的结构并不构成对计算机设备1100的限定，可以包括比图示更多或更少的组件，或者组合某些组件，或者采用不同的组件布置。Those skilled in the art can understand that the structure shown in FIG. 11 does not constitute a limitation on the computer device 1100, and may include more or less components than those shown, or combine some components, or adopt different component arrangements.

在示例性实施例中，还提供了一种计算机可读存储介质，所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集，所述至少一条指令、所述至少一段程序、所述代码集或所述指令集在被处理器执行时以实现上述视频播放的测试方法。In an exemplary embodiment, a computer-readable storage medium is also provided, wherein the storage medium stores at least one instruction, at least one piece of program, code set or instruction set, the at least one instruction, the at least one piece of program . When the code set or the instruction set is executed by the processor, the above-mentioned testing method for video playback is implemented.

可选地，该计算机可读存储介质可以包括：ROM(Read Only Memory，只读存储器)、RAM(Random Access Memory，随机存取记忆体)、SSD(Solid State Drives，固态硬盘)或光盘等。其中，随机存取记忆体可以包括ReRAM(Resistance Random Access Memory,电阻式随机存取记忆体)和DRAM(Dynamic Random Access Memory，动态随机存取存储器)。Optionally, the computer-readable storage medium may include: ROM (Read Only Memory, read only memory), RAM (Random Access Memory, random access memory), SSD (Solid State Drives, solid state hard disk), or an optical disc. The random access memory may include ReRAM (Resistance Random Access Memory, resistive random access memory) and DRAM (Dynamic Random Access Memory, dynamic random access memory).

在示例性实施例中，还提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述视频播放的测试方法。In an exemplary embodiment, there is also provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the above-mentioned video playback test method.

应当理解的是，在本文中提及的“多个”是指两个或两个以上。“和/或”，描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。另外，本文中描述的步骤编号，仅示例性示出了步骤间的一种可能的执行先后顺序，在一些其它实施例中，上述步骤也可以不按照编号顺序来执行，如两个不同编号的步骤同时执行，或者两个不同编号的步骤按照与图示相反的顺序执行，本申请实施例对此不作限定。以上所述仅为本申请的示例性实施例，并不用以限制本申请，凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。It should be understood that references herein to "a plurality" means two or more. "And/or", which describes the association relationship of the associated objects, means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone. The character "/" generally indicates that the associated objects are an "or" relationship. In addition, the numbering of the steps described in this document only exemplarily shows a possible execution sequence between the steps. In some other embodiments, the above steps may also be executed in different order, such as two different numbers. The steps are performed at the same time, or two steps with different numbers are performed in a reverse order to that shown in the figure, which is not limited in this embodiment of the present application. The above are only exemplary embodiments of the present application and are not intended to limit the present application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present application shall be included in the protection of the present application. within the range.

Claims

1. A method for testing video playing, the method comprising:

acquiring video playing information corresponding to a target video, wherein the video playing information comprises visual information and auditory information corresponding to the target video and data stream information corresponding to the target video played by a target player, and the data stream information is used for representing the processing quality of a data stream in the video playing process;

according to the visual information, the auditory information and the data stream information, determining corresponding playing quality characteristic information of the target video on at least two information modalities;

and generating an audio-visual quality test result corresponding to the target video played by the target player based on the playing quality characteristic information corresponding to the at least two information modes, wherein the audio-visual quality test result is used for representing the audio-visual information transmission quality of the target video corresponding to the target player.

2. The method according to claim 1, wherein the obtaining video playing information corresponding to the target video includes:

responding to a video playing instruction, and acquiring at least two data streams corresponding to the target video, wherein the at least two data streams comprise an original video frame data stream and an original audio data stream;

performing image display processing on the original video frame data stream to obtain a video frame sequence displayed on a target page and data stream information corresponding to the video frame sequence, wherein the video frame sequence is used for representing visual information on an image modality;

and carrying out audio playing processing on the original audio data stream to generate an audio signal and data stream information corresponding to the audio signal, wherein the audio signal is used for representing the auditory information.

3. The method according to claim 2, wherein the at least two data streams further include an original text data stream, and after the at least two data streams corresponding to the target video are obtained in response to the video playing instruction, the method further includes:

and performing text display processing on the original text data stream to obtain text content displayed in the target page and data stream information corresponding to the text content, wherein the text content is used for representing visual information corresponding to a text mode.

4. The method according to claim 1, wherein the audiovisual quality test result comprises a corresponding playback quality attribute of the target video in each of the at least two information modalities, and the playback quality attribute is used for characterizing the propagation quality of single-modality information in the target video;

generating an audio-visual quality test result corresponding to the target video played by the target player based on the playing quality characteristic information corresponding to the at least two information modalities, including:

and determining the corresponding play quality attribution of the target video on each information modality based on the corresponding play quality characteristic information on each information modality of the at least two information modalities.

5. The method according to claim 4, wherein the audiovisual quality test result further includes an audiovisual quality overall score corresponding to the target video, and the audiovisual quality overall score is used to represent audiovisual information propagation quality corresponding to the target video on the target player as a whole;

after determining the playback quality attribution of the target video corresponding to each information modality based on the playback quality characteristic information corresponding to each information modality of the at least two information modalities, the method further includes:

and carrying out fusion processing on the corresponding play quality attribution points on each information modality to obtain the integral audio-visual quality points.

6. The method of claim 1, wherein the visual information comprises a sequence of video frames corresponding to the target video, wherein the auditory information comprises an audio signal corresponding to the target video, and wherein the data stream information comprises data stream information corresponding to the sequence of video frames and data stream information corresponding to the audio signal;

the determining, according to the visual information, the auditory information, and the data stream information, the playing quality feature information corresponding to the target video in at least two information modalities includes:

determining corresponding playing quality characteristic information of the target video on a video modality based on data stream information corresponding to the video frame sequence;

determining corresponding playing quality characteristic information of the target video on an image modality based on the video frame sequence;

and determining corresponding playing quality characteristic information of the target video on an audio modality based on the audio signal and the data stream information corresponding to the audio signal.

7. The method according to claim 6, wherein the visual information further includes text content corresponding to the target video, the data stream information further includes data stream information corresponding to text content, and wherein determining the corresponding play quality characteristic information of the target video in at least two information modalities according to the visual information, the auditory information and the data stream information further comprises:

determining playing quality characteristic information corresponding to the target video on a text mode based on the text content and the data stream information corresponding to the text content;

and performing feature information fusion processing on the playing quality feature information corresponding to the image modality and the playing quality feature information corresponding to the text modality to obtain the playing quality feature information corresponding to the target video in the image-text modality, wherein the image-text modality refers to a fusion information modality corresponding to the image modality and the text modality.

8. The method according to any one of claims 1 to 7, further comprising:

generating policy adjustment information for the data processing policy information configured for the target player based on the audiovisual quality test result;

wherein the data processing policy information includes at least one data processing policy for the data flow, and the policy adjustment information is used to adjust the data processing policy.

9. A device for testing video playback, the device comprising:

the playing information acquisition module is used for acquiring video playing information corresponding to a target video, wherein the video playing information comprises visual information and auditory information corresponding to the target video and data stream information corresponding to the target video played by a target player, and the data stream information is used for representing the processing quality of a data stream in the video playing process;

the quality characteristic determining module is used for determining corresponding playing quality characteristic information of the target video on at least two information modalities according to the visual information, the auditory information and the data stream information;

and the test result generation module is used for generating an audio-visual quality test result corresponding to the target video played by the target player based on the playing quality characteristic information corresponding to the at least two information modalities, wherein the audio-visual quality test result is used for representing the audio-visual information transmission quality corresponding to the target video on the target player.

10. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement a method of testing video playback as claimed in any of claims 1 to 8.