CN114422743A

CN114422743A - Video stream display method, device, computer equipment and storage medium

Info

Publication number: CN114422743A
Application number: CN202111583153.XA
Authority: CN
Inventors: 余力丛; 于勇
Original assignee: Huizhou Shiwei New Technology Co Ltd
Current assignee: Huizhou Shiwei New Technology Co Ltd
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-04-29
Anticipated expiration: 2041-12-22
Also published as: CN114422743B

Abstract

The embodiment of the present application discloses a video stream display method, device, computer equipment and storage medium; the embodiment of the present application acquires multiple video streams and sound source positions of the current scene, and an image acquisition area corresponding to each of the video streams; The position of the sound source and the image acquisition area are determined from the multi-channel video streams to determine a target video stream; according to the target video stream, a target object is identified, and the target object is an object with lip movements; According to the identification result of the target object, the video stream to be displayed is determined from the multi-channel video streams, and the picture corresponding to the video stream to be displayed is displayed. In the embodiment of the present application, the target video stream used for identifying the speaker is determined by the position of the sound source, which can improve the efficiency of identifying the speaker, and at the same time, the video stream to be displayed can be determined according to the identification result, so that the displayed picture can be focused on the speaker. to present a better meeting picture.

Description

Video stream display method, apparatus, computer equipment and storage medium

技术领域technical field

本申请涉及计算机技术领域，具体涉及视频流显示方法、装置、计算机设备和存储介质。The present application relates to the field of computer technology, and in particular, to a video stream display method, apparatus, computer device, and storage medium.

背景技术Background technique

随着视频技术的发展，越来越多的场合会通过摄像头实时采集现场画面并进行播放。但是在多人参与的场景中，摄像头捕捉的画面通常都无法凸显当前场景中的重点。With the development of video technology, more and more occasions will capture and play live pictures in real time through cameras. However, in scenes with multiple people participating, the images captured by the cameras usually fail to highlight the key points in the current scene.

尤其是在多人会议的场景，会议进程中常会有多个不同的发言人，如何使显示的画面聚焦在发言人身上，呈现更好的会议画面，是当前亟需解决的问题。Especially in the scenario of a multi-person conference, there are often multiple different speakers during the conference process. How to focus the displayed image on the speaker and present a better conference image is a problem that needs to be solved urgently.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供视频流显示方法、装置、计算机设备和存储介质，可以使显示的画面聚焦在发言人身上，呈现更好的会议画面。Embodiments of the present application provide a video stream display method, apparatus, computer device, and storage medium, which can make the displayed picture focus on the speaker and present a better conference picture.

本申请实施例提供一种视频流显示方法，包括：获取当前场景的多路视频流以及声源位置，每个所述视频流对应的一个图像采集区域；根据所述声源位置以及所述图像采集区域，从所述多路视频流中确定目标视频流；根据所述目标视频流，识别目标对象，所述目标对象为有嘴唇动作的对象；根据对所述目标对象的识别结果，从所述多路视频流中确定待显示的视频流；显示所述待显示的视频流对应的画面。An embodiment of the present application provides a video stream display method, including: acquiring multiple video streams and sound source positions of a current scene, and an image acquisition area corresponding to each of the video streams; according to the sound source position and the image The acquisition area is to determine the target video stream from the multi-channel video streams; according to the target video stream, identify the target object, and the target object is an object with lip movements; according to the recognition result of the target object, from the Determining a video stream to be displayed among the multiple video streams; and displaying a picture corresponding to the video stream to be displayed.

本申请实施例还提供一种视频流显示装置，包括：获取单元，用于获取当前场景的多路视频流以及声源位置，每个所述视频流对应的一个图像采集区域；第一确定单元，用于根据所述声源位置以及所述图像采集区域，从所述多路视频流中确定目标视频流；识别单元，用于根据所述目标视频流，识别目标对象，所述目标对象为有嘴唇动作的对象；第二确定单元，用于根据对所述目标对象的识别结果，从所述多路视频流中确定待显示的视频流；显示单元，用于显示所述待显示的视频流对应的画面。An embodiment of the present application further provides a video stream display device, including: an acquisition unit for acquiring multiple video streams and sound source positions of a current scene, and an image acquisition area corresponding to each of the video streams; a first determination unit , used to determine the target video stream from the multi-channel video stream according to the sound source position and the image acquisition area; the identification unit is used to identify the target object according to the target video stream, and the target object is an object with lip movements; a second determination unit for determining a video stream to be displayed from the multi-channel video streams according to the recognition result of the target object; a display unit for displaying the video to be displayed The corresponding screen of the stream.

本申请实施例还提供一种计算机设备，包括存储器存储有多条指令；所述处理器从所述存储器中加载指令，以执行本申请实施例所提供的任一种视频流显示方法中的步骤。The embodiments of the present application further provide a computer device, including a memory storing a plurality of instructions; the processor loads the instructions from the memory to execute the steps in any of the video stream display methods provided by the embodiments of the present application .

本申请实施例还提供一种计算机可读存储介质，所述计算机可读存储介质存储有多条指令，所述指令适于处理器进行加载，以执行本申请实施例所提供的任一种视频流显示方法中的步骤。Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a plurality of instructions, and the instructions are adapted to be loaded by a processor to execute any video provided by the embodiments of the present application A flow shows the steps in a method.

本申请实施例可以获取当前场景的多路视频流以及声源位置，每个所述视频流对应的一个图像采集区域；根据所述声源位置以及所述图像采集区域，从所述多路视频流中确定目标视频流；根据所述目标视频流，识别目标对象，所述目标对象为有嘴唇动作的对象；根据对所述目标对象的识别结果，从所述多路视频流中确定待显示的视频流；显示所述待显示的视频流对应的画面。在本申请中通过声源位置确定用来识别发言人的目标视频流，可以提高识别发言人的效率，同时根据识别结果确定待显示的视频流，可以使显示的画面聚焦在发言人身上，呈现更好的会议画面。This embodiment of the present application can acquire multiple video streams and sound source positions of the current scene, and an image capture area corresponding to each of the video streams; Determine a target video stream in the stream; identify a target object according to the target video stream, and the target object is an object with lip movements; according to the recognition result of the target object, determine from the multi-channel video streams to be displayed the video stream; display the picture corresponding to the to-be-displayed video stream. In this application, the target video stream used to identify the speaker is determined by the position of the sound source, which can improve the efficiency of identifying the speaker. Better meeting images.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained from these drawings without creative effort.

图1是本申请实施例提供的视频流显示系统的场景示意图；1 is a schematic diagram of a scene of a video stream display system provided by an embodiment of the present application;

图2是本申请实施例提供的视频流显示方法的流程示意图；2 is a schematic flowchart of a method for displaying a video stream provided by an embodiment of the present application;

图3是本申请实施例提供的视频流显示系统的结构示意图；3 is a schematic structural diagram of a video stream display system provided by an embodiment of the present application;

图4是本申请实施例提供的数据处理模块的流程示意图；4 is a schematic flowchart of a data processing module provided by an embodiment of the present application;

图5是本申请另一个实施例提供的视频流显示方法的流程示意图；5 is a schematic flowchart of a video stream display method provided by another embodiment of the present application;

图6是本申请实施例提供的视频流显示装置的结构示意图；6 is a schematic structural diagram of a video stream display device provided by an embodiment of the present application;

图7是本申请实施例提供的计算机设备的结构示意图。FIG. 7 is a schematic structural diagram of a computer device provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present application.

本申请实施例提供视频流显示方法、装置、计算机设备和存储介质。Embodiments of the present application provide a video stream display method, apparatus, computer device, and storage medium.

其中，该视频流显示装置具体可以集成在电子设备中，该电子设备可以为终端、服务器等设备。其中，终端可以为手机、平板电脑、智能蓝牙设备、笔记本电脑、或者个人电脑(Personal Computer，PC)等设备；服务器可以是单一服务器，也可以是由多个服务器组成的服务器集群。Specifically, the video stream display apparatus may be integrated in an electronic device, and the electronic device may be a terminal, a server, or the like. The terminal may be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, or a personal computer (Personal Computer, PC) and other devices; the server may be a single server, or a server cluster composed of multiple servers.

在一些实施例中，该视频流显示装置还可以集成在多个电子设备中，比如，视频流显示装置可以集成在多个服务器中，由多个服务器来实现本申请的视频流显示方法。In some embodiments, the video stream display apparatus may also be integrated into multiple electronic devices, for example, the video stream display apparatus may be integrated into multiple servers, and the video stream display method of the present application may be implemented by multiple servers.

在一些实施例中，服务器也可以以终端的形式来实现。In some embodiments, the server may also be implemented in the form of a terminal.

例如，参考图1，在一些实施方式中，提供了一种视频流显示系统的场景示意图，该图像渲染系统可以包括显示数据采集模块1000、服务器2000以及终端3000。For example, referring to FIG. 1 , in some embodiments, a scene schematic diagram of a video stream display system is provided, and the image rendering system may include a display data acquisition module 1000 , a server 2000 and a terminal 3000 .

其中，数据采集模块可以获取当前场景的多路视频流以及声源位置，每个视频流对应的一个图像采集区域。The data acquisition module can acquire multiple video streams and sound source positions of the current scene, and an image acquisition area corresponding to each video stream.

其中，服务器可以根据声源位置以及图像采集区域，从多路视频流中确定目标视频流；根据目标视频流，识别目标对象，目标对象为有嘴唇动作的对象；根据对目标对象的识别结果，从多路视频流中确定待显示的视频流。Among them, the server can determine the target video stream from the multiple video streams according to the position of the sound source and the image acquisition area; according to the target video stream, identify the target object, and the target object is an object with lip movements; according to the recognition result of the target object, Determine the video stream to be displayed from the multiple video streams.

其中，终端可以显示待显示的视频流对应的画面。The terminal may display a picture corresponding to the video stream to be displayed.

以下分别进行详细说明。需说明的是，以下实施例的序号不作为对实施例优选顺序的限定。Each of them will be described in detail below. It should be noted that the serial numbers of the following embodiments are not intended to limit the preferred order of the embodiments.

在本实施例中，提供了一种基于视频流显示方法，如图2所示，该视频流显示方法的具体流程可以如下：In this embodiment, a video stream-based display method is provided. As shown in FIG. 2 , the specific process of the video stream display method may be as follows:

110、获取当前场景的多路视频流以及声源位置，每个视频流对应的一个图像采集区域。110. Acquire multiple video streams and sound source positions of the current scene, and an image acquisition area corresponding to each video stream.

其中，声源位置是指当前场景中声音发出的位置，例如在会议场景中可以为讲话声发出的位置。可以通过在当前场景中设置麦克风阵列采集声音，并根据声源定位算法计算声源位置。The position of the sound source refers to the position where the sound is emitted in the current scene, for example, the position where the speech sound is emitted in the conference scene. You can collect sound by setting a microphone array in the current scene, and calculate the sound source position according to the sound source localization algorithm.

在一些实施方式中，多路视频流包括一路全景视频流以及至少一路近景视频流。图像采集区域是指视频流对应的图像采集装置能采集的图像对应当前场景的区域范围。其中，全景视频流是指包含当前场景全景画面的视频流，其对应的图像采集区域即为当前场景全景，可以由具有广角镜头的摄像头采集，近景视频流是指包含当前场景局部场景的视频流，其对应的图像采集区域即为当前场景的局部场景，可以由具有长焦镜头的摄像头采集。In some embodiments, the multiple video streams include one panoramic video stream and at least one close-up video stream. The image acquisition area refers to the area of the current scene corresponding to the image that can be acquired by the image acquisition device corresponding to the video stream. Among them, the panoramic video stream refers to the video stream containing the panoramic picture of the current scene, and the corresponding image acquisition area is the current scene panorama, which can be collected by a camera with a wide-angle lens, and the close-up video stream refers to the current scene. The corresponding image collection area is the local scene of the current scene, which can be collected by a camera with a telephoto lens.

在一些实施方式中，声源位置的获取方法，可以包括步骤1.1～1.2，如下：In some embodiments, the method for acquiring the sound source position may include steps 1.1 to 1.2, as follows:

1.1、采集当前场景的声音信息；1.1. Collect the sound information of the current scene;

1.2、通过声源定位算法处理采集到的声音信息，得到声源位置。1.2. Process the collected sound information through the sound source localization algorithm to obtain the sound source position.

其中，声源定位算法可以采用TDOA(Time Difference of Arrival，到达时间差)、GCC-PHAT(Generalized Cross Correlation PHAse Transformation，广义互相关-相位变换方法)，等等。Among them, the sound source localization algorithm may adopt TDOA (Time Difference of Arrival, time difference of arrival), GCC-PHAT (Generalized Cross Correlation PHAse Transformation, generalized cross correlation-phase transformation method), and so on.

120、根据声源位置以及图像采集区域，从多路视频流中确定目标视频流。120. Determine the target video stream from the multiple video streams according to the sound source position and the image acquisition area.

其中，目标视频流是根据声源位置以及图像采集区域关联关系确定的视频流。关联关系可以为声源位置位于图像采集区域内，也可以为声源位置与图像采集区域中心距离小于预设的距离。The target video stream is a video stream determined according to the position of the sound source and the relationship between the image acquisition areas. The association relationship may be that the sound source position is located in the image collection area, or the distance between the sound source position and the center of the image collection area is less than a preset distance.

在一些实施方式中，步骤120可以包括步骤：根据声源位置以及图像采集区域关联关系，从多个多路视频流对应的采集区域中确定目标图像采集区域，将目标图像采集区域对应的视频流确定为目标视频流。In some embodiments, step 120 may include the step of: determining a target image capture area from the capture areas corresponding to the multiple multi-channel video streams according to the position of the sound source and the relationship between the image capture areas, and selecting the video stream corresponding to the target image capture area Determined as the target video stream.

在一些实施方式中，由于反射、噪声等干扰，通过声源定位确定的声源位置可能会有误差，因此通过声源位置确定声源存在的可能区域，以此来确定与该区域对应的目标视频流，增加获取的图像信息的准确性，具体地，步骤120可以包括步骤2.1～2.4，如下：In some embodiments, due to interference such as reflection, noise, etc., there may be errors in the sound source location determined by sound source localization. Therefore, the possible area where the sound source exists is determined by the sound source location, so as to determine the target corresponding to the area. The video stream increases the accuracy of the acquired image information. Specifically, step 120 may include steps 2.1 to 2.4, as follows:

2.1、根据声源位置确定声源区域；2.1. Determine the sound source area according to the sound source position;

2.2、针对每个视频流，确定声源区域以及图像采集区域的重叠区域；2.2. For each video stream, determine the overlapping area of the sound source area and the image acquisition area;

2.3、将满足预设的第一区域大小的重叠区域确定为目标区域；2.3. Determine the overlapping area that meets the preset first area size as the target area;

2.4、将目标区域对应的视频流确定为目标视频流。2.4. Determine the video stream corresponding to the target area as the target video stream.

其中，声源区域是指声源位置所在的区域，该区域与图像采集区域位于同一平面。声源区域可以根据声源位置以及预设的区域参数值确定，预设的区域参数值可以根据当前场景或经验设置，例如将以声源位置为圆心，预设的半径值为半径的圆形区域作为声源区域，等等。The sound source area refers to the area where the sound source is located, and the area and the image acquisition area are located on the same plane. The sound source area can be determined according to the sound source position and the preset area parameter value. The preset area parameter value can be set according to the current scene or experience. For example, a circle with the sound source position as the center and the preset radius value as the radius will be used. zones as sound source zones, and so on.

在一些实施方式中，步骤2.1可以包括步骤：获取基准点；将基准点作为顶点，基准点以及声源位置的连线作为角平分线，确定一满足预设第一角度的夹角；将该夹角在当前场景对应的区域确定为声源区域。其中，基准点可以为当前场景中的任意一个边界点。在一些实施方式中，基准点可以为采集当前场景的声音信息的位置点，例如，根据用于测量声源位置的麦克风阵列确定的点，该点可以为麦克风阵列上的任意一个位置点，也可以为中点。需说明的是，基准点、声源位置、图像采集区域、声源区域以及目标区域都处在同一平面上，可以为水平面，例如声源位置为通过声源定位算法计算得到的真实声源位置投影到水平面上的位置点。In some embodiments, step 2.1 may include the steps of: acquiring a reference point; taking the reference point as a vertex, and a line connecting the reference point and the sound source position as an angle bisector, and determining an included angle that satisfies the preset first angle; The included angle is determined as the sound source area in the area corresponding to the current scene. The reference point may be any boundary point in the current scene. In some embodiments, the reference point may be a location point where sound information of the current scene is collected, for example, a point determined according to a microphone array used to measure the location of the sound source, the point may be any location point on the microphone array, or Can be the midpoint. It should be noted that the reference point, the sound source position, the image acquisition area, the sound source area and the target area are all on the same plane, which can be a horizontal plane. For example, the sound source position is the real sound source position calculated by the sound source localization algorithm. The position point projected onto the horizontal plane.

其中，预设的第一区域大小是根据当前场景或经验设置的区域的尺寸条件。可以为具体的值，例如为大于等于任意一个视频流对应的图像采集区域大小的三分之一，也可以为大于等于根据声源区域确定的区域尺寸，例如大于等于声源区域尺寸的二分之一。The preset first area size is a size condition of the area set according to the current scene or experience. It can be a specific value, such as greater than or equal to one-third of the image capture area corresponding to any video stream, or greater than or equal to the area size determined according to the sound source area, such as greater than or equal to half the size of the sound source area one.

在一些实施方式中，步骤2.3可以包括步骤：将声源区域大小相同的重叠区域确定为目标区域。In some embodiments, step 2.3 may include the step of: determining an overlapping area with the same size of the sound source area as the target area.

130、根据目标视频流，识别目标对象，目标对象为有嘴唇动作的对象。130. Identify a target object according to the target video stream, where the target object is an object with lip movements.

其中，目标对象是指从目标视频流的图像信息中识别出的有嘴唇动作的对象。一般而言，有嘴唇动作的人是正在讲话，因此可以作为当前场景的发言人。嘴唇动作可以为根据现有技术中确定的人说话时的嘴唇动作。The target object refers to an object with lip movements identified from the image information of the target video stream. Generally speaking, the person with the lip movement is speaking and thus can act as the speaker for the current scene. The lip movement may be the lip movement of a person when speaking determined according to the prior art.

由于不同视频流对应的图像采集区域不同，通过声源位置确定用来识别发言人的目标视频流，可以减少识别数据量，提高识别发言人的效率。Since the image acquisition areas corresponding to different video streams are different, the target video stream used to identify the speaker can be determined by the position of the sound source, which can reduce the amount of identification data and improve the efficiency of identifying the speaker.

在一些实施方式中，由于反射、噪声等干扰，通过声源定位确定的声源位置可能会有误差，因此通过声源位置确定用来识别发言人的区域，以此来确定与该区域对应的目标视频流，增加获取的图像信息的准确性，步骤130可以包括步骤3.1～3.3，如下：In some embodiments, due to interference such as reflection, noise, etc., there may be errors in the position of the sound source determined by the sound source localization. Therefore, the area used to identify the speaker is determined by the sound source position, so as to determine the area corresponding to the area. The target video stream increases the accuracy of the acquired image information. Step 130 may include steps 3.1 to 3.3, as follows:

3.1、根据声源位置确定识别区域；3.1. Determine the recognition area according to the location of the sound source;

3.2、根据识别区域，从目标视频流获取目标图像信息，目标图像信息为识别区域对应的图像信息；3.2. According to the recognition area, obtain target image information from the target video stream, and the target image information is the image information corresponding to the recognition area;

3.3、根据目标图像信息，识别目标对象。3.3. Identify the target object according to the target image information.

其中，识别区域是指根据声源位置确定的用来识别目标对象的区域，该区域与图像采集区域位于同一平面。声源位置位于识别区域内，识别区域可以根据声源位置以及预设的区域参数值确定，预设的区域参数值可以根据当前场景或经验设置，例如将以声源位置为圆心，预设的半径值为半径的圆形区域作为识别区域，等等。识别区域也可以为声源区域。Wherein, the identification area refers to the area determined according to the position of the sound source and used to identify the target object, and the area and the image acquisition area are located on the same plane. The position of the sound source is located in the identification area. The identification area can be determined according to the position of the sound source and the preset area parameter values. The preset area parameter values can be set according to the current scene or experience. A circular area with a radius value as a recognition area, and so on. The identification area may also be the sound source area.

在一些实施方式中，步骤3.1可以包括步骤：获取基准点；将基准点作为顶点，基准点以及声源位置的连线作为角平分线，确定一满足预设第二角度的夹角；将该夹角在当前场景对应的区域确定为识别区域。In some embodiments, step 3.1 may include the steps of: acquiring a reference point; taking the reference point as a vertex, and a line connecting the reference point and the sound source position as an angle bisector, and determining an included angle that satisfies the preset second angle; The area corresponding to the included angle in the current scene is determined as the recognition area.

其中，目标图像信息是指识别区域投影到目标视频流采集的画面的区域内的图像信息。具体地，可以通过获取识别区域的坐标位置，将该坐标位置投影到目标视频流采集的画面所在的坐标系中，以得到投影后的区域，将该区域内的图像信息作为目标图像信息。The target image information refers to the image information of the recognition area projected into the area of the image captured by the target video stream. Specifically, the coordinate position of the recognition area can be obtained, and the coordinate position can be projected into the coordinate system where the image captured by the target video stream is located to obtain the projected area, and the image information in the area is used as the target image information.

通过声源位置确定可能包含目标对象的识别区域，并通过识别区域从目标视频流中获取识别区域对应的图像信息，再从该图像信息中识别是否有目标对象。The recognition area that may contain the target object is determined by the position of the sound source, the image information corresponding to the recognition area is obtained from the target video stream through the recognition area, and whether there is a target object is identified from the image information.

在一些实施方式中，为了提高识别效率，步骤3.3可以包括步骤3.3.1～3.3.2，如下：In some embodiments, in order to improve the recognition efficiency, step 3.3 may include steps 3.3.1 to 3.3.2, as follows:

3.3.1、当从目标图像信息中识别到有嘴唇动作的对象时，将有嘴唇动作的对象作为目标对象；3.3.1. When an object with lip movements is identified from the target image information, the object with lip movements is used as the target object;

3.3.2、当从目标图像信息中未识别到有嘴唇动作的对象时，扩大识别区域至预设的第二区域大小，以识别目标对象。3.3.2. When the object with lip motion is not recognized from the target image information, expand the recognition area to the preset second area size to recognize the target object.

例如，先设置识别区域为夹角为30°的扇形区域，当在该区域内没有识别到目标对象时，将识别区域扩大为夹角为40°的扇形区域，再次进行识别，当在该区域内没有识别到目标对象时，将识别区域扩大为夹角为50°的扇形区域，以此类推，直至识别到目标对象或者识别区域扩大到上限值。For example, first set the recognition area as a fan-shaped area with an included angle of 30°, when the target object is not recognized in this area, expand the recognition area to a fan-shaped area with an included angle of 40°, and perform the recognition again. When no target object is recognized, the recognition area is expanded to a fan-shaped area with an included angle of 50°, and so on, until the target object is recognized or the recognition area is expanded to the upper limit.

由于通过声源定位确定的声源位置可能会有误差，在进行嘴唇动作识别的时，预先设置的识别区域可能无法识别到目标对象，此时通过逐步扩大识别区域大小，能够扩大识别范围，以此修正识别结果，同时逐步扩大识别区域大小也能使每次要识别的区域相比下一次识别要小，以此尽可能以最小区域获得识别结果，提高识别效率。Since the position of the sound source determined by sound source localization may have errors, the preset recognition area may not recognize the target object when performing lip motion recognition. At this time, by gradually expanding the recognition area, the recognition range can be expanded to This correction of the recognition result and the gradual expansion of the size of the recognition area can also make the area to be recognized each time smaller than the next recognition, so as to obtain the recognition result with the smallest area as much as possible, and improve the recognition efficiency.

在一些实施方式中，为了进一步提高识别效率，步骤3.3.2可以包括步骤：当从目标图像信息中未识别到有嘴唇动作的对象时，扩大识别区域至预设的第二区域大小，得到扩大后的区域；将识别区域与扩大后的区域的不重叠区域作为目标识别区域；根据目标识别区域，从目标视频流获取目标图像信息，目标图像信息为识别区域对应的图像信息；根据目标图像信息，识别目标对象。In some embodiments, in order to further improve the recognition efficiency, step 3.3.2 may include the step of: when an object with lip movements is not recognized from the target image information, expanding the recognition area to a preset second area size to obtain the enlarged The non-overlapping area between the recognition area and the enlarged area is used as the target recognition area; according to the target recognition area, the target image information is obtained from the target video stream, and the target image information is the image information corresponding to the recognition area; according to the target image information , identify the target object.

140、根据对目标对象的识别结果，从多路视频流中确定待显示的视频流。140. Determine the video stream to be displayed from the multiple video streams according to the identification result of the target object.

其中，待显示视频流是指用来显示当前场景的视频流。可以通过待显示视频流聚焦显示目标对象。The video stream to be displayed refers to a video stream used to display the current scene. The target object can be displayed in focus through the video stream to be displayed.

在一些实施方式中，为了提供更好的当前场景显示效果，提供了根据目标对象识别结果确定的显示策略，步骤140可以包括步骤4.1～4.4，如下：In some embodiments, in order to provide a better display effect of the current scene, a display strategy determined according to the target object recognition result is provided, and step 140 may include steps 4.1 to 4.4, as follows:

4.1、当识别到目标对象时，根据目标对象确定待显示区域；4.1. When the target object is identified, determine the area to be displayed according to the target object;

4.2、当未识别到目标对象时，根据目标图像信息中的所有对象确定待显示区域；4.2. When the target object is not recognized, determine the area to be displayed according to all objects in the target image information;

4.3、获取每个视频流对应的图像采集区域；4.3. Obtain the image capture area corresponding to each video stream;

4.4、根据待显示区域以及图像采集区域，确定待显示视频流。4.4. Determine the video stream to be displayed according to the area to be displayed and the image acquisition area.

其中，待显示区域是指要通过待显示视频流显示的区域。当识别到目标对象时，可以将目标对象所在的区域例如声源区域或识别区域作为待显示区域，当未识别到目标对象时，将目标图像信息中的所有对象所在的区域作为待显示区域。待显示区域可以与图像采集区域处于同一平面，也可以与目标视频流对应的图像处于同一平面，在将待显示区域与图像采集区域等不同平面区域进行比较时，可以将待显示区域投影到图像采集区域等所在平面后再进行比较。The to-be-displayed area refers to an area to be displayed through the to-be-displayed video stream. When the target object is recognized, the area where the target object is located, such as the sound source area or the recognition area, can be used as the area to be displayed. When the target object is not recognized, the area where all objects in the target image information are located can be used as the area to be displayed. The to-be-displayed area can be on the same plane as the image acquisition area or the image corresponding to the target video stream. When comparing the to-be-displayed area with the image acquisition area and other different plane areas, the to-be-displayed area can be projected onto the image. Compare the plane where the collection area is located.

在一些实施方式中，当识别到多个目标对象时，根据多个目标对象确定待显示区域。此时，待显示区域为多个目标对象所在的区域。In some embodiments, when multiple target objects are identified, the area to be displayed is determined according to the multiple target objects. At this time, the to-be-displayed area is an area where multiple target objects are located.

通过对目标对象的识别结果确定待显示区域，并将待显示区域与图像采集区域进行比较以确定待显示视频流。例如，可以通过确定待显示区域与每个图像采集区域的重复区域，将重复区域最大的图像采集区域对应的视频流作为待显示视频流。The to-be-displayed area is determined by the recognition result of the target object, and the to-be-displayed area is compared with the image acquisition area to determine the to-be-displayed video stream. For example, by determining the overlapping area between the area to be displayed and each image capturing area, the video stream corresponding to the image capturing area with the largest overlapping area can be used as the video stream to be displayed.

在一些实施方式中，为了能够聚焦发言人，提供更好的当前场景显示效果，步骤4.4可以包括步骤：确定待显示区域与每个图像采集区域的区域大小比值，将待显示区域大小/图像采集区域大小的比值最高的图像采集区域对应的视频流作为待显示视频流。在一些实施方式中，为了避免显示的发言人画面不完整，待显示区域大小/图像采集区域大小的比值小于预设值，该值可以为1。In some embodiments, in order to be able to focus on the speaker and provide a better display effect of the current scene, step 4.4 may include the step of: determining the area size ratio between the area to be displayed and each image acquisition area, and dividing the size of the area to be displayed/image acquisition The video stream corresponding to the image acquisition area with the highest area size ratio is used as the video stream to be displayed. In some implementation manners, in order to prevent the displayed speaker picture from being incomplete, the ratio of the size of the area to be displayed/the size of the image acquisition area is smaller than a preset value, and the value may be 1.

150、显示待显示的视频流对应的画面。150. Display a picture corresponding to the video stream to be displayed.

在一些实施方式中，通过裁剪显示画面，聚焦发言人，提供更好的当前场景显示效果，步骤150可以包括步骤5.1～5.3，如下：In some embodiments, by cropping the display screen and focusing on the speaker, a better display effect of the current scene is provided. Step 150 may include steps 5.1 to 5.3, as follows:

5.1、获取待显示视频流的显示画面；5.1. Obtain the display screen of the video stream to be displayed;

5.2、根据待显示区域，裁剪待显示视频流的显示画面，得到裁剪后的显示画面；5.2. According to the area to be displayed, trim the display image of the video stream to be displayed, and obtain the trimmed display image;

5.3、显示裁剪后的显示画面。5.3. Display the cropped display screen.

其中，裁剪后的显示画面是待显示视频流的显示画面中与待显示区域对应的画面。The cropped display picture is a picture corresponding to the to-be-displayed area in the display picture of the video stream to be displayed.

通过将待显示视频流的显示画面裁剪为待显示区域对应的画面，能够进一步聚焦发言人，以提供更好的当前场景显示效果。By trimming the display image of the video stream to be displayed to the image corresponding to the area to be displayed, the speaker can be further focused, so as to provide a better display effect of the current scene.

本申请实施例提供的视频流显示方法可以应用在各种多人参与的场景中。比如，以多人会议为例，获取当前场景的多路视频流以及声源位置，每个视频流对应的一个图像采集区域；根据声源位置以及图像采集区域，从多路视频流中确定目标视频流；根据目标视频流，识别目标对象，目标对象为有嘴唇动作的对象；根据对目标对象的识别结果，从多路视频流中确定待显示的视频流；显示待显示的视频流对应的画面。采用本申请实施例提供的方案通过声源位置确定用来识别发言人的目标视频流，可以提高识别发言人的效率，同时根据识别结果确定待显示的视频流，可以使显示的画面聚焦在发言人身上，呈现更好的会议画面。The video stream display method provided by the embodiment of the present application can be applied in various scenarios where multiple people participate. For example, taking a multi-person conference as an example, obtain the multi-channel video streams and sound source positions of the current scene, and an image capture area corresponding to each video stream; determine the target from the multi-channel video streams according to the sound source position and image capture area Video stream; identify the target object according to the target video stream, and the target object is an object with lip movements; according to the recognition result of the target object, determine the video stream to be displayed from the multiple video streams; display the corresponding video stream to be displayed. screen. Using the solution provided by the embodiment of the present application to determine the target video stream used to identify the speaker through the position of the sound source can improve the efficiency of identifying the speaker. On the person, present a better picture of the meeting.

根据上述实施例所描述的方法，以下将作进一步详细说明。According to the methods described in the above embodiments, further detailed description will be given below.

在本实施例中，将以多人会议场景为例，对本申请实施例的方法进行详细说明。In this embodiment, the method of the embodiment of the present application will be described in detail by taking a multi-person conference scenario as an example.

如图3所示，提供了视频流显示系统的结构示意图，该系统包括数据采集模块、数据处理模块以及终端。As shown in FIG. 3 , a schematic structural diagram of a video stream display system is provided, and the system includes a data acquisition module, a data processing module and a terminal.

其中，数据采集模块由红外热像仪、超声波模块、双摄像头模块和阵列麦克风组成，摄像头模块采集信息后发送给数据处理模块。具体如下：Among them, the data acquisition module is composed of an infrared thermal imager, an ultrasonic module, a dual camera module and an array microphone. The camera module collects information and sends it to the data processing module. details as follows:

数据采集模块，包括两个摄像头，两个摄像头分别为广角镜头以及长焦镜头。广角镜头视场角大，可视范围广，但远景模糊。长焦镜头视场角小，可视范围窄，但能看到的远景清晰。当视场角出现重叠切至长焦镜头的摄像头，当视场角在长焦镜头范围外时，就切换至广角镜头的摄像头。双摄像头切换方法包括：1、双摄像头模块包括广角镜头的摄像头和长焦镜头的摄像头两种，广角镜头的摄像头焦距短视野广，拍摄的画面多，画面的物体占比更小，相反长焦镜头的摄像头焦距长视野窄，拍摄的画面少，画面的物体占比更大。广角镜头的摄像头和长焦镜头的摄像头都可以分别出两路视频流，一路视频流用于实际画面呈现，可以称为预览流，另一路视频流用于给到AI做唇动检测和人脸识别，可以称为AI图像流。2、终端的画面呈现只能是源自两个摄像头的其中一路预览流，但可以同时将两个摄像头的AI图像流提供给图像AI线程做唇动检测和人脸识别。3、图像AI线程根据声源定位的角度信息，决策对两路AI图像流的其中一路做唇动识别和人脸识别，然后会输出给到UVC线程决策要切到哪一路预览流并做裁剪，最后呈现人脸聚焦效果。The data acquisition module includes two cameras, and the two cameras are respectively a wide-angle lens and a telephoto lens. The wide-angle lens has a large field of view and a wide viewing range, but the distant view is blurred. The telephoto lens has a small field of view and a narrow viewing range, but it can see a clear distant view. When the field of view overlaps and cuts to the camera with the telephoto lens, when the field of view is outside the range of the telephoto lens, it switches to the camera with the wide-angle lens. The dual-camera switching method includes: 1. The dual-camera module includes two types of cameras with a wide-angle lens and a camera with a telephoto lens. The camera with a wide-angle lens has a short focal length and a wide field of view. The camera has a long focal length, a narrow field of view, and fewer pictures, and a larger proportion of objects in the picture. Both the wide-angle lens camera and the telephoto lens camera can output two video streams, one video stream is used for the actual picture presentation, which can be called the preview stream, and the other video stream is used for lip motion detection and face recognition for AI, which can be Called AI Image Streaming. 2. The screen presentation of the terminal can only be one of the preview streams from the two cameras, but the AI image streams of the two cameras can be provided to the image AI thread for lip motion detection and face recognition at the same time. 3. The image AI thread decides to perform lip motion recognition and face recognition on one of the two AI image streams according to the angle information of sound source localization, and then outputs it to the UVC thread to decide which preview stream to cut and crop. , finally showing the face focus effect.

红外热像仪，用于对目标物体进行测温。Infrared thermal imager, used to measure the temperature of the target object.

超声波模块，用于结合红外热像仪检测目标物体的距离，由于红外热像仪本质上也是一个摄像头，其镜头的最小成像距离是有要求的，例如被测物和镜头之间的距离要大于25cm，才能保证热像图效果清晰。因此利用超声波模组，可以起到检测目标距离，提示目标的距离要求。The ultrasonic module is used to detect the distance of the target object in combination with the infrared thermal imager. Since the infrared thermal imager is essentially a camera, the minimum imaging distance of its lens is required. For example, the distance between the measured object and the lens should be greater than 25cm to ensure a clear thermal image. Therefore, the ultrasonic module can be used to detect the distance of the target and prompt the distance requirements of the target.

矩阵麦克风模组，用于声源定位，确定发言人所在方位。Matrix microphone module, used for sound source localization to determine the position of the speaker.

其中，数据处理模块包括UVC线程、UAC线程、图像AI线程以及音频AI线程，数据处理模块获取摄像头模块采集的信息，并进行数据处理。如图4所示，数据处理模块中线程的工作流程如下：Among them, the data processing module includes UVC thread, UAC thread, image AI thread and audio AI thread, and the data processing module obtains the information collected by the camera module and performs data processing. As shown in Figure 4, the workflow of the thread in the data processing module is as follows:

UVC线程，用于收集双摄像头的视频流信息，每一路摄像头都会输出两路视频流，一路用作输出给到终端呈现实时画面，另一路用作给到图像AI线程分析嘴唇动作分析以及人脸识别。UVC thread is used to collect video stream information of dual cameras. Each camera will output two video streams, one is used for outputting real-time images to the terminal, and the other is used for image AI thread analysis of lip motion analysis and face identify.

UAC线程，用于收集阵列麦克风的音频流信息，分为两类音频信息输出，一类音频信息是将其中一路麦克风的PCM格式的音频流数据直接输出给到终端做音频播放，另一类是将所有麦克风采集到的PCM格式的音频流数据组合后给到音频AI线程进行声源定位。The UAC thread is used to collect the audio stream information of the array microphones. It is divided into two types of audio information output. One type of audio information is to directly output the PCM format audio stream data of one of the microphones to the terminal for audio playback. The audio stream data in PCM format collected by all microphones is combined and sent to the audio AI thread for sound source localization.

图像AI线程，用于分析处理UVC线程输出的两个摄像头的图像信息，并输出决策给到UVC线程，决策包括反馈显示哪一摄像头的视频流，以及放大裁剪该路视频流的图像信息，以聚焦发言人。具体地，图像AI线程获取两种信息，一种为UVC线程提供的两个摄像头的视频流信息，另一种为音频AI线程提供的声源角度信息。图像AI线程获取到声源角度信息后，确定当前发言人的声源角度，根据两个摄像头的视场角范围确定获取哪个摄像头的视频流来分析嘴唇动作，确定嘴唇动作对应的识别区域，并识别出人脸信息。最后反馈UVC线程切换摄像头进行显示，以及放大裁剪以聚焦发言人。The image AI thread is used to analyze and process the image information of the two cameras output by the UVC thread, and output the decision to the UVC thread. Focus on speakers. Specifically, the image AI thread obtains two kinds of information, one is the video stream information of the two cameras provided by the UVC thread, and the other is the sound source angle information provided by the audio AI thread. After the image AI thread obtains the sound source angle information, it determines the sound source angle of the current speaker, determines which camera's video stream is obtained according to the field of view of the two cameras to analyze the lip movement, determines the recognition area corresponding to the lip movement, and Face information is recognized. The final feedback UVC thread switches the camera for display, and zooms in and crops to focus on the speaker.

音频AI线程，用于分析处理UAC给到的阵列MIC输出的PCM格式的音频流数据，进行声源定位，并将输出的声源角度信息至图像AI线程进行决策。The audio AI thread is used to analyze and process the audio stream data in the PCM format output by the array MIC given by the UAC, perform sound source localization, and send the output sound source angle information to the image AI thread for decision-making.

其中，数据处理模块还包括策略管理模块，策略管理模块用于获取数据处理模块处理后的数据，进行场景决策，实现发言人追踪、发言字幕显示和参会人员签到。Among them, the data processing module further includes a policy management module, which is used to obtain the data processed by the data processing module, make scene decisions, and realize speaker tracking, speech subtitle display, and participant sign-in.

其中，终端用来显示画面，终端可以为TV(电视)。The terminal is used for displaying pictures, and the terminal may be a TV (television).

如图5所示，一种视频流显示方法具体流程如下：As shown in Figure 5, the specific process of a video stream display method is as follows:

210、阵列麦克风实时采集环境声音。210. The array microphone collects ambient sound in real time.

220、音频AI线程通过声源定位算法，根据采集的环境声音确定并输出声源角度信息。220. The audio AI thread determines and outputs the sound source angle information according to the collected ambient sound through the sound source localization algorithm.

在通过陈列麦克风采集环境声音之前还可以包括步骤：策略管理模块控制红外热像仪以及超声波模块，检测参会人员的体温。超声波模组启动距离检测功能，当目标参会人员的距离达到红外热像仪成像的要求时，红外热像仪开始检测目标参会人员的体温，当温度超过要求时，不得参会。Before collecting the ambient sound through the display microphone, the following steps may be included: the strategy management module controls the infrared thermal imager and the ultrasonic module to detect the body temperature of the participants. The ultrasonic module activates the distance detection function. When the distance of the target participants reaches the imaging requirements of the infrared thermal imager, the infrared thermal imager starts to detect the body temperature of the target participants. When the temperature exceeds the requirements, they are not allowed to participate in the conference.

声源角度是指声源位置与阵列麦克风的夹角，可以以阵列麦克风组成的线段的中点为顶点，以声源位置、阵列麦克风组成的线段的中点以及阵列麦克风组成的线段的任意一个顶点形成的夹角作为声源角度。The sound source angle refers to the angle between the sound source position and the array microphone. The midpoint of the line segment formed by the array microphone can be used as the vertex, and any one of the sound source position, the midpoint of the line segment formed by the array microphone, and the line segment formed by the array microphone can be used. The angle formed by the vertices is used as the sound source angle.

阵列麦克风实时采集环境声音后发送给UAC线程，经UAC线程处理后一路发送给终端进行播放，一路发送给图像AI线程进行声源定位。The array microphone collects the ambient sound in real time and sends it to the UAC thread. After being processed by the UAC thread, it is sent all the way to the terminal for playback, and all the way to the image AI thread for sound source localization.

231、当音频AI线程未输出声源角度信息时，图像AI线程控制终端显示广角摄像头的拍摄的画面。231. When the audio AI thread does not output the sound source angle information, the image AI thread controls the terminal to display the image captured by the wide-angle camera.

没有声源角度输出时，会进入到聆听模式，在该模式下，UVC线程默认会输出广角摄像头的图像呈现，当图像AI线程分析两个摄像头的其中一路AI图像流有人脸识别的情况，就会告知UVC线程切换到对应摄像头的图像呈现，他，同时执行步骤210实时采集环境声音。如果两路AI图像流都有人脸识别到的情况，则优先输出广角摄像头的图像画面。如果两路AI图像流都没有人脸识别的情况，则不聚焦，也优先输出广角摄像头的图像画面。When there is no sound source angle output, it will enter the listening mode. In this mode, the UVC thread will output the image presentation of the wide-angle camera by default. When the image AI thread analyzes one of the AI image streams of the two cameras, there is face recognition. The UVC thread will be instructed to switch to the image presentation of the corresponding camera, and he will simultaneously perform step 210 to collect ambient sounds in real time. If both AI image streams have face recognition, the image of the wide-angle camera will be output preferentially. If there is no face recognition in the two AI image streams, it will not focus, and the image of the wide-angle camera will be output preferentially.

232、当音频AI线程输出声源角度信息时，图像AI线程根据声源角度信息确定声源区域。232. When the audio AI thread outputs the sound source angle information, the image AI thread determines the sound source area according to the sound source angle information.

在有声源角度输出时，图像AI线程会通过声源角度的±15°～±30°范围划分扇区，该扇区为声源区域。When there is a sound source angle output, the image AI thread will divide the sector by the range of ±15°～±30° of the sound source angle, and this sector is the sound source area.

240、图像AI线程根据声源区域，从两路视频流中确定目标视频流。240. The image AI thread determines the target video stream from the two video streams according to the sound source area.

图像AI线程根据声源区域判断根据哪个摄像头采集的图像进行识别，如果声源区域均完整被两个摄像头覆盖，图像AI线程则优先处理长焦摄像头的图像信息，如果声源区域范围在广角摄像头，图像AI线程则处理广角摄像头的图像信息。The image AI thread determines which camera captures the image based on the sound source area. If the sound source area is completely covered by two cameras, the image AI thread prioritizes the image information of the telephoto camera. If the sound source area is within the wide-angle camera , the image AI thread processes the image information of the wide-angle camera.

250、图像AI线程根据目标视频流，识别目标对象，目标对象为有嘴唇动作的对象。250. The image AI thread identifies the target object according to the target video stream, and the target object is an object with lip movements.

图像AI线程根据步骤340中确定的用来识别的摄像头，根据摄像头拍摄的图像信息进行嘴唇动作分析，以识别目标对象。The image AI thread performs lip motion analysis according to the camera used for identification determined in step 340 and according to the image information captured by the camera, so as to identify the target object.

261、当识别到目标对象时，图像AI线程根据目标对象确定待显示区域。261. When the target object is recognized, the image AI thread determines the area to be displayed according to the target object.

开始时，将声源角度的±15°对应的区域作为识别人脸的扇区，若未识别到有嘴唇动作的人，则将声源角度的±20°对应的区域作为识别人脸的扇区，依此以每次5°进行递增增加扇区大小，直至识别到有嘴唇动作的人，并将此时的扇区作为待显示区域。若有多个发言人，则待显示区域要覆盖所有发言人。At the beginning, the area corresponding to ±15° of the sound source angle is used as the sector for recognizing faces. If no person with lip movements is recognized, the area corresponding to ±20° of the sound source angle is used as the sector for recognizing faces. In this way, the size of the sector is increased in increments of 5° each time until a person with lip movements is identified, and the sector at this time is used as the area to be displayed. If there are multiple speakers, the area to be displayed should cover all speakers.

最后，UVC线程控制输出摄像头的图像信息，并对输出图像信息进行裁剪，从而能让用户看到最终的人脸聚焦效果。Finally, the UVC thread controls the image information of the output camera, and crops the output image information, so that the user can see the final face focus effect.

262、当未识别到目标对象时，图像AI线程根据目标图像信息中的所有对象确定待显示区域。262. When the target object is not recognized, the image AI thread determines the area to be displayed according to all objects in the target image information.

270、图像AI线程根据待显示区域确定待显示视频流。270. The image AI thread determines the video stream to be displayed according to the area to be displayed.

280、图像AI线程根据待显示区域裁剪待显示视频流的显示画面，得到裁剪后的显示画面。280. The image AI thread trims the display image of the video stream to be displayed according to the area to be displayed, and obtains the trimmed display image.

290、终端显示裁剪后的显示画面。290. The terminal displays the cropped display screen.

当未识别到有嘴唇动作的人时，将声源角度的±15°对应的区域作为识别人脸的扇区，若识别到人脸，则将声源角度的±20°对应的区域作为识别人脸的扇区，依此以每次5°进行递增增加扇区大小，直至识别到人脸，并将此时的扇区作为待显示区域。若待显示区域有多个人，则待显示区域要覆盖所有人。When no person with lip movements is recognized, the area corresponding to ±15° of the sound source angle is used as the sector for recognizing faces. If a face is recognized, the area corresponding to ±20° of the sound source angle is used as the recognition area. For the sector of the face, the sector size is increased in increments of 5° each time until the face is recognized, and the sector at this time is used as the area to be displayed. If there are many people in the area to be displayed, the area to be displayed must cover everyone.

当识别不到人脸，进入聆听模式，执行步骤210实时采集环境声音。When no face is recognized, the listening mode is entered, and step 210 is performed to collect ambient sound in real time.

由上可知，本申请实施例通过获取声源角度，进行双摄切换，从而实现对发言人聚焦，可以使显示的画面聚焦在发言人身上，呈现更好的会议画面。It can be seen from the above that the embodiment of the present application performs dual camera switching by acquiring the angle of the sound source, thereby realizing focusing on the speaker, so that the displayed image can be focused on the speaker, presenting a better conference image.

为了更好地实施以上方法，本申请实施例还提供一种视频流显示装置，该视频流显示装置具体可以集成在电子设备中，该电子设备可以为终端、服务器等设备。其中，终端可以为手机、平板电脑、智能蓝牙设备、笔记本电脑、个人电脑等设备；服务器可以是单一服务器，也可以是由多个服务器组成的服务器集群。In order to better implement the above method, an embodiment of the present application further provides a video stream display apparatus, which may be integrated in an electronic device, and the electronic device may be a terminal, a server, or the like. The terminal may be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer, etc.; the server may be a single server, or a server cluster composed of multiple servers.

比如，在本实施例中，将以视频流显示装置具体集成在终端为例，对本申请实施例的方法进行详细说明。For example, in this embodiment, the method of the embodiment of the present application will be described in detail by taking the example that the video stream display device is specifically integrated in the terminal.

例如，如图6所示，该视频流显示装置可以包括获取单元310、第一确定单元320、识别单元330、第二确定单元340以及显示单元350，如下：For example, as shown in FIG. 6 , the video stream display apparatus may include an acquisition unit 310, a first determination unit 320, an identification unit 330, a second determination unit 340, and a display unit 350, as follows:

(一)获取单元310(1) Acquisition unit 310

用于获取当前场景的多路视频流以及声源位置，每个视频流对应的一个图像采集区域。It is used to obtain multiple video streams and sound source positions of the current scene, and each video stream corresponds to an image acquisition area.

在一些实施方式中，声源位置的获取方法，可以包括步骤6.1～6.2，如下：In some embodiments, the method for acquiring the sound source position may include steps 6.1 to 6.2, as follows:

6.1、采集当前场景的声音信息；6.1. Collect the sound information of the current scene;

6.2、通过声源定位算法处理采集到的声音信息，得到声源位置。6.2. Process the collected sound information through the sound source localization algorithm to obtain the sound source position.

(二)第一确定单元320(2) The first determination unit 320

用于根据声源位置以及图像采集区域，从多路视频流中确定目标视频流。It is used to determine the target video stream from multiple video streams according to the position of the sound source and the image capture area.

在一些实施方式中，第一确定单元320可以具体用于步骤7.1～7.4，如下：In some embodiments, the first determining unit 320 may be specifically used in steps 7.1 to 7.4, as follows:

7.1、根据声源位置确定声源区域；7.1. Determine the sound source area according to the sound source position;

7.2、针对每个视频流，确定声源区域以及图像采集区域的重叠区域；7.2. For each video stream, determine the overlapping area of the sound source area and the image acquisition area;

7.3、将满足预设的第一区域大小的重叠区域确定为目标区域；7.3. Determine the overlapping area that meets the preset first area size as the target area;

7.4、将目标区域对应的视频流确定为目标视频流。7.4. Determine the video stream corresponding to the target area as the target video stream.

(三)识别单元330(3) Identification unit 330

用于根据目标视频流，识别目标对象，目标对象为有嘴唇动作的对象。It is used to identify the target object according to the target video stream, and the target object is an object with lip movements.

在一些实施方式中，识别单元330可以具有用于包括步骤8.1～8.3，如下：In some embodiments, the identification unit 330 may have steps 8.1 to 8.3, as follows:

8.1、根据声源位置确定识别区域；8.1. Determine the recognition area according to the position of the sound source;

8.2、根据识别区域，从目标视频流获取目标图像信息，目标图像信息为识别区域对应的图像信息；8.2. According to the recognition area, obtain the target image information from the target video stream, and the target image information is the image information corresponding to the recognition area;

8.3、根据目标图像信息，识别目标对象。8.3. Identify the target object according to the target image information.

在一些实施方式中，步骤8.3可以包括步骤8.3.1～8.3.2，如下：In some embodiments, step 8.3 may include steps 8.3.1 to 8.3.2, as follows:

8.3.1、当从目标图像信息中识别到有嘴唇动作的对象时，将有嘴唇动作的对象作为目标对象；8.3.1. When an object with lip movements is identified from the target image information, the object with lip movements is used as the target object;

8.3.2、当从目标图像信息中未识别到有嘴唇动作的对象时，扩大识别区域至预设的第二区域大小，以识别目标对象。8.3.2. When the object with lip motion is not recognized from the target image information, expand the recognition area to the preset second area size to recognize the target object.

(四)第二确定单元340(4) The second determination unit 340

用于根据对目标对象的识别结果，从多路视频流中确定待显示的视频流。It is used to determine the video stream to be displayed from the multiple video streams according to the recognition result of the target object.

在一些实施方式中，第二确定单元340可以具体用于步骤9.1～9.4，如下：In some embodiments, the second determining unit 340 may be specifically used in steps 9.1 to 9.4, as follows:

9.1、当识别到目标对象时，根据目标对象确定待显示区域；9.1. When the target object is identified, determine the area to be displayed according to the target object;

9.2、当未识别到目标对象时，根据目标图像信息中的所有对象确定待显示区域；9.2. When the target object is not recognized, determine the area to be displayed according to all objects in the target image information;

9.3、获取每个视频流对应的图像采集区域；9.3. Obtain the image capture area corresponding to each video stream;

9.4、根据待显示区域以及图像采集区域，确定待显示视频流。9.4. Determine the video stream to be displayed according to the area to be displayed and the image acquisition area.

(五)显示单元350(5) Display unit 350

用于显示待显示的视频流对应的画面。It is used to display the picture corresponding to the video stream to be displayed.

在一些实施方式中，显示单元350可以具体用于步骤10.1～10.3，如下：In some embodiments, the display unit 350 may be specifically used in steps 10.1 to 10.3, as follows:

10.1、获取待显示视频流的显示画面；10.1. Obtain the display screen of the video stream to be displayed;

10.2、根据待显示区域，裁剪待显示视频流的显示画面，得到裁剪后的显示画面；10.2. According to the area to be displayed, trim the display image of the video stream to be displayed, and obtain the trimmed display image;

10.3、显示裁剪后的显示画面。10.3. Display the cropped display screen.

具体实施时，以上各个单元可以作为独立的实体来实现，也可以进行任意组合，作为同一或若干个实体来实现，以上各个单元的具体实施可参见前面的方法实施例，在此不再赘述。During specific implementation, the above units can be implemented as independent entities, or can be arbitrarily combined to be implemented as the same or several entities. The specific implementation of the above units can refer to the previous method embodiments, which will not be repeated here.

由此，本申请实施例可以通过声源位置确定用来识别发言人的目标视频流，可以提高识别发言人的效率，同时根据识别结果确定待显示的视频流，可以使显示的画面聚焦在发言人身上，呈现更好的会议画面。Therefore, the embodiment of the present application can determine the target video stream used to identify the speaker through the sound source position, which can improve the efficiency of identifying the speaker. On the person, present a better picture of the meeting.

相应的，本申请实施例还提供一种计算机设备，该计算机设备可以为终端或服务器，该终端可以为智能手机、平板电脑、笔记本电脑、触控屏幕、游戏机、个人计算机、个人数字助理(Personal Digital Assistant，PDA)等终端设备。Correspondingly, the embodiment of the present application also provides a computer device, the computer device may be a terminal or a server, and the terminal may be a smart phone, a tablet computer, a notebook computer, a touch screen, a game console, a personal computer, a personal digital assistant ( Personal Digital Assistant, PDA) and other terminal equipment.

如图7所示，图7为本申请实施例提供的计算机设备的结构示意图，该计算机设备400包括有一个或者一个以上处理核心的处理器410、有一个或一个以上计算机可读存储介质的存储器420及存储在存储器420上并可在处理器上运行的计算机程序。其中，处理器410与存储器420电性连接。本领域技术人员可以理解，图中示出的计算机设备结构并不构成对计算机设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。As shown in FIG. 7 , which is a schematic structural diagram of a computer device provided by an embodiment of the application, the computer device 400 includes a processor 410 having one or more processing cores, and a memory having one or more computer-readable storage media 420 and a computer program stored on memory 420 and executable on the processor. The processor 410 is electrically connected to the memory 420 . Those skilled in the art can understand that the computer device structure shown in the figures does not constitute a limitation on the computer device, and may include more or less components than the one shown, or combine some components, or arrange different components.

处理器410是计算机设备400的控制中心，利用各种接口和线路连接整个计算机设备400的各个部分，通过运行或加载存储在存储器420内的软件程序和/或模块，以及调用存储在存储器420内的数据，执行计算机设备400的各种功能和处理数据，从而对计算机设备400进行整体监控。The processor 410 is the control center of the computer device 400, and uses various interfaces and lines to connect various parts of the entire computer device 400, by running or loading software programs and/or modules stored in the memory 420, and calling the software programs and/or modules stored in the memory 420. to perform various functions of the computer device 400 and process data, so as to monitor the computer device 400 as a whole.

在本申请实施例中，计算机设备400中的处理器410会按照如下的步骤，将一个或一个以上的应用程序的进程对应的指令加载到存储器420中，并由处理器410来运行存储在存储器420中的应用程序，从而实现各种功能：In the embodiment of the present application, the processor 410 in the computer device 400 loads the instructions corresponding to the processes of one or more application programs into the memory 420 according to the following steps, and the processor 410 executes the instructions stored in the memory. 420 applications to achieve various functions:

获取当前场景的多路视频流以及声源位置，每个视频流对应的一个图像采集区域；根据声源位置以及图像采集区域，从多路视频流中确定目标视频流；根据目标视频流，识别目标对象，目标对象为有嘴唇动作的对象；根据对目标对象的识别结果，从多路视频流中确定待显示的视频流；显示待显示的视频流对应的画面。Obtain the multi-channel video streams and sound source positions of the current scene, and an image capture area corresponding to each video stream; determine the target video stream from the multi-channel video streams according to the sound source position and image capture area; identify the target video stream according to the target video stream The target object is an object with lip movements; according to the recognition result of the target object, the video stream to be displayed is determined from the multi-channel video streams; the picture corresponding to the video stream to be displayed is displayed.

以上各个操作的具体实施可参见前面的实施例，在此不再赘述。For the specific implementation of the above operations, reference may be made to the foregoing embodiments, and details are not described herein again.

可选的，如图7所示，计算机设备400还包括：触控显示屏430、射频电路440、音频电路450、输入单元460以及电源470。其中，处理器410分别与触控显示屏430、射频电路440、音频电路450、输入单元460以及电源470电性连接。本领域技术人员可以理解，图7中示出的计算机设备结构并不构成对计算机设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Optionally, as shown in FIG. 7 , the computer device 400 further includes: a touch display screen 430 , a radio frequency circuit 440 , an audio circuit 450 , an input unit 460 and a power supply 470 . The processor 410 is electrically connected to the touch display screen 430 , the radio frequency circuit 440 , the audio circuit 450 , the input unit 460 and the power supply 470 , respectively. Those skilled in the art can understand that the computer device structure shown in FIG. 7 does not constitute a limitation on the computer device, and may include more or less components than the one shown, or combine some components, or arrange different components.

触控显示屏430可用于显示图形用户界面以及接收用户作用于图形用户界面产生的操作指令。触控显示屏430可以包括显示面板和触控面板。其中，显示面板可用于显示由用户输入的信息或提供给用户的信息以及计算机设备的各种图形用户接口，这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。可选的，可以采用液晶显示器(LCD，Liquid Crystal Display)、有机发光二极管(OLED，Organic Light-EmittingDiode)等形式来配置显示面板。触控面板可用于收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板上或在触控面板附近的操作)，并生成相应的操作指令，且操作指令执行对应程序。可选的，触控面板可包括触摸检测装置和触摸控制器两个部分。其中，触摸检测装置检测用户的触摸方位，并检测触摸操作带来的信号，将信号传送给触摸控制器；触摸控制器从触摸检测装置上接收触摸信息，并将它转换成触点坐标，再送给处理器410，并能接收处理器410发来的命令并加以执行。触控面板可覆盖显示面板，当触控面板检测到在其上或附近的触摸操作后，传送给处理器410以确定触摸事件的类型，随后处理器410根据触摸事件的类型在显示面板上提供相应的视觉输出。在本申请实施例中，可以将触控面板与显示面板集成到触控显示屏430而实现输入和输出功能。但是在某些实施例中，触控面板与显示面板可以作为两个独立的部件来实现输入和输出功能。即触控显示屏430也可以作为输入单元460的一部分实现输入功能。The touch screen 430 can be used to display a graphical user interface and receive operation instructions generated by a user acting on the graphical user interface. The touch display 430 may include a display panel and a touch panel. Among them, the display panel can be used to display the information input by the user or the information provided to the user and various graphical user interfaces of the computer equipment, and these graphical user interfaces can be composed of graphics, text, icons, videos and any combination thereof. Optionally, the display panel may be configured in the form of a liquid crystal display (LCD, Liquid Crystal Display), an organic light-emitting diode (OLED, Organic Light-Emitting Diode), and the like. The touch panel can be used to collect the user's touch operations on or near it (such as the user's operations on or near the touch panel using a finger, stylus, etc., any suitable object or accessory), and generate corresponding operations instruction, and the operation instruction executes the corresponding program. Optionally, the touch panel may include two parts, a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it to the touch controller. To the processor 410, and can receive the command sent by the processor 410 and execute it. The touch panel may cover the display panel, and when the touch panel detects a touch operation on or near it, it is transmitted to the processor 410 to determine the type of the touch event, and then the processor 410 provides the display panel according to the type of the touch event. Corresponding visual output. In this embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 430 to implement input and output functions. However, in some embodiments, the touch panel and the display panel can be used as two independent components to realize the input and output functions. That is, the touch display screen 430 can also be used as a part of the input unit 460 to realize the input function.

射频电路440可用于收发射频信号，以通过无线通信与网络设备或其他计算机设备建立无线通讯，与网络设备或其他计算机设备之间收发信号。The radio frequency circuit 440 can be used for transmitting and receiving radio frequency signals, so as to establish wireless communication with network equipment or other computer equipment through wireless communication, and send and receive signals with the network equipment or other computer equipment.

音频电路450可以用于通过扬声器、传声器提供用户与计算机设备之间的音频接口。音频电路450可将接收到的音频数据转换后的电信号，传输到扬声器，由扬声器转换为声音信号输出；另一方面，传声器将收集的声音信号转换为电信号，由音频电路450接收后转换为音频数据，再将音频数据输出处理器410处理后，经射频电路440以发送给比如另一计算机设备，或者将音频数据输出至存储器420以便进一步处理。音频电路450还可能包括耳塞插孔，以提供外设耳机与计算机设备的通信。The audio circuit 450 may be used to provide an audio interface between the user and the computer equipment through speakers, microphones. The audio circuit 450 can convert the received audio data into an electrical signal and transmit it to the speaker, and the speaker converts it into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is converted after being received by the audio circuit 450. In the form of audio data, the audio data is output to the processor 410 for processing, and then sent to, for example, another computer device via the radio frequency circuit 440, or the audio data is output to the memory 420 for further processing. Audio circuitry 450 may also include an ear jack to provide for communication of peripheral headphones with the computer device.

输入单元460可用于接收输入的数字、字符信息或用户特征信息(例如指纹、虹膜、面部信息等)，以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。The input unit 460 can be used to receive input numbers, character information or user characteristic information (such as fingerprint, iris, face information, etc.), and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control .

电源470用于给计算机设备400的各个部件供电。可选的，电源470可以通过电源管理系统与处理器410逻辑相连，从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源470还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。Power supply 470 is used to power various components of computer device 400 . Optionally, the power supply 470 may be logically connected to the processor 410 through a power management system, so that functions such as managing charging, discharging, and power consumption are implemented through the power management system. Power supply 470 may also include one or more DC or AC power sources, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and any other components.

尽管图7中未示出，计算机设备400还可以包括摄像头、传感器、无线保真模块、蓝牙模块等，在此不再赘述。Although not shown in FIG. 7 , the computer device 400 may further include a camera, a sensor, a Wi-Fi module, a Bluetooth module, and the like, which will not be repeated here.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

由上可知，本实施例提供的计算机设备可以通过声源位置确定用来识别发言人的目标视频流，可以提高识别发言人的效率，同时根据识别结果确定待显示的视频流，可以使显示的画面聚焦在发言人身上，呈现更好的会议画面。It can be seen from the above that the computer device provided in this embodiment can determine the target video stream used to identify the speaker through the position of the sound source, which can improve the efficiency of identifying the speaker. The picture focuses on the speaker, presenting a better picture of the meeting.

本领域普通技术人员可以理解，上述实施例的各种方法中的全部或部分步骤可以通过指令来完成，或通过指令控制相关的硬件来完成，该指令可以存储于一计算机可读存储介质中，并由处理器进行加载和执行。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructions, or by instructions that control relevant hardware, and the instructions can be stored in a computer-readable storage medium, and loaded and executed by the processor.

为此，本申请实施例提供一种计算机可读存储介质，其中存储有多条计算机程序，该计算机程序能够被处理器进行加载，以执行本申请实施例所提供的任一种视频流显示方法中的步骤。例如，该计算机程序可以执行如下步骤：To this end, the embodiments of the present application provide a computer-readable storage medium, in which a plurality of computer programs are stored, and the computer programs can be loaded by a processor to execute any of the video stream display methods provided by the embodiments of the present application. steps in . For example, the computer program may perform the following steps:

其中，该存储介质可以包括：只读存储器(ROM，Read Only Memory)、随机存取记忆体(RAM，Random Access Memory)、磁盘或光盘等。Wherein, the storage medium may include: a read only memory (ROM, Read Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk or an optical disk, and the like.

由于该存储介质中所存储的计算机程序，可以执行本申请实施例所提供的任一种视频流显示方法中的步骤，因此，可以实现本申请实施例所提供的任一种视频流显示方法所能实现的有益效果，详见前面的实施例，在此不再赘述。Since the computer program stored in the storage medium can execute the steps in any video stream display method provided by the embodiments of the present application, it is possible to realize the steps of any video stream display method provided by the embodiments of the present application. For the beneficial effects that can be achieved, see the foregoing embodiments for details, which will not be repeated here.

以上对本申请实施例所提供的一种视频流显示方法、装置、存储介质及计算机设备进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本申请的限制。A video stream display method, device, storage medium and computer equipment provided by the embodiments of the present application have been described in detail above. The principles and implementations of the present application are described with specific examples in this article. It is only used to help understand the method of the present application and its core idea; at the same time, for those skilled in the art, according to the idea of the present application, there will be changes in the specific implementation and application scope. The contents of the description should not be construed as limiting the application.

Claims

1. A method for displaying a video stream, comprising:

acquiring a plurality of paths of video streams and a sound source position of a current scene, wherein each video stream corresponds to an image acquisition area;

determining a target video stream from the multiple paths of video streams according to the sound source position and the image acquisition area;

identifying a target object according to the target video stream, wherein the target object is an object with lip action;

determining a video stream to be displayed from the multi-path video stream according to the identification result of the target object;

and displaying the picture corresponding to the video stream to be displayed.

2. The method for displaying a video stream according to claim 1, wherein said determining a target video stream from the plurality of video streams based on the sound source position and the image capturing area comprises:

determining a sound source area according to the sound source position;

for each of the video streams, determining an overlapping area of the sound source area and the image acquisition area;

determining the overlapping area meeting the preset first area size as a target area;

and determining the video stream corresponding to the target area as a target video stream.

3. The method for displaying a video stream according to claim 1, wherein said identifying a target object from said target video stream comprises:

determining an identification area according to the sound source position;

acquiring target image information from the target video stream according to the identification area, wherein the target image information is image information corresponding to the identification area;

and identifying the target object according to the target image information.

4. The method for displaying a video stream according to claim 3, wherein said identifying a target object based on said target image information comprises:

when the object with the lip action is identified from the target image information, taking the object with the lip action as a target object;

and when the object with the lip action is not identified from the target image information, expanding the identification area to a preset second area size so as to identify the target object.

5. The method for displaying a video stream according to claim 1, wherein the determining a video stream to be displayed from the plurality of video streams based on the recognition result of the target object comprises:

when the target object is identified, determining a region to be displayed according to the target object;

when the target object is not identified, determining a region to be displayed according to all objects in the target image information;

acquiring an image acquisition area corresponding to each video stream;

and determining the video stream to be displayed according to the area to be displayed and the image acquisition area.

6. The method for displaying a video stream according to claim 5, wherein the displaying a picture corresponding to the video stream to be displayed comprises:

acquiring a display picture of the video stream to be displayed;

cutting the display picture of the video stream to be displayed according to the area to be displayed to obtain the cut display picture;

and displaying the cut display picture.

7. The video stream display method according to claim 1, wherein the method of acquiring the sound source position includes:

collecting sound information of a current scene;

and processing the collected sound information through a sound source positioning algorithm to obtain the sound source position.

8. A video stream display apparatus, comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of paths of video streams and a sound source position of a current scene, and each video stream corresponds to an image acquisition area;

a first determining unit, configured to determine a target video stream from the multiple video streams according to the sound source position and the image acquisition area;

the identification unit is used for identifying a target object according to the target video stream, wherein the target object is an object with lip action;

the second determining unit is used for determining the video stream to be displayed from the multi-path video stream according to the identification result of the target object;

and the display unit is used for displaying the picture corresponding to the video stream to be displayed.

9. A computer device comprising a processor and a memory, the memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps of the video stream display method according to any one of claims 1 to 7.

10. A computer readable storage medium storing instructions adapted to be loaded by a processor to perform the steps of the method of displaying a video stream according to any one of claims 1 to 7.