CN114422743A

CN114422743A - Video stream display method, device, computer equipment and storage medium

Info

Publication number: CN114422743A
Application number: CN202111583153.XA
Authority: CN
Inventors: 余力丛; 于勇
Original assignee: Huizhou Shiwei New Technology Co Ltd
Current assignee: Huizhou Shiwei New Technology Co Ltd
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-04-29

Abstract

The embodiment of the application discloses a video stream display method, a video stream display device, computer equipment and a storage medium; the method comprises the steps of obtaining multiple paths of video streams and sound source positions of a current scene, wherein each video stream corresponds to an image acquisition area; determining a target video stream from the multiple paths of video streams according to the sound source position and the image acquisition area; identifying a target object according to the target video stream, wherein the target object is an object with lip action; determining a video stream to be displayed from the multi-path video stream according to the identification result of the target object; and displaying the picture corresponding to the video stream to be displayed. In the embodiment of the application, the target video stream for identifying the speaker is determined according to the sound source position, so that the efficiency of identifying the speaker can be improved, and meanwhile, the video stream to be displayed is determined according to the identification result, so that the displayed picture can be focused on the speaker, and a better conference picture can be presented.

Description

Video stream display method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video stream display method and apparatus, a computer device, and a storage medium.

Background

With the development of video technology, more and more occasions can acquire and play the scene pictures in real time through the camera. However, in a scene in which a plurality of persons participate, the pictures captured by the camera cannot highlight the important points in the current scene.

Especially in a multi-person conference scene, there are often a plurality of different speakers in the conference process, and how to focus the displayed picture on the speakers to present a better conference picture is a problem that needs to be solved at present.

Disclosure of Invention

The embodiment of the application provides a video stream display method, a video stream display device, computer equipment and a storage medium, which can enable a displayed picture to be focused on a speaker and present a better conference picture.

An embodiment of the present application provides a video stream display method, including: acquiring a plurality of paths of video streams and a sound source position of a current scene, wherein each video stream corresponds to an image acquisition area; determining a target video stream from the multiple paths of video streams according to the sound source position and the image acquisition area; identifying a target object according to the target video stream, wherein the target object is an object with lip action; determining a video stream to be displayed from the multi-path video stream according to the identification result of the target object; and displaying the picture corresponding to the video stream to be displayed.

An embodiment of the present application further provides a video stream display device, including: the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of paths of video streams and a sound source position of a current scene, and each video stream corresponds to an image acquisition area; a first determining unit, configured to determine a target video stream from the multiple video streams according to the sound source position and the image acquisition area; the identification unit is used for identifying a target object according to the target video stream, wherein the target object is an object with lip action; the second determining unit is used for determining the video stream to be displayed from the multi-path video stream according to the identification result of the target object; and the display unit is used for displaying the picture corresponding to the video stream to be displayed.

The embodiment of the application also provides computer equipment, which comprises a memory, a storage and a control unit, wherein the memory stores a plurality of instructions; the processor loads instructions from the memory to execute the steps in any one of the video stream display methods provided by the embodiments of the present application.

The embodiment of the present application further provides a computer-readable storage medium, where multiple instructions are stored in the computer-readable storage medium, and the instructions are suitable for being loaded by a processor to perform the steps in any one of the video stream display methods provided in the embodiment of the present application.

The method and the device can acquire a plurality of paths of video streams and sound source positions of a current scene, wherein each video stream corresponds to one image acquisition area; determining a target video stream from the multiple paths of video streams according to the sound source position and the image acquisition area; identifying a target object according to the target video stream, wherein the target object is an object with lip action; determining a video stream to be displayed from the multi-path video stream according to the identification result of the target object; and displaying the picture corresponding to the video stream to be displayed. According to the method and the device, the target video stream for identifying the speaker is determined through the sound source position, so that the efficiency of identifying the speaker can be improved, meanwhile, the video stream to be displayed is determined according to the identification result, the displayed picture can be focused on the speaker, and a better conference picture can be presented.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of a scene of a video stream display system provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a video stream display method provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a video stream display system provided in an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram of a data processing module provided in an embodiment of the present application;

fig. 5 is a schematic flowchart of a video stream display method according to another embodiment of the present application;

fig. 6 is a schematic structural diagram of a video stream display apparatus provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a video stream display method, a video stream display device, computer equipment and a storage medium.

The video stream display apparatus may be specifically integrated in an electronic device, and the electronic device may be a terminal, a server, or the like. The terminal can be a mobile phone, a tablet Computer, an intelligent bluetooth device, a notebook Computer, or a Personal Computer (PC), and the like; the server may be a single server or a server cluster composed of a plurality of servers.

In some embodiments, the video stream display apparatus may also be integrated in a plurality of electronic devices, for example, the video stream display apparatus may be integrated in a plurality of servers, and the video stream display method of the present application is implemented by the plurality of servers.

In some embodiments, the server may also be implemented in the form of a terminal.

For example, referring to fig. 1, in some embodiments, a scene schematic of a video stream display system is provided, which may include a display data acquisition module 1000, a server 2000, and a terminal 3000.

The data acquisition module can acquire a plurality of paths of video streams and a sound source position of a current scene, and each video stream corresponds to one image acquisition area.

The server can determine a target video stream from the multiple paths of video streams according to the sound source position and the image acquisition area; identifying a target object according to the target video stream, wherein the target object is an object with lip motion; and determining the video stream to be displayed from the multiple video streams according to the identification result of the target object.

The terminal can display the picture corresponding to the video stream to be displayed.

The following are detailed below. The numbers in the following examples are not intended to limit the order of preference of the examples.

In this embodiment, a video stream display method is provided, and as shown in fig. 2, a specific flow of the video stream display method may be as follows:

110. acquiring a plurality of paths of video streams and a sound source position of a current scene, wherein each video stream corresponds to one image acquisition area.

The sound source position refers to a position from which sound is emitted in a current scene, for example, a position from which speech can be emitted in a conference scene. Sound can be collected by placing an array of microphones in the current scene and the location of the sound source is calculated according to a sound source localization algorithm.

In some embodiments, the multiple video streams include one panoramic video stream and at least one close-up video stream. The image acquisition area refers to an area range of the current scene corresponding to an image which can be acquired by an image acquisition device corresponding to the video stream. The panoramic video stream is a video stream containing a panoramic picture of a current scene, the corresponding image acquisition area of the panoramic video stream is a panoramic picture of the current scene, and the panoramic video stream can be acquired by a camera with a wide-angle lens, the close-range video stream is a video stream containing a local scene of the current scene, the corresponding image acquisition area of the close-range video stream is a local scene of the current scene, and the close-range video stream can be acquired by a camera with a telephoto lens.

In some embodiments, the method for acquiring the sound source position may include steps 1.1 to 1.2 as follows:

1.1, collecting sound information of a current scene;

and 1.2, processing the collected sound information through a sound source positioning algorithm to obtain the sound source position.

The sound source localization algorithm may use TDOA (Time Difference of Arrival), GCC-PHAT (Generalized Cross Correlation PHAse Transformation), and the like.

120. And determining a target video stream from the multiple video streams according to the sound source position and the image acquisition area.

The target video stream is determined according to the sound source position and the image acquisition area incidence relation. The incidence relation may be that the sound source position is located in the image acquisition region, or that the distance between the sound source position and the center of the image acquisition region is smaller than a preset distance.

In some embodiments, step 120 may include the steps of: and determining a target image acquisition area from acquisition areas corresponding to the multiple paths of video streams according to the association relationship between the sound source position and the image acquisition area, and determining the video stream corresponding to the target image acquisition area as a target video stream.

In some embodiments, due to interference such as reflection and noise, the sound source position determined by sound source positioning may have an error, and therefore, a possible area where the sound source exists is determined by the sound source position, so as to determine a target video stream corresponding to the area, and increase the accuracy of the acquired image information, specifically, step 120 may include steps 2.1 to 2.4, as follows:

2.1, determining a sound source area according to the sound source position;

2.2, determining an overlapping area of a sound source area and an image acquisition area for each video stream;

2.3, determining an overlapping area meeting the preset first area size as a target area;

and 2.4, determining the video stream corresponding to the target area as the target video stream.

The sound source area refers to an area where a sound source is located, and the area and the image acquisition area are located on the same plane. The sound source area may be determined according to a sound source position and a preset area parameter value, and the preset area parameter value may be set according to a current scene or experience, for example, a circular area with the sound source position as a center of a circle and a preset radius value as a radius is used as the sound source area, and the like.

In some embodiments, step 2.1 may comprise the steps of: acquiring a reference point; determining an included angle meeting a preset first angle by taking the reference point as a vertex and a connecting line of the reference point and the sound source position as an angular bisector; and determining the area of the included angle corresponding to the current scene as a sound source area. The reference point may be any boundary point in the current scene. In some embodiments, the reference point may be a location point for collecting sound information of the current scene, for example, a point determined by a microphone array for measuring a sound source location, which may be any location point on the microphone array or a midpoint. It should be noted that the reference point, the sound source position, the image acquisition area, the sound source area, and the target area are all located on the same plane, and may be a horizontal plane, for example, the sound source position is a position point where a real sound source position calculated by a sound source localization algorithm is projected onto the horizontal plane.

Wherein the preset first region size is a size condition of the region set according to a current scene or experience. The specific value may be, for example, one third or more of the size of the image capturing area corresponding to any one of the video streams, or may be one half or more of the size of the area determined according to the sound source area, for example, one half or more of the size of the sound source area.

In some embodiments, step 2.3 may comprise the steps of: and determining the overlapped areas with the same size as the sound source areas as the target areas.

130. And identifying a target object according to the target video stream, wherein the target object is an object with lip motion.

The target object is an object having a lip motion recognized from image information of the target video stream. In general, the person with lip motion is speaking and therefore can be the speaker of the current scene. The lip motion may be the motion of the lips of a person speaking as determined in the prior art.

Due to the fact that image acquisition regions corresponding to different video streams are different, the target video stream used for identifying the speaker is determined through the sound source position, the data amount can be reduced, and the efficiency of identifying the speaker is improved.

In some embodiments, the sound source position determined by sound source localization may have an error due to interference of reflection, noise, and the like, and therefore, an area for identifying a speaker is determined by the sound source position, so as to determine a target video stream corresponding to the area, and increase accuracy of the acquired image information, and step 130 may include steps 3.1 to 3.3, as follows:

3.1, determining an identification area according to the sound source position;

3.2, acquiring target image information from the target video stream according to the identification area, wherein the target image information is image information corresponding to the identification area;

and 3.3, identifying the target object according to the target image information.

The identification region is a region for identifying a target object, which is determined according to the sound source position, and the region is located on the same plane as the image acquisition region. The sound source position is located in the identification area, the identification area can be determined according to the sound source position and a preset area parameter value, the preset area parameter value can be set according to the current scene or experience, for example, a circular area with the sound source position as a circle center and a preset radius value as a radius is used as the identification area, and the like. The identification area may also be a sound source area.

In some embodiments, step 3.1 may comprise the steps of: acquiring a reference point; determining an included angle meeting a preset second angle by taking the reference point as a vertex and a connecting line of the reference point and the sound source position as an angular bisector; and determining the area of the included angle corresponding to the current scene as an identification area.

The target image information refers to image information of an identification area projected to an area of a picture acquired by a target video stream. Specifically, the coordinate position of the identification area may be acquired, the coordinate position is projected into a coordinate system where a picture acquired by the target video stream is located to obtain a projected area, and image information in the area is used as the target image information.

And determining an identification area possibly containing the target object according to the sound source position, acquiring image information corresponding to the identification area from the target video stream through the identification area, and identifying whether the target object exists in the image information.

In some embodiments, to improve the recognition efficiency, step 3.3 may include steps 3.3.1-3.3.2 as follows:

3.3.1, when the object with the lip action is identified from the target image information, taking the object with the lip action as a target object;

and 3.3.2, when the object with the lip action is not recognized from the target image information, expanding the recognition area to a preset second area size so as to recognize the target object.

For example, the recognition area is set to a sector area with an angle of 30 °, when no target object is recognized in the area, the recognition area is expanded to a sector area with an angle of 40 °, recognition is performed again, when no target object is recognized in the area, the recognition area is expanded to a sector area with an angle of 50 °, and so on until a target object is recognized or the recognition area is expanded to an upper limit value.

Because the sound source position determined by sound source positioning may have errors, when lip motion recognition is performed, a preset recognition area may not be capable of recognizing a target object, at this time, the recognition range can be enlarged by gradually enlarging the size of the recognition area, so that the recognition result is corrected, and simultaneously, the size of the recognition area is gradually enlarged, so that the area to be recognized each time is smaller than the area to be recognized next time, so that the recognition result is obtained in the smallest area as possible, and the recognition efficiency is improved.

In some embodiments, to further improve the recognition efficiency, step 3.3.2 may comprise the steps of: when the object with the lip action is not identified from the target image information, expanding the identification area to a preset second area size to obtain an expanded area; taking a non-overlapping area of the identification area and the expanded area as a target identification area; acquiring target image information from a target video stream according to the target identification area, wherein the target image information is image information corresponding to the identification area; and identifying the target object according to the target image information.

140. And determining the video stream to be displayed from the multiple video streams according to the identification result of the target object.

The video stream to be displayed is a video stream used for displaying a current scene. The target object may be displayed in focus by the video stream to be displayed.

In some embodiments, in order to provide a better display effect of the current scene, a display strategy determined according to the target object recognition result is provided, and step 140 may include steps 4.1 to 4.4 as follows:

4.1, when the target object is identified, determining a region to be displayed according to the target object;

4.2, when the target object is not identified, determining a region to be displayed according to all objects in the target image information;

4.3, acquiring an image acquisition area corresponding to each video stream;

and 4.4, determining the video stream to be displayed according to the area to be displayed and the image acquisition area.

The area to be displayed refers to an area to be displayed by the video stream to be displayed. When the target object is recognized, a region where the target object is located, for example, a sound source region or a recognition region, may be set as a region to be displayed, and when the target object is not recognized, regions where all objects in the target image information are located may be set as regions to be displayed. The area to be displayed can be on the same plane with the image acquisition area, or on the same plane with the image corresponding to the target video stream, and when the area to be displayed is compared with different plane areas such as the image acquisition area, the area to be displayed can be projected to the plane where the image acquisition area is located, and then the comparison is carried out.

In some embodiments, when multiple target objects are identified, the area to be displayed is determined from the multiple target objects. At this time, the area to be displayed is an area where a plurality of target objects are located.

And determining a region to be displayed according to the identification result of the target object, and comparing the region to be displayed with the image acquisition region to determine the video stream to be displayed. For example, a video stream corresponding to an image capture area with the largest repetition area may be used as the video stream to be displayed by determining the repetition area of the area to be displayed and each image capture area.

In some embodiments, in order to be able to focus on the speaker and provide better current scene display, step 4.4 may include the steps of: determining the area size ratio of the area to be displayed to each image acquisition area, and taking the video stream corresponding to the image acquisition area with the highest ratio of the area to be displayed to the image acquisition area as the video stream to be displayed. In some embodiments, in order to avoid incomplete displayed speaker picture, the ratio of the size of the area to be displayed/the size of the image capturing area is smaller than a preset value, which may be 1.

150. And displaying the picture corresponding to the video stream to be displayed.

In some embodiments, the step 150 may include steps 5.1 to 5.3, which are performed by cropping the display screen, focusing the speaker, and providing a better current scene display effect, as follows:

5.1, acquiring a display picture of a video stream to be displayed;

5.2, cutting a display picture of the video stream to be displayed according to the area to be displayed to obtain the cut display picture;

and 5.3, displaying the cut display screen.

And the cut display picture is a picture corresponding to the to-be-displayed area in the display picture of the to-be-displayed video stream.

By cutting the display picture of the video stream to be displayed into a picture corresponding to the area to be displayed, a speaker can be further focused to provide a better current scene display effect.

The video stream display method provided by the embodiment of the application can be applied to various scenes in which multiple persons participate. For example, taking a multi-person conference as an example, acquiring multiple video streams and a sound source position of a current scene, wherein each video stream corresponds to one image acquisition area; determining a target video stream from the multiple paths of video streams according to the sound source position and the image acquisition area; identifying a target object according to the target video stream, wherein the target object is an object with lip motion; determining a video stream to be displayed from the multiple video streams according to the identification result of the target object; and displaying the picture corresponding to the video stream to be displayed. By adopting the scheme provided by the embodiment of the application, the target video stream for identifying the speaker is determined according to the sound source position, the efficiency of identifying the speaker can be improved, meanwhile, the video stream to be displayed is determined according to the identification result, the displayed picture can be focused on the speaker, and a better conference picture can be presented.

The method described in the above embodiments is further described in detail below.

In this embodiment, a method in the embodiment of the present application will be described in detail by taking a multi-person conference scenario as an example.

As shown in fig. 3, a schematic structural diagram of a video stream display system is provided, and the system includes a data acquisition module, a data processing module, and a terminal.

The data acquisition module consists of a thermal infrared imager, an ultrasonic module, a double-camera module and an array microphone, and the camera module acquires information and then sends the information to the data processing module. The method comprises the following specific steps:

the data acquisition module comprises two cameras which are respectively a wide-angle lens and a telephoto lens. The wide-angle lens has a large field angle and a wide visual range, but the long shot is blurred. The telephoto lens has a small field angle and a narrow visible range, but can see a clear long-range view. When the angle of view appears to overlap the camera head cut to the telephoto lens, the camera head is switched to the wide-angle lens when the angle of view is out of the range of the telephoto lens. The double-camera switching method comprises the following steps: 1. the double-camera module comprises a wide-angle lens camera and a telephoto lens camera, the wide-angle lens camera is short in focal length and wide in visual field, the shot pictures are multiple, the image object occupation ratio is smaller, the opposite telephoto lens camera is long in focal length and narrow in visual field, the shot pictures are fewer, and the image object occupation ratio is larger. The camera of the wide-angle lens and the camera of the telephoto lens can respectively output two paths of video streams, one path of video stream is used for actual picture presentation and can be called as preview stream, and the other path of video stream is used for lip motion detection and face recognition of the AI and can be called as AI image stream. 2. The terminal only can present one preview flow from two cameras, but can simultaneously provide the AI image flows of the two cameras to an image AI thread for lip movement detection and face recognition. 3. And the image AI thread decides to perform lip motion recognition and face recognition on one of the two paths of AI image streams according to the angle information of sound source positioning, then outputs the decision to the UVC thread to decide which path to cut to preview the stream and performs cutting, and finally presents a face focusing effect.

And the thermal infrared imager is used for measuring the temperature of the target object.

The ultrasonic module is used for detecting the distance of a target object by combining with the thermal infrared imager, and the thermal infrared imager is also a camera in nature, so that the minimum imaging distance of a lens of the thermal infrared imager is required, for example, the distance between a measured object and the lens is more than 25cm, and the thermal image effect can be ensured to be clear. Therefore, the ultrasonic module can detect the distance of the target and prompt the distance requirement of the target.

And the matrix microphone module is used for positioning a sound source and determining the position of the speaker.

The data processing module comprises a UVC thread, a UAC thread, an image AI thread and an audio AI thread, and acquires information collected by the camera module and performs data processing. As shown in fig. 4, the workflow of the thread in the data processing module is as follows:

and the UVC thread is used for collecting video stream information of the double cameras, each camera outputs two video streams, one video stream is used for outputting to a terminal to present a real-time picture, and the other video stream is used for analyzing lip action and recognizing a face in an image AI thread.

And the UAC thread is used for collecting audio stream information of the array microphone and outputting the audio stream information into two types, wherein one type of audio information is used for directly outputting audio stream data in a PCM format of one microphone to a terminal for audio playing, and the other type of audio information is used for combining the audio stream data in the PCM format collected by all the microphones and then sending the audio stream data to an audio AI thread for sound source positioning.

And the image AI thread is used for analyzing and processing the image information of the two cameras output by the UVC thread and outputting a decision to the UVC thread, wherein the decision comprises feeding back a video stream of which camera is displayed and amplifying and cutting the image information of the video stream to focus a speaker. Specifically, the image AI thread acquires two kinds of information, one is video stream information of two cameras provided for the UVC thread, and the other is sound source angle information provided for the audio AI thread. After the image AI thread acquires the sound source angle information, determining the sound source angle of the current speaker, determining which camera video stream is acquired according to the field angle ranges of the two cameras to analyze the lip action, determining the identification area corresponding to the lip action, and identifying the face information. And finally, feeding back the UVC thread to switch the camera for displaying, and amplifying and cutting to focus the speaker.

And the audio AI thread is used for analyzing and processing audio stream data in a PCM format output by the array MIC given by the UAC, positioning a sound source, and sending output sound source angle information to the image AI thread for decision making.

The data processing module further comprises a strategy management module, and the strategy management module is used for acquiring the data processed by the data processing module, making scene decision and realizing speaker tracking, speaker subtitle display and participant sign-in.

The terminal is used for displaying a picture, and the terminal may be a TV (television).

As shown in fig. 5, a specific flow of a video stream display method is as follows:

210. the array microphone collects environmental sounds in real time.

220. And the audio AI thread determines and outputs sound source angle information according to the collected environmental sound through a sound source positioning algorithm.

It may further comprise the steps before the ambient sound is collected by the array microphones: the strategy management module controls the thermal infrared imager and the ultrasonic module to detect the body temperature of the participators. The ultrasonic module starts a distance detection function, when the distance of the target participant reaches the imaging requirement of the thermal infrared imager, the thermal infrared imager starts to detect the body temperature of the target participant, and when the temperature exceeds the requirement, the target participant cannot participate.

The sound source angle refers to an angle between a sound source position and the array microphones, and may be an angle formed by using a midpoint of a line segment formed by the array microphones as a vertex and using the sound source position, the midpoint of the line segment formed by the array microphones, and any vertex of the line segment formed by the array microphones as a sound source angle.

The array microphone collects environmental sounds in real time and sends the environmental sounds to the UAC thread, one path of the environmental sounds is sent to the terminal for playing after being processed by the UAC thread, and the other path of the environmental sounds is sent to the image AI thread for sound source positioning.

231. And when the audio AI thread does not output the sound source angle information, the image AI thread control terminal displays the shot picture of the wide-angle camera.

When no sound source angle is output, the mode can be entered into a listening mode, in the mode, the UVC thread can output the image presentation of the wide-angle camera by default, when the image AI thread analyzes that one path of AI image flow of the two cameras has face identification, the UVC thread is informed to be switched to the image presentation of the corresponding camera, and the step 210 is executed to collect the environmental sound in real time. And if the two paths of AI image streams have the condition of face identification, preferentially outputting the image picture of the wide-angle camera. If the two paths of AI image streams have no face recognition condition, the AI image streams are not focused, and the image pictures of the wide-angle camera are preferentially output.

232. And when the audio AI thread outputs the sound source angle information, the image AI thread determines a sound source area according to the sound source angle information.

When the sound source angle is output, the image AI thread divides a sector which is a sound source area through the range of +/-15 DEG to +/-30 DEG of the sound source angle.

240. And the image AI thread determines a target video stream from the two paths of video streams according to the sound source area.

And the image AI thread judges which camera to collect according to the sound source area to identify, if the sound source area is completely covered by the two cameras, the image AI thread preferentially processes the image information of the telephoto camera, and if the sound source area is in the wide-angle camera, the image AI thread processes the image information of the wide-angle camera.

250. And the image AI thread identifies a target object according to the target video stream, wherein the target object is an object with lip action.

The image AI thread performs lip motion analysis based on the image information captured by the camera to identify the target object according to the camera determined for identification in step 340.

261. And when the target object is identified, the image AI thread determines the area to be displayed according to the target object.

At the beginning, the area corresponding to the +/-15 degrees of the sound source angle is taken as a sector for recognizing the human face, if the person with the lip action is not recognized, the area corresponding to the +/-20 degrees of the sound source angle is taken as the sector for recognizing the human face, the size of the sector is increased by 5 degrees at each time until the person with the lip action is recognized, and the sector at the moment is taken as the area to be displayed. If there are multiple speakers, the area to be displayed should cover all speakers.

And finally, the UVC thread controls the image information of the output camera and cuts the output image information, so that the user can see the final face focusing effect.

262. And when the target object is not identified, the image AI thread determines the area to be displayed according to all objects in the target image information.

270. And the image AI thread determines the video stream to be displayed according to the area to be displayed.

280. And the image AI thread cuts the display picture of the video stream to be displayed according to the area to be displayed to obtain the cut display picture.

290. And displaying the cut display picture by the terminal.

When people with lip action are not recognized, the area corresponding to the +/-15 degrees of the sound source angle is used as a sector for recognizing the face, if the face is recognized, the area corresponding to the +/-20 degrees of the sound source angle is used as a sector for recognizing the face, the size of the sector is increased by 5 degrees each time until the face is recognized, and the sector at the moment is used as a to-be-displayed area. If the area to be displayed has a plurality of people, the area to be displayed covers all the people.

When the human face is not recognized, the method enters a listening mode, and step 210 is executed to collect the environmental sound in real time.

Therefore, the embodiment of the application performs double-shooting switching by acquiring the sound source angle, so that the speaker can be focused, the displayed picture can be focused on the speaker, and a better conference picture can be presented.

In order to better implement the above method, embodiments of the present application further provide a video stream display apparatus, where the video stream display apparatus may be specifically integrated in an electronic device, and the electronic device may be a terminal, a server, or the like. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices; the server may be a single server or a server cluster composed of a plurality of servers.

For example, in this embodiment, a video stream display device is specifically integrated in a terminal as an example, and the method of this embodiment will be described in detail.

For example, as shown in fig. 6, the video stream display apparatus may include an acquisition unit 310, a first determination unit 320, a recognition unit 330, a second determination unit 340, and a display unit 350, as follows:

acquisition unit 310

The method is used for acquiring multiple paths of video streams and sound source positions of a current scene, and each video stream corresponds to one image acquisition area.

In some embodiments, the method for acquiring the sound source position may include steps 6.1 to 6.2 as follows:

6.1, collecting sound information of a current scene;

6.2, processing the collected sound information through a sound source positioning algorithm to obtain the sound source position.

(II) first determination unit 320

The method is used for determining a target video stream from the multiple video streams according to the sound source position and the image acquisition area.

In some embodiments, the first determining unit 320 may be specifically used in steps 7.1 to 7.4, as follows:

7.1, determining a sound source area according to the sound source position;

7.2, determining an overlapping area of a sound source area and an image acquisition area for each video stream;

7.3, determining an overlapping area meeting the preset first area size as a target area;

and 7.4, determining the video stream corresponding to the target area as the target video stream.

(III) recognition unit 330

The method is used for identifying a target object according to the target video stream, wherein the target object is an object with lip motion.

In some embodiments, the identification unit 330 may have a function for including steps 8.1-8.3, as follows:

8.1, determining an identification area according to the sound source position;

8.2, acquiring target image information from the target video stream according to the identification area, wherein the target image information is image information corresponding to the identification area;

and 8.3, identifying the target object according to the target image information.

In some embodiments, step 8.3 may include steps 8.3.1 to 8.3.2, as follows:

8.3.1, when the object with the lip action is identified from the target image information, taking the object with the lip action as a target object;

and 8.3.2, when the object with the lip action is not identified from the target image information, expanding the identification area to a preset second area size so as to identify the target object.

(IV) second determination unit 340

And the video display device is used for determining the video stream to be displayed from the multiple video streams according to the identification result of the target object.

In some embodiments, the second determining unit 340 may be specifically used in steps 9.1 to 9.4, as follows:

9.1, when the target object is identified, determining a region to be displayed according to the target object;

9.2, when the target object is not identified, determining a region to be displayed according to all objects in the target image information;

9.3, acquiring an image acquisition area corresponding to each video stream;

and 9.4, determining the video stream to be displayed according to the area to be displayed and the image acquisition area.

(V) display unit 350

And the method is used for displaying the picture corresponding to the video stream to be displayed.

In some embodiments, the display unit 350 can be specifically used in steps 10.1-10.3, as follows:

10.1, acquiring a display picture of a video stream to be displayed;

10.2, cutting a display picture of the video stream to be displayed according to the area to be displayed to obtain the cut display picture;

and 10.3, displaying the cut display screen.

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

Therefore, the target video stream for identifying the speaker can be determined according to the sound source position, the efficiency of identifying the speaker can be improved, meanwhile, the video stream to be displayed is determined according to the identification result, the displayed picture can be focused on the speaker, and a better conference picture can be presented.

Correspondingly, the embodiment of the present application further provides a computer device, where the computer device may be a terminal or a server, and the terminal may be a terminal device such as a smart phone, a tablet computer, a notebook computer, a touch screen, a game machine, a Personal computer, and a Personal Digital Assistant (PDA).

As shown in fig. 7, fig. 7 is a schematic structural diagram of a computer device 400 according to an embodiment of the present application, where the computer device 400 includes a processor 410 having one or more processing cores, a memory 420 having one or more computer-readable storage media, and a computer program stored in the memory 420 and running on the processor. The processor 410 is electrically connected to the memory 420. Those skilled in the art will appreciate that the computer device configurations illustrated in the figures are not meant to be limiting of computer devices and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components.

The processor 410 is a control center of the computer device 400, connects various parts of the entire computer device 400 using various interfaces and lines, performs various functions of the computer device 400 and processes data by running or loading software programs and/or modules stored in the memory 420 and calling data stored in the memory 420, thereby monitoring the computer device 400 as a whole.

In the embodiment of the present application, the processor 410 in the computer device 400 loads instructions corresponding to processes of one or more applications into the memory 420, and the processor 410 executes the applications stored in the memory 420 according to the following steps, so as to implement various functions:

acquiring a plurality of paths of video streams and a sound source position of a current scene, wherein each video stream corresponds to one image acquisition area; determining a target video stream from the multiple paths of video streams according to the sound source position and the image acquisition area; identifying a target object according to the target video stream, wherein the target object is an object with lip motion; determining a video stream to be displayed from the multiple video streams according to the identification result of the target object; and displaying the picture corresponding to the video stream to be displayed.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Optionally, as shown in fig. 7, the computer device 400 further includes: touch display 430, radio frequency circuit 440, audio circuit 450, input unit 460 and power supply 470. The processor 410 is electrically connected to the touch display 430, the radio frequency circuit 440, the audio circuit 450, the input unit 460 and the power supply 470, respectively. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 7 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components.

The touch display 430 can be used for displaying a graphical user interface and receiving operation instructions generated by a user acting on the graphical user interface. The touch display screen 430 may include a display panel and a touch panel. The display panel may be used, among other things, to display information entered by or provided to a user and various graphical user interfaces of the computer device, which may be made up of graphics, text, icons, video, and any combination thereof. Alternatively, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations of a user on or near the touch panel (for example, operations of the user on or near the touch panel using any suitable object or accessory such as a finger, a stylus pen, and the like), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 410, and can receive and execute commands sent by the processor 410. The touch panel may overlay the display panel, and when the touch panel detects a touch operation thereon or nearby, the touch panel transmits the touch operation to the processor 410 to determine the type of the touch event, and then the processor 410 provides a corresponding visual output on the display panel according to the type of the touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 430 to implement input and output functions. However, in some embodiments, the touch panel and the display panel may be implemented as two separate components to perform the input and output functions. That is, the touch display 430 can also be used as a part of the input unit 460 to implement an input function.

The rf circuit 440 may be used for transceiving rf signals to establish wireless communication with a network device or other computer device via wireless communication, and for transceiving signals with the network device or other computer device.

The audio circuit 450 may be used to provide an audio interface between a user and a computer device through a speaker, microphone. The audio circuit 450 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 450 and converted into audio data, which is then processed by the audio data output processor 410, and then passed through the radio frequency circuit 440 to be sent to, for example, another computer device, or output to the memory 420 for further processing. The audio circuit 450 may also include an earbud jack to provide communication of peripheral headphones with the computer device.

The input unit 460 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 470 is used to power the various components of the computer device 400. Optionally, the power supply 470 may be logically connected to the processor 410 through a power management system, so as to implement functions of managing charging, discharging, power consumption management, and the like through the power management system. The power supply 470 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown in fig. 7, the computer device 400 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described in detail herein.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

As can be seen from the above, the computer device provided in this embodiment can determine the target video stream for identifying the speaker through the sound source position, so as to improve the efficiency of identifying the speaker, and at the same time, determine the video stream to be displayed according to the identification result, so as to focus the displayed picture on the speaker, thereby presenting a better conference picture.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer-readable storage medium, in which a plurality of computer programs are stored, where the computer programs can be loaded by a processor to execute the steps in any one of the video stream display methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the computer program stored in the storage medium can execute the steps in any video stream display method provided in the embodiments of the present application, the beneficial effects that can be achieved by any video stream display method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described again here.

The foregoing detailed description has provided a video stream display method, apparatus, storage medium and computer device according to embodiments of the present application, and specific examples have been applied in the present application to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for displaying a video stream, comprising:

acquiring a plurality of paths of video streams and a sound source position of a current scene, wherein each video stream corresponds to an image acquisition area;

determining a target video stream from the multiple paths of video streams according to the sound source position and the image acquisition area;

identifying a target object according to the target video stream, wherein the target object is an object with lip action;

determining a video stream to be displayed from the multi-path video stream according to the identification result of the target object;

and displaying the picture corresponding to the video stream to be displayed.

2. The method for displaying a video stream according to claim 1, wherein said determining a target video stream from the plurality of video streams based on the sound source position and the image capturing area comprises:

determining a sound source area according to the sound source position;

for each of the video streams, determining an overlapping area of the sound source area and the image acquisition area;

determining the overlapping area meeting the preset first area size as a target area;

and determining the video stream corresponding to the target area as a target video stream.

3. The method for displaying a video stream according to claim 1, wherein said identifying a target object from said target video stream comprises:

determining an identification area according to the sound source position;

acquiring target image information from the target video stream according to the identification area, wherein the target image information is image information corresponding to the identification area;

and identifying the target object according to the target image information.

4. The method for displaying a video stream according to claim 3, wherein said identifying a target object based on said target image information comprises:

when the object with the lip action is identified from the target image information, taking the object with the lip action as a target object;

and when the object with the lip action is not identified from the target image information, expanding the identification area to a preset second area size so as to identify the target object.

5. The method for displaying a video stream according to claim 1, wherein the determining a video stream to be displayed from the plurality of video streams based on the recognition result of the target object comprises:

when the target object is identified, determining a region to be displayed according to the target object;

when the target object is not identified, determining a region to be displayed according to all objects in the target image information;

acquiring an image acquisition area corresponding to each video stream;

and determining the video stream to be displayed according to the area to be displayed and the image acquisition area.

6. The method for displaying a video stream according to claim 5, wherein the displaying a picture corresponding to the video stream to be displayed comprises:

acquiring a display picture of the video stream to be displayed;

cutting the display picture of the video stream to be displayed according to the area to be displayed to obtain the cut display picture;

and displaying the cut display picture.

7. The video stream display method according to claim 1, wherein the method of acquiring the sound source position includes:

collecting sound information of a current scene;

and processing the collected sound information through a sound source positioning algorithm to obtain the sound source position.

8. A video stream display apparatus, comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of paths of video streams and a sound source position of a current scene, and each video stream corresponds to an image acquisition area;

a first determining unit, configured to determine a target video stream from the multiple video streams according to the sound source position and the image acquisition area;

the identification unit is used for identifying a target object according to the target video stream, wherein the target object is an object with lip action;

the second determining unit is used for determining the video stream to be displayed from the multi-path video stream according to the identification result of the target object;

and the display unit is used for displaying the picture corresponding to the video stream to be displayed.

9. A computer device comprising a processor and a memory, the memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps of the video stream display method according to any one of claims 1 to 7.

10. A computer readable storage medium storing instructions adapted to be loaded by a processor to perform the steps of the method of displaying a video stream according to any one of claims 1 to 7.