CN111541859A

CN111541859A - Video conference processing method and device, electronic equipment and storage medium

Info

Publication number: CN111541859A
Application number: CN202010256843.3A
Authority: CN
Inventors: 刘苹苹; 牛永会; 李玉城; 杨春晖
Original assignee: Visionvera Information Technology Co Ltd
Current assignee: Visionvera Information Technology Co Ltd
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2020-08-14

Abstract

The embodiment of the invention provides a video conference processing method, a video conference processing device, electronic equipment and a storage medium, wherein the method is applied to a storage service system in a video network and comprises the following steps: acquiring a video file for recording the video conference and a text file corresponding to an audio stream in the video conference within a recording time period; the text file comprises a plurality of sections of text information, wherein each section of text information comprises the initial playing time of an audio stream corresponding to the text information in the video file; when a playing request of a user for the video file is received, the video file is played; in the process of playing a video file, extracting at least one section of text information corresponding to the current playing time from a text file according to the current playing time and each initial playing time; at least one piece of text information is displayed. By adopting the technical scheme of the invention, the efficiency of acquiring the conference content of the video conference can be improved.

Description

Video conference processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a video conference processing method and apparatus, an electronic device, and a storage medium.

Background

With the development of the video network, a large-scale high-definition video conference is held in the video network by a plurality of users. In the process of holding a video conference, it is often necessary to record the content in the video conference. In the related art, a video conference is generally recorded to record the video conference, and a user can acquire conference content by watching a recorded video file.

However, in this way, the user cannot accurately understand the conference content of the video conference while watching the video file due to poor recording effect or unclear audio emitted by the speaker. On the other hand, the user needs to manually record the speech content in the process of watching the recorded video file, which results in low efficiency of recording the speech content of the video conference.

Disclosure of Invention

In view of the above problems, embodiments of the present invention are proposed to provide a video conference processing method, apparatus, electronic device and storage medium that overcome or at least partially solve the above problems.

In a first aspect of the embodiments of the present invention, a method for processing a video conference is disclosed, where the method is applied to a storage service system in a video network, and includes:

acquiring a video file for recording the video conference and a text file corresponding to an audio stream in the video conference within a recording time period; the text file comprises a plurality of sections of text information, wherein each section of text information comprises the initial playing time of an audio stream corresponding to the text information in the video file;

when a playing request of a user for the video file is received, playing the video file;

in the process of playing the video file, extracting at least one section of text information corresponding to the current playing time from the text file according to the current playing time and each initial playing time;

and displaying the at least one piece of text information.

Optionally, the playing the video file with multiple tags includes:

displaying the marks on a playing progress bar for playing the video file, wherein each mark on the playing progress bar is used for indicating the playing time of the mark in the video file;

extracting at least one section of text information corresponding to the current playing time from the text file according to the current playing time and each starting playing time, wherein the extracting comprises the following steps:

when the triggering operation of the plurality of marks is detected, determining a first playing time of the triggered marks in the video file, and switching the current playing time to the first playing time so as to play the video file from the first playing time;

and extracting at least one section of first text information of which the corresponding starting playing time is within the first preset time range from the first playing time from the text file.

Optionally, the playing of the video file, where the video file carries a plurality of markers, includes:

determining a progress position corresponding to the current playing time, and determining a target mark within a preset distance from the progress position in the plurality of marks;

determining a second playing time in the video file from the target mark;

and extracting at least one section of second text information with the starting playing time and the second playing time within a preset time range from the text file.

Optionally, extracting at least one piece of text information corresponding to the current playing time from the text file according to the current playing time and each of the start playing times, including:

determining at least one starting playing time within a second preset time range with the current playing time from the starting playing times;

and acquiring at least one piece of third text information corresponding to the at least one starting playing time from the text file.

Optionally, a conference control terminal and an autonomous server are deployed in the video network, and before obtaining a video file for recording the video conference, the method further includes:

receiving a conference recording instruction sent by the autonomous server, wherein the conference recording instruction is generated by the conference control terminal and is sent to the autonomous server;

responding to the conference recording instruction, recording the audio stream and the video stream which are currently generated in the video conference, and caching the audio stream; the audio stream is an audio stream collected by a terminal which is speaking currently in the video conference;

when detecting that the total playing time of a section of cached audio stream reaches a preset time, identifying the section of cached audio stream to obtain character information corresponding to the section of audio stream, and taking the initial time of the section of audio stream in the recording process as the initial playing time;

and when the recording of the current video conference is finished, storing the multiple sections of character information corresponding to the multiple sections of cached audio streams as text files.

Optionally, after storing the multiple pieces of text information corresponding to the multiple pieces of cached audio streams as a text file when recording of the current video conference is finished, the method further includes:

when the video conference is finished, obtaining conference data of the video conference;

generating a conference summary based on the conference data and the text file, and publishing the conference summary;

and when a request of a user for downloading the published conference summary is received, sending the conference summary to the user.

In a second aspect of the embodiments of the present invention, there is provided a video conference processing apparatus, where the apparatus is applied to a storage service system in a video network, and the apparatus includes:

the file acquisition module is used for acquiring a video file for recording the video conference and a text file corresponding to an audio stream in the video conference within a recording time period; the text file comprises a plurality of sections of text information, wherein each section of text information comprises the initial playing time of an audio stream corresponding to the text information in the video file;

the video playing module is used for playing the video file when receiving a playing request of a user for the video file;

the information extraction module is used for extracting at least one section of text information corresponding to the current playing time from the text file according to the current playing time and each initial playing time in the process of playing the video file;

and the information display module is used for displaying the at least one section of text information.

Optionally, a conference control terminal and an autonomous server are deployed in the video network, and the apparatus further includes:

the instruction receiving module is used for receiving a conference recording instruction sent by the autonomous server, wherein the conference recording instruction is generated by the conference control terminal and is sent to the autonomous server;

the video recording module is used for responding to the conference recording instruction, recording the audio stream and the video stream which are currently generated in the video conference, and caching the audio stream; the audio stream is an audio stream collected by a terminal which is speaking currently in the video conference;

the audio identification module is used for identifying the cached section of audio stream to obtain character information corresponding to the section of audio stream when the total playing duration of the cached section of audio stream reaches the preset duration, and taking the initial time of the section of audio stream in the recording process as the initial playing time;

and the file obtaining module is used for storing the multiple sections of character information corresponding to the multiple sections of cached audio streams as text files when the recording of the current video conference is finished.

Optionally, the video file carries a plurality of marks, and the file playing module is specifically configured to display the plurality of marks on a playing progress bar for playing the video file, where each mark on the playing progress bar is used to indicate a playing time of the mark in the video file;

the information extraction module comprises:

a first determination unit configured to determine, when a trigger operation on the plurality of markers is detected, a first playback time of the triggered marker in the video file;

the progress adjusting unit is used for switching the current playing time to the first playing time so as to play the video file from the first playing time;

and the first extraction unit is used for extracting at least one section of first text information of which the corresponding starting playing time is within the first preset time range from the first playing time from the text file.

the information extraction module comprises:

the second determining unit is used for determining a progress position corresponding to the current playing time and determining a target mark within a preset distance from the progress position in the plurality of marks;

a third determining unit, configured to determine a second playing time in the video file from the target mark;

and the second extraction unit is used for extracting at least one piece of second text information of which the starting playing time is within a preset time range from the second playing time from the text file.

Optionally, the information extraction module includes:

a fourth determining unit, configured to determine, from each of the start playing times, at least one start playing time that is within a second preset time range from the current playing time;

and the third extraction unit is used for acquiring at least one piece of third text information corresponding to the at least one starting playing time from the text file.

Optionally, the apparatus further comprises:

a conference data obtaining module, configured to obtain conference data of the video conference when the video conference is finished;

the conference summary generation module is used for generating a conference summary based on the conference data and the text file and publishing the conference summary;

and the conference summary sending module is used for sending the conference summary to the user when receiving a downloading request of the user to the published conference summary.

In a third aspect of the embodiments of the present invention, there is provided an electronic device, including:

one or more processors; and

one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform a video conference processing method according to an embodiment of the invention.

In a fourth aspect of the embodiments of the present invention, a computer-readable storage medium is further provided, which stores a computer program for causing a processor to execute the video conference processing method according to the embodiments of the present invention.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, the storage service system can obtain the video file recorded in the video conference and the text file corresponding to the audio stream in the video conference, and in the process of playing the video file, the text information corresponding to the playing time can be displayed according to the current playing time and the initial playing time corresponding to each piece of text information in the text file. On one hand, when watching the recorded video conference, the user can synchronously obtain the content of the video conference corresponding to the current playing time period through the displayed text information, so that the accuracy of understanding the conference content of the video conference by the user is improved, and the efficiency of obtaining the conference content by the user is improved. On the other hand, the storage service system can acquire the text file corresponding to the audio in the video conference, so that a user does not need to manually record the content of the video conference, and the efficiency of recording the video conference is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a schematic networking diagram of a video network of the present invention;

FIG. 2 is a schematic diagram of a hardware architecture of a node server according to the present invention;

fig. 3 is a schematic diagram of a hardware structure of an access switch of the present invention;

fig. 4 is a schematic diagram of a hardware structure of an ethernet protocol conversion gateway according to the present invention;

fig. 5 is an implementation environment diagram of a video conference processing method according to an embodiment of the present invention

FIG. 6 is a flow chart of the steps of a method of video conference processing according to an embodiment of the present invention;

fig. 7 is a flowchart of a step of recording a video conference in a video conference processing method according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of an interface for displaying text information in a playing video file in accordance with an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a video conference processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The video networking is an important milestone for network development, is a real-time network, can realize high-definition video real-time transmission, and pushes a plurality of internet applications to high-definition video, and high-definition faces each other.

The video networking adopts a real-time high-definition video exchange technology, can integrate required services such as dozens of services of video, voice, pictures, characters, communication, data and the like on a system platform on a network platform, such as high-definition video conference, video monitoring, intelligent monitoring analysis, emergency command, digital broadcast television, delayed television, network teaching, live broadcast, VOD on demand, television mail, Personal Video Recorder (PVR), intranet (self-office) channels, intelligent video broadcast control, information distribution and the like, and realizes high-definition quality video broadcast through a television or a computer.

To better understand the embodiments of the present invention, the following description refers to the internet of view:

some of the technologies applied in the video networking are as follows:

network Technology (Network Technology)

Network technology innovation in video networking has improved the traditional Ethernet (Ethernet) to face the potentially huge first video traffic on the network. Unlike pure network Packet Switching (Packet Switching) or network Circuit Switching (Circuit Switching), the Packet Switching is adopted by the technology of the video networking to meet the Streaming requirement. The video networking technology has the advantages of flexibility, simplicity and low price of packet switching, and simultaneously has the quality and safety guarantee of circuit switching, thereby realizing the seamless connection of the whole network switching type virtual circuit and the data format.

Switching Technology (Switching Technology)

The video network adopts two advantages of asynchronism and packet switching of the Ethernet, eliminates the defects of the Ethernet on the premise of full compatibility, has end-to-end seamless connection of the whole network, is directly communicated with a user terminal, and directly bears an IP data packet. The user data does not require any format conversion across the entire network. The video networking is a higher-level form of the Ethernet, is a real-time exchange platform, can realize the real-time transmission of the whole-network large-scale high-definition video which cannot be realized by the existing Internet, and pushes a plurality of network video applications to high-definition and unification.

Server Technology (Server Technology)

The server technology on the video networking and unified video platform is different from the traditional server, the streaming media transmission of the video networking and unified video platform is established on the basis of connection orientation, the data processing capacity of the video networking and unified video platform is independent of flow and communication time, and a single network layer can contain signaling and data transmission. For voice and video services, the complexity of video networking and unified video platform streaming media processing is much simpler than that of data processing, and the efficiency is greatly improved by more than one hundred times compared with that of a traditional server.

Storage Technology (Storage Technology)

The super-high speed storage technology of the unified video platform adopts the most advanced real-time operating system in order to adapt to the media content with super-large capacity and super-large flow, the program information in the server instruction is mapped to the specific hard disk space, the media content is not passed through the server any more, and is directly sent to the user terminal instantly, and the general waiting time of the user is less than 0.2 second. The optimized sector distribution greatly reduces the mechanical motion of the magnetic head track seeking of the hard disk, the resource consumption only accounts for 20% of that of the IP internet of the same grade, but concurrent flow which is 3 times larger than that of the traditional hard disk array is generated, and the comprehensive efficiency is improved by more than 10 times.

Network Security Technology (Network Security Technology)

The structural design of the video network completely eliminates the network security problem troubling the internet structurally by the modes of independent service permission control each time, complete isolation of equipment and user data and the like, generally does not need antivirus programs and firewalls, avoids the attack of hackers and viruses, and provides a structural carefree security network for users.

Service Innovation Technology (Service Innovation Technology)

The unified video platform integrates services and transmission, and is not only automatically connected once whether a single user, a private network user or a network aggregate. The user terminal, the set-top box or the PC are directly connected to the unified video platform to obtain various multimedia video services in various forms. The unified video platform adopts a menu type configuration table mode to replace the traditional complex application programming, can realize complex application by using very few codes, and realizes infinite new service innovation.

Networking of the video network is as follows:

the video network is a centralized control network structure, and the network can be a tree network, a star network, a ring network and the like, but on the basis of the centralized control node, the whole network is controlled by the centralized control node in the network.

As shown in fig. 1, the video network is divided into an access network and a metropolitan network.

The devices of the access network part can be mainly classified into 3 types: node server, access switch, terminal (including various set-top boxes, coding boards, memories, etc.). The node server is connected to an access switch, which may be connected to a plurality of terminals and may be connected to an ethernet network.

The node server is a node which plays a centralized control function in the access network and can control the access switch and the terminal. The node server can be directly connected with the access switch or directly connected with the terminal.

Similarly, devices of the metropolitan network portion may also be classified into 3 types: a metropolitan area server, a node switch and a node server. The metro server is connected to a node switch, which may be connected to a plurality of node servers.

The node server is a node server of the access network part, namely the node server belongs to both the access network part and the metropolitan area network part.

The metropolitan area server is a node which plays a centralized control function in the metropolitan area network and can control a node switch and a node server. The metropolitan area server can be directly connected with the node switch or directly connected with the node server.

Therefore, the whole video network is a network structure with layered centralized control, and the network controlled by the node server and the metropolitan area server can be in various structures such as tree, star and ring.

The access network part can form a unified video platform (the part in the dotted circle), and a plurality of unified video platforms can form a video network; each unified video platform may be interconnected via metropolitan area and wide area video networking.

Video networking device classification

1.1 devices in the video network of the embodiment of the present invention can be mainly classified into 3 types: server, exchanger (including Ethernet protocol conversion gateway), terminal (including various set-top boxes, code board, memory, etc.). The video network as a whole can be divided into a metropolitan area network (or national network, global network, etc.) and an access network.

1.2 wherein the devices of the access network part can be mainly classified into 3 types: node server, access exchanger (including Ethernet protocol conversion gateway), terminal (including various set-top boxes, coding board, memory, etc.).

The specific hardware structure of each access network device is as follows:

a node server:

as shown in fig. 2, the system mainly includes a network interface module 201, a switching engine module 202, a CPU module 203, and a disk array module 204;

the network interface module 201, the CPU module 203, and the disk array module 204 all enter the switching engine module 202; the switching engine module 202 performs an operation of looking up the address table 205 on the incoming packet, thereby obtaining the direction information of the packet; and stores the packet in a queue of the corresponding packet buffer 206 based on the packet's steering information; if the queue of the packet buffer 206 is nearly full, it is discarded; the switching engine module 202 polls all packet buffer queues for forwarding if the following conditions are met: 1) the port send buffer is not full; 2) the queue packet counter is greater than zero. The disk array module 204 mainly implements control over the hard disk, including initialization, read-write, and other operations on the hard disk; the CPU module 203 is mainly responsible for protocol processing with an access switch and a terminal (not shown in the figure), configuring an address table 205 (including a downlink protocol packet address table, an uplink protocol packet address table, and a data packet address table), and configuring the disk array module 204.

The access switch:

as shown in fig. 3, the network interface module mainly includes a network interface module (a downlink network interface module 301 and an uplink network interface module 302), a switching engine module 303 and a CPU module 304;

wherein, the packet (uplink data) coming from the downlink network interface module 301 enters the packet detection module 305; the packet detection module 305 detects whether the Destination Address (DA), the Source Address (SA), the packet type, and the packet length of the packet meet the requirements, and if so, allocates a corresponding stream identifier (stream-id) and enters the switching engine module 303, otherwise, discards the stream identifier; the packet (downstream data) coming from the upstream network interface module 302 enters the switching engine module 303; the incoming data packet of the CPU module 304 enters the switching engine module 303; the switching engine module 303 performs an operation of looking up the address table 306 on the incoming packet, thereby obtaining the direction information of the packet; if the packet entering the switching engine module 303 is from the downstream network interface to the upstream network interface, the packet is stored in the queue of the corresponding packet buffer 307 in association with the stream-id; if the queue of the packet buffer 307 is nearly full, it is discarded; if the packet entering the switching engine module 303 is not from the downlink network interface to the uplink network interface, the data packet is stored in the queue of the corresponding packet buffer 307 according to the guiding information of the packet; if the queue of the packet buffer 307 is nearly full, it is discarded.

The switching engine module 303 polls all packet buffer queues and may include two cases:

if the queue is from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queued packet counter is greater than zero; 3) obtaining a token generated by a code rate control module;

if the queue is not from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queue packet counter is greater than zero.

The rate control module 308 is configured by the CPU module 304, and generates tokens for packet buffer queues from all downstream network interfaces to upstream network interfaces at programmable intervals to control the rate of upstream forwarding.

The CPU module 304 is mainly responsible for protocol processing with the node server, configuration of the address table 306, and configuration of the code rate control module 308.

Ethernet protocol conversion gateway：

As shown in fig. 4, the apparatus mainly includes a network interface module (a downlink network interface module 401 and an uplink network interface module 402), a switching engine module 403, a CPU module 404, a packet detection module 405, a rate control module 408, an address table 406, a packet buffer 407, a MAC adding module 409, and a MAC deleting module 410.

Wherein, the data packet coming from the downlink network interface module 401 enters the packet detection module 405; the packet detection module 405 detects whether the ethernet MAC DA, the ethernet MAC SA, the ethernet length or frame type, the video network destination address DA, the video network source address SA, the video network packet type, and the packet length of the packet meet the requirements, and if so, allocates a corresponding stream identifier (stream-id); then, the MAC deletion module 410 subtracts MAC DA, MAC SA, length or frame type (2byte) and enters the corresponding receiving buffer, otherwise, discards it;

the downlink network interface module 401 detects the sending buffer of the port, and if there is a packet, obtains the ethernet MAC DA of the corresponding terminal according to the destination address DA of the packet, adds the ethernet MAC DA of the terminal, the MACSA of the ethernet coordination gateway, and the ethernet length or frame type, and sends the packet.

The other modules in the ethernet protocol gateway function similarly to the access switch.

A terminal:

the system mainly comprises a network interface module, a service processing module and a CPU module; for example, the set-top box mainly comprises a network interface module, a video and audio coding and decoding engine module and a CPU module; the coding board mainly comprises a network interface module, a video and audio coding engine module and a CPU module; the memory mainly comprises a network interface module, a CPU module and a disk array module.

1.3 devices of the metropolitan area network part can be mainly classified into 2 types: node server, node exchanger, metropolitan area server. The node switch mainly comprises a network interface module, a switching engine module and a CPU module; the metropolitan area server mainly comprises a network interface module, a switching engine module and a CPU module.

2. Video networking packet definition

2.1 Access network packet definition

The data packet of the access network mainly comprises the following parts: destination Address (DA), Source Address (SA), reserved bytes, payload (pdu), CRC.

As shown in the following table, the data packet of the access network mainly includes the following parts:

DA

SA

Reserved

Payload

CRC

wherein:

the Destination Address (DA) is composed of 8 bytes (byte), the first byte represents the type of the data packet (such as various protocol packets, multicast data packets, unicast data packets, etc.), there are 256 possibilities at most, the second byte to the sixth byte are metropolitan area network addresses, and the seventh byte and the eighth byte are access network addresses;

the Source Address (SA) is also composed of 8 bytes (byte), defined as the same as the Destination Address (DA);

the reserved byte consists of 2 bytes;

the payload part has different lengths according to different types of datagrams, and is 64 bytes if the datagram is various types of protocol packets, and is 32+1024 or 1056 bytes if the datagram is a unicast packet, of course, the length is not limited to the above 2 types;

the CRC consists of 4 bytes and is calculated in accordance with the standard ethernet CRC algorithm.

2.2 metropolitan area network packet definition

The topology of a metropolitan area network is a graph and there may be 2, or even more than 2, connections between two devices, i.e., there may be more than 2 connections between a node switch and a node server, a node switch and a node switch, and a node switch and a node server. However, the metro network address of the metro network device is unique, and in order to accurately describe the connection relationship between the metro network devices, parameters are introduced in the embodiment of the present invention: a label to uniquely describe a metropolitan area network device.

In this specification, the definition of the Label is similar to that of the Label of MPLS (Multi-Protocol Label Switch), and assuming that there are two connections between the device a and the device B, there are 2 labels for the packet from the device a to the device B, and 2 labels for the packet from the device B to the device a. The label is classified into an incoming label and an outgoing label, and assuming that the label (incoming label) of the packet entering the device a is 0x0000, the label (outgoing label) of the packet leaving the device a may become 0x 0001. The network access process of the metro network is a network access process under centralized control, that is, address allocation and label allocation of the metro network are both dominated by the metro server, and the node switch and the node server are both passively executed, which is different from label allocation of MPLS, and label allocation of MPLS is a result of mutual negotiation between the switch and the server.

As shown in the following table, the data packet of the metro network mainly includes the following parts:

DA

SA

Reserved

label (R)

Payload

CRC

Namely Destination Address (DA), Source Address (SA), Reserved byte (Reserved), tag, payload (pdu), CRC. The format of the tag may be defined by reference to the following: the tag is 32 bits with the upper 16 bits reserved and only the lower 16 bits used, and its position is between the reserved bytes and payload of the packet.

Based on the characteristics of the video network, one of the concepts of the invention is provided, the storage server can obtain the video file for recording the video conference and the text file corresponding to the audio in the video conference, when the user plays the video file, a plurality of sections of text information near the current playing time can be extracted from the text file, and the plurality of sections of text information are displayed, so that the user can conveniently and rapidly obtain the conference content.

Referring to fig. 5, an implementation environment diagram of a video conference processing method according to an embodiment of the present invention is shown, and as shown in fig. 5, the implementation environment diagram includes a conference control terminal, a conference management server, an autonomous server, a storage service system, and a plurality of terminals. The system comprises an autonomous server, a storage service system and a plurality of terminals, wherein the autonomous server, the storage service system and the plurality of terminals are deployed in the video network, a conference management server and a conference control terminal are deployed in the internet, the autonomous server can communicate with the conference management server by adopting an internet protocol, and the autonomous server can communicate with the storage service system and the plurality of terminals by adopting the video network protocol.

In this embodiment, the terminal may be, but is not limited to, the following devices: a mobile phone, a computer, a set-top box or a video networking terminal. Among them, in a video conference performed in the video network, a plurality of terminals are used as participants. The conference control terminal is used for controlling a plurality of terminals participating in the video conference, sending a conference control instruction and the like. The self-governing server is used for transmitting audio and video streams, conference text data and the like in the video network, and the storage service system is used for recording the received video streams and audio streams so as to achieve the purpose of storing the conference content of the video conference.

As shown in fig. 5, in a specific implementation, the storage service system may include a storage service system web front end and a storage service system back end, where the storage service system back end is a storage server, and the storage service system web front end may support communication of an internet protocol, so as to facilitate users in the internet to download files related to a video conference on a web page.

In the following, the form of a video conference in a video network is described as an example:

as shown in fig. 5, in the video conference, the scheduling of the audio/video stream may be triggered by the conference control terminal, and the autonomous server processes the actual scheduling of the audio/video stream in the video conference according to the scheduling triggered by the conference control terminal. For example, at a certain time, the terminal 1 is set as a speaking party by the conference control terminal, and the

other terminals

2, 3 and 4 are set as common participating parties, then the conference control terminal triggers an audio/video stream scheduling at this moment, and the autonomous server sends the audio stream and the video stream acquired by the terminal 1 to the terminal 2, the terminal 3 and the terminal 4 according to the scheduling. For another example, at another time, the terminal 2 is set as a speaking party by the conference control terminal, and the

other terminals

1, 3 and 4 are set as common participating parties, so that at this moment, the conference control terminal triggers an audio/video stream scheduling, and the autonomous server sends the audio stream and the video stream acquired by the terminal 2 to the terminal 1, the terminal 3 and the terminal 4 according to the scheduling.

In practice, the conference control terminal can continuously change speakers in the conference process according to the conference requirements, that is, continuously change the audio and video stream scheduling, and the autonomous server can forward the audio streams collected by the corresponding terminals to other terminals participating in the conference according to the audio and video stream scheduling determined by the conference control terminal.

Next, the video conference processing method according to the present application will be described with reference to the implementation environment shown in fig. 5, in which the storage service system is used as an execution subject.

In an embodiment, when a video conference is performed, the storage service system may record the video conference to obtain a video file and a text file corresponding to the video conference. Referring to fig. 6, a flowchart illustrating a step of recording a video conference by a storage service system to obtain a video file and a text file in an embodiment is shown, where the step may specifically include the following steps:

step S601: and receiving a conference recording instruction sent by the autonomous server.

And the conference recording instruction is generated by the conference control terminal and is sent to the autonomous server.

In this embodiment, the storage service system may record the video conference based on a conference recording instruction sent by the conference control terminal. Specifically, the conference recording instruction may be generated by the conference control terminal, sent to the autonomous server by the conference control terminal, and sent to the back end of the storage service system by the autonomous server.

In practice, as shown in fig. 5, since the conference control terminal is located in the internet and is in communication connection with the conference management server in the internet, the conference control terminal generates a conference recording instruction conforming to the internet protocol and sends the conference recording instruction conforming to the internet protocol to the conference management server through the internet, the conference management server converts the conference recording instruction conforming to the internet protocol into a conference recording instruction conforming to the video networking protocol, and sends the conference recording instruction conforming to the video networking protocol to the autonomous server, and the autonomous server sends the conference recording instruction to the back end of the storage service system in the video networking, and then the conference recording instruction received by the storage server is an instruction conforming to the video networking protocol.

Step S602: responding to the conference recording instruction, recording the audio stream and the video stream which are currently generated in the video conference, and caching the audio stream; and the audio stream is the audio stream collected by the terminal which is speaking currently in the video conference.

In this embodiment, the back end of the storage service system may record the audio stream and the video stream generated in the video conference in response to the conference recording instruction, where the audio stream and the video stream generated in the video conference may be sent to the back end of the storage service system by the autonomous server after responding to the conference recording instruction. Namely, when the autonomous server forwards the conference recording instruction to the storage service system, an audio/video stream transmission channel with the storage service system is also established, so that the video stream and the audio stream sent by each terminal in the video conference are sent to the storage service system.

In a specific implementation, as shown in fig. 5, the storage service system stores the audio stream by using the voice analysis system, so that the voice analysis system can recognize the stored audio stream.

Step S603: when the fact that the total playing time of a section of cached audio stream reaches the preset time is detected, the section of cached audio stream is identified to obtain character information corresponding to the section of audio stream, and the initial time of the section of audio stream in the recording process is used as the initial playing time.

In this embodiment, when the total playing duration of the stored audio stream reaches the preset duration, the storage service system may identify the segment of audio stream by using the voice analysis system, so as to obtain the text information corresponding to the segment of audio stream. Meanwhile, the cached audio stream can be deleted, the next audio stream can be continuously cached, and when the total playing time of the cached next audio stream reaches the preset time, the text information corresponding to the next audio stream can be obtained, so that a plurality of sections of text information can be obtained.

The start time of the audio stream during the recording process may be a time from the start of recording to the start of storing the audio stream, where the time is a start playing time of the text information corresponding to the audio stream.

For example, when the audio stream sent by the storage terminal 2 is started to be recorded since the video conference, the time for starting the audio stream is 20 minutes and 15 seconds, when the playing time of the stored audio stream reaches 5 seconds, the text information a of the stored audio stream is generated, and then the starting playing time of the text information a is determined to be 20 minutes and 15 seconds; the characterization is that in the recorded video file, terminal 2 starts speaking from 20 minutes and 15 seconds. Meanwhile, the terminal 2 continuously sends the audio stream, starts to store the audio stream again in 20 minutes and 21 seconds, generates the text information B of the stored audio stream when the playing time of the stored audio stream reaches 5 seconds, and determines the initial playing time of the text information a to be 20 minutes and 21 seconds.

In one embodiment, the storage service system may further determine a terminal that collects the currently stored audio stream, and record the name of the terminal or the name of the user using the terminal into the text message, so that each piece of text message may carry the speaker information.

Step S604: and when the recording of the current video conference is finished, storing the multiple sections of character information corresponding to the multiple sections of cached audio streams as text files.

In this embodiment, the storage service system may end recording of the video conference according to the recording ending instruction sent by the conference terminal, where a process of receiving the recording ending instruction sent by the conference terminal by the storage service system is the same as a process of receiving the recording ending instruction sent by the conference terminal by the storage service system, and is not described herein again.

In practice, the storage service system may also end recording the video conference when the time length after the video conference starts to be recorded reaches the preset time length threshold. Of course, the recording of the video conference may also be ended when the end of the video conference is detected.

In a specific implementation, since the audio is identified in the recording process to obtain multiple sections of text information, when the recording is finished, the storage service system can combine the multiple sections of text information together to form a text file, and the text information of each section in the text file has an initial playing time.

When the recording is finished, the storage service system can issue the recorded video file and text file to a web end of the storage service system, and specifically, the storage service system can issue the name of the video file and the name of the text file to the web end, so that a user can download the video file and the text file, and thus, the conference content of the video conference is acquired, and the application of the video file and the text file is realized.

In this embodiment, the storage service system may show the text information in the text file to the user in the process that the user plays the recorded video file, so that the user can obtain the conference content of the video conference in a video clip.

Specifically, referring to fig. 7, a flowchart illustrating a step of a storage service system presenting a text file to a user in a text information video conference processing method in an embodiment is shown, and specifically includes the following steps:

step S701: and acquiring a video file for recording the video conference and a text file corresponding to the audio stream in the video conference within the recording time period.

The text file comprises a plurality of sections of text information, and each section of text information comprises the initial playing time of the audio stream corresponding to the text information in the video file.

In this embodiment, the storage service system may obtain the video file and the text file recorded for the video conference when the video conference is finished, or may obtain the currently recorded video file and the currently recorded text file during the video conference, so as to record and play the video conference. The video file can restore the scene of the video conference, wherein the video file for recording the video conference can be used for recording the audio and video streams after mixing the video stream and the audio stream which are collected by each terminal in the video conference.

The text file is obtained by identifying an audio stream generated in a video conference in the process of recording a video file. Taking fig. 5 as an example, in the process of recording a video file, the terminal 1 and the terminal 2 speak successively, so that the text file may include text information corresponding to the audio collected by the terminal 1 and text information corresponding to the audio collected by the terminal 2.

In practice, each piece of text information may have an initial playing time, where the initial playing time may be a time difference between a time when a video file starts to be recorded and a time when the piece of text information is identified, and the time difference may be a relative playing time of an audio stream corresponding to the text information in the video file. The process of recording the video file and obtaining the text file may be as described in step S601 to step S604.

Step S702: and when a playing request of the user for the video file is received, playing the video file.

In this embodiment, the storage service system may publish the obtained name of the video file to the web front end, so that the user may play the published video file. Specifically, when the web front end of the storage service system detects a click operation of a user on the name of the video file, a play request may be sent to the back end of the storage service system, and the back end of the storage service system pushes the video file to the web front end, and the web front end may load and play the video file.

In a specific implementation, when the back end of the storage service system pushes the video file to the web front end, the back end can also push the text file to the web front end.

Step S703: and in the process of playing the video file, extracting at least one section of text information corresponding to the current playing time from the text file according to the current playing time and each initial playing time.

Step S704: and displaying the at least one piece of text information.

In this embodiment, in the process of playing the video conference, the web front end of the storage service system may detect a relationship between the current playing time and the starting playing time carried by each piece of text information in the text file, so as to determine the starting playing time located near the current playing time, and extract at least one piece of text information corresponding to the starting playing time near the current playing time from the text file.

Because the initial playing time of each segment of text information is the relative playing time of the audio stream corresponding to the segment of text information in the video file, in practice, when the video file is played to a video segment where a certain speaker plays a speech, the text information corresponding to the speech of the speaker can be synchronously displayed, so that a user can conveniently know the speech content of the speaker in a text form, and the user can conveniently and accurately know the content of the video conference.

In a specific implementation, the web front end of the storage service system may display the extracted at least one piece of text information on a video playing window of the video file. In practice, when at least one piece of text information is displayed, the user can save the speaking content of the speaker in the video by capturing the screen of the video playing window. Specifically, the web front end of the storage service system may capture a current video playing window when detecting a screen capture operation on the video playing window to obtain a picture of a display page, where the picture of the display page may include at least one text message displayed therein. And then, the stored pictures are sent to the user who logs in the storage service system at present, so that the user can record important conference contents through the pictures without manual recording, and the efficiency of conference recording is improved.

In an embodiment, because the text information can carry speaker information, the speaker information can be displayed together when the text information is displayed, so that a user can conveniently know the content of the speaker to which the currently displayed text information belongs, and the user experience is optimized.

In the embodiment of the invention, the storage service system can display the text information corresponding to the playing time according to the current playing time and the initial playing time corresponding to each section of text information in the text file in the process of playing the video file, so that a user can conveniently and synchronously obtain the conference content of the current video segment through the displayed text information while watching the recorded video conference, the user can accurately obtain the conference content, and the efficiency of the user for obtaining the conference content is improved. On the other hand, the storage service system can acquire the text file corresponding to the audio in the video conference, so that a user does not need to manually record the content of the video conference, and the efficiency of recording the video conference is improved.

With reference to the foregoing embodiment, an implementation manner of displaying text information in a process of playing a video file is provided, in this implementation manner, the video file may carry a plurality of marks, and playing the video file may be a following playing process:

step S702': and displaying the marks on a playing progress bar for playing the video file.

Wherein each mark on the playing progress bar is used for indicating the playing time of the mark in the video file.

In this embodiment, the multiple marks carried in the video file may refer to marks that mark multiple playing time points of the video file. When playing the video file, the plurality of marks can be displayed at different positions on the playing progress bar according to the respective marked playing time points.

The time interval between the playing time points marked by every two adjacent marks can be the same, for example, if the total playing time of the video file is 20 minutes, 5 marks can be set at equal time intervals, so that one mark is set on the playing progress bar every 4 minutes.

Accordingly, displaying at least one piece of text information may specifically include the steps of:

step S7031: when the trigger operation on the plurality of marks is detected, determining a first playing time of the triggered marks in the video file, and switching the current playing time to the first playing time so as to play the video file from the first playing time.

In this embodiment, the trigger operation on the plurality of marks may be a click operation or a touch operation performed on the marks by the user, and when the trigger operation on the marks by the user is detected, the playing time point marked by the mark may be obtained, where the playing time point is the first playing time. Meanwhile, the current playing time may be adjusted to the first playing time, that is, the current position of the playing progress bar is adjusted to the position where the mark is located, so as to play the video file from the first playing time.

Step S7032: and extracting at least one section of first text information of which the corresponding starting playing time is within a first preset time range from the first playing time from the text file.

In this embodiment, when the first playing time marked by the triggered mark is obtained, the starting playing time within the first preset time range from the first playing time may be obtained from the plurality of starting playing times, and then, the text information within the first preset time range from the corresponding starting playing time to the first playing time may be extracted from the text file, where the text information is the first text information.

The first preset time range can be set according to requirements.

Illustratively, referring to FIG. 8, a schematic diagram of an interface for displaying textual information in a playing video file in one example is shown. Taking the total playing time of the video file as 20 minutes, 4 flags are set at equal time intervals. If the user clicks the mark 1, the video file is played from the mark 1, and the playing time of the mark 1 is 4 minutes and 5 seconds, the text information of which the initial playing time is between 3 minutes and 55 seconds and 4 minutes and 15 seconds in the text file can be extracted, and the text information is displayed on the playing interface. When displaying the text information between 3 minutes 55 seconds and 4 minutes 15 seconds, the text information may be displayed on a progress bar, or displayed on the left or right side of the video picture, and the display position may be specifically determined as required.

Through the implementation, the user can trigger the mark on the playing progress bar to acquire the conference content of the video conference in the nearby time period.

With reference to the foregoing embodiment, a further implementation manner of displaying text information during playing of a video file is provided, in which the video file may also carry multiple tags, and playing the video file may be the playing process as in step S702'.

In this embodiment, the displaying at least one text message may specifically include the following steps:

step S7031': and determining a progress position corresponding to the current playing time, and determining a target mark within a preset distance from the progress position in the plurality of marks.

In this embodiment, the progress position corresponding to the current playing time is the position of the current playing time point on the playing progress bar, and since a plurality of marks are displayed on the playing progress bar, each mark has its own position on the playing progress bar. It is possible to determine a distance difference between the position of the current playing time point on the playing progress bar and the position of each mark on the playing progress bar and determine a target mark having the distance difference within a preset distance.

In practice, the target mark may be the mark closest to the current playing time point.

Step S7032': a second play time in the video file from the target mark is determined.

In this embodiment, the playing time marked by the target mark can be obtained, and the playing time is the second playing time.

Step S7033': and extracting at least one section of second text information with the starting playing time and the second playing time within a preset time range from the text file.

The process of step S7033' is similar to the process of step S7032, and reference may be made to the description of step S7032 for relevant points, which is not described herein again. Wherein, the preset time range may be consistent with the first preset time range.

By adopting the embodiment, the storage service system can automatically extract and display the text information according to the difference between the current playing time and the playing time point marked by each mark, so that the normal playing of the video file can not be influenced, and the user experience is optimized.

With reference to the above embodiment, a further implementation manner of displaying text information during playing of a video file is provided, in which at least one piece of text information near the current playing may be displayed, and specifically, the following steps may be included:

step S7031 ": and determining at least one starting playing time within a second preset time range with the current playing time from the starting playing times.

In this embodiment, the initial playing time corresponding to each segment of text information in the text file may be traversed, so as to screen out at least one initial playing time within a second preset time range from the current playing time.

The second preset time range can be set according to actual requirements, and specifically, the second preset time range can be different from the first preset time range.

Step S7032 ": and acquiring at least one piece of third text information corresponding to the at least one starting playing time from the text file.

In this embodiment, the text information corresponding to the screened at least one initial playing time may be extracted, where the text information is the third text information, and the third text information is displayed.

When the embodiment is adopted, the storage service system can display the text information synchronous with the playing progress in real time in the playing process of the video file, so that a user can watch the conference content synchronous with the video file in real time in the process of watching the video file, and the user experience is further optimized.

In one embodiment, after the video conference is ended, the storage service system may further generate a complete conference summary, and specifically, may further perform the following steps:

step S605: and when the video conference is finished, obtaining conference data of the video conference.

In this embodiment, when the video conference is finished, the storage service system may obtain the conference data from the conference control terminal, and specifically, the storage service system may generate a conference data request instruction and send the conference data request instruction to the autonomous server, so that the autonomous server sends the conference data request instruction to the conference control terminal. Further, the conference control terminal may return conference data related to the video conference to the storage service system in response to the conference data request instruction along the above path (conference management server-autonomous server).

The conference data may include information such as the name of the video conference, the number of participants, and the subject of the conference.

Step S606: generating a conference summary based on the conference data and the text file, and publishing the conference summary.

After obtaining the meeting data, the storage service system may store the meeting data in a text file to obtain a meeting summary, and in practice, the storage service system may store the meeting summary in a backend of the storage service system.

Step S607: and when a request of a user for downloading the published conference summary is received, sending the conference summary to the user.

In this embodiment, the storage service system may publish the name of the meeting summary at the web site for the user to view and download. The name of the conference summary may coincide with the name of the video conference. Specifically, when detecting a downloading operation of a user on a conference summary, a web side of the storage service system may generate a downloading request and send the downloading request to a back end of the storage service system, where the back end of the storage service system sends the conference summary to the user currently logging in the storage service system based on the downloading request.

It should be noted that the above example is a specific embodiment of the present invention, and in practice, the recording of the video file and the text file may not be limited to the above manner, for example, the video file and the text file are recorded by an autonomous server, and accordingly, the storage service system may obtain the video file and the text file from the autonomous server. Or, the video file and the text file are recorded by the conference control terminal, and accordingly the storage service system can obtain the video file and the text file from the conference control terminal.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 9, a schematic structural diagram of a video conference processing apparatus according to this embodiment is shown, where the apparatus may be applied to a storage service system in a video network, and specifically may include the following modules:

a file obtaining module 901, configured to obtain a video file for recording the video conference and a text file corresponding to an audio stream in the video conference within a recording time period; the text file comprises a plurality of sections of text information, wherein each section of text information comprises the initial playing time of an audio stream corresponding to the text information in the video file;

the video playing module 902 may be configured to play the video file when receiving a play request of the user for the video file;

an information extraction module 903, configured to extract at least one piece of text information corresponding to the current playing time from the text file according to the current playing time and each starting playing time in a process of playing the video file;

an information display module 904 may be configured to display the at least one piece of text information.

Optionally, a conference control terminal and an autonomous server may be deployed in the video network, and the apparatus may further include the following modules:

the instruction receiving module can be used for receiving a conference recording instruction sent by the autonomous server, wherein the conference recording instruction is generated by the conference control terminal and sent to the autonomous server;

the video recording module may be configured to record an audio stream and a video stream currently generated in the video conference in response to the conference recording instruction, and cache the audio stream; the audio stream is an audio stream collected by a terminal which is speaking currently in the video conference;

the audio identification module can be used for identifying the cached section of audio stream to obtain character information corresponding to the section of audio stream when the total playing duration of the cached section of audio stream reaches the preset duration is detected, and taking the initial time of the section of audio stream in the recording process as the initial playing time;

the file obtaining module may be configured to store, as a text file, multiple pieces of text information corresponding to the multiple pieces of cached audio streams when recording of the current video conference is finished.

Optionally, the video file may carry a plurality of marks, and the file playing module 902 may be specifically configured to display the plurality of marks on a playing progress bar for playing the video file, where each mark on the playing progress bar is used to indicate a playing time of the mark in the video file;

the information extraction module 903 may specifically include the following units:

a first determining unit, configured to determine, when a trigger operation on the plurality of markers is detected, a first playing time of the triggered marker in the video file;

the progress adjusting unit may be configured to switch a current playing time to the first playing time, so as to play the video file from the first playing time;

the first extracting unit may be configured to extract, from the text file, at least one piece of first text information whose corresponding start playing time is within the first preset time range from the first playing time.

Optionally, the video file may carry a plurality of marks, and the file playing module may be specifically configured to display the plurality of marks on a playing progress bar for playing the video file, where each mark on the playing progress bar is used to indicate a playing time of the mark in the video file;

the information extraction module 903 may include the following units:

a second determining unit, configured to determine a progress position corresponding to a current playing time, and determine, among the plurality of markers, a target marker within a preset distance from the progress position;

a third determining unit, operable to determine a second playing time in the video file from the target mark;

the second extracting unit may be configured to extract at least one piece of second text information from the text file, where a starting playing time is within a preset time range from the second playing time.

Optionally, the information extraction module 903 may include the following units:

a fourth determining unit, configured to determine, from each of the start playing times, at least one start playing time within a second preset time range from the current playing time;

the third extracting unit may be configured to acquire at least one piece of third text information corresponding to the at least one starting playing time from the text file.

Optionally, the apparatus may further include the following modules:

the conference data obtaining module can be used for obtaining conference data of the video conference when the video conference is finished;

a conference summary generation module, configured to generate a conference summary based on the conference data and the text file, and publish the conference summary;

the conference summary sending module may be configured to send the conference summary to the user when receiving a request of the user for downloading the published conference summary. :

for the embodiment of the video conference processing apparatus, since it is basically similar to the embodiment of the video conference processing method, the description is relatively simple, and for relevant points, reference may be made to part of the description of the embodiment of the video conference processing method.

An embodiment of the present invention further provides an electronic device, including:

one or more processors; and

one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform one or more video conference processing methods as described in embodiments of the invention.

Embodiments of the present invention further provide a computer-readable storage medium, where a stored computer program causes a processor to execute the video conference processing method according to the embodiments of the present invention.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The video conference processing method, the video conference processing device, the electronic device and the storage medium provided by the invention are introduced in detail, a specific example is applied in the text to explain the principle and the implementation of the invention, and the description of the above embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A video conference processing method is applied to a storage service system in a video network, and comprises the following steps:

and displaying the at least one piece of text information.

2. The method of claim 1, wherein the video file carries a plurality of tags, and playing the video file comprises:

when the triggering operation of the plurality of marks is detected, determining first playing time of the triggered marks in the video file, and switching the current playing time to the first playing time so as to play the video file from the first playing time;

and extracting at least one section of first text information of which the corresponding starting playing time is within a first preset time range from the first playing time from the text file.

3. The method of claim 1, wherein the video file carries a plurality of tags, and playing the video file comprises:

determining a second playing time in the video file from the target mark;

4. The method of claim 1, wherein extracting at least one piece of text information corresponding to a current playing time from the text file according to the current playing time and each of the starting playing times comprises:

5. The method according to claim 1, wherein a conference control terminal and an autonomous server are deployed in the video network, and before obtaining a video file for recording the video conference, the method further comprises:

6. The method of claim 5, wherein after storing the plurality of text information corresponding to the plurality of buffered audio streams as a text file when the recording of the current video conference is finished, the method further comprises:

7. A video conference processing device, which is applied to a storage service system in a video network, comprises:

8. The apparatus of claim 7, wherein a conference control terminal and an autonomous server are deployed in the video network, the apparatus further comprising:

9. An electronic device, comprising:

one or more processors; and

one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the device to perform the video conference processing method of any of claims 1-6.

10. A computer-readable storage medium storing a computer program for causing a processor to execute the video conference processing method according to any one of claims 1 to 6.