CN110636245A

CN110636245A - Audio processing method and device, electronic equipment and storage medium

Info

Publication number: CN110636245A
Application number: CN201910804699.XA
Authority: CN
Inventors: 胡贵超; 安君超; 韩杰; 王艳辉
Original assignee: Visionvera Information Technology Co Ltd
Current assignee: Visionvera Information Technology Co Ltd
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2019-12-31
Anticipated expiration: 2039-08-28
Also published as: CN110636245B

Abstract

The invention provides an audio processing method, an audio processing device, electronic equipment and a storage medium. The method comprises the following steps: the first video network terminal collects the audio and video of a speaker and encodes the audio and video into audio and video data, the first video network terminal sends the audio and video data to the first video network server, the first video network server sends the audio data in the audio and video data to the first voice recognition server, and the first voice recognition server recognizes the audio data to obtain text data corresponding to the audio data and stores the text data. By adopting the scheme of the invention, the meeting content needing to be recorded can be automatically and intelligently recorded in a text mode in the video networking meeting process, manual recording is not needed, so that the manpower resource is saved, and the completeness and the correctness of the meeting record are ensured.

Description

Audio processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a storage medium.

Background

The video network is a real-time exchange platform and is a higher-level form of the internet, and the video network pushes a plurality of internet applications to high-definition video, unification and high-definition face-to-face. Finally, world no distance is realized, and the distance between people in the world is only the distance of one screen; on the other hand, the video network has the flexibility, simplicity and low price of packet switching, simultaneously has the quality and safety guarantee of circuit switching, and realizes the seamless connection of the whole network switching type virtual circuit and the data format for the first time in the communication history.

At present, a conference is established through a video networking network and equipment, each participant terminal can receive audios and videos of speakers in real time in a high-definition mode, and the effect of the conference is far better than that of the conference established through the Internet network and the equipment in the current Internet field.

However, in the existing conferences constructed based on the video networking, if some important conference contents need to be stored, a specially-assigned person is used for manual recording, which not only wastes manpower, but also has the problems that recording errors and errors may occur in the manual recording process, or the conference contents are not completely and completely stored.

Disclosure of Invention

The invention provides an audio processing method, an audio processing device, electronic equipment and a storage medium, and solves the problems.

In order to solve the above technical problem, an embodiment of the present invention provides an audio processing method, where the audio processing method is applied to a video networking conference system, and the video networking conference system includes: the method comprises the following steps that a first video network terminal, a second video network terminal, a first video network server and a first voice recognition server are connected, and the method comprises the following steps:

the first video network terminal collects the audio and video of a speaker and encodes the audio and video into audio and video data;

the first video network terminal sends the audio and video data to a first video network server;

the first video network server sends audio data in the audio and video data to a first voice recognition server and sends the audio and video data to a second video network terminal which is in the same video network conference with the first video network terminal, wherein the first voice recognition server is located in the first geographic area;

the second video network terminal receives the audio and video data, decodes the audio and video data and plays a corresponding audio and video;

and the first voice recognition server recognizes the audio data to obtain text data corresponding to the audio data, and stores the text data.

Optionally, the second video network terminal is not in the same geographical area as the first video network server, and the video network conference system further includes a second video network server and a second voice recognition server; the first video network server sends the audio and video data to a second video network terminal which is in the same video network conference with the first video network terminal, and the method comprises the following steps:

the first video network server sends the audio and video data to the second video network server, and the second video network server and the second video network terminal are both located in a second geographic area;

the second video network server sends the audio data in the audio and video data to the second voice recognition server and sends the audio and video data to the second video network terminal, wherein the second voice recognition server is located in the second geographic area;

the method further comprises the following steps:

and the second voice recognition server recognizes the audio data to obtain text data corresponding to the audio data, and stores the text data.

Optionally, the recognizing, by the first speech recognition server, the audio data to obtain text data corresponding to the audio data includes:

the first voice recognition server decodes the audio data to obtain corresponding audio;

the first voice recognition server filters the audio frequency and filters a silent part and a noise part;

and the first voice recognition server recognizes the audio with the preset length after the length of the accumulated and filtered audio reaches the preset length, so as to obtain corresponding text data.

Optionally, the recognizing, by the first speech recognition server, the audio with the preset length to obtain corresponding text data includes:

the first voice recognition server extracts the characteristics of the audio with the preset length;

the first voice recognition server inputs the extracted features into a pre-trained voice recognition model to obtain corresponding text data;

the pre-trained speech recognition model is obtained by training a Gaussian mixture model by taking a plurality of audio samples as training samples.

Optionally, the method further comprises:

the first voice recognition server obtains conference information of a video networking conference where the first video networking terminal is located, wherein the conference information comprises at least one of the following: a conference identifier, information of the speaker, and a conference time;

the first speech recognition server stores the text data, including:

and the first voice recognition server takes the conference information as a file name and stores the text data serving as file content into a corresponding file.

The embodiment of the invention also provides an audio processing device, which is applied to a video networking conference system, and the video networking conference system comprises: first video networking terminal, second video networking terminal, first video networking server, first speech recognition server, the device includes:

the data acquisition and coding module is used for acquiring the audio and video of a speaker by the first video network terminal and coding the audio and video into audio and video data;

the terminal scheduling module is used for the first video network terminal to send the audio and video data to the first video network server;

the server scheduling module is used for the first video network server to send the audio data in the audio and video data to the first voice recognition server and send the audio and video data to the second video network terminal which is in the same video network conference with the first video network terminal, wherein the first voice recognition server is positioned in the first geographic area;

the terminal data processing module is used for receiving the audio and video data and playing the corresponding audio and video after decoding by the second video network terminal;

and the voice data processing module is used for identifying the audio data by the first voice identification server to obtain text data corresponding to the audio data and storing the text data.

Optionally, the second video network terminal is not in the same geographical area as the first video network server, and the video network conference system further includes a second video network server and a second voice recognition server; the server scheduling module comprises:

the cross-server scheduling submodule is used for the first video network server to send the audio and video data to the second video network server, and the second video network server and the second video network terminal are both located in a second geographic area;

the co-domain scheduling submodule is used for the second video network server to send the audio data in the audio and video data to the second voice recognition server and send the audio and video data to the second video network terminal, wherein the second voice recognition server is positioned in the second geographic area;

the voice data processing module is further configured to identify the audio data by the second voice identification server, obtain text data corresponding to the audio data, and store the text data.

Optionally, the voice data processing module includes:

the decoding submodule is used for decoding the audio data by the first voice recognition server to obtain corresponding audio;

the filtering submodule is used for filtering the audio by the first voice recognition server and filtering a silent part and a noise part;

and the recognition submodule is used for recognizing the audio with the preset length after the length of the accumulated and filtered audio reaches the preset length by the first voice recognition server to obtain corresponding text data.

Optionally, the identifier module comprises:

the characteristic extraction subordinate submodule is used for the first voice recognition server to extract the characteristics of the audio with the preset length;

the text data subordinate submodule is used for inputting the extracted features into a pre-trained voice recognition model by the first voice recognition server to obtain corresponding text data;

the pre-trained speech recognition model is obtained by taking a plurality of audio samples as training samples and training on the basis of a Gaussian mixture model.

Optionally, the apparatus further comprises:

a conference information obtaining module, configured to obtain, by the first voice recognition server, conference information of a video networking conference in which the first video networking terminal is located, where the conference information includes at least one of: a conference identifier, information of the speaker, and a conference time;

the voice data processing module is further configured to store the text data as file content in a corresponding file by using the conference information as a file name through the first voice recognition server.

By adopting the audio processing method provided by the invention, the video networking terminal collects the audio and video of the speaker and then encodes the audio and video into audio and video data, and sends the audio data in the audio and video data to the voice recognition server through the video networking server and sends the audio and video data to other video networking terminals needing to receive the audio and video data. And when other video network terminals play corresponding audio and video, the voice recognition server recognizes the received audio data to obtain corresponding text data and stores the text data. The audio processing method of the invention can automatically and intelligently record the meeting content needing to be recorded in the video networking meeting process without manual recording, thereby not only saving the human resources, but also ensuring the integrity and the correctness of the meeting record.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic networking diagram of a video network of the present invention;

FIG. 2 is a schematic diagram of a hardware architecture of a node server according to the present invention;

fig. 3 is a schematic diagram of a hardware structure of an access switch of the present invention;

fig. 4 is a schematic diagram of a hardware structure of an ethernet protocol conversion gateway according to the present invention;

FIG. 5 is a flow chart of an audio processing method according to an embodiment of the invention;

FIG. 6 is a schematic diagram of an embodiment of an Internet-of-view conferencing system;

fig. 7 is a block diagram of an audio processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The video networking is an important milestone for network development, is a real-time network, can realize high-definition video real-time transmission, and pushes a plurality of internet applications to high-definition video, and high-definition faces each other.

The video networking adopts a real-time high-definition video exchange technology, can integrate required services such as dozens of services of video, voice, pictures, characters, communication, data and the like on a system platform on a network platform, such as high-definition video conference, video monitoring, intelligent monitoring analysis, emergency command, digital broadcast television, delayed television, network teaching, live broadcast, VOD on demand, television mail, Personal Video Recorder (PVR), intranet (self-office) channels, intelligent video broadcast control, information distribution and the like, and realizes high-definition quality video broadcast through a television or a computer.

To better understand the embodiments of the present invention, the following description refers to the internet of view:

some of the technologies applied in the video networking are as follows:

network Technology (Network Technology)

Network technology innovation in video networking has improved over traditional Ethernet (Ethernet) to face the potentially enormous video traffic on the network. Unlike pure network Packet Switching (Packet Switching) or network circuit Switching (circuit Switching), the Packet Switching is adopted by the technology of the video networking to meet the Streaming requirement. The video networking technology has the advantages of flexibility, simplicity and low price of packet switching, and simultaneously has the quality and safety guarantee of circuit switching, thereby realizing the seamless connection of the whole network switching type virtual circuit and the data format.

Switching Technology (Switching Technology)

The video network adopts two advantages of asynchronism and packet switching of the Ethernet, eliminates the defects of the Ethernet on the premise of full compatibility, has end-to-end seamless connection of the whole network, is directly communicated with a user terminal, and directly bears an IP data packet. The user data does not require any format conversion across the entire network. The video networking is a higher-level form of the Ethernet, is a real-time exchange platform, can realize the real-time transmission of the whole-network large-scale high-definition video which cannot be realized by the existing Internet, and pushes a plurality of network video applications to high-definition and unification.

Server Technology (Server Technology)

The server technology on the video networking and unified video platform is different from the traditional server, the streaming media transmission of the video networking and unified video platform is established on the basis of connection orientation, the data processing capacity of the video networking and unified video platform is independent of flow and communication time, and a single network layer can contain signaling and data transmission. For voice and video services, the complexity of video networking and unified video platform streaming media processing is much simpler than that of data processing, and the efficiency is greatly improved by more than one hundred times compared with that of a traditional server.

Storage Technology (Storage Technology)

The super-high speed storage technology of the unified video platform adopts the most advanced real-time operating system in order to adapt to the media content with super-large capacity and super-large flow, the program information in the server instruction is mapped to the specific hard disk space, the media content is not passed through the server any more, and is directly sent to the user terminal instantly, and the general waiting time of the user is less than 0.2 second. The optimized sector distribution greatly reduces the mechanical motion of the magnetic head track seeking of the hard disk, the resource consumption only accounts for 20% of that of the IP internet of the same grade, but concurrent flow which is 3 times larger than that of the traditional hard disk array is generated, and the comprehensive efficiency is improved by more than 10 times.

Network Security Technology (Network Security Technology)

The structural design of the video network completely eliminates the network security problem troubling the internet structurally by the modes of independent service permission control each time, complete isolation of equipment and user data and the like, generally does not need antivirus programs and firewalls, avoids the attack of hackers and viruses, and provides a structural carefree security network for users.

Service Innovation Technology (Service Innovation Technology)

The unified video platform integrates services and transmission, and is not only automatically connected once whether a single user, a private network user or a network aggregate. The user terminal, the set-top box or the PC are directly connected to the unified video platform to obtain various multimedia video services in various forms. The unified video platform adopts a menu type configuration table mode to replace the traditional complex application programming, can realize complex application by using very few codes, and realizes infinite new service innovation.

Networking of the video network is as follows:

the video network is a centralized control network structure, and the network can be a tree network, a star network, a ring network and the like, but on the basis of the centralized control node, the whole network is controlled by the centralized control node in the network.

As shown in fig. 1, the video network is divided into an access network and a metropolitan network.

The devices of the access network part can be mainly classified into 3 types: node server, access switch, terminal (including various set-top boxes, coding boards, memories, etc.). The node server is connected to an access switch, which may be connected to a plurality of terminals and may be connected to an ethernet network.

The node server is a node which plays a centralized control function in the access network and can control the access switch and the terminal. The node server can be directly connected with the access switch or directly connected with the terminal.

Similarly, devices of the metropolitan network portion may also be classified into 3 types: a metropolitan area server, a node switch and a node server. The metro server is connected to a node switch, which may be connected to a plurality of node servers.

The node server is a node server of the access network part, namely the node server belongs to both the access network part and the metropolitan area network part.

The metropolitan area server is a node which plays a centralized control function in the metropolitan area network and can control a node switch and a node server. The metropolitan area server can be directly connected with the node switch or directly connected with the node server.

Therefore, the whole video network is a network structure with layered centralized control, and the network controlled by the node server and the metropolitan area server can be in various structures such as tree, star and ring.

The access network part can form a unified video platform (the part in the dotted circle), and a plurality of unified video platforms can form a video network; each unified video platform may be interconnected via metropolitan area and wide area video networking.

Video networking device classification

1.1 devices in the video network of the embodiment of the present invention can be mainly classified into 3 types: servers, switches (including ethernet gateways), terminals (including various set-top boxes, code boards, memories, etc.). The video network as a whole can be divided into a metropolitan area network (or national network, global network, etc.) and an access network.

1.2 wherein the devices of the access network part can be mainly classified into 3 types: node servers, access switches (including ethernet gateways), terminals (including various set-top boxes, code boards, memories, etc.).

The specific hardware structure of each access network device is as follows:

a node server:

as shown in fig. 2, the system mainly includes a network interface module 201, a switching engine module 202, a CPU module 203, and a disk array module 204;

the network interface module 201, the CPU module 203, and the disk array module 204 all enter the switching engine module 202; the switching engine module 202 performs an operation of looking up the address table 205 on the incoming packet, thereby obtaining the direction information of the packet; and stores the packet in a queue of the corresponding packet buffer 206 based on the packet's steering information; if the queue of the packet buffer 206 is nearly full, it is discarded; the switching engine module 202 polls all packet buffer queues for forwarding if the following conditions are met: 1) the port send buffer is not full; 2) the queue packet counter is greater than zero. The disk array module 204 mainly implements control over the hard disk, including initialization, read-write, and other operations on the hard disk; the CPU module 203 is mainly responsible for protocol processing with an access switch and a terminal (not shown in the figure), configuring an address table 205 (including a downlink protocol packet address table, an uplink protocol packet address table, and a data packet address table), and configuring the disk array module 204.

The access switch:

as shown in fig. 3, the network interface module mainly includes a network interface module (a downlink network interface module 301 and an uplink network interface module 302), a switching engine module 303 and a CPU module 304;

wherein, the packet (uplink data) coming from the downlink network interface module 301 enters the packet detection module 305; the packet detection module 305 detects whether the Destination Address (DA), the Source Address (SA), the packet type, and the packet length of the packet meet the requirements, and if so, allocates a corresponding stream identifier (stream-id) and enters the switching engine module 303, otherwise, discards the stream identifier; the packet (downstream data) coming from the upstream network interface module 302 enters the switching engine module 303; the incoming data packet of the CPU module 304 enters the switching engine module 303; the switching engine module 303 performs an operation of looking up the address table 306 on the incoming packet, thereby obtaining the direction information of the packet; if the packet entering the switching engine module 303 is from the downstream network interface to the upstream network interface, the packet is stored in the queue of the corresponding packet buffer 307 in association with the stream-id; if the queue of the packet buffer 307 is nearly full, it is discarded; if the packet entering the switching engine module 303 is not from the downlink network interface to the uplink network interface, the data packet is stored in the queue of the corresponding packet buffer 307 according to the guiding information of the packet; if the queue of the packet buffer 307 is nearly full, it is discarded.

The switching engine module 303 polls all packet buffer queues, which in this embodiment of the present invention is divided into two cases:

if the queue is from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queued packet counter is greater than zero; 3) obtaining a token generated by a code rate control module;

if the queue is not from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queue packet counter is greater than zero.

The rate control module 308 is configured by the CPU module 304, and generates tokens for packet buffer queues from all downstream network interfaces to upstream network interfaces at programmable intervals to control the rate of upstream forwarding.

The CPU module 304 is mainly responsible for protocol processing with the node server, configuration of the address table 306, and configuration of the code rate control module 308.

Ethernet protocol conversion gateway：

As shown in fig. 4, the apparatus mainly includes a network interface module (a downlink network interface module 401 and an uplink network interface module 402), a switching engine module 403, a CPU module 404, a packet detection module 405, a rate control module 408, an address table 406, a packet buffer 407, a MAC adding module 409, and a MAC deleting module 410.

Wherein, the data packet coming from the downlink network interface module 401 enters the packet detection module 405; the packet detection module 405 detects whether the ethernet MAC DA, the ethernet MAC SA, the ethernet length or frame type, the video network destination address DA, the video network source address SA, the video network packet type, and the packet length of the packet meet the requirements, and if so, allocates a corresponding stream identifier (stream-id); then, the MAC deletion module 410 subtracts MAC DA, MAC SA, length or frame type (2byte) and enters the corresponding receiving buffer, otherwise, discards it;

the downlink network interface module 401 detects the sending buffer of the port, and if there is a packet, obtains the ethernet MAC DA of the corresponding terminal according to the destination address DA of the packet, adds the ethernet MAC DA of the terminal, the MACSA of the ethernet coordination gateway, and the ethernet length or frame type, and sends the packet.

The other modules in the ethernet protocol gateway function similarly to the access switch.

A terminal:

the system mainly comprises a network interface module, a service processing module and a CPU module; for example, the set-top box mainly comprises a network interface module, a video and audio coding and decoding engine module and a CPU module; the coding board mainly comprises a network interface module, a video and audio coding engine module and a CPU module; the memory mainly comprises a network interface module, a CPU module and a disk array module.

1.3 devices of the metropolitan area network part can be mainly classified into 2 types: node server, node exchanger, metropolitan area server. The node switch mainly comprises a network interface module, a switching engine module and a CPU module; the metropolitan area server mainly comprises a network interface module, a switching engine module and a CPU module.

2. Video networking packet definition

2.1 Access network packet definition

The data packet of the access network mainly comprises the following parts: destination Address (DA), Source Address (SA), reserved bytes, payload (pdu), CRC.

As shown in the following table, the data packet of the access network mainly includes the following parts:

DA

SA

Reserved

Payload

CRC

wherein:

the Destination Address (DA) is composed of 8 bytes (byte), the first byte represents the type of the data packet (such as various protocol packets, multicast data packets, unicast data packets, etc.), there are 256 possibilities at most, the second byte to the sixth byte are metropolitan area network addresses, and the seventh byte and the eighth byte are access network addresses;

the Source Address (SA) is also composed of 8 bytes (byte), defined as the same as the Destination Address (DA);

the reserved byte consists of 2 bytes;

the payload part has different lengths according to different types of datagrams, and is 64 bytes if the datagram is various types of protocol packets, and is 32+1024 or 1056 bytes if the datagram is a unicast packet, of course, the length is not limited to the above 2 types;

the CRC consists of 4 bytes and is calculated in accordance with the standard ethernet CRC algorithm.

2.2 metropolitan area network packet definition

The topology of a metropolitan area network is a graph and there may be 2, or even more than 2, connections between two devices, i.e., there may be more than 2 connections between a node switch and a node server, a node switch and a node switch, and a node switch and a node server. However, the metro network address of the metro network device is unique, and in order to accurately describe the connection relationship between the metro network devices, parameters are introduced in the embodiment of the present invention: a label to uniquely describe a metropolitan area network device.

In this specification, the definition of the Label is similar to that of the Label of MPLS (Multi-Protocol Label Switch), and assuming that there are two connections between the device a and the device B, there are 2 labels for the packet from the device a to the device B, and 2 labels for the packet from the device B to the device a. The label is classified into an incoming label and an outgoing label, and assuming that the label (incoming label) of the packet entering the device a is 0x0000, the label (outgoing label) of the packet leaving the device a may become 0x 0001. The network access process of the metro network is a network access process under centralized control, that is, address allocation and label allocation of the metro network are both dominated by the metro server, and the node switch and the node server are both passively executed, which is different from label allocation of MPLS, and label allocation of MPLS is a result of mutual negotiation between the switch and the server.

As shown in the following table, the data packet of the metro network mainly includes the following parts:

DA

SA

Reserved

label (R)

Payload

CRC

Namely Destination Address (DA), Source Address (SA), Reserved byte (Reserved), tag, payload (pdu), CRC. The format of the tag may be defined by reference to the following: the tag is 32 bits with the upper 16 bits reserved and only the lower 16 bits used, and its position is between the reserved bytes and payload of the packet.

Based on the characteristics of the video network, one of the core concepts of the embodiment of the invention is provided, following the protocol of the video network, the video network server sends audio data to the voice recognition server after receiving the audio and video data sent by the video network terminal, and meanwhile, other conference participating video network terminals decode and play corresponding audio and video after receiving the audio and video data sent by the video network server.

The inventor finds that in the meeting process of the conference established based on the video networking at present, if some important conference contents need to be stored, a specially-assigned person is basically used for manual recording, so that not only is the manpower wasted, but also the problems of recording errors and errors or incomplete and complete storage of the conference contents possibly occur in the manual recording process.

In order to solve the above problems, the inventor of the present invention has made extensive studies and creatively proposed that when an audio/video corresponding to a video network terminal is played, a voice recognition server is used to process audio data therein to obtain corresponding text data. The invention can automatically and intelligently record the conference content without manual recording. The invention is explained and illustrated in detail below.

Fig. 5 is a flowchart of an audio processing method according to an embodiment of the present invention, which is applied to a video networking conference system, where the video networking conference system includes: the audio processing method comprises the following steps:

step 101: the first video network terminal collects the audio and video of the speaker and encodes the audio and video into audio and video data.

In the embodiment of the invention, the establishment of the video networking conference is generally initiated by a chairman end, the chairman end is a video networking terminal provided with a video networking conference control program, after the chairman end initiates the video networking conference, at least two video networking terminals participate, one of the two video networking terminals is the chairman end, and the other video networking terminal participates, of course, a speaker can speak at the chairman end or speak at any video networking terminal participating in the conference.

In the meeting process, a first video network terminal where a speaker is located collects the audio and video of the speaker, and encodes the collected audio and video into corresponding audio and video data.

As an example: a certain video networking conference has three video networking terminals participating in the conference, which are respectively: the terminal comprises a 1# terminal, a 2# terminal and a 3# terminal, wherein the 1# terminal and the 2# terminal are located in the same geographic area, the 1# terminal and the 2# terminal are both connected with a 1# video network server, a 1# voice recognition server is also connected with the 1# video network server, the 3# terminal is not located in the same geographic area with the 1# terminal and the 2# terminal, the 3# terminal is not connected with the 1# video network server and is connected with the 2# video network server, the 2# voice recognition server is connected with the 2# video network server, and the 1# terminal is a chairman terminal.

After the conference is established, a speaker speaks through the 1# terminal, the 1# terminal collects the audio and video of the speaker and encodes the audio and video into corresponding audio and video data.

Step 102: and the first video network terminal sends the audio and video data to a first video network server.

In the embodiment of the invention, according to the characteristics of the video network, other participating video network terminals need to receive the speech of the speaker, and the first video network server is required to forward the audio and video data of the first video network terminal, so that the first video network terminal needs to send the encoded audio and video data to the video network server.

Following the above example: and the 1# terminal sends the encoded audio and video data to the 1# video network server.

Step 103: the first video network server sends audio data in the audio and video data to the first voice recognition server and sends the audio and video data to a second video network terminal which is in the same video network conference with the first video network terminal, wherein the first voice recognition server is located in a first geographical area.

In the embodiment of the invention, after the first video network server receives the audio and video data, the audio data in the audio and video data needs to be independently extracted, and then the audio data is sent to the first voice recognition server, and meanwhile, the first video network server sends the audio and video data to all video network terminals participating in a meeting (including the second video network terminal). The first voice recognition server, the first video network server and the first video network terminal are all located in a first geographic area, and the first voice recognition server and the first video network terminal are all connected with the first video network server. Of course, it can be understood that, if there is a video network terminal participating in a meeting that is not in the first geographic area, the audio and video data received by the video network terminal needs to be forwarded by the first video network terminal and a video network server connected to the video network terminal that is not in the first geographic area.

Following the above example: the 1# terminal sends the encoded audio and video data to the 1# video network server, the 1# video network server extracts the audio data in the audio and video data and sends the audio data to the 1# voice recognition server, meanwhile, the audio and video data are sent to the 2# terminal and the 2# video network server, and the 2# video network server can forward the received audio and video data to the 3# terminal.

Optionally, if the second video network terminal is not located in the same geographic area as the first video network server, the video network conference system further includes a second video network server and a second voice recognition server, and step 103 specifically includes:

step 103 a: the first video network server sends the audio and video data to the second video network server, and the second video network server and the second video network terminal are both located in a second geographic area.

Step 103 b: and the second video network server sends the audio data in the audio and video data to a second voice recognition server and sends the audio and video data to a second video network terminal, wherein the second voice recognition server is positioned in a second geographic area.

In the embodiment of the invention, if the second video network terminal of the participant is not in the first geographic area, and the second video network terminal, the second video network server and the second voice recognition server are all in the second geographic area, the first video network server sends the audio and video data to the second video network server.

And after receiving the audio and video data, the second video network server extracts the audio data in the audio and video data and sends the audio data to the second voice recognition server, and meanwhile, sends the audio and video data to the second video network terminal. And after receiving the audio data, the second voice recognition server recognizes the audio data to obtain text data corresponding to the audio data, and stores the text data.

Following the above example: the No. 1 video network server sends the audio and video data to the No. 2 video network server, the No. 2 video network server extracts the audio data in the audio and video data and sends the audio data to the No. 2 voice recognition server, and meanwhile, the audio and video data are sent to the No. 3 terminal, so that the No. 3 terminal can decode the audio and video data and play corresponding audio and video.

And after receiving the audio data, the 2# voice recognition server recognizes the audio data to obtain text data corresponding to the audio data and stores the text data.

Step 104: and the second video network terminal receives the audio and video data, decodes the audio and video data and plays the corresponding audio and video.

In the embodiment of the invention, after the second video network terminal participating in the conference receives the audio and video data, the second video network terminal firstly decodes the audio and video data, and then plays the corresponding audio and video data after the decoding is finished, so that other persons participating in the conference can see the audio and video of the speaker through the second video network terminal.

Following the above example: and after receiving the audio and video data sent by the video network server, the No. 2 terminal and the No. 3 terminal decode the audio and video data and play corresponding audio and video after decoding.

Step 105: and the first voice recognition server recognizes the audio data to obtain text data corresponding to the audio data, and stores the text data.

In the embodiment of the invention, after the first voice recognition server receives the audio data, the audio data is recognized to obtain the text data corresponding to the audio data, and the text data is stored.

Following the above example: and after receiving the audio data, the 1# voice recognition server recognizes the audio data to obtain corresponding text data and stores the text data.

Optionally, step 105 specifically includes:

step 105 a: the first voice recognition server decodes the audio data to obtain corresponding audio.

Step 105 b: the first voice recognition server filters the audio frequency to filter a silent part and a noise part

Step 105 c: and the first voice recognition server recognizes the audio with the preset length after the length of the accumulated and filtered audio reaches the preset length, so as to obtain corresponding text data.

In the above step, the first speech recognition server recognizes the audio with the preset length to obtain the corresponding text data, including:

the method comprises the steps that a first voice recognition server extracts the characteristics of audio with preset length;

In the embodiment of the invention, the voice recognition server processes the audio data into corresponding text data. The premise that the voice recognition server can recognize the audio frequency is that a voice recognition model is built, the voice recognition model is natural, sound materials of various speakers need to be collected in the building process, and finally the models can be subjected to offline learning by combining an acoustic model, a language model, a dictionary and the like, so that the recognition of the voice recognition model of the voice recognition server on the audio data can achieve high enough accuracy.

When the speech recognition model is used for establishing and training various samples, firstly, sound needs to be filtered to eliminate silent parts, and then, the sample establishment and training are carried out, otherwise, serious deviation exists. The Voice is filtered, and the part excluding silence adopts VAD (Voice Activity Detection) method, so as to distinguish the Voice part and the non-Voice part from the complex environment. A simple energy-based approach to silence removal requires removing frames with an average energy less than 0.01 times the average energy of the entire speech, and LTSD (Long-term spectral energy difference) can do this. Similarly, when the audio is identified by using the speech recognition model in the following, the audio needs to be filtered first, and the same method is also adopted to filter the audio.

After the voice recognition server receives the audio data, the audio data is decoded to obtain corresponding audio, the data processing module in the voice recognition server filters the audio, and after the length of the audio after accumulative filtering reaches a preset length, for example: and identifying the audio with the preset length after the audio with the time length of 1 second, thereby obtaining text data.

The specific process of recognizing the audio with the preset length by the voice recognition server is as follows: firstly, the data processing module extracts the characteristics of the audio, and then the extracted characteristics are input into a pre-trained speech recognition model. Wherein, the feature extraction is to extract the components with identification in the audio frequency according to the related algorithm, and then to extract other components, such as: and removing background noise, emotion and other information, and aiming at converting each frame of waveform into a multi-dimensional vector containing sound information. The relevant main algorithms include: linear Prediction Cepstrum Coefficients (LPCC) and mel-frequency cepstrum coefficients (MFCC), etc., where mel frequencies are extracted based on the auditory characteristics of human ears and form a nonlinear correspondence with Hz frequency, and mel-frequency cepstrum coefficients (MFCC) are calculated to obtain the Hz spectrum characteristics by using the relationship between them. MFCC is mainly used for speech data feature extraction and reduces the operational dimension. For example: for 512-dimensional (sampling point) data in a frame, the most important 40-dimensional (general) data can be extracted after MFCC, and the purpose of reducing dimension is achieved.

After the feature extraction is completed, the features are input into a pre-trained speech recognition model, and corresponding text data is obtained through the speech recognition model. The pre-trained speech recognition model is obtained by taking a plurality of audio samples as training samples and training on the basis of a Gaussian mixture model. The Gaussian Mixture Model (GMM) fits the probability density of spatial distribution with the weighting of multiple gaussian probability density functions, can smoothly approximate a probability density function of any shape, is an easy-to-process parametric model, and has extremely strong representation power on actual data. Conversely, the larger the GMM scale, the stronger the characterization force, and the more obvious the negative effects: the parameter law also expands proportionally, and more data are needed to drive the parameter training of the GMM to obtain a more general (or generalized) GMM model. Of course, the speech recognition model may be combined with an acoustic model, a language model, a dictionary, and the like to recognize the audio to obtain corresponding text data.

Optionally, after obtaining the text data, the audio processing method further includes: .

Step S: the first voice recognition server obtains conference information of the video networking conference where the first video networking terminal is located, wherein the conference information comprises at least one of the following: a conference identification, information of the speaker, and a conference time.

In the embodiment of the present invention, when the first voice recognition server stores the obtained text data, the conference information of the video networking conference in which the first video networking terminal is located is used as a file name, and the text data is stored in a corresponding file as file content, where the conference information includes at least one of: the conference system comprises a conference identifier, speaker information and conference time, wherein the conference identifier is the name of the conference and is determined by a chairman end initiating the conference; the information of the speaker is the name of the speaker or the information capable of identifying the identity of the speaker; the conference time includes: the time the meeting started, the time it ended.

It should be noted that the function of the voice recognition server for processing the audio data into text data may be set, that is, the chairman end sends a signaling to the video networking server to control whether the video networking server sends the audio data to the voice recognition server, and if the conference content does not need to be recorded, the video networking server does not send the audio data to the voice recognition server; and when meeting content needs to be recorded, the video network server sends audio data to the voice recognition server.

In conclusion, the scheme of the invention not only ensures that the participating video networking terminals normally watch the conference audio and video of the speaker, but also intelligently records the conference contents to be recorded in a text data mode without manual recording.

As shown in fig. 6, a schematic diagram of a video network conference system according to an embodiment of the present invention is shown, and the system includes a # 1 terminal, a # 2 terminal, a # 3 terminal, a # 1 video network server, a # 2 video network server, a # 1 voice recognition server, and a # 2 voice recognition server. The 1# terminal, the 2# terminal, the 1# video network server and the 1# voice recognition server belong to the same geographical area, and the 3# terminal, the 2# video network server and the 2# voice recognition server belong to the same geographical area. According to the characteristics of the video network, in the data interaction process of all the devices, the protocol modules of the devices are required to be used, so that the data are transmitted and interacted based on the video network protocol. The 1# terminal is a chairman end initiating a conference, a speaker speaks through the 1# terminal, and the name of the conference is determined: analysis of the impact of indian acarlla on bara iron on economy; speaker information: code number zero seven; conference start time: 8 month 8, 8 days 8 in 2019: 08, end time 2019, 8 month, 8 day 18: 08.

after the video networking conference is successfully established, a data module of the 1# terminal collects the audio and video of a speaker, the audio and video is coded into corresponding audio and video data, the formed audio and video data are sent to the 1# video networking server through a self scheduling module, the data module of the 1# video networking server extracts the audio data in the audio and video data, the audio data are sent to the 1# voice recognition server through the self scheduling module, the audio and video data are sent to the 2# terminal and the 2# video networking server through the self scheduling module, the 2# terminal decodes through the self data module after receiving the audio and video data, and the corresponding audio and video is played after the decoding is completed.

After the 2# video networking server receives the audio and video data, the audio data in the audio and video data are extracted through the data module of the 2# video networking server, then the audio and video data are sent to the 2# voice recognition server through the scheduling module of the 2# video networking server, meanwhile, the audio and video data are sent to the 3# terminal through the scheduling module of the 3# video networking server, after the 3# terminal receives the audio and video data, the audio and video data are decoded through the data module of the 3# video networking server, and the corresponding audio and video are.

After audio data are received by the 1# voice recognition server and the 2# voice recognition server respectively, corresponding audio is obtained through decoding of the data processing module of the data processing server, the audio is filtered, after the length of the audio after accumulative filtering reaches a preset length, the data processing module extracts the characteristics of the audio, the extracted characteristics are input into a pre-trained voice recognition model, corresponding text data are obtained through the voice recognition model, and the text data are stored and named: analysis of the impact of indian acasan on barnacle on economy + code number zero seven + 2019.8.8.8: 08-2019.8.8.18: 08, in the document.

Referring to fig. 7, a block diagram of an audio processing apparatus according to an embodiment of the present invention is shown, where the audio processing apparatus is applied to a video networking conference system, and the video networking conference system includes: first video networking terminal, second video networking terminal, first video networking server, first speech recognition server, audio processing apparatus includes:

the data acquisition and coding module 310 is used for acquiring the audio and video of a speaker by the first video network terminal and coding the audio and video into audio and video data;

the terminal scheduling module 320 is used for the first video network terminal to send the audio and video data to the first video network server;

the server scheduling module 330 is configured to send audio data in the audio and video data to a first voice recognition server by a first video networking server, and send the audio and video data to a second video networking terminal in the same video networking conference as the first video networking terminal, where the first voice recognition server is located in a first geographic area;

the terminal data processing module 340 is configured to receive the audio and video data and decode the audio and video data and then play a corresponding audio and video;

and the voice data processing module 350 is configured to identify the audio data by the first voice identification server, obtain text data corresponding to the audio data, and store the text data.

the co-domain scheduling submodule is used for the second video network server to send the audio data in the audio and video data to the second voice recognition server and send the audio and video data to the second video network terminal, wherein the second voice recognition server is positioned in a second geographic area;

the voice data processing module is also used for the second voice recognition server to recognize the audio data, obtain text data corresponding to the audio data and store the text data.

Optionally, the voice data processing module includes:

Optionally, the identifier module comprises:

Optionally, the audio processing apparatus further comprises:

the conference information obtaining module is used for the first voice recognition server to obtain conference information of the video networking conference where the first video networking terminal is located, and the conference information comprises at least one of the following: meeting identification, information of speakers and meeting time;

the voice data processing module is also used for the first voice recognition server to store the text data as the file content into the corresponding file by taking the conference information as the file name.

According to the audio processing method, the audio processing device, the electronic equipment and the storage medium, the video networking terminal collects audios and videos of speakers and encodes the audios and videos into audio and video data, the audio data in the audio and video data are sent to the voice recognition server through the video networking server, and the audio and video data are sent to other video networking terminals needing to receive the audio and video data. And when other video network terminals play corresponding audio and video, the voice recognition server recognizes the received audio data to obtain corresponding text data and stores the text data. The audio processing method of the invention can automatically and intelligently record the meeting content needing to be recorded in the video networking meeting process without manual recording, thereby not only saving the human resources, but also ensuring the integrity and the correctness of the meeting record.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An audio processing method, applied to a video networking conference system, the video networking conference system comprising: the method comprises the following steps that a first video network terminal, a second video network terminal, a first video network server and a first voice recognition server are connected, and the method comprises the following steps:

2. The method of claim 1, wherein a second video networking terminal is not in the same geographic area as the first video networking server, the video networking conferencing system further comprising a second video networking server and a second voice recognition server; the first video network server sends the audio and video data to a second video network terminal which is in the same video network conference with the first video network terminal, and the method comprises the following steps:

the method further comprises the following steps:

3. The method of claim 1, wherein the recognizing the audio data by the first speech recognition server to obtain text data corresponding to the audio data comprises:

4. The method of claim 3, wherein the recognizing, by the first speech recognition server, the audio with the preset length to obtain the corresponding text data comprises:

5. The method of claim 1, further comprising:

the first speech recognition server stores the text data, including:

6. An audio processing apparatus, wherein the apparatus is applied to a video networking conference system, the video networking conference system comprising: first video networking terminal, second video networking terminal, first video networking server, first speech recognition server, the device includes:

7. The apparatus of claim 6, wherein a second video networking terminal is not in the same geographic area as the first video networking server, the video networking conferencing system further comprising a second video networking server and a second voice recognition server; the server scheduling module comprises:

8. The apparatus of claim 6, wherein the voice data processing module comprises:

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executed implements the steps of the method according to any of claims 1-5.