CN113450797A - Audio processing method, device, storage medium and system based on online conference - Google Patents

Audio processing method, device, storage medium and system based on online conference Download PDF

Info

Publication number
CN113450797A
CN113450797A CN202110729223.1A CN202110729223A CN113450797A CN 113450797 A CN113450797 A CN 113450797A CN 202110729223 A CN202110729223 A CN 202110729223A CN 113450797 A CN113450797 A CN 113450797A
Authority
CN
China
Prior art keywords
audio data
conference
voice
paths
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110729223.1A
Other languages
Chinese (zh)
Inventor
韦国华
顾振华
张祖良
王超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Keda Technology Co Ltd
Original Assignee
Suzhou Keda Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Keda Technology Co Ltd filed Critical Suzhou Keda Technology Co Ltd
Priority to CN202110729223.1A priority Critical patent/CN113450797A/en
Publication of CN113450797A publication Critical patent/CN113450797A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The application relates to an audio processing method, equipment, a storage medium and a system based on an online conference, belonging to the technical field of computers, wherein the method comprises the following steps of; acquiring audio data of at least two participating terminals in the process of an online conference; determining one path of target audio data corresponding to a main speaker from at least two paths of audio data; acquiring first text data corresponding to target audio data; the problem that a large amount of transcription resources are consumed when each path of audio data is converted into text data can be solved; by predicting the current main speaker and processing only one path of audio data corresponding to the main speaker, only one path of voice transcription resource is needed for one online conference, and the transcription resource occupied by the online conference can be reduced. Meanwhile, the audio actually used for transcription is the original audio, so that signal loss caused by intermediate processing does not exist, and the accuracy of voice transcription can be improved.

Description

Audio processing method, device, storage medium and system based on online conference
[ technical field ] A method for producing a semiconductor device
The application relates to an audio processing method, equipment, a storage medium and a system based on an online conference, and belongs to the technical field of computers.
[ background of the invention ]
In the process of performing an online conference, it is necessary to process the voice of the online conference to perform subtitle display or generate a conference summary. The processing of the audio includes a process of converting voice data in the audio data into text data, i.e., a voice transcription process.
The traditional voice transcription method in the video conference comprises the following steps: respectively carrying out voice transcription on the audios of all the participating terminals to obtain text data; and combining the text data of each road according to the time sequence to obtain a text file corresponding to the online conference.
However, transferring multiple audio channels separately requires setting a corresponding voice transfer module for each audio channel, which consumes a large amount of transfer resources.
[ summary of the invention ]
The application provides an audio processing method, equipment, a storage medium and a system based on an online conference, which can solve the problem that a large amount of transcription resources are consumed when each path of audio data is converted into text data. The technical scheme provided by the application is as follows:
in a first aspect, an audio processing method based on an online conference is provided, and is used in a conference intelligent server, and the method includes:
acquiring audio data of at least two participating terminals in the process of an online conference; the at least two participating terminals are accessed into the same online conference, and each participating terminal corresponds to one path of audio data;
determining one path of target audio data corresponding to a main speaker from at least two paths of audio data;
and acquiring first text data corresponding to the target audio data.
Optionally, the determining one path of target audio data corresponding to the main speaker from the at least two paths of audio data includes:
determining whether voice data is included in each path of audio data;
when at least two paths of audio data comprise voice data, acquiring voice characteristics of the voice data;
and determining the target audio data from at least two paths of audio data comprising the voice data according to the voice characteristics.
Optionally, the determining, according to the voice feature, the target audio data from at least two paths of audio data including the voice data includes:
the voice features comprise voice energy, and one path of audio data with the maximum voice energy is determined as the target audio data;
alternatively, the first and second electrodes may be,
the voice features comprise voice energy and voice duration, and one path of audio data with the voice energy exceeding a preset threshold and the voice duration being the largest is determined as the target audio data;
alternatively, the first and second electrodes may be,
and the voice characteristics comprise voice duration, and one path of audio data with the voice duration exceeding a preset duration threshold and the voice duration being the maximum is determined as the target audio data.
Optionally, the determining one path of target audio data corresponding to the main speaker from the at least two paths of audio data further includes:
acquiring a main speaker appointed in the online conference;
and determining one path of audio data corresponding to the appointed main speaker from at least two paths of audio data to obtain the target audio data.
Optionally, the obtaining of the first text data corresponding to the target audio data includes:
processing the target audio data by using a voice transcription algorithm to obtain first text data;
alternatively, the first and second electrodes may be,
sending the target audio data to a designated device, so that the designated device can process the target audio data by using a voice transcription algorithm to obtain first text data corresponding to the target audio data; and receiving first text data corresponding to the target audio data sent by the designated equipment.
Optionally, the acquiring the audio data of the at least two participating terminals includes:
when a conference control platform starts a sound mixing function, obtaining a sound mixing list sent by the conference control platform, wherein the sound mixing list comprises data identifications of the at least two paths of audio data; the conference control platform is used for providing online conference service for the at least two conference-participating terminals, and the number of the data identifiers in the sound mixing list is determined by the conference control platform according to a preset sound mixing depth;
receiving N paths of audio data sent by the conference control platform, wherein the N paths of audio data refer to audio data corresponding to all conference participating terminals accessed to the online conference, and N is greater than or equal to 2;
and acquiring the at least two paths of audio data indicated by the data identification from the N paths of audio data.
Optionally, after the obtaining of the first text data corresponding to the target audio data, the method further includes:
after the online conference is finished, acquiring other audio data of the online conference, wherein the other audio data refers to audio data except the target audio data in the at least two paths of audio data;
acquiring second text data corresponding to the other audio data;
and combining the first text data and the second text data according to a time sequence to obtain a text file corresponding to the online conference.
In a second aspect, an audio processing apparatus based on an online conference is provided, the apparatus comprising:
the audio acquisition module is used for acquiring audio data of at least two participating terminals in the process of an online conference; the at least two participating terminals are accessed into the same online conference, and each participating terminal corresponds to one path of audio data;
the voice frequency determining module is used for determining one path of target voice frequency data corresponding to the main speaker from at least two paths of voice frequency data;
and the text acquisition module is used for acquiring first text data corresponding to the target audio data.
In a third aspect, an electronic device is provided, the device comprising a processor and a memory; the memory stores a program that is loaded and executed by the processor to implement the online conference based audio processing method provided by the first aspect.
In a fourth aspect, a computer-readable storage medium is provided, in which a program is stored, which when executed by a processor is configured to implement the online conference-based audio processing method provided in the first aspect.
In a fifth aspect, an audio processing system based on an online conference is provided, where the system includes N conference-participating terminals, a conference control platform communicatively connected to each conference-participating terminal, and a conference intelligent server connected to the conference control platform; the N participating terminals are accessed into the same online conference, and N is an integer greater than 1;
each of the N participating terminals is used for acquiring audio data in the process of the online conference to obtain N paths of audio data; sending the N paths of audio data to the conference control platform;
the conference control platform is used for receiving the N paths of audio data and forwarding the N paths of audio data to the conference intelligent server;
the intelligent conference server is used for acquiring audio data of at least two participating terminals in the process of an online conference; determining one path of target audio data corresponding to a main speaker from at least two paths of audio data; and acquiring first text data corresponding to the target audio data.
The beneficial effects of this application include at least: acquiring audio data of at least two participating terminals in the process of an online conference; determining one path of target audio data corresponding to a main speaker from at least two paths of audio data; acquiring first text data corresponding to target audio data; the problem that a large amount of transcription resources are consumed when each path of audio data is converted into text data can be solved; only one path of audio data corresponding to the main speaker is processed by predicting the current main speaker, and at the moment, only one path of voice transcription resource is needed for one online conference, so that the transcription resources occupied by the online conference can be reduced.
In addition, the audio actually used for transcription is the original audio instead of the audio after mixing, so that signal loss caused by intermediate processing does not exist, and the accuracy of voice transcription can be improved.
In addition, by determining the main speaker in conjunction with voice detection and voice features, rather than determining the speaker only through voice detection, the accuracy of determining the main speaker may be improved.
In addition, the main speaker appointed by the online conference is obtained, and the main speaker does not need to be determined by a decision, so that the efficiency of determining the main speaker can be improved.
In addition, when the conference control platform starts the audio mixing function, the audio data for entering the audio mixing cannot be heard by the participants, so that the number of paths of the audio data to be judged can be reduced by determining the main speaker from the at least two paths of audio data corresponding to the audio mixing list, and equipment resources are saved.
In addition, when the conference is finished, voice transcription is carried out on other audio data of each path to obtain second text data, and the first text data and the second text data are combined, so that the user can be ensured to obtain complete conference records.
The foregoing description is only an overview of the technical solutions of the present application, and in order to make the technical solutions of the present application more clear and clear, and to implement the technical solutions according to the content of the description, the following detailed description is made with reference to the preferred embodiments of the present application and the accompanying drawings.
[ description of the drawings ]
FIG. 1 is a schematic block diagram of an online conference-based audio processing system according to an embodiment of the present application;
FIG. 2 is a flow diagram of a method for online conference based audio processing provided by an embodiment of the present application;
FIG. 3 is a flow chart of a method for online conference based audio processing according to another embodiment of the present application;
fig. 4 is a flowchart of an audio processing method after an online conference is ended according to an embodiment of the present application;
FIG. 5 is a block diagram of an online conference based audio processing device provided by an embodiment of the present application;
fig. 6 is a block diagram of an electronic device provided by an embodiment of the application.
[ detailed description ] embodiments
The following detailed description of embodiments of the present application will be made with reference to the accompanying drawings and examples. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application.
First, a number of nouns to which the partial application relates will be introduced.
And (3) online meeting: the online conference is also called a network conference or remote cooperative office, and a user utilizes the internet to realize data sharing of a plurality of users at different places.
Voice Activity Detection (VAD), also called Voice endpoint Detection, Voice boundary Detection: the method is used for detecting whether the current audio signal contains a voice signal, namely judging the current input audio signal and distinguishing the voice signal from various background noise signals.
The voice transcription technology comprises the following steps: refers to a technique of converting audio data into text data. In one example, speech transcription techniques include, but are not limited to: automatic Speech recognition technology (Automatic Speech Recognitio, ASR): is a technique for converting human speech into text.
Fig. 1 is a schematic structural diagram of an audio processing system based on an online conference according to an embodiment of the present application. In this application, the online conference may be a multipoint conference or a point-to-point conference, and the embodiment does not limit the type of the online conference. As shown in fig. 1, the system includes at least a conferencing terminal 110, a conference control platform 120, and a conference intelligence server 130.
The number of conferee terminals 110 is N, the number of N being greater than or equal to 2. The N participating terminals 110 join the same online conference through the conference control platform 120.
Optionally, the conferencing terminal 110 may be a mobile phone, a tablet computer, a computer, or a like device with an online conferencing function, and the embodiment does not limit the type of the conferencing terminal 110.
During the on-line conference, the conference terminal 110 has a function of collecting audio data of a conference participant and transmitting the audio data to the conference control platform 120. At this time, each of the participating terminals 110 sends a channel of audio data to the conference control platform 120. In other words, each of the participating terminals 110 corresponds to one path of audio data, and different paths of audio data of different participating terminals 110.
In addition, the conferencing terminal 110 may also have other functions required when the online conference is in progress, such as: an image acquisition function, a communication function, and the like, which are not listed in this embodiment.
The conferencing terminal 110 is communicatively coupled to the conference control platform 120.
Optionally, the conference control platform 120 may be a Multipoint Control Unit (MCU), or a terminal or a server installed with an MCU, and this embodiment does not limit the implementation manner of the conference control platform 120.
The conference control platform 120 is configured to provide an online conference service for the conferencing terminal 110. In the process of the online conference, the conference control platform 120 is configured to receive the audio data sent by the N conferencing terminals 110, and obtain N channels of audio data.
Optionally, the conference control platform 120 has a sound mixing function, and when the conference control platform 120 starts the sound mixing function, the M channels of audio data in the N channels of audio data may be mixed based on the preset sound mixing depth; and transmits the mixed audio data to each of the participating terminals 110. N is more than or equal to M, and M is more than or equal to 2. The value of M is less than or equal to the preset mixing depth.
Accordingly, the conference terminal 110 receives and plays the audio data mixed by the conference control platform 120, and at this time, the participant corresponding to each conference terminal 110 can only hear the M channels of audio data added to the mixed audio.
The preset mixing depth refers to the maximum number of mixing paths of the conference control platform 120, for example: the preset mixing depth is 8 or 5, and the like, and the value of the preset mixing depth is not limited in this embodiment.
Since there may be many participating terminals 110 collecting audio data at the same time, audio data of all participating terminals 110 need not be mixed. Based on this, in this embodiment, by setting the preset mixing depth, when the number of the participating terminals 110 that simultaneously send audio data exceeds the preset mixing depth, M (equal to the preset mixing depth) channels of audio data are selected from each channel of audio data for mixing, so that transmission resources can be saved, and the conference effect can be improved.
The method for selecting M channels of audio data according to the mixing depth by the conference control platform 120 includes: respectively carrying out voice detection on the received N paths of audio data; determining at least two paths of audio data comprising voice data from the N paths of audio data; determining whether the number of at least two paths of audio data including voice data is greater than the sound mixing depth; if so, acquiring the voice energy of at least two paths of audio data including the voice data, and selecting the M paths of audio data with the highest voice energy as the audio data to be subjected to sound mixing.
Alternatively, in the present application, the voice energy may be a maximum value of a volume of audio data including the voice data; or a volume average value of audio data including voice data, and the present application does not limit the setting manner of voice energy.
The voice detection of the audio data may be VAD detection, and at this time, the conference control platform 120 further has a VAD detection function.
It should be added that when the number of the participating terminals 110 that simultaneously transmit audio data is less than or equal to the preset mixing depth, the conference control platform 120 may mix all the received audio data without selecting the audio data.
Optionally, the conference control platform 120 does not have a sound mixing function, or when the conference control platform 120 has the sound mixing function but does not turn on the sound mixing function, the conference control platform 120 forwards the audio data to each of the other participating terminals 110 accessing the same online conference every time the conference control platform 120 receives the audio data sent by one participating terminal 110.
In this embodiment, the conference control platform 120 is further communicatively connected to the conference intelligent server 130.
Optionally, the conference intelligent server 130 may be a computer, a server cluster, or the like, and the conference intelligent server 130 may be implemented in the same device as the conference control platform 120, or may be implemented in a different device from the conference control platform 120, and the implementation manner of the conference intelligent server 130 is not limited in this embodiment.
In other embodiments, the conference intelligent server 130 may also be referred to as an audio processing device, an intelligent conference management server, and the like, and the name of the conference intelligent server 130 is not limited in this embodiment.
In the process of the online conference, after receiving the N channels of audio data sent by the N conferencing terminals 110, the conference control platform 120 sends the N channels of audio data to the conference intelligent server 130.
Correspondingly, after receiving the N channels of audio data, the conference intelligent server 130 obtains the audio data of at least two participating terminals; determining one path of target audio data corresponding to a main speaker from at least two paths of audio data; and acquiring first text data corresponding to the target audio data.
The audio data of the at least two conference terminals acquired by the conference intelligent server 130 refers to audio data to be subjected to audio processing. The number of paths of audio data to be subjected to audio processing is less than or equal to N.
The main speaker refers to a participant who is mainly speaking in the current time period predicted by the conference smart server 130.
Since in some meeting scenes, the conference mode is mainly that a speaker speaks at the chairman station, the conversation with the audience is rarely involved, and even if the conversation is needed, the conference mode is performed in a mode that the terminals speak in turn. Based on the above, in a specific conference scene mainly involving a single speaker, the first text data is obtained by predicting the current main speaker and then processing only one path of audio data corresponding to the main speaker. At the moment, one online conference only needs one path of voice transcription resource; and because the audio actually used for transcription is the original audio (not the audio after mixing), so there is no signal loss caused by intermediate processing, and the accuracy of voice transcription can be improved.
The following describes an audio processing method for an online conference according to the present application in detail.
Fig. 2 is a flowchart of an audio processing method for an online conference according to an embodiment of the present application, which is described in this embodiment by taking the method as an example for being used in the intelligent conference server 130 of the system shown in fig. 1, and the method at least includes the following steps:
step 201, in the process of the online conference, acquiring audio data of at least two participating terminals.
The at least two paths of the conferencing terminals are terminals accessed to the same online conference, and each conferencing terminal corresponds to one path of audio data.
In this embodiment, the audio data acquired by the conference intelligent server is the audio data forwarded by the conference control platform, that is, after receiving the audio data sent by each conference terminal, the conference control platform forwards the original audio data to the conference intelligent server.
In this embodiment, the audio data of the at least two conferencing terminals acquired by the conference intelligent server is the audio data to be subjected to audio processing. In other words, the conference intelligence server may receive N channels of audio data, but only audio process M channels of audio data.
Such as: under the condition that the conference control platform starts the sound mixing function, if the conference control platform receives N paths of audio data sent by N conference-participating terminals, the N paths of audio data are mixed according to the preset sound mixing depth, and a sound mixing list and the N paths of audio data are sent to the conference intelligent server. The mixing list comprises data identifications of at least two paths of audio data; the number of the data identifications in the mixing list is determined by the conference control platform according to the preset mixing depth.
Optionally, the data identifier is used to uniquely identify one path of audio data, and may be a number, a device number, a conference name, an IP address, or the like of a conference terminal that sends the audio data.
Correspondingly, the conference intelligent server obtains a mixing list sent by the conference control platform, receives N paths of audio data sent by the conference control platform, and obtains at least two paths of audio data indicated by the data identification from the N paths of audio data, namely M paths of audio data to be subjected to audio processing are obtained.
Under the condition that the conference control platform does not start the audio mixing function or does not have the audio mixing function, the conference intelligent server can take all the received audio data as at least two paths of audio data to be processed, namely, M is equal to N.
Step 202, determining one path of target audio data corresponding to the main speaker from the at least two paths of audio data.
In this embodiment, the number of the main speakers determined by the intelligent conference server is one, so that one path of target audio data corresponding to the main speakers can be obtained.
The method for determining one path of target audio data corresponding to the main speaker from the at least two paths of audio data includes, but is not limited to, one of the following:
the first method comprises the following steps: acquiring a main speaker appointed in an online conference; and determining one path of audio data corresponding to the appointed main speaker from the at least two paths of audio data to obtain target audio data.
The main speaker specified in the online conference is specified by the administrator of the online conference or voted by each participant, and the embodiment does not limit the way in which the specified main speaker is specified. After the conference control platform obtains the appointed main speaker, the conference control platform sends speaker information of the appointed speaker to the intelligent conference server. Correspondingly, the intelligent conference server obtains the speaker information sent by the conference control platform, and obtains the main speaker corresponding to the speaker information.
Wherein the speaker information may be the same as the data identification or different from the data identification. The speaker information may be a participant name, an IP address of a participant terminal, a device number, and the like, and the implementation manner of the speaker information is not limited in this embodiment.
And the second method comprises the following steps: determining whether voice data is included in each path of audio data; when at least two paths of audio data comprise voice data, acquiring voice characteristics of the voice data; target audio data is determined from at least two paths of audio data including voice data according to voice characteristics.
In one example, determining whether voice data is included in each audio data includes: and respectively detecting each path of audio data by using a VAD algorithm to obtain a detection result, wherein the detection result is used for indicating whether the audio data comprises voice data or not.
At this time, each path of audio data corresponds to one VAD detection algorithm, in other words, each path of audio data in the same time period is subjected to VAD detection in parallel to obtain a detection result.
In another example, the conference control platform performs VAD detection on each path of audio data during audio mixing to obtain a detection result; and sending the detection result to the intelligent conference server. Accordingly, the conference intelligent server determines whether each path of audio data comprises voice data, and comprises the following steps: receiving a detection result sent by a conference control platform; and determining whether each path of audio data comprises voice data according to the detection result.
Optionally, the speech feature comprises speech energy and or speech duration. The manner in which the target audio data is determined varies for different speech characteristics. The following describes each way of determining the target audio data.
Case 1: when the voice features comprise voice energy, determining one path of audio data with the maximum voice energy as target audio data.
Case 2: the voice characteristics comprise voice energy and voice duration, and one path of audio data with the voice energy exceeding a preset threshold and the voice duration being the largest is determined as target audio data.
The situation that voice energy suddenly changes may exist in a certain path of audio data, such as: the participants sneeze, causing sudden increases and decreases in speech energy. At this time, the determination of the target audio data may not be accurate enough when relying only on the speech energy. Based on this, in case 2, determining the target audio data in combination with the voice energy and the voice duration can improve the accuracy of determining the target audio data.
Case 3: the voice characteristics comprise voice duration, and one path of audio data with the voice duration exceeding a preset duration threshold and the voice duration being the largest is determined as target audio data.
It should be added that, in practical implementation, the speech features may also include other features, such as: voice frequency, etc., and the implementation manner of the voice feature and the manner of determining the target audio data according to the voice feature are not limited in this embodiment.
Step 203, acquiring first text data corresponding to the target audio data.
In this embodiment, the obtaining of the first text data corresponding to the target audio data includes, but is not limited to, the following implementation manners:
the first method comprises the following steps: and processing the target audio data by using a voice transcription algorithm to obtain first text data.
Such as: the target audio data is voice transcribed using an ASR algorithm.
And the second method comprises the following steps: sending the target audio data to designated equipment so that the designated equipment can process the target audio data by using a voice transcription algorithm to obtain first text data corresponding to the target audio data; and receiving first text data corresponding to the target audio data sent by the appointed equipment.
At this moment, the conference intelligent server does not have the voice transcription function, and the conference intelligent server can send the target audio data to the designated equipment, and the designated equipment is in communication connection with the conference intelligent server and has the voice transcription function.
Optionally, after the intelligent conference server acquires the first text data, the intelligent conference server may send the first text data to each of the conference terminals through the conference control platform, so that the conference terminals display the first text data in real time during the conference. At this time, each participant can see the text information corresponding to the speech content of the main speaker.
In the above embodiment, only one path of target audio data corresponding to the main speaker is transcribed, and at this time, the information of other speakers is not converted into text data. In order to ensure that the user can acquire the text data corresponding to the speech content of each speaker, in this embodiment, after step 103, the method further includes: when the online conference is finished, acquiring other audio data of the online conference, wherein the other audio data refers to audio data except for target audio data in at least two paths of audio data; acquiring second text data corresponding to other audio data; and combining the first text data and the second text data according to the time sequence to obtain a text file corresponding to the online conference.
The details of obtaining the relevant description of the second text data corresponding to the other audio data are shown in two implementation manners in step 103, and this embodiment is not described herein again.
Combining the first text data and the second text data according to a time sequence to obtain a text file corresponding to the online conference, wherein the text file comprises: when the same time period comprises the first text data and the second text data, respectively establishing a corresponding relation between the time period and the first text data and the second text data, and combining the first text data and the second text data according to the corresponding relation.
In other words, when the speakers are crossed in time, the conference intelligent server processes the audio data according to the original audio data of each speaker to obtain text data corresponding to each speaker, and then stores the text data of each speaker in a text file.
Alternatively, the text file may be sent to each of the participating terminals as a final conference summary.
In summary, in the audio processing method based on the online conference provided by the embodiment, the audio data of at least two participating terminals are obtained in the process of the online conference; determining one path of target audio data corresponding to a main speaker from at least two paths of audio data; acquiring first text data corresponding to target audio data; the problem that a large amount of transcription resources are consumed when each path of audio data is converted into text data can be solved; and under a specific conference scene mainly based on single speaker speaking, only one path of audio data corresponding to the main speaker is processed by predicting the current main speaker, and at the moment, one online conference only needs one path of voice transcription resources, so that the transcription resources occupied by the online conference can be reduced.
In addition, the audio actually used for transcription is the original audio instead of the audio after mixing, so that signal loss caused by intermediate processing does not exist, and the accuracy of voice transcription can be improved.
In addition, by determining the main speaker in conjunction with voice detection and voice features, rather than determining the speaker only through voice detection, the accuracy of determining the main speaker may be improved.
In addition, when the conference control platform starts the audio mixing function, the audio data for entering the audio mixing cannot be heard by the participants, so that the number of paths of the audio data to be judged can be reduced by determining the main speaker from the at least two paths of audio data corresponding to the audio mixing list, and equipment resources are saved.
In addition, when the conference is finished, voice transcription is carried out on other audio data of each path to obtain second text data, and the first text data and the second text data are combined, so that the user can be ensured to obtain complete conference records.
In order to more clearly understand the audio processing method based on online conference provided by the present application, the method is described as an example. Referring to fig. 3, in this embodiment, taking the number of the participating terminals accessing the online conference as 3, which are the participating terminal 1, the participating terminal 2, and the participating terminal 3 as an example, the method at least includes the following steps:
step 31, the conference terminal 1 joins in the conference and sends a path of audio data to the conference control platform; the conference participating terminal 2 joins the conference and sends a path of audio data to the conference control platform; and the conference participating terminal 3 joins the conference and sends one path of audio data to the conference control platform.
The embodiment does not limit the sequence of joining the participating terminals 1, 2 and 3 into the same online conference.
And step 32, the conference control platform forwards the 3 paths of audio data to the conference intelligent server.
And the conference control platform forwards all the audio data from the conference-participating terminals to the conference intelligent server. And under the condition that the conference control platform starts the audio mixing function, the conference control platform also sends all audio lists entering the audio mixing to the conference intelligent server according to the preset audio mixing depth.
And step 33, when the specified main speaker exists in the current online conference, the conference control platform forwards speaker information of the specified main speaker to the conference intelligent server.
Optionally, the specified main speaker may be forcibly specified at the beginning of the online conference, or may also be elected by the conference control platform according to its own decision policy, and the embodiment does not limit the determination manner of the specified main speaker.
Step 33 is not performed when the current online conference does not have a designated primary speaker.
Step 33 may be executed after step 32, or may also be executed before step 32, or may also be executed simultaneously with step 32, and the execution order between step 33 and step 32 is not limited in this embodiment.
Step 34, the conference intelligent server determines whether the specified main speaker exists in the online conference; if yes, go to step 37; if not, go to step 35.
Specifically, if receiving speaker information of a specified main speaker, the conference intelligent server determines that the specified main speaker exists in the online conference; and if the speaker information of the specified main speaker is not received, determining that the specified main speaker does not exist in the online conference.
And step 35, the conference intelligent server performs VAD detection on the received 3 paths of audio data.
In the case of the audio mixing function enabled by the conference control platform, the embodiment takes the example that all 3 channels of audio data received by the conference intelligent server enter the audio mixing. In practical implementation, if there is one path of audio data that does not enter the audio mixing, two paths of audio data may be selected from the 3 paths of audio data according to the audio mixing list for VAD detection, which is described in detail with reference to fig. 2.
And step 36, the conference intelligent server determines one path of audio data which currently has voice data, has a voice duration longer than 3 seconds and has the longest voice duration according to the VAD detection result, and obtains one path of target audio data corresponding to the main speaker.
In this embodiment, the preset duration threshold is taken as 3 seconds for explanation, and in actual implementation, other values may be used, and the value of the preset duration threshold is not limited in this application.
Wherein, the steps 35 and 36 are a process of continuously looping and iterating along with the time, that is, continuously performing VAD detection on 3 paths of audio data of each current time period, and determining a main speaker until stopping at the end of the online conference.
And step 37, the conference intelligent server sends a path of target audio data corresponding to the main speaker to the ASR transcription server for transcription.
In this embodiment, the example of the transcription performed by the ASR transcription server is described, and during actual implementation, the transcription process may also be completed in the conference intelligent server, and this embodiment does not limit the implementation manner of the transcription process.
Step 38, the ASR transcription server transcribes the audio of the speaker to obtain first text data; and returning the first text data to the conference intelligent server.
And 39, after receiving the first text data returned by the ASR server, the conference intelligent server takes the first text data as a real-time text corresponding to the online conference, and forwards the first text data to the conference control platform.
In step 310, after receiving the first text data, the conference control platform forwards the first text data to the participant terminals 1, 2 and 3, so that the participant terminals 1, 2 and 3 display simultaneous captions.
At the end of the online conference, referring to fig. 4, at least the following steps are also included:
in step 311, the conference intelligence server sends the other audio data with VAD to the ASR transcription server.
Step 312, the ASR transcription server transcribes other audio data to obtain second text data; and the second text data is returned to the conference intelligent server.
And 313, the intelligent conference server receives the second text data, and combines the second text data and the first text data according to a time sequence to obtain a text file.
Optionally, the conference intelligent server may further perform audio mixing on each channel of audio data to obtain an audio file after the audio mixing.
According to the above steps, the audio processing method provided by this embodiment can implement:
1. in a real-time scene (simultaneous voice and caption), only one path of transcription (one path corresponding to the appointed main speaker or the main speaker decided by the user) is carried out and output to a screen.
The audio data of the non-main speakers are not transcribed in real time, and the transcription is started when the text is generated (or the conference summary is generated), so that the transcription resources can be saved. The resource requirement of one path of real-time transcription is far greater than that of one path of non-real-time transcription.
2. And recording the complete conference speech in a non-real-time scene, and storing the speech of each conference into a text file.
Fig. 5 is a block diagram of an audio processing apparatus based on an online conference according to an embodiment of the present application. The embodiment is described by taking an example that the device is used in the conference intelligent server 130 of the system described in fig. 1, and the device at least includes the following modules: an audio acquisition module 510, an audio determination module 520, and a text acquisition module 530.
An audio obtaining module 510, configured to obtain audio data of at least two participating terminals during an online conference; the at least two participating terminals are accessed into the same online conference, and each participating terminal corresponds to one path of audio data;
the audio determining module 520 is configured to determine one path of target audio data corresponding to the main speaker from the at least two paths of audio data;
a text obtaining module 530, configured to obtain first text data corresponding to the target audio data.
For relevant details reference is made to the above-described method embodiments.
It should be noted that: in the above embodiment, when performing the audio processing based on the online conference, the audio processing apparatus based on the online conference is illustrated by only dividing the functional modules, and in practical applications, the functions may be allocated to different functional modules according to needs, that is, the internal structure of the audio processing apparatus based on the online conference is divided into different functional modules to complete all or part of the functions described above. In addition, the audio processing apparatus based on the online conference and the audio processing method based on the online conference provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.
Fig. 6 is a block diagram of an electronic device provided by an embodiment of the application. The device may be the conference intelligence server 130 of the system described in fig. 1. The device comprises at least a processor 601 and a memory 602.
Processor 601 may include one or more processing cores such as: 4 core processors, 8 core processors, etc. The processor 601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 601 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.
The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 602 is used to store at least one instruction for execution by processor 601 to implement the online conference based audio processing method provided by the method embodiments herein.
In some embodiments, the electronic device may further include: a peripheral interface and at least one peripheral. The processor 601, memory 602 and peripheral interface may be connected by a bus or signal lines. Each peripheral may be connected to the peripheral interface via a bus, signal line, or circuit board. Illustratively, peripheral devices include, but are not limited to: radio frequency circuit, touch display screen, audio circuit, power supply, etc.
Of course, the electronic device may include fewer or more components, which is not limited by the embodiment.
Optionally, the present application further provides a computer-readable storage medium, in which a program is stored, and the program is loaded and executed by a processor to implement the online conference-based audio processing method according to the above method embodiment.
Optionally, the present application further provides a computer product, which includes a computer-readable storage medium, in which a program is stored, and the program is loaded and executed by a processor to implement the online conference-based audio processing method of the above-mentioned method embodiment.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An audio processing method based on an online conference, which is used in a conference intelligent server, and comprises the following steps:
acquiring audio data of at least two participating terminals in the process of an online conference; the at least two participating terminals are accessed into the same online conference, and each participating terminal corresponds to one path of audio data;
determining one path of target audio data corresponding to a main speaker from at least two paths of audio data;
and acquiring first text data corresponding to the target audio data.
2. The method of claim 1, wherein the determining one path of target audio data corresponding to the main speaker from the at least two paths of audio data comprises:
determining whether voice data is included in each path of audio data;
when at least two paths of audio data comprise voice data, acquiring voice characteristics of the voice data;
and determining the target audio data from at least two paths of audio data comprising the voice data according to the voice characteristics.
3. The method of claim 2, wherein determining the target audio data from at least two audio data including the speech data according to the speech characteristics comprises:
the voice features comprise voice energy, and one path of audio data with the maximum voice energy is determined as the target audio data;
alternatively, the first and second electrodes may be,
the voice features comprise voice energy and voice duration, and one path of audio data with the voice energy exceeding a preset threshold and the voice duration being the largest is determined as the target audio data;
alternatively, the first and second electrodes may be,
and the voice characteristics comprise voice duration, and one path of audio data with the voice duration exceeding a preset duration threshold and the voice duration being the maximum is determined as the target audio data.
4. The method of claim 1, wherein the determining one path of target audio data corresponding to the main speaker from the at least two paths of audio data further comprises:
acquiring a main speaker appointed in the online conference;
and determining one path of audio data corresponding to the appointed main speaker from at least two paths of audio data to obtain the target audio data.
5. The method according to claim 1, wherein the obtaining of the first text data corresponding to the target audio data comprises:
processing the target audio data by using a voice transcription algorithm to obtain first text data;
alternatively, the first and second electrodes may be,
sending the target audio data to a designated device, so that the designated device can process the target audio data by using a voice transcription algorithm to obtain first text data corresponding to the target audio data; and receiving first text data corresponding to the target audio data sent by the designated equipment.
6. The method of claim 1, wherein the obtaining audio data of the at least two participating terminals comprises:
when a conference control platform starts a sound mixing function, obtaining a sound mixing list sent by the conference control platform, wherein the sound mixing list comprises data identifications of the at least two paths of audio data; the conference control platform is used for providing online conference service for the at least two conference-participating terminals, and the number of the data identifiers in the sound mixing list is determined by the conference control platform according to a preset sound mixing depth;
receiving N paths of audio data sent by the conference control platform, wherein the N paths of audio data refer to audio data corresponding to all conference participating terminals accessed to the online conference, and N is greater than or equal to 2;
and acquiring the at least two paths of audio data indicated by the data identification from the N paths of audio data.
7. The method according to claim 1, wherein after the obtaining the first text data corresponding to the target audio data, further comprising:
after the online conference is finished, acquiring other audio data of the online conference, wherein the other audio data refers to audio data except the target audio data in the at least two paths of audio data;
acquiring second text data corresponding to the other audio data;
and combining the first text data and the second text data according to a time sequence to obtain a text file corresponding to the online conference.
8. An online conference based audio processing device, characterized in that the device comprises a processor and a memory; the memory has stored therein a program that is loaded and executed by the processor to implement the online conference based audio processing method according to any one of claims 1 to 7.
9. A computer-readable storage medium, characterized in that the storage medium has stored therein a program which, when being executed by a processor, is adapted to carry out the method of audio processing based on an online conference according to any one of claims 1 to 7.
10. An audio processing system based on an online conference is characterized by comprising N conference-participating terminals, a conference control platform in communication connection with each conference-participating terminal, and a conference intelligent server connected with the conference control platform; the N participating terminals are accessed into the same online conference, and N is an integer greater than 1;
each of the N participating terminals is used for acquiring audio data in the process of the online conference to obtain N paths of audio data; sending the N paths of audio data to the conference control platform;
the conference control platform is used for receiving the N paths of audio data and forwarding the N paths of audio data to the conference intelligent server;
the intelligent conference server is used for acquiring audio data of at least two participating terminals in the process of an online conference; determining one path of target audio data corresponding to a main speaker from at least two paths of audio data; and acquiring first text data corresponding to the target audio data.
CN202110729223.1A 2021-06-29 2021-06-29 Audio processing method, device, storage medium and system based on online conference Pending CN113450797A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110729223.1A CN113450797A (en) 2021-06-29 2021-06-29 Audio processing method, device, storage medium and system based on online conference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110729223.1A CN113450797A (en) 2021-06-29 2021-06-29 Audio processing method, device, storage medium and system based on online conference

Publications (1)

Publication Number Publication Date
CN113450797A true CN113450797A (en) 2021-09-28

Family

ID=77814121

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110729223.1A Pending CN113450797A (en) 2021-06-29 2021-06-29 Audio processing method, device, storage medium and system based on online conference

Country Status (1)

Country Link
CN (1) CN113450797A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115662437A (en) * 2022-12-28 2023-01-31 广州市保伦电子有限公司 Voice transcription method under scene of simultaneous use of multiple microphones

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110310657A (en) * 2019-07-10 2019-10-08 北京猎户星空科技有限公司 A kind of audio data processing method and device
CN110992930A (en) * 2019-12-06 2020-04-10 广州国音智能科技有限公司 Voiceprint feature extraction method and device, terminal and readable storage medium
CN111049792A (en) * 2019-10-08 2020-04-21 广州视源电子科技股份有限公司 Audio transmission method and device, terminal equipment and storage medium
CN111883123A (en) * 2020-07-23 2020-11-03 平安科技(深圳)有限公司 AI identification-based conference summary generation method, device, equipment and medium
CN112053691A (en) * 2020-09-21 2020-12-08 广东迷听科技有限公司 Conference assisting method and device, electronic equipment and storage medium
CN112203039A (en) * 2020-10-12 2021-01-08 北京字节跳动网络技术有限公司 Processing method and device for online conference, electronic equipment and computer storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110310657A (en) * 2019-07-10 2019-10-08 北京猎户星空科技有限公司 A kind of audio data processing method and device
CN111049792A (en) * 2019-10-08 2020-04-21 广州视源电子科技股份有限公司 Audio transmission method and device, terminal equipment and storage medium
CN110992930A (en) * 2019-12-06 2020-04-10 广州国音智能科技有限公司 Voiceprint feature extraction method and device, terminal and readable storage medium
CN111883123A (en) * 2020-07-23 2020-11-03 平安科技(深圳)有限公司 AI identification-based conference summary generation method, device, equipment and medium
CN112053691A (en) * 2020-09-21 2020-12-08 广东迷听科技有限公司 Conference assisting method and device, electronic equipment and storage medium
CN112203039A (en) * 2020-10-12 2021-01-08 北京字节跳动网络技术有限公司 Processing method and device for online conference, electronic equipment and computer storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115662437A (en) * 2022-12-28 2023-01-31 广州市保伦电子有限公司 Voice transcription method under scene of simultaneous use of multiple microphones
CN115662437B (en) * 2022-12-28 2023-04-18 广东保伦电子股份有限公司 Voice transcription method under scene of simultaneous use of multiple microphones

Similar Documents

Publication Publication Date Title
US8971511B2 (en) Method and apparatus for enhancing speaker selection
US9311920B2 (en) Voice processing method, apparatus, and system
CN106301811A (en) Realize the method and device of multimedia conferencing
WO2019071808A1 (en) Video image display method, apparatus and system, terminal device, and storage medium
US20180048683A1 (en) Private communications in virtual meetings
CN101631225A (en) Conference voting method, conference voting device and conference voting system
CN111314780B (en) Method and device for testing echo cancellation function and storage medium
CN113450797A (en) Audio processing method, device, storage medium and system based on online conference
CN112862461A (en) Conference process control method, device, server and storage medium
CN112702468A (en) Call control method and device
CN112216306A (en) Voiceprint-based call management method and device, electronic equipment and storage medium
CN106664432A (en) Multimedia information play methods and systems, acquisition equipment, standardized server
CN103905483A (en) Audio and video sharing method, equipment and system
CN111951821B (en) Communication method and device
US20200184973A1 (en) Transcription of communications
CN113992882A (en) Packet processing method and device for multi-person conversation, electronic device and storage medium
CN108766448B (en) Mixing testing system, method, device and storage medium
US10867609B2 (en) Transcription generation technique selection
CN111405122B (en) Audio call testing method, device and storage medium
US11477326B2 (en) Audio processing method, device, and apparatus for multi-party call
WO2024032111A1 (en) Data processing method and apparatus for online conference, and device, medium and product
US20230421620A1 (en) Method and system for handling a teleconference
US11632404B2 (en) Data stream prioritization for communication session
CN108924465A (en) Determination method, apparatus, equipment and the storage medium of video conference spokesman's terminal
CN113098931B (en) Information sharing method and multimedia session terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination