WO2020232865A1 - 会议分角色语音合成方法、装置、计算机设备和存储介质 - Google Patents

会议分角色语音合成方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2020232865A1
WO2020232865A1 PCT/CN2019/102448 CN2019102448W WO2020232865A1 WO 2020232865 A1 WO2020232865 A1 WO 2020232865A1 CN 2019102448 W CN2019102448 W CN 2019102448W WO 2020232865 A1 WO2020232865 A1 WO 2020232865A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
information
speech
voice
voice stream
Prior art date
Application number
PCT/CN2019/102448
Other languages
English (en)
French (fr)
Inventor
岳鹏昱
闫冬
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020232865A1 publication Critical patent/WO2020232865A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, device, computer equipment, and storage medium for speech synthesis by roles in conferences.
  • multimedia conferences are gradually being used in more and more enterprises, greatly improving the efficiency of people's communication and collaboration.
  • meeting records are often necessary.
  • the recording of multimedia meetings is a form of meeting records. For example, when a user needs to temporarily leave the meeting for other events during the process of participating in a meeting, but does not want to miss important speeches of some meeting participants in the meeting, or the user wants to talk to some participants in the meeting.
  • the speech is recorded, it is necessary to start the meeting recording and record the meeting.
  • the current conference recording is generally for the entire conference process. That is to say, if the recording is activated during the conference, all the speeches in the conference will be recorded, and the specified participants in the conference cannot be recorded.
  • a method for speech synthesis by roles in conferences including:
  • Receive the start recording signal turn on multiple microphones, receive multiple voice streams through multiple microphones, and perform breakpoint detection on each voice stream, intercept multiple valid voice streams, and combine multiple valid voice streams.
  • the voice stream, the audio start time, the audio length, and the associated participant information corresponding to the effective voice stream are stored together until the end recording signal is received, and multiple microphones are turned off;
  • the audio start time sequence starting from the earliest time, synthesize a plurality of the valid speech streams into a piece of audio information, and according to the audio start time sequence, the audio start time, the audio length, and the corresponding Participant information is merged into a piece of role information, and after mapping the effective voice stream in the audio information with the audio start time corresponding to the role information, the audio information and the role information are defined together Save the conference audio.
  • a speech synthesis device for different roles in conferences including:
  • the information obtaining module is configured to obtain participant information input by the user and the association relationship with the microphone, and each participant is associated with one microphone;
  • Receive and save the voice stream module set to receive the start recording signal, turn on multiple said microphones, receive multiple voice streams through multiple said microphones, and perform breakpoint detection on each voice stream, and intercept multiple valid Voice stream, storing multiple valid voice streams, audio start time, audio length, and associated participant information corresponding to the valid voice streams together, until the end recording signal is received, and multiple microphones are turned off;
  • a conference audio module is generated, which is set to synthesize a piece of audio information from multiple valid speech streams into a piece of audio information according to the audio start time sequence, starting from the earliest time.
  • the audio start time sequence the audio start time, the The audio length and the corresponding participant information are combined into a piece of role information, and after mapping the effective voice stream in the audio information with the corresponding audio start time in the role information, the audio information Together with the role information, it is defined as conference audio and saved.
  • a computer device includes a memory and a processor, and computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, the processor executes the following steps:
  • Receive the start recording signal turn on multiple microphones, receive multiple voice streams through multiple microphones, and perform breakpoint detection on each voice stream, intercept multiple valid voice streams, and combine multiple valid voice streams.
  • the voice stream, the audio start time, the audio length, and the associated participant information corresponding to the effective voice stream are stored together until the end recording signal is received, and multiple microphones are turned off;
  • the audio start time sequence starting from the earliest time, synthesize a plurality of the valid speech streams into a piece of audio information, and according to the audio start time sequence, the audio start time, the audio length, and the corresponding Participant information is merged into a piece of role information, and after mapping the effective voice stream in the audio information with the audio start time corresponding to the role information, the audio information and the role information are defined together Save the conference audio.
  • a storage medium storing computer-readable instructions.
  • the one or more processors execute the following steps:
  • Receive the start recording signal turn on multiple microphones, receive multiple voice streams through multiple microphones, and perform breakpoint detection on each voice stream, intercept multiple valid voice streams, and combine multiple valid voice streams.
  • the voice stream, the audio start time, the audio length, and the associated participant information corresponding to the effective voice stream are stored together until the end recording signal is received, and multiple microphones are turned off;
  • the audio start time sequence starting from the earliest time, synthesize a plurality of the valid speech streams into a piece of audio information, and according to the audio start time sequence, the audio start time, the audio length, and the corresponding Participant information is merged into a piece of role information, and after mapping the effective voice stream in the audio information with the audio start time corresponding to the role information, the audio information and the role information are defined together Save the conference audio.
  • the aforementioned conference role-specific speech synthesis method, device, computer equipment and storage medium include obtaining participant information input by a user and an association relationship with a microphone, each participant is associated with one microphone; receiving a recording start signal, Turn on multiple microphones, receive multiple voice streams through multiple microphones, perform breakpoint detection on each voice stream, intercept multiple effective voice streams, and combine multiple effective voice streams,
  • the audio start time, audio length, and associated participant information corresponding to the effective voice stream are stored together until the end of the recording signal is received, and multiple microphones are turned off; according to the audio start time sequence, starting from the earliest time, the A piece of audio information is synthesized from multiple valid speech streams, and the audio start time, the audio length, and the corresponding participant information are combined into a piece of role information according to the sequence of the audio start time.
  • the audio information and the role information are defined together as conference audio for storage.
  • the participant information is set corresponding to the conference room microphone, and the audio is intercepted by muting detection technology.
  • each audio segment is synthesized into the conference audio in chronological order, and the corresponding role information can be known for each audio segment , Can easily determine the content of all speakers during the meeting.
  • FIG. 1 is a flowchart of a method for speech synthesis by roles in a conference in an embodiment of the application
  • FIG. 2 is a flowchart of step S2 in an embodiment of this application.
  • Fig. 3 is a structural diagram of a speech synthesis device for conference roles in an embodiment of the application.
  • Fig. 1 is a flowchart of a speech synthesis method for conference roles according to an embodiment of this application. As shown in Fig. 1, a conference role speech synthesis method includes the following steps:
  • Step S1 obtaining information: obtaining the participant information input by the user and the association relationship with the microphone, and each participant is associated with a microphone.
  • the participant information input by the user and the association relationship between all participants and the microphone can be received through the preset management interface in the conference system.
  • the management interface a schematic diagram of the conference room seat is presented, and the location information of each microphone in the conference room is marked on the schematic diagram.
  • the participant information can be the participant’s name, work ID or other unique identifier in the company, used to distinguish each participant.
  • the multiple microphones in this step are connected to the conference system based on the Raspberry Pi's radio equipment, and the MAC address of the radio equipment is used as a unique identifier, and the microphone name is matched with the corresponding MAC address, thereby completing the physical association between the participants and the microphone relationship.
  • Step S2 receive and save the voice stream: receive the start recording signal, turn on multiple microphones, receive multiple voice streams through multiple microphones, and perform breakpoint detection on each voice stream, intercept multiple valid voice streams, and set multiple
  • the effective voice stream, the audio start time, audio length, and associated participant information corresponding to the effective voice stream are stored together until the end of the recording signal is received, and multiple microphones are turned off.
  • this step after independently receiving voice streams sent by multiple microphones, multiple independent threads can be started, and this step can be executed concurrently to perform breakpoint detection on each voice stream, and intercept effective voice streams.
  • this step when the valid voice stream is saved, the corresponding participant information is also saved together, so as to determine which valid voice stream is speaking by which participant.
  • step S2 includes:
  • Step S201 start recording: receive a start recording signal, enable the recording function for multiple associated microphones, and respectively receive the voice stream transmitted by each microphone.
  • the recording start signal can be received through the management interface of the conference system, the recording function can be automatically turned on for the associated microphones, and the voice streams transmitted by multiple microphones can be received respectively.
  • Step S202 breakpoint detection and interception of effective voice streams: breakpoint detection is performed on each voice stream separately, if there is a breakpoint, a segment of effective voice stream is intercepted, and the intercepted effective voice stream and the corresponding audio start time and audio length , The associated participant information is saved to the storage medium together, and the breakpoint detection of the current voice stream is continued.
  • Breakpoint detection is used to detect a segment of valid voice stream from a continuous voice stream. It includes detecting the starting point of the valid voice stream, that is, the previous breakpoint, and detecting the end point of the valid voice stream, that is, the post breakpoint. Separating the effective voice stream from the continuous voice stream can reduce the amount of stored data, and the use of breakpoint detection can also simplify human-computer interaction. For example, if necessary, you can receive the end recording signal without step S203, and directly pass the received voice stream Real-time breakpoint detection, confirm the end of the recording operation.
  • Step S20201 segment the voice stream: segment the voice stream according to a fixed duration, define each segmentation unit as a frame of speech, and collect the same number of N sampling points for each frame of speech.
  • the fixed duration in this step can be 20ms, 30ms, etc.
  • the voice stream is divided according to this fixed duration, and the voice stream is divided into several frames of voice. Even if the same participant is speaking, the speech volume of the same word may be different, so before segmenting the voice stream in this step, you can also normalize the voice stream: take each voice stream The point with the largest amplitude in the middle will increase its amplitude to close to 1, record the ratio of the increase, and then stretch all other points according to this ratio.
  • Step S20202 Calculate the energy value: calculate the energy value of each frame of speech.
  • the energy value calculation formula is as follows:
  • E is the energy value of a frame of speech
  • f k is the peak value of the k-th sampling point
  • N is the total number of sampling points of a frame of speech
  • the energy value of a frame of speech is not only related to the size of the sample value in it, but also to the number of sample points contained in it.
  • the sample value is the above-mentioned peak value, which generally contains positive and negative values, and there is no need to consider the positive value when calculating the energy value.
  • Negative value so this step uses the sum of squares of the sampled values to define the energy value of a frame of speech.
  • Step S20203 Determine the breakpoint before and after: if the energy value of consecutive M frames of speech is higher than the preset threshold, define the first frame of speech higher than the preset value in the consecutive M frames of speech as the previous breakpoint of a segment of audio. The energy value of +1 frame starts to be lower than the preset threshold and lasts for a preset time, then the M+1 frame of speech is defined as the post-breakpoint of a segment of audio, and the audio between the previous and post-breakpoints is intercepted as a segment Effective voice streaming.
  • the first frame of speech whose energy value is just above the preset threshold is defined as Before the breakpoint. If the energy value of consecutive M frames of speech is high, and the energy value of the subsequent frame of speech becomes smaller and lasts for a preset period of time, it can be considered that the place where the energy value decreases is a post-breakpoint. The audio between the pre-breakpoint and the post-breakpoint is intercepted and saved as a valid voice stream.
  • the ideal mute energy value in this step is 0, so the preset threshold in this step is 0 in an ideal state, but in the collected voice stream, there is often a certain intensity of background sound, this background sound is also muted, obviously energy The value is higher than 0, so it is usually not 0 when setting the preset threshold.
  • the preset threshold in this step can be a dynamic threshold: when detecting the breakpoint of each voice stream, first collect the average energy value of the initial duration of the voice stream, for example, the average energy value E0 of the initial 100ms-1000ms of the voice stream Or the average energy value E0 of the previous 100 frames of speech, the energy value E0 is added by a coefficient or multiplied by a coefficient greater than 1, to obtain the preset threshold value in this step.
  • a single voice stream is divided into multiple frames of voice, the energy value is calculated for each frame of voice, and the process of judging whether there is a breakpoint according to the energy value, the single voice stream is intercepted into multiple valid voice streams, and the silent part is discarded. To save the intercepted multiple effective voice streams, reducing storage pressure.
  • Step S203 End recording: receiving an end recording signal, and turning off the recording function for multiple associated microphones.
  • Step S204 Save the effective voice stream: after receiving the end recording signal, if there is no breakpoint, intercept the voice stream from the breakpoint detection to the end of the audio signal as a valid voice stream, and combine the effective voice stream and the corresponding audio
  • the start time, audio length, and associated participant information are stored in the storage medium together.
  • Each voice stream passes through step S202 to detect breakpoints and intercept valid voice streams in real time.
  • the detection in step S202 is also performed until the audio signal ends.
  • the effective voice stream is intercepted in step S202. If there is no breakpoint, it is considered that the audio from the breakpoint detection to the end of the audio signal is a valid voice stream, and it is intercepted and saved.
  • breakpoint detection is performed on each voice stream transmitted by the microphone, and the effective voice stream is intercepted until the end recording signal is received, and the receiving voice stream is stopped, and the corresponding audio start time and audio length are calculated for each effective voice stream.
  • the associated participant information is stored together to provide accurate data for the subsequent audio information for distinguishing conference roles.
  • Step S3 generate conference audio: according to the audio start time sequence, from the earliest time, synthesize multiple valid voice streams into a piece of audio information, according to the audio start time sequence, merge the audio start time, audio length, and corresponding participant information After forming a piece of role information, after mapping the effective voice stream in the audio information with the corresponding audio start time in the role information, the audio information and role information are defined together as conference audio for storage.
  • the audio name input by the user can be obtained through the management interface, and the input interface can be displayed after the user triggers the end recording signal through the management interface, and the user inputs the audio name through the input interface. If any information entered by the user is not obtained within the set time, such as 5 minutes, it will be stored by default.
  • step S4 audio display:
  • Step S401 receiving the request and displaying: receiving the audio playback request sent by the user, and displaying the file name of the conference audio.
  • the user can make an audio playback request through the API interface connected to the conference system, or send an audio playback request to the conference system through an http request.
  • the conference system After the conference system receives the audio playback request, it will display all the stored conference audios.
  • the conference audio files When displaying, the conference audio files will be sorted and displayed after being displayed. For example, the file names are displayed in order of storage time, or the file names are displayed in descending order of the first letter of English.
  • Step S402 playing audio information and synchronously displaying role information: when the user triggers any file name, the audio information corresponding to the file name is played, and the role information corresponding to the file name is displayed.
  • each valid voice stream in the audio information is mapped to the corresponding role information, when the audio information triggered by the user is played in this step, the corresponding role information can also be displayed simultaneously to provide the user with the conference speaker corresponding to the audio information.
  • This embodiment provides the user with an audio playback channel.
  • the role information is also displayed synchronously. The user does not need to organize the content of the conference, and can intuitively understand the conference speaker corresponding to the recording content.
  • step S2 it further includes:
  • Each valid voice stream in the audio information is converted into translated text through the preset voice recognition software; when the audio start time, audio length, and corresponding participant information are combined into a piece of role information according to the audio start time sequence, Combine the translated text into a piece of role information, and map the voice stream and the translated text in the audio information; when the user triggers any file name, the audio information corresponding to the file name is played, and the role information corresponding to the file name is played.
  • the translated text is also displayed simultaneously.
  • this embodiment After intercepting multiple valid voice streams for each voice stream in step S2, this embodiment also converts each valid voice stream into a translated text through a preset voice recognition software.
  • the speech recognition software uses the acoustic model to decode the effective speech stream, and uses the language model to search the decoded speech to obtain the translated text.
  • the acoustic model can use a neural network model
  • the language model can use an N-GRAM model (N-element statistical model)
  • the search algorithm can use the Viterbi algorithm (Viterbi algorithm).
  • step S3 When merging into a piece of role information in step S3, the audio start time, audio length, corresponding participant information and translation text are combined into a piece of role information.
  • step S4 when displaying the audio, it also includes displaying the translated text. Since there is a mapping relationship between the effective voice stream and the translated text, when the user clicks on a certain translated text, it can also jump to the corresponding effective voice stream for playback, and simultaneously display the translated text and role information.
  • This embodiment provides the translated text corresponding to the effective voice stream, and displays the translated text together during audio display, so that the user can further intuitively understand the specific meeting content.
  • the conference audio can also be retrieved:
  • Receive the search request sent by the user get the keyword, search for the existence of the keyword in the saved conference audio, if it exists, display the file name of the conference audio corresponding to the keyword; when the user triggers any file name , Play the audio information corresponding to the file name, and display the role information and translation text corresponding to the file name.
  • the user of this embodiment can receive a search request and obtain keywords through the management interface of the conference system.
  • the user can also make a search request through an API interface connected to the conference system, and can also send a search request to the conference system through an http request.
  • the keywords include the audio name, audio start time, participant information or general words, etc., through the keyword search and save whether the conference audio contains this keyword, if it contains, all audio information or
  • the file name of the conference audio corresponding to the role information is displayed.
  • the keyword is blockchain, this keyword is a general term.
  • this keyword is mentioned in a certain translation text of Zhang San by the participant.
  • this keyword is mentioned in another translation text of Li Si, the participant is the file name of the conference audio corresponding to a certain translation text of Zhang San, and the participant is a certain translation text of Li Si.
  • the file names of the conference audio are displayed together.
  • This embodiment provides a search channel for users and provides users with more extended functions.
  • the role-based speech synthesis method for conferences uses role relationships corresponding to the microphones in the conference room and intercepts effective voice streams in segments through breakpoint detection technology. After the conference is over, each effective voice stream is synthesized into conference audio in chronological order. , For each effective voice stream, the corresponding participant information and translated text can be known, providing users with intuitive meeting content.
  • a speech synthesis device for different roles in conferences is proposed, as shown in FIG. 3, including the following modules:
  • the obtaining information module is set to obtain the participant information input by the user and the association relationship with the microphone. Each participant is associated with a microphone;
  • Receive and save the voice stream module set it to receive the start recording signal, turn on multiple microphones, receive multiple voice streams through multiple microphones, and perform breakpoint detection on each voice stream, intercept multiple valid voice streams, and set multiple
  • the effective voice stream, the audio start time, audio length, and associated participant information corresponding to the effective voice stream are stored together until the end of the recording signal is received, and multiple microphones are turned off;
  • Generate a conference audio module set to combine multiple valid voice streams into a piece of audio information according to the audio start time sequence, starting from the earliest time, and merge the audio start time, audio length, and corresponding participant information in the audio start time sequence
  • the audio information and role information are defined together as conference audio for storage.
  • the module for receiving and saving the voice stream includes: a recording unit configured to receive a recording start signal, turn on the recording function for multiple associated microphones, and receive the voice stream transmitted by each microphone respectively; a breakpoint detection unit , Set to detect breakpoints for each voice stream separately, if there is a breakpoint, intercept a valid voice stream, and save the intercepted effective voice stream along with the corresponding audio start time, audio length, and associated participant information To the storage medium, and continue to detect the breakpoint of the current voice stream; turn off the recording unit, set to receive the end of recording signal, turn off the recording function for multiple associated microphones; save the unit, set to after receiving the end of recording signal , If there is no breakpoint, intercept the voice stream from the breakpoint detection to the end of the audio signal as a valid voice stream, and save the valid voice stream along with the corresponding audio start time, audio length, and associated participant information To the storage medium.
  • the breakpoint detection unit is further configured to divide the voice stream according to a fixed duration, define each segmentation unit as a frame of voice, and collect N sampling points of the same number for each frame of voice;
  • E is the energy value of a frame of speech
  • f k is the peak value of the k-th sampling point
  • N is the total number of sampling points of a frame of speech
  • the conference audio generation module is also set to obtain the audio name entered by the user, rename the file name of the conference audio to the conference audio and save it. If the audio name is not obtained within the set time, the conference audio Rename the file name to the earliest audio start time and save it.
  • it further includes: a display module, configured to receive audio playback requests sent by users, and display the file name of the conference audio; a playback module, configured to display audio information corresponding to the file name when the user triggers any file name Play and display the role information corresponding to the file name.
  • it further includes: a conversion module, configured to convert each valid voice stream in the audio information into translated text through preset voice recognition software; the merging module, configured to start the audio in the order of the audio start time When the time, audio length, and corresponding participant information are merged into a piece of role information, the translated text is also merged into a piece of role information, and the voice stream in the audio information is also mapped with the translated text; the synchronization display module is set to After the user triggers any file name, the audio information corresponding to the file name is played, and when the role information corresponding to the file name is displayed, the translated text is also displayed simultaneously.
  • a conversion module configured to convert each valid voice stream in the audio information into translated text through preset voice recognition software
  • the merging module configured to start the audio in the order of the audio start time When the time, audio length, and corresponding participant information are merged into a piece of role information, the translated text is also merged into a piece of role information, and the voice stream in the audio information is also mapped with the translated text
  • a retrieval module which is configured to receive a retrieval request sent by a user, obtain keywords, and retrieve whether there are keywords in a plurality of saved conference audios, and if they exist, the conference audios corresponding to the keywords
  • a retrieval module configured to receive a retrieval request sent by a user, obtain keywords, and retrieve whether there are keywords in a plurality of saved conference audios, and if they exist, the conference audios corresponding to the keywords
  • a computer device including a memory and a processor, and computer-readable instructions are stored in the memory.
  • the processor executes the computer-readable instructions. The steps in the speech synthesis method for conference roles according to the embodiment.
  • a storage medium storing computer-readable instructions.
  • the processors can execute the role-based speech synthesis methods for conferences in the foregoing embodiments.
  • the storage medium may be a non-volatile storage medium.
  • the program can be stored in a computer-readable storage medium, and the storage medium can include: Read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

本申请涉及人工智能技术领域,尤其涉及一种会议分角色语音合成方法、装置、计算机设备和存储介质。该方法包括:获取用户输入的参会人员信息及与麦克风的关联关系;通过多个麦克风接收多条语音流,分别将每条语音流进行断点检测,截取多条有效语音流,将多条有效语音流、音频开始时间、音频长度、关联的参会人员信息一起进行保存;将多条有效语音流合成一段音频信息,将音频开始时间、音频长度、对应的参会人员信息合并成一段角色信息,将音频信息和角色信息一起定义为会议音频进行保存。本申请通过对会议室麦克风对应设置参会人员信息,对每段音频都对应参会人员信息,能容易确定会议过程中所有发言人的发言内容。

Description

会议分角色语音合成方法、装置、计算机设备和存储介质
本申请要求于2019年05月21日提交中国专利局、申请号为201910424720.3、发明申请名称为“会议分角色语音合成方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种会议分角色语音合成方法、装置、计算机设备和存储介质。
背景技术
多媒体会议作为一种经济高效的会议解决方案,逐步在企业得到越来越多的应用,大大提高了人们沟通、协作的效率。会议作为一种多人交流的手段,会议记录往往是必要的,对于多媒体会议来讲,多媒体会议的录音就是一种会议记录的形式。例如,用户在参加会议的过程中,有其他事件需要暂时离开会议,但是又不想错过会议中的某些会议参与者的重要发言的情况下,或者用户想要对会议中某些会议参与者的发言进行录音时,就需要启动会议录音,对会议进行记录。但是目前的会议录音一般都是针对整个会议过程的,也就是说,在会议过程中,如果启动了录音,会对会议中所有人的发言进行录音,无法针对会议中指定参与者进行录音,也无法区分参与者角色进行录音,当用户想要记录会议主要发言人的发言时,只能从对会议中所有人的发言的录音中对发言进行整理,以获取需要的发言内容,这就导致用户后续需要花费过多的时间去整理会议录音,为用户带来不便。
发明内容
有鉴于此,有必要针对会议录音时无法区分角色保存音频文件的问题,提供一种会议分角色语音合成方法、装置、计算机设备和存储介质。
一种会议分角色语音合成方法,包括:
获取用户输入的参会人员信息及与麦克风的关联关系,每个所述参会人员 关联一个所述麦克风;
接收开始录音信号,开启多个所述麦克风,通过多个所述麦克风接收多条语音流,分别将每条所述语音流进行断点检测,截取多条有效语音流,将多条所述有效语音流、所述有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个所述麦克风;
按照所述音频开始时间顺序,从时间最早开始,将多条所述有效语音流合成一段音频信息,按照所述音频开始时间顺序,将所述音频开始时间、所述音频长度、所述对应的参会人员信息合并成一段角色信息,将所述音频信息中所述有效语音流与所述角色信息中对应的所述音频开始时间进行映射后,将所述音频信息和所述角色信息一起定义为会议音频进行保存。
一种会议分角色语音合成装置,包括:
获取信息模块,设置为获取用户输入的参会人员信息及与麦克风的关联关系,每个所述参会人员关联一个所述麦克风;
接收并保存语音流模块,设置为接收开始录音信号,开启多个所述麦克风,通过多个所述麦克风接收多条语音流,分别将每条所述语音流进行断点检测,截取多条有效语音流,将多条所述有效语音流、所述有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个所述麦克风;
生成会议音频模块,设置为按照所述音频开始时间顺序,从时间最早开始,将多条所述有效语音流合成一段音频信息,按照所述音频开始时间顺序,将所述音频开始时间、所述音频长度、所述对应的参会人员信息合并成一段角色信息,将所述音频信息中所述有效语音流与所述角色信息中对应的所述音频开始时间进行映射后,将所述音频信息和所述角色信息一起定义为会议音频进行保存。
一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:
获取用户输入的参会人员信息及与麦克风的关联关系,每个所述参会人员关联一个所述麦克风;
接收开始录音信号,开启多个所述麦克风,通过多个所述麦克风接收多条 语音流,分别将每条所述语音流进行断点检测,截取多条有效语音流,将多条所述有效语音流、所述有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个所述麦克风;
按照所述音频开始时间顺序,从时间最早开始,将多条所述有效语音流合成一段音频信息,按照所述音频开始时间顺序,将所述音频开始时间、所述音频长度、所述对应的参会人员信息合并成一段角色信息,将所述音频信息中所述有效语音流与所述角色信息中对应的所述音频开始时间进行映射后,将所述音频信息和所述角色信息一起定义为会议音频进行保存。
一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
获取用户输入的参会人员信息及与麦克风的关联关系,每个所述参会人员关联一个所述麦克风;
接收开始录音信号,开启多个所述麦克风,通过多个所述麦克风接收多条语音流,分别将每条所述语音流进行断点检测,截取多条有效语音流,将多条所述有效语音流、所述有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个所述麦克风;
按照所述音频开始时间顺序,从时间最早开始,将多条所述有效语音流合成一段音频信息,按照所述音频开始时间顺序,将所述音频开始时间、所述音频长度、所述对应的参会人员信息合并成一段角色信息,将所述音频信息中所述有效语音流与所述角色信息中对应的所述音频开始时间进行映射后,将所述音频信息和所述角色信息一起定义为会议音频进行保存。
上述会议分角色语音合成方法、装置、计算机设备和存储介质,包括获取用户输入的参会人员信息及与麦克风的关联关系,每个所述参会人员关联一个所述麦克风;接收开始录音信号,开启多个所述麦克风,通过多个所述麦克风接收多条语音流,分别将每条所述语音流进行断点检测,截取多条有效语音流,将多条所述有效语音流、所述有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个所述麦克风;按照所述音频开始时间顺序,从时间最早开始,将多条所述有效语音流合成一段音频信息,按照所述音频开始时间顺序,将所述音频开始时间、所述音频长度、所述对应的参会人员信息合并成一段角色信息,将所述音频信息中 所述有效语音流与所述角色信息中对应的所述音频开始时间进行映射后,将所述音频信息和所述角色信息一起定义为会议音频进行保存。本申请通过对会议室麦克风对应设置参会人员信息,通过静音检测技术分段截取音频,在会议结束后,对每段音频按时间顺序合成为会议音频,对每段音频都可知对应的角色信息,能容易确定会议过程中所有发言人的发言内容。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。
图1为本申请一个实施例中的会议分角色语音合成方法的流程图;
图2为本申请一个实施例中步骤S2的一种流程图;
图3为本申请一个实施例中会议分角色语音合成装置的结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本申请的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。
图1为本申请一个实施例中的会议分角色语音合成方法的流程图,如图1所示,一种会议分角色语音合成方法,包括以下步骤:
步骤S1,获取信息:获取用户输入的参会人员信息及与麦克风的关联关系,每个参会人员关联一个麦克风。
本步骤可以通过会议系统中预设的管理界面,接收用户输入的参会人员信息及所有参会人员与麦克风的关联关系。在管理界面中呈现会议室座位示意图,在示意图上标注有每个麦克风在会议室中的位置信息。用户通过点击对应麦克 风,触发输入界面,通过输入界面输入对应参会人员信息,完成参会人员与麦克风在系统层面的关联关系。参会人员信息可以是参会人员的姓名、工号或在公司的其他唯一标识,用于区分各个参会人员。
本步骤中的多个麦克风基于树莓派的收音设备与会议系统连接,以收音设备的MAC地址作为唯一标识,将麦克风名称与对应的MAC地址进行对应,进而完成了参与人员与麦克风的物理关联关系。
步骤S2,接收并保存语音流:接收开始录音信号,开启多个麦克风,通过多个麦克风接收多条语音流,分别将每条语音流进行断点检测,截取多条有效语音流,将多条有效语音流、有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个麦克风。
本步骤在独立接收多个麦克风发送的语音流后,可以开启多条独立线程,并发执行本步骤对每条语音流均进行断点检测,截取有效语音流。本步骤在对有效语音流进行保存时,还对其对应的参会人员信息一同进行保存,以便于确定哪条有效语音流是哪个参会人员发言的。
在一个实施例中,步骤S2,如图2所示,包括:
步骤S201,开始录音:接收开始录音信号,对多个已关联的麦克风开启录音功能,分别接收每个麦克风传送的语音流。
本步骤可以通过会议系统的管理界面接收开始录音信号,对已关联的麦克风自动开启录音功能,且分别接收多个麦克风传送的语音流。
步骤S202,断点检测及截取有效语音流:分别对每条语音流进行断点检测,若存在断点时,截取一段有效语音流,将截取的有效语音流及对应的音频开始时间、音频长度、关联的参会人员信息一起保存至存储介质中,并继续对当前条语音流进行断点检测。
断点检测用于从连续的语音流中检测出一段段有效语音流。其包括检测出有效语音流的起始点即前断点,检测出有效语音流的结束点即后断点。从连续的语音流中分离出有效语音流,可以降低存储的数据量,使用断点检测也可以简化人机交互,比如必要时,可以无需步骤S203接收结束录音信号,直接通过对接收的语音流实时断点检测,确定结束录音操作。
本步骤在对语音流进行断点检测时,在一个实施例中,采用如下方式:
步骤S20201,分割语音流:将语音流按照固定时长进行分割,将每个分割 单元定义为一帧语音,对每帧语音采集数量相同的N个采样点。
本步骤中的固定时长可以是20ms,30ms等,将语音流按照此固定时长进行分割,将语音流分割为若干帧语音。由于即使同一个参会人员在讲话过程中,对于同一个词的讲话音量可能也不一样,因此在进行本步骤分割语音流之前,还可以对语音流进行归一化处理:取每条语音流中幅度最大的点将其幅度拉大到接近1,记录拉大的比例,再将其他所有点均按此比例进行拉伸。
步骤S20202,计算能量值:计算每帧语音的能量值,能量值的计算公式如下:
Figure PCTCN2019102448-appb-000001
其中,E为一帧语音的能量值,f k为第k个采样点的峰值,N为一帧语音的采样点总个数;
一帧语音的能量值既与其中的采样值大小有关,又与其中包含的采样点数量有关,而采样值即上述的峰值,一般包含正值和负值,而计算能量值时不需要考虑正负值,因此本步骤使用采样值的平方和来定义一帧语音的能量值。
步骤S20203,确定前后断点:若连续M帧语音的能量值高于预设阈值,则将连续M帧语音中高于预设值的第一帧语音定义为一段音频的前断点,若从M+1帧的能量值开始低于预设阈值,且持续一预设时长,则将M+1帧语音定义为一段音频的后断点,截取前断点和后断点之间的音频为一段有效语音流。
若一段语音流的前面几帧语音的能量值低于预设阈值,而连续M帧语音的能量值均高于预设阈值,则在能量值刚高于预设阈值的第一帧语音定义为前断点。若连续的M帧语音的能量值均较高,随后的一帧语音能量值变小,并且持续一预设时长,可以认为在能量值减少的地方为后断点。截取前断点和后断点之间的音频作为一段有效语音流进行保存。
本步骤连续M帧对应的音频时长越小,断点检测灵敏度越高。本步骤由于是在会议录音过程中,存在接收大段语音流情况,中间可能会出现较长时间的停顿,宜将灵敏度降低,因此本步骤的M值可以设置为较大值,对应的音频时长为2000ms-2500ms之间。
本步骤理想的静音能量值为0,因此本步骤中的预设阈值在理想状态下取0,但是在采集的语音流中,往往存在一定强度的背景音,此背景音也算静音,显 然能量值高于0,因此在设置预设阈值时通常不为0。本步骤的预设阈值可以是一个动态阈值:可以在对每条语音流进行断点检测时,首先采集语音流起始时长的平均能量值,例如语音流起始100ms-1000ms的平均能量值E0或前100帧语音的平均能量值E0,将能量值E0加上一系数或乘以大于1的系数,得到本步骤的预设阈值。
本实施例通过对单条语音流进行分割成多帧语音,对每帧语音计算能量值,根据能量值判断是否存在断点的过程,将单条语音流截取成多条有效语音流,将静音部分舍去,将截取的多条有效语音流进行保存,减少了存储压力。
步骤S203,结束录音:接收结束录音信号,对多个已关联的麦克风关闭录音功能。
本步骤也可以通过会议系统的管理界面接收结束录音信号,对已关联的麦克风自动关闭录音功能,停止接收语音流。
步骤S204,保存有效语音流:在接收结束录音信号后,若还不存在断点时,则截取从断点检测开始至音频信号结束的语音流为有效语音流,将有效语音流及对应的音频开始时间、音频长度、关联的参会人员信息一起保存至存储介质中。
每条语音流通过步骤S202,实时进行断点检测及截取有效语音流,在检测过程中,接收到结束录音信号后,还进行步骤S202的检测,直到音频信号结束。此过程中,若存在前后断点,则进行步骤S202的截取有效语音流。若不存在断点,则认为从断点检测开始到音频信号结束的此段音频均为有效语音流,进行截取和保存。
本实施例对麦克风传送的每条语音流都分别进行断点检测和截取有效语音流直至接收到结束录音信号,停止接收语音流,对每条有效语音流均和对应的音频开始时间、音频长度、关联的参会人员信息一起保存,为后续区分会议角色的音频信息提供准确数据。
步骤S3,生成会议音频:按照音频开始时间顺序,从时间最早开始,将多条有效语音流合成一段音频信息,按照音频开始时间顺序,将音频开始时间、音频长度、对应的参会人员信息合并成一段角色信息,将音频信息中有效语音流与角色信息中对应的音频开始时间进行映射后,将音频信息和角色信息一起定义为会议音频进行保存。
本步骤在对会议音频进行保存时,获取用户输入的音频名称,将会议音频的文件名称重命名为会议音频后进行保存,若设定时间内未获取到音频名称,则将会议音频的文件名称重命名为最早的音频开始时间后进行保存。
本步骤可以通过管理界面获取用户输入的音频名称,可以在用户通过管理界面触发结束录音信号后,展示输入界面,用户通过输入界面输入音频名称。若在设定时间内,如5分钟内未获取到用户输入的任何信息,则进行默认存储。
在一个实施例中,还包括步骤S4,音频展示:
步骤S401,接收请求并展示:接收用户发送的音频回放请求,展示会议音频的文件名称。
用户可以通过与会议系统连接的API接口进行音频回放请求,也可以通过http请求向会议系统发送音频回放请求。会议系统接收到音频回放请求后,将存储的所有会议音频进行展示,展示时,按会议音频的文件名称进行排序后展示。例如将文件名称按存储时间先后进行排序后展示,或将文件名称按英语首字母降序排序后展示。
步骤S402,播放音频信息及同步展示角色信息:当用户触发任一文件名称后,将文件名称对应的音频信息进行播放,将文件名称对应的角色信息进行展示。
由于音频信息中的每条有效语音流均映射对应的角色信息,因此,本步骤在播放用户触发的音频信息时,还可以同步展示对应的角色信息,为用户提供音频信息对应的会议发言者。
本实施例为用户提供了音频回放渠道,在进行音频回放时,还同步展示角色信息,用户无需再对会议内容进行整理,可以直观的了解录音内容对应的会议发言者。
在一个实施例中,在步骤S2后,还包括:
将音频信息中的每条有效语音流通过预设的语音识别软件转换为翻译文本;按照音频开始时间顺序,将音频开始时间、音频长度、对应的参会人员信息合并成一段角色信息时,还将翻译文本一起合并成一段角色信息,并将音频信息中语音流与翻译文本也进行映射;当用户触发任一文件名称后,将文件名称对应的音频信息进行播放,将文件名称对应的角色信息进行展示时,将翻译文本也进行同步展示。
在步骤S2对每条语音流截取出多条有效语音流后,本实施例还对每条有效语音流通过预设的语音识别软件转换为翻译文本。语音识别软件通过声学模型对有效语音流进行语音解码,通过语言模型对解码后的语音进行搜索算法,得到翻译文本。其中,声学模型可以采用神经网络模型,语言模型可以采用N-GRAM模型(N元统计模型),搜索算法可以采用Viterbi算法(维特比算法)。
在步骤S3合并成一段角色信息时,将音频开始时间、音频长度、对应的参会人员信息和翻译文本一起合并成一段角色信息。
在步骤S4,对音频进行展示时,也包括对翻译文本进行展示。由于有效语音流与翻译文本存在映射关系,因此当用户点击某段翻译文本后,也可以实现跳转到对应的一段有效语音流进行播放,并同步展示翻译文本及角色信息。
本实施例给出了有效语音流对应的翻译文本,并在音频展示时,将翻译文本一同展示,便于用户进一步直观的了解具体的会议内容。
在一个实施例中,还可以对会议音频进行检索:
接收用户发送的检索请求,获取关键字,在保存的多个会议音频中检索是否存在关键字,若存在,则将关键字对应的会议音频的文件名称进行展示;当用户触发任一文件名称后,将文件名称对应的音频信息进行播放,将文件名称对应的角色信息和翻译文本进行展示。
本实施例的用户可以通过会议系统的管理界面接收检索请求,获取关键字,用户也可以通过与会议系统连接的API接口进检索请求,还可以通过http请求向会议系统发送检索请求。
其中,关键字包括音频名称、音频开始时间、参会人员信息或通用词语等,通过关键字检索保存的会议音频中是否含有此关键字,若含有,则将所有含有此关键字的音频信息或角色信息对应的会议音频的文件名称进行展示。例如,关键字为区块链,此关键字是通用词语,通过关键字检索出某一会议音频的角色信息中,在参会人员为张三的某一翻译文本中提到此关键字,在参会人员为李四的另一翻译文本中提到此关键字,则将参与人员为张三的某段翻译文本对应的会议音频的文件名称、参与人员为李四的某段翻译文本对应的会议音频的文件名称一起进行展示。本实施例为用户提供了检索渠道,为用户提供了更多的扩展功能。
本实施例会议分角色语音合成方法,通过对会议室麦克风对应设置角色关 系,通过断点检测技术分段截取有效语音流,在会议结束后,对每段有效语音流按时间顺序合成为会议音频,对每段有效语音流都可知对应的参会人员信息和翻译文本,为用户提供了直观的会议内容。
在一个实施例中,提出了一种会议分角色语音合成装置,如图3所示,包括如下模块:
获取信息模块,设置为获取用户输入的参会人员信息及与麦克风的关联关系,每个参会人员关联一个麦克风;
接收并保存语音流模块,设置为接收开始录音信号,开启多个麦克风,通过多个麦克风接收多条语音流,分别将每条语音流进行断点检测,截取多条有效语音流,将多条有效语音流、有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个麦克风;
生成会议音频模块,设置为按照音频开始时间顺序,从时间最早开始,将多条有效语音流合成一段音频信息,按照音频开始时间顺序,将音频开始时间、音频长度、对应的参会人员信息合并成一段角色信息,将音频信息中有效语音流与角色信息中对应的音频开始时间进行映射后,将音频信息和角色信息一起定义为会议音频进行保存。
在一个实施例中,接收并保存语音流模块,包括:录音单元,设置为接收开始录音信号,对多个已关联的麦克风开启录音功能,分别接收每个麦克风传送的语音流;断点检测单元,设置为分别对每条语音流进行断点检测,若存在断点时,截取一段有效语音流,将截取的有效语音流及对应的音频开始时间、音频长度、关联的参会人员信息一起保存至存储介质中,并继续对当前条语音流进行断点检测;关闭录音单元,设置为接收结束录音信号,对多个已关联的麦克风关闭录音功能;保存单元,设置为在接收结束录音信号后,若还不存在断点时,则截取从断点检测开始至音频信号结束的语音流为有效语音流,将有效语音流及对应的音频开始时间、音频长度、关联的参会人员信息一起保存至存储介质中。
在一个实施例中,断点检测单元,还设置为将语音流按照固定时长进行分割,将每个分割单元定义为一帧语音,对每帧语音采集数量相同的N个采样点;
计算每帧语音的能量值,能量值的计算公式如下:
Figure PCTCN2019102448-appb-000002
其中,E为一帧语音的能量值,f k为第k个采样点的峰值,N为一帧语音的采样点总个数;若连续M帧语音的能量值高于预设阈值,则将连续M帧语音中高于预设值的第一帧语音定义为一段音频的前断点,若从M+1帧的能量值开始低于预设阈值,且持续一预设时长,则将M+1帧语音定义为一段音频的后断点,截取前断点和后断点之间的音频为一段有效语音流。
在一个实施例中,生成会议音频模块还设置为获取用户输入的音频名称,将会议音频的文件名称重命名为会议音频后进行保存,若设定时间内未获取到音频名称,则将会议音频的文件名称重命名为最早的音频开始时间后进行保存。
在一个实施例中,还包括:展示模块,设置为接收用户发送的音频回放请求,展示会议音频的文件名称;播放模块,设置为当用户触发任一文件名称后,将文件名称对应的音频信息进行播放,将文件名称对应的角色信息进行展示。
在一个实施例中,还包括:转换模块,设置为将音频信息中的每条有效语音流通过预设的语音识别软件转换为翻译文本;合并模块,设置为按照音频开始时间顺序,将音频开始时间、音频长度、对应的参会人员信息合并成一段角色信息时,还将翻译文本一起合并成一段角色信息,并将音频信息中语音流与翻译文本也进行映射;同步展示模块,设置为当用户触发任一文件名称后,将文件名称对应的音频信息进行播放,将文件名称对应的角色信息进行展示时,将翻译文本也进行同步展示。
在一个实施例中,还包括:检索模块,设置为接收用户发送的检索请求,获取关键字,在保存的多个会议音频中检索是否存在关键字,若存在,则将关键字对应的会议音频的文件名称进行展示;当用户触发任一文件名称后,将文件名称对应的音频信息进行播放,将文件名称对应的角色信息和翻译文本进行展示。
在一个实施例中,提出了一种计算机设备,包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行计算机可读指令时实现上述各实施例的会议分角色语音合成方法中的步骤。
在一个实施例中,提出了一种存储有计算机可读指令的存储介质,计算机 可读指令被一个或多个处理器执行时,使得处理器执行上述各实施例的会议分角色语音合成方法中的步骤。其中,存储介质可以为非易失性存储介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁盘或光盘等。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请一些示例性实施例,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种会议分角色语音合成方法,包括:
    获取用户输入的参会人员信息及与麦克风的关联关系,每个所述参会人员关联一个所述麦克风;
    接收开始录音信号,开启多个所述麦克风,通过多个所述麦克风接收多条语音流,分别将每条所述语音流进行断点检测,截取多条有效语音流,将多条所述有效语音流、所述有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个所述麦克风;
    按照所述音频开始时间顺序,从时间最早开始,将多条所述有效语音流合成一段音频信息,按照所述音频开始时间顺序,将所述音频开始时间、所述音频长度、所述对应的参会人员信息合并成一段角色信息,将所述音频信息中所述有效语音流与所述角色信息中对应的所述音频开始时间进行映射后,将所述音频信息和所述角色信息一起定义为会议音频进行保存。
  2. 根据权利要求1所述的会议分角色语音合成方法,其中,所述接收开始录音信号,开启多个所述麦克风,通过多个所述麦克风接收多条语音流,分别将每条所述语音流进行断点检测,截取多条有效语音流,将多条所述有效语音流、所述有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个所述麦克风,包括:
    接收开始录音信号,对多个已关联的所述麦克风开启录音功能,分别接收每个所述麦克风传送的语音流;
    分别对每条所述语音流进行断点检测,若存在断点时,截取一段有效语音流,将截取的所述有效语音流及对应的音频开始时间、音频长度、关联的参会人员信息一起保存至存储介质中,并继续对当前条所述语音流进行断点检测;
    接收结束录音信号,对多个已关联的所述麦克风关闭录音功能;
    在接收结束录音信号后,若还不存在断点时,则截取从断点检测开始至音频信号结束的所述语音流为有效语音流,将所述有效语音流及对应的音频开始时间、音频长度、关联的参会人员信息一起保存至存储介质中。
  3. 根据权利要求2所述的会议分角色语音合成方法,其中,所述分别对每条所述语音流进行断点检测,若存在断点时,截取一段有效语音流,包括:
    将所述语音流按照固定时长进行分割,将每个分割单元定义为一帧语音,对每帧语音采集数量相同的N个采样点;
    计算每帧语音的能量值,所述能量值的计算公式如下:
    Figure PCTCN2019102448-appb-100001
    其中,E为一帧语音的能量值,f k为第k个采样点的峰值,N为一帧语音的采样点总个数;
    若连续M帧语音的能量值高于预设阈值,则将连续M帧语音中高于预设值的第一帧语音定义为一段音频的前断点,若从M+1帧的能量值开始低于预设阈值,且持续一预设时长,则将M+1帧语音定义为一段音频的后断点,截取所述前断点和所述后断点之间的音频为一段所述有效语音流。
  4. 根据权利要求1所述的会议分角色语音合成方法,其中,所述将所述音频信息和所述角色信息一起定义为会议音频进行保存,包括:
    获取用户输入的音频名称,将所述会议音频的文件名称重命名为所述会议音频后进行保存,若设定时间内未获取到所述音频名称,则将所述会议音频的文件名称重命名为最早的音频开始时间后进行保存。
  5. 根据权利要求1所述的会议分角色语音合成方法,其中,还包括:
    接收用户发送的音频回放请求,展示所述会议音频的文件名称;
    当用户触发任一所述文件名称后,将所述文件名称对应的所述音频信息进行播放,将所述文件名称对应的角色信息进行展示。
  6. 根据权利要求5所述的会议分角色语音合成方法,其中,还包括:
    将所述音频信息中的每条所述有效语音流通过预设的语音识别软件转换为翻译文本;
    按照所述音频开始时间顺序,将所述音频开始时间、所述音频长度、所述对应的参会人员信息合并成一段角色信息时,还将所述翻译文本一起合并成一段角色信息,并将所述音频信息中所述语音流与所述翻译文本也进行映射;
    当用户触发任一所述文件名称后,将所述文件名称对应的所述音频信息进行播放,将所述文件名称对应的角色信息进行展示时,将所述翻译文本也进行同步展示。
  7. 根据权利要求6所述的会议分角色语音合成方法,其中,还包括:
    接收用户发送的检索请求,获取关键字,在保存的多个所述会议音频中检索是否存在所述关键字,若存在,则将所述关键字对应的所述会议音频的文件 名称进行展示;
    当用户触发任一所述文件名称后,将所述文件名称对应的所述音频信息进行播放,将所述文件名称对应的角色信息和翻译文本进行展示。
  8. 一种会议分角色语音合成装置,包括:
    获取信息模块,设置为获取用户输入的参会人员信息及与麦克风的关联关系,每个所述参会人员关联一个所述麦克风;
    接收并保存语音流模块,设置为接收开始录音信号,开启多个所述麦克风,通过多个所述麦克风接收多条语音流,分别将每条所述语音流进行断点检测,截取多条有效语音流,将多条所述有效语音流、所述有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个所述麦克风;
    生成会议音频模块,设置为按照所述音频开始时间顺序,从时间最早开始,将多条所述有效语音流合成一段音频信息,按照所述音频开始时间顺序,将所述音频开始时间、音频长度、对应的参会人员信息合并成一段角色信息,将所述音频信息中所述有效语音流与所述角色信息中对应的所述音频开始时间进行映射后,将所述音频信息和所述角色信息一起定义为会议音频进行保存。
  9. 根据权利要求8所述的会议分角色语音合成装置,其中,所述接收并保存语音流模块,包括:
    录音单元,设置为接收开始录音信号,对多个已关联的所述麦克风开启录音功能,分别接收每个所述麦克风传送的语音流;
    断点检测单元,设置为分别对每条所述语音流进行断点检测,若存在断点时,截取一段有效语音流,将截取的所述有效语音流及对应的音频开始时间、音频长度、关联的参会人员信息一起保存至存储介质中,并继续对当前条所述语音流进行断点检测;
    关闭录音单元,设置为接收结束录音信号,对多个已关联的所述麦克风关闭录音功能;
    保存单元,设置为接收结束录音信号后,若还不存在断点时,则截取从断点检测开始至音频信号结束的所述语音流为有效语音流,将所述有效语音流及对应的音频开始时间、音频长度、关联的参会人员信息一起保存至存储介质中。
  10. 根据权利要求9所述的会议分角色语音合成装置,其中,所述断点检 测单元,还设置为将所述语音流按照固定时长进行分割,将每个分割单元定义为一帧语音,对每帧语音采集数量相同的N个采样点;
    计算每帧语音的能量值,所述能量值的计算公式如下:
    Figure PCTCN2019102448-appb-100002
    其中,E为一帧语音的能量值,f k为第k个采样点的峰值,N为一帧语音的采样点总个数;
    若连续M帧语音的能量值高于预设阈值,则将连续M帧语音中高于预设值的第一帧语音定义为一段音频的前断点,若从M+1帧的能量值开始低于预设阈值,且持续一预设时长,则将M+1帧语音定义为一段音频的后断点,截取所述前断点和所述后断点之间的音频为一段所述有效语音流。
  11. 根据权利要求8所述的会议分角色语音合成装置,其中,所述生成会议音频模块还设置为获取用户输入的音频名称,将所述会议音频的文件名称重命名为所述会议音频后进行保存,若设定时间内未获取到所述音频名称,则将所述会议音频的文件名称重命名为最早的音频开始时间后进行保存。
  12. 根据权利要求8所述的会议分角色语音合成装置,其中,还包括:
    展示模块,设置为接收用户发送的音频回放请求,展示所述会议音频的文件名称;
    播放模块,设置为当用户触发任一所述文件名称后,将所述文件名称对应的所述音频信息进行播放,将所述文件名称对应的角色信息进行展示。
  13. 根据权利要求12所述的会议分角色语音合成装置,其中,还包括:
    转换模块,设置为将所述音频信息中的每条所述有效语音流通过预设的语音识别软件转换为翻译文本;
    合并模块,设置为按照所述音频开始时间顺序,将所述音频开始时间、所述音频长度、所述对应的参会人员信息合并成一段角色信息时,还将所述翻译文本一起合并成一段角色信息,并将所述音频信息中所述语音流与所述翻译文本也进行映射;
    同步展示模块,设置为当用户触发任一所述文件名称后,将所述文件名称对应的所述音频信息进行播放,将所述文件名称对应的角色信息进行展示时,将所述翻译文本也进行同步展示。
  14. 根据权利要求13所述的会议分角色语音合成装置,其中,还包括:
    检索模块,设置为接收用户发送的检索请求,获取关键字,在保存的多个所述会议音频中检索是否存在所述关键字,若存在,则将所述关键字对应的所述会议音频的文件名称进行展示;当用户触发任一所述文件名称后,将所述文件名称对应的所述音频信息进行播放,将所述文件名称对应的角色信息和翻译文本进行展示。
  15. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:
    获取用户输入的参会人员信息及与麦克风的关联关系,每个所述参会人员关联一个所述麦克风;
    接收开始录音信号,开启多个所述麦克风,通过多个所述麦克风接收多条语音流,分别将每条所述语音流进行断点检测,截取多条有效语音流,将多条所述有效语音流、所述有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个所述麦克风;
    按照所述音频开始时间顺序,从时间最早开始,将多条所述有效语音流合成一段音频信息,按照所述音频开始时间顺序,将所述音频开始时间、所述音频长度、所述对应的参会人员信息合并成一段角色信息,将所述音频信息中所述有效语音流与所述角色信息中对应的所述音频开始时间进行映射后,将所述音频信息和所述角色信息一起定义为会议音频进行保存。
  16. 根据权利要求15所述的计算机设备,其中,所述接收开始录音信号,开启多个所述麦克风,通过多个所述麦克风接收多条语音流,分别将每条所述语音流进行断点检测,截取多条有效语音流,将多条所述有效语音流、所述有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个所述麦克风时,使得所述处理器执行以下步骤:
    接收开始录音信号,对多个已关联的所述麦克风开启录音功能,分别接收每个所述麦克风传送的语音流;
    分别对每条所述语音流进行断点检测,若存在断点时,截取一段有效语音流,将截取的所述有效语音流及对应的音频开始时间、音频长度、关联的参会 人员信息一起保存至存储介质中,并继续对当前条所述语音流进行断点检测;
    接收结束录音信号,对多个已关联的所述麦克风关闭录音功能;
    在接收结束录音信号后,若还不存在断点时,则截取从断点检测开始至音频信号结束的所述语音流为有效语音流,将所述有效语音流及对应的音频开始时间、音频长度、关联的参会人员信息一起保存至存储介质中。
  17. 根据权利要求16所述的计算机设备,其中,所述分别对每条所述语音流进行断点检测,若存在断点时,截取一段有效语音流,使得所述处理器执行以下步骤:
    将所述语音流按照固定时长进行分割,将每个分割单元定义为一帧语音,对每帧语音采集数量相同的N个采样点;
    计算每帧语音的能量值,所述能量值的计算公式如下:
    Figure PCTCN2019102448-appb-100003
    其中,E为一帧语音的能量值,f k为第k个采样点的峰值,N为一帧语音的采样点总个数;
    若连续M帧语音的能量值高于预设阈值,则将连续M帧语音中高于预设值的第一帧语音定义为一段音频的前断点,若从M+1帧的能量值开始低于预设阈值,且持续一预设时长,则将M+1帧语音定义为一段音频的后断点,截取所述前断点和所述后断点之间的音频为一段所述有效语音流。
  18. 一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:获取用户输入的参会人员信息及与麦克风的关联关系,每个所述参会人员关联一个所述麦克风;
    接收开始录音信号,开启多个所述麦克风,通过多个所述麦克风接收多条语音流,分别将每条所述语音流进行断点检测,截取多条有效语音流,将多条所述有效语音流、所述有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个所述麦克风;
    按照所述音频开始时间顺序,从时间最早开始,将多条所述有效语音流合成一段音频信息,按照所述音频开始时间顺序,将所述音频开始时间、所述音频长度、所述对应的参会人员信息合并成一段角色信息,将所述音频信息中所述有效语音流与所述角色信息中对应的所述音频开始时间进行映射后,将所述 音频信息和所述角色信息一起定义为会议音频进行保存。
  19. 根据权利要求18所述的存储介质,其中,所述接收开始录音信号,开启多个所述麦克风,通过多个所述麦克风接收多条语音流,分别将每条所述语音流进行断点检测,截取多条有效语音流,将多条所述有效语音流、所述有效语音流对应的音频开始时间、音频长度、关联的参会人员信息一起进行保存,直至接收到结束录音信号,关闭多个所述麦克风时,使得一个或多个处理器执行以下步骤:接收开始录音信号,对多个已关联的所述麦克风开启录音功能,分别接收每个所述麦克风传送的语音流;
    分别对每条所述语音流进行断点检测,若存在断点时,截取一段有效语音流,将截取的所述有效语音流及对应的音频开始时间、音频长度、关联的参会人员信息一起保存至存储介质中,并继续对当前条所述语音流进行断点检测;
    接收结束录音信号,对多个已关联的所述麦克风关闭录音功能;
    在接收结束录音信号后,若还不存在断点时,则截取从断点检测开始至音频信号结束的所述语音流为有效语音流,将所述有效语音流及对应的音频开始时间、音频长度、关联的参会人员信息一起保存至存储介质中。
  20. 根据权利要求19所述的存储介质,其中,所述分别对每条所述语音流进行断点检测,若存在断点时,截取一段有效语音流,使得一个或多个处理器执行以下步骤:将所述语音流按照固定时长进行分割,将每个分割单元定义为一帧语音,对每帧语音采集数量相同的N个采样点;计算每帧语音的能量值,所述能量值的计算公式如下:
    Figure PCTCN2019102448-appb-100004
    其中,E为一帧语音的能量值,f k为第k个采样点的峰值,N为一帧语音的采样点总个数;
    若连续M帧语音的能量值高于预设阈值,则将连续M帧语音中高于预设值的第一帧语音定义为一段音频的前断点,若从M+1帧的能量值开始低于预设阈值,且持续一预设时长,则将M+1帧语音定义为一段音频的后断点,截取所述前断点和所述后断点之间的音频为一段所述有效语音流。
PCT/CN2019/102448 2019-05-21 2019-08-26 会议分角色语音合成方法、装置、计算机设备和存储介质 WO2020232865A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910424720.3A CN110322869B (zh) 2019-05-21 2019-05-21 会议分角色语音合成方法、装置、计算机设备和存储介质
CN201910424720.3 2019-05-21

Publications (1)

Publication Number Publication Date
WO2020232865A1 true WO2020232865A1 (zh) 2020-11-26

Family

ID=68113334

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102448 WO2020232865A1 (zh) 2019-05-21 2019-08-26 会议分角色语音合成方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN110322869B (zh)
WO (1) WO2020232865A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362869A (zh) * 2021-05-19 2021-09-07 北京明略软件系统有限公司 一种录音设备
CN113708868A (zh) * 2021-08-27 2021-11-26 国网安徽省电力有限公司池州供电公司 一种多拾音设备的调度系统及其调度方法
CN113723086A (zh) * 2021-08-31 2021-11-30 平安科技(深圳)有限公司 一种文本处理方法、系统、设备及介质

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110808062B (zh) * 2019-11-26 2022-12-13 秒针信息技术有限公司 混合语音分离方法和装置
WO2021109000A1 (zh) * 2019-12-03 2021-06-10 深圳市欢太科技有限公司 数据处理方法、装置、电子设备和存储介质
CN111128132A (zh) * 2019-12-19 2020-05-08 秒针信息技术有限公司 语音分离方法和装置及系统、存储介质
CN111445920B (zh) * 2020-03-19 2023-05-16 西安声联科技有限公司 一种多声源的语音信号实时分离方法、装置和拾音器
CN111429914B (zh) * 2020-03-30 2023-04-18 招商局金融科技有限公司 麦克风控制方法、电子装置及计算机可读存储介质
CN113704312A (zh) * 2020-05-21 2021-11-26 北京声智科技有限公司 一种信息处理方法、装置、介质和设备
CN113782026A (zh) * 2020-06-09 2021-12-10 北京声智科技有限公司 一种信息处理方法、装置、介质和设备
CN113963452A (zh) * 2020-07-02 2022-01-21 Oppo广东移动通信有限公司 一种会议签到方法、装置及计算机可读存储介质
CN111883168B (zh) * 2020-08-04 2023-12-22 上海明略人工智能(集团)有限公司 一种语音处理方法及装置
CN111968686B (zh) * 2020-08-06 2022-09-30 维沃移动通信有限公司 录音方法、装置和电子设备
CN112185424A (zh) * 2020-09-29 2021-01-05 国家计算机网络与信息安全管理中心 一种语音文件裁剪还原方法、装置、设备和存储介质
CN112270918A (zh) * 2020-10-22 2021-01-26 北京百度网讯科技有限公司 信息处理方法、装置、系统、电子设备及存储介质
CN112804401A (zh) * 2020-12-31 2021-05-14 中国人民解放军战略支援部队信息工程大学 一种会议角色的确定及语音采集控制方法和装置
CN112908336A (zh) * 2021-01-29 2021-06-04 深圳壹秘科技有限公司 一种用于语音处理装置的角色分离方法及其语音处理装置
CN113055529B (zh) * 2021-03-29 2022-12-13 深圳市艾酷通信软件有限公司 录音控制方法和录音控制装置
CN113422865A (zh) * 2021-06-01 2021-09-21 维沃移动通信有限公司 定向录音方法和装置
CN113539269A (zh) * 2021-07-20 2021-10-22 上海明略人工智能(集团)有限公司 音频信息处理方法、系统和计算机可读存储介质
CN113542661A (zh) * 2021-09-09 2021-10-22 北京鼎天宏盛科技有限公司 一种视频会议语音识别方法及系统
US11838340B2 (en) 2021-09-20 2023-12-05 International Business Machines Corporation Dynamic mute control for web conferencing
CN115242747A (zh) * 2022-07-21 2022-10-25 维沃移动通信有限公司 语音消息处理方法、装置、电子设备和可读存储介质
CN116015996B (zh) * 2023-03-28 2023-06-02 南昌航天广信科技有限责任公司 一种数字会议音频处理方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2922237A1 (de) * 2014-03-20 2015-09-23 Unify GmbH & Co. KG Verfahren, Vorrichtung und System zur Steuerung einer Konferenz
EP3197139A1 (en) * 2016-01-20 2017-07-26 Ricoh Company, Ltd. Information processing system, information processing device, and information processing method
US20180033332A1 (en) * 2016-07-27 2018-02-01 David Nelson System and Method for Recording, Documenting and Visualizing Group Conversations
CN108346034A (zh) * 2018-02-02 2018-07-31 深圳市鹰硕技术有限公司 一种会议智能管理方法及系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107564531A (zh) * 2017-08-25 2018-01-09 百度在线网络技术(北京)有限公司 基于声纹特征的会议记录方法、装置及计算机设备
CN108305632B (zh) * 2018-02-02 2020-03-27 深圳市鹰硕技术有限公司 一种会议的语音摘要形成方法及系统
CN109388701A (zh) * 2018-08-17 2019-02-26 深圳壹账通智能科技有限公司 会议记录生成方法、装置、设备和计算机存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2922237A1 (de) * 2014-03-20 2015-09-23 Unify GmbH & Co. KG Verfahren, Vorrichtung und System zur Steuerung einer Konferenz
EP3197139A1 (en) * 2016-01-20 2017-07-26 Ricoh Company, Ltd. Information processing system, information processing device, and information processing method
US20180033332A1 (en) * 2016-07-27 2018-02-01 David Nelson System and Method for Recording, Documenting and Visualizing Group Conversations
CN108346034A (zh) * 2018-02-02 2018-07-31 深圳市鹰硕技术有限公司 一种会议智能管理方法及系统

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362869A (zh) * 2021-05-19 2021-09-07 北京明略软件系统有限公司 一种录音设备
CN113708868A (zh) * 2021-08-27 2021-11-26 国网安徽省电力有限公司池州供电公司 一种多拾音设备的调度系统及其调度方法
CN113708868B (zh) * 2021-08-27 2023-06-27 国网安徽省电力有限公司池州供电公司 一种多拾音设备的调度系统及其调度方法
CN113723086A (zh) * 2021-08-31 2021-11-30 平安科技(深圳)有限公司 一种文本处理方法、系统、设备及介质
CN113723086B (zh) * 2021-08-31 2023-09-05 平安科技(深圳)有限公司 一种文本处理方法、系统、设备及介质

Also Published As

Publication number Publication date
CN110322869B (zh) 2023-06-16
CN110322869A (zh) 2019-10-11

Similar Documents

Publication Publication Date Title
WO2020232865A1 (zh) 会议分角色语音合成方法、装置、计算机设备和存储介质
CN108305632B (zh) 一种会议的语音摘要形成方法及系统
US11115541B2 (en) Post-teleconference playback using non-destructive audio transport
US11557280B2 (en) Background audio identification for speech disambiguation
WO2020233068A1 (zh) 会议音频控制方法、系统、设备及计算机可读存储介质
US11699456B2 (en) Automated transcript generation from multi-channel audio
CN107211061B (zh) 用于空间会议回放的优化虚拟场景布局
CN107211062B (zh) 虚拟声学空间中的音频回放调度
EP3254454B1 (en) Conference searching and playback of search results
US11076052B2 (en) Selective conference digest
Chaudhuri et al. Ava-speech: A densely labeled dataset of speech activity in movies
CN107210036B (zh) 会议词语云
US20180006837A1 (en) Post-conference playback system having higher perceived quality than originally heard in the conference
US20110004473A1 (en) Apparatus and method for enhanced speech recognition
US20130211826A1 (en) Audio Signals as Buffered Streams of Audio Signals and Metadata
WO2020057102A1 (zh) 语音翻译方法及翻译装置
JPWO2020222925A5 (zh)
US11334618B1 (en) Device, system, and method of capturing the moment in audio discussions and recordings
JP5030868B2 (ja) 会議音声録音システム
WO2017020011A1 (en) Searching the results of an automatic speech recognition process
EP2763136B1 (en) Method and system for obtaining relevant information from a voice communication
WO2014085985A1 (zh) 一种通话转录系统和方法
US12002452B2 (en) Background audio identification for speech disambiguation
JP2021131524A (ja) オンライン話者逐次区別方法、オンライン話者逐次区別装置及びオンライン話者逐次区別システム
CN111415128A (zh) 控制会议的方法、系统、装置、设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19929758

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19929758

Country of ref document: EP

Kind code of ref document: A1