CN110322869B - Conference character-division speech synthesis method, device, computer equipment and storage medium - Google Patents

Conference character-division speech synthesis method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN110322869B
CN110322869B CN201910424720.3A CN201910424720A CN110322869B CN 110322869 B CN110322869 B CN 110322869B CN 201910424720 A CN201910424720 A CN 201910424720A CN 110322869 B CN110322869 B CN 110322869B
Authority
CN
China
Prior art keywords
audio
information
conference
voice
microphones
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910424720.3A
Other languages
Chinese (zh)
Other versions
CN110322869A (en
Inventor
岳鹏昱
闫冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910424720.3A priority Critical patent/CN110322869B/en
Priority to PCT/CN2019/102448 priority patent/WO2020232865A1/en
Publication of CN110322869A publication Critical patent/CN110322869A/en
Application granted granted Critical
Publication of CN110322869B publication Critical patent/CN110322869B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The present invention relates to the field of artificial intelligence technologies, and in particular, to a conference component character speech synthesis method, apparatus, computer device, and storage medium. The method comprises the following steps: acquiring the information of the participants input by the user and the association relation between the information and the microphone; receiving a plurality of voice streams through a plurality of microphones, respectively performing breakpoint detection on each voice stream, intercepting a plurality of effective voice streams, and storing the plurality of effective voice streams, audio start time, audio length and associated meeting participant information together; and synthesizing a plurality of effective voice streams into a piece of audio information, combining the audio start time, the audio length and the corresponding meeting personnel information into a piece of role information, and defining the audio information and the role information together as meeting audio for storage. According to the invention, the conference room microphone is correspondingly provided with the conference participant information, and each section of audio corresponds to the conference participant information, so that the speaking contents of all the speakers in the conference process can be easily determined.

Description

Conference character-division speech synthesis method, device, computer equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a conference component character speech synthesis method, apparatus, computer device, and storage medium.
Background
As an economic and efficient conference solution, multimedia conferences are increasingly applied to enterprises, and the communication and collaboration efficiency of people is greatly improved. Conference recording is often necessary as a means of multi-person communication, for which the recording of a multimedia conference is a form of conference recording. For example, when a user needs to leave a conference temporarily during a conference, but does not want to miss important utterances of some conference participants in the conference, or when the user wants to record utterances of some conference participants in the conference, the user needs to start recording the conference and record the conference. However, the current conference recording is generally aimed at the whole conference process, that is, if recording is started in the conference process, all the voices of the conference can be recorded, the recording cannot be carried out for the designated participants in the conference, and the roles of the participants cannot be distinguished for recording.
Disclosure of Invention
In view of this, it is necessary to provide a conference character-by-character speech synthesis method, apparatus, computer device, and storage medium for the problem that audio files cannot be stored in different characters during conference recording.
A conference character-to-character speech synthesis method, comprising:
acquiring the information of the participants input by the user and the association relation with microphones, wherein each participant is associated with one microphone;
receiving a recording starting signal, starting a plurality of microphones, receiving a plurality of voice streams through the microphones, respectively detecting break points of each voice stream, intercepting a plurality of effective voice streams, storing the plurality of effective voice streams, audio starting time, audio length and associated meeting participants information corresponding to the effective voice streams together until the recording ending signal is received, and closing the microphones;
and according to the audio start time sequence, combining the audio start time, the audio length and the corresponding meeting participant information into a piece of role information, mapping the effective voice stream in the audio information and the audio start time corresponding to the role information, and defining the audio information and the role information as meeting audio to store.
In one possible design, the receiving a recording start signal, starting a plurality of microphones, receiving a plurality of voice streams through a plurality of microphones, respectively performing breakpoint detection on each voice stream, intercepting a plurality of valid voice streams, storing a plurality of valid voice streams, audio start time, audio length and associated participant information corresponding to the valid voice streams together until receiving an ending recording signal, and closing a plurality of microphones, including:
receiving a recording starting signal, starting a recording function for a plurality of associated microphones, and respectively receiving voice streams transmitted by each microphone;
respectively carrying out breakpoint detection on each voice stream, if a breakpoint exists, intercepting a section of effective voice stream, storing the intercepted effective voice stream and corresponding audio start time, audio length and associated meeting participant information into a storage medium, and continuously carrying out breakpoint detection on the current voice stream;
receiving a recording ending signal, and closing recording functions for a plurality of associated microphones;
after receiving the recording signal, if the breakpoint does not exist, intercepting the voice stream from the breakpoint detection start to the audio signal end as an effective voice stream, and storing the effective voice stream, the corresponding audio start time, the audio length and the associated meeting participant information into a storage medium.
In one possible design, the detecting a break point for each of the voice streams respectively, and intercepting a valid voice stream if a break point exists, includes:
dividing the voice stream according to fixed duration, defining each dividing unit as one frame of voice, and collecting N sampling points with the same number for each frame of voice;
and calculating the energy value of each frame of voice, wherein the energy value is calculated according to the following formula:
Figure BDA0002067120010000031
wherein E is the energy value of a frame of voice, f k The peak value of the kth sampling point is N, and the total number of sampling points of one frame of voice;
if the energy value of the continuous M frames of voice is higher than a preset threshold, the first frame of voice which is higher than the preset value in the continuous M frames of voice is defined as a front break point of a section of audio, if the energy value of the continuous M frames of voice is lower than the preset threshold from the beginning and lasts for a preset duration, the M+1 frames of voice is defined as a rear break point of a section of audio, and the audio between the front break point and the rear break point is intercepted to be a section of effective voice stream.
In one possible design, the defining the audio information and the character information together as conference audio for saving includes:
and acquiring an audio name input by a user, renaming the file name of the conference audio to be the conference audio, then storing the file name, and if the audio name is not acquired within a set time, renaming the file name of the conference audio to be the earliest audio starting time, then storing the file name.
In one possible design, the method further comprises:
receiving an audio playback request sent by a user, and displaying the file name of the conference audio;
and after triggering any file name by a user, playing the audio information corresponding to the file name, and displaying the role information corresponding to the file name.
In one possible design, the method further comprises:
converting each effective voice stream in the audio information into a translation text through preset voice recognition software;
when the audio starting time, the audio length and the corresponding participant information are combined into a piece of role information according to the audio starting time sequence, the translation text is also combined into a piece of role information, and the voice stream and the translation text in the audio information are also mapped;
and after triggering any file name by a user, playing the audio information corresponding to the file name, and synchronously displaying the translation text when displaying the role information corresponding to the file name.
In one possible design, the method further comprises:
receiving a search request sent by a user, acquiring keywords, searching whether the keywords exist in a plurality of stored conference audios, and if so, displaying the file names of the conference audios corresponding to the keywords;
and after any file name is triggered by a user, playing the audio information corresponding to the file name, and displaying the role information corresponding to the file name and the translation text.
A conference character-to-character speech synthesis apparatus comprising:
the information acquisition module is used for acquiring the information of the participants input by the user and the association relation with the microphones, and each participant is associated with one microphone;
the voice stream receiving and storing module is used for receiving a recording starting signal, starting a plurality of microphones, receiving a plurality of voice streams through the microphones, respectively detecting break points of each voice stream, intercepting a plurality of effective voice streams, storing the effective voice streams, audio starting time, audio length and associated meeting participant information corresponding to the effective voice streams together until the recording ending signal is received, and closing the microphones;
and the conference audio generation module is used for synthesizing a plurality of effective voice streams into a piece of audio information from the earliest time according to the audio start time sequence, merging the audio start time, the audio length and the corresponding participant information into a piece of role information according to the audio start time sequence, mapping the effective voice streams in the audio information and the audio start time corresponding to the role information, and defining the audio information and the role information together as conference audio for storage.
A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the conference character-by-character speech synthesis method described above.
A storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the conference persona speech synthesis method described above.
The conference character-by-character speech synthesis method, the conference character-by-character speech synthesis device, the computer equipment and the storage medium comprise the steps of acquiring information of participants input by a user and association relations with microphones, wherein each participant is associated with one microphone; receiving a recording starting signal, starting a plurality of microphones, receiving a plurality of voice streams through the microphones, respectively detecting break points of each voice stream, intercepting a plurality of effective voice streams, storing the plurality of effective voice streams, audio starting time, audio length and associated meeting participants information corresponding to the effective voice streams together until the recording ending signal is received, and closing the microphones; and according to the audio start time sequence, combining the audio start time, the audio length and the corresponding meeting participant information into a piece of role information, mapping the effective voice stream in the audio information and the audio start time corresponding to the role information, and defining the audio information and the role information as meeting audio to store. According to the invention, the conference participant information is correspondingly arranged on the microphone of the conference room, the audio is intercepted by the silence detection technology in a segmentation way, after the conference is finished, each section of audio is synthesized into conference audio according to time sequence, and the corresponding role information can be known for each section of audio, so that the speaking content of all speakers in the conference process can be easily determined.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.
FIG. 1 is a flow chart of a conference character-by-character speech synthesis method in one embodiment of the invention;
FIG. 2 is a flowchart of step S2 in an embodiment of the present invention;
fig. 3 is a block diagram of a conference character-by-character speech synthesis apparatus according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Fig. 1 is a flowchart of a conference character-division speech synthesis method according to an embodiment of the present invention, as shown in fig. 1, and the conference character-division speech synthesis method includes the following steps:
step S1, information is acquired: and acquiring the information of the participants input by the user and the association relation with the microphones, wherein each participant is associated with one microphone.
The step can receive the information of the participants and the association relation between all the participants and the microphone which are input by the user through a preset management interface in the conference system. A conference room seating schematic is presented in the management interface, on which the positional information of each microphone in the conference room is marked. And the user clicks the corresponding microphone to trigger the input interface, and inputs the information of the corresponding participants through the input interface to complete the association relationship between the participants and the microphone at the system level. The attendee information may be the name, job number, or other unique identification at the company of the attendee for distinguishing between the attendees.
In the step, a plurality of microphones are connected with a conference system based on the radio equipment of the raspberry group, the MAC address of the radio equipment is used as a unique identifier, the microphone name is corresponding to the corresponding MAC address, and then the physical association relation between the participants and the microphones is completed.
Step S2, receiving and storing the voice stream: receiving a recording starting signal, starting a plurality of microphones, receiving a plurality of voice streams through the microphones, respectively detecting break points of each voice stream, intercepting a plurality of effective voice streams, storing the plurality of effective voice streams, audio starting time, audio length and associated meeting personnel information corresponding to the effective voice streams together until the recording ending signal is received, and closing the microphones.
After the voice streams sent by the microphones are independently received, a plurality of independent threads can be started, the breakpoint detection is carried out on each voice stream by executing the step, and the effective voice streams are intercepted. When the effective voice stream is stored, the corresponding participant information is stored together, so that the determination of which effective voice stream is uttered by which participant is facilitated.
In one embodiment, step S2, as shown in fig. 2, includes:
step S201, starting recording: a recording starting signal is received, a recording function is started for a plurality of associated microphones, and voice streams transmitted by each microphone are respectively received.
The step can receive the recording starting signal through the management interface of the conference system, automatically start the recording function for the associated microphone, and respectively receive the voice streams transmitted by a plurality of microphones.
Step S202, breakpoint detection and interception of effective voice stream: and respectively carrying out breakpoint detection on each voice stream, if a breakpoint exists, intercepting a section of effective voice stream, storing the intercepted effective voice stream, the corresponding audio starting time, audio length and associated participant information into a storage medium, and continuously carrying out breakpoint detection on the current voice stream.
Breakpoint detection is used to detect a segment of a valid voice stream from a continuous voice stream. The method comprises the steps of detecting a starting point, namely a front breakpoint, of an effective voice stream and detecting an ending point, namely a rear breakpoint, of the effective voice stream. The effective voice stream is separated from the continuous voice stream, so that the stored data volume can be reduced, the man-machine interaction can be simplified by using breakpoint detection, for example, if necessary, the recording ending signal can be received without step S203, and the recording ending operation can be determined directly by detecting the real-time breakpoint of the received voice stream.
In this step, when breakpoint detection is performed on the voice stream, in one embodiment, the following manner is adopted:
step S20201, dividing the voice stream: dividing the voice stream according to fixed duration, defining each dividing unit as one frame of voice, and collecting N sampling points with the same number for each frame of voice.
The fixed duration in this step may be 20ms,30ms, etc., and the voice stream is divided into several frames of voice according to the fixed duration. Since the speaking volume for the same word may be different even when the same participant is speaking, the normalization process may be performed on the voice stream before the voice stream is divided in this step: and (3) taking the point with the largest amplitude in each voice stream, increasing the amplitude to be close to 1, recording the increasing proportion, and then stretching all other points according to the proportion.
Step S20202, calculating an energy value: the energy value of each frame of voice is calculated, and the calculation formula of the energy value is as follows:
Figure BDA0002067120010000081
wherein E is the energy value of a frame of voice, f k The peak value of the kth sampling point is N, and the total number of sampling points of one frame of voice;
the energy value of a frame of speech is related to the size of the sampling value and the number of sampling points contained in the frame of speech, and the sampling value, namely the peak value, generally contains positive values and negative values, and the positive and negative values are not considered when the energy value is calculated, so that the square sum of the sampling values is used for defining the energy value of a frame of speech in the step.
Step S20203, determining the front and rear break points: if the energy value of the continuous M frames of voice is higher than a preset threshold, the first frame of voice higher than the preset value in the continuous M frames of voice is defined as a front break point of a section of audio, if the energy value of the continuous M frames of voice is lower than the preset threshold from the beginning and lasts for a preset duration, the M+1 frames of voice is defined as a rear break point of a section of audio, and the audio between the front break point and the rear break point is intercepted to be a section of effective voice stream.
If the energy value of the first few frames of voice of one voice stream is lower than a preset threshold value and the energy values of the continuous M frames of voice are all higher than the preset threshold value, the first frame of voice with the energy value just higher than the preset threshold value is defined as a break-before point. If the energy values of the consecutive M frames of speech are all higher, the energy value of the following frame of speech becomes smaller and lasts for a preset period of time, the following break point can be considered at the place where the energy value is reduced. And intercepting the audio between the front breakpoint and the rear breakpoint and storing the audio as a section of effective voice stream.
The smaller the audio time length corresponding to the continuous M frames in the step is, the higher the breakpoint detection sensitivity is. In the step, because the condition of receiving a large section of voice stream exists in the conference recording process, a longer pause can occur in the middle, and the sensitivity is reduced, the M value of the step can be set to be a larger value, and the corresponding audio time length is between 2000ms and 2500 ms.
The ideal mute energy value in this step is 0, so the preset threshold in this step is taken to be 0 in the ideal state, but in the collected voice stream, there is always a background sound with a certain intensity, the background sound is also calculated to be mute, and the energy value is obviously higher than 0, so the preset threshold is usually not 0 when the preset threshold is set. The preset threshold of this step may be a dynamic threshold: when breakpoint detection is performed on each voice stream, firstly, an average energy value of a voice stream starting duration, for example, an average energy value E0 of a voice stream starting duration of 100ms-1000ms or an average energy value E0 of a voice of the previous 100 frames, is collected, and a coefficient is added to the energy value E0 or multiplied by the coefficient greater than 1 to obtain a preset threshold value in the step.
In the embodiment, a single voice stream is divided into multiple frames of voices, an energy value is calculated for each frame of voices, whether a breakpoint exists or not is judged according to the energy value, the single voice stream is intercepted into multiple effective voice streams, a mute part is omitted, the intercepted multiple effective voice streams are stored, and storage pressure is reduced.
Step S203, end the recording: and receiving an end recording signal, and closing the recording function for a plurality of associated microphones.
The step can also receive the recording signal through the management interface of the conference system, automatically close the recording function to the associated microphone, and stop receiving the voice stream.
Step S204, save the valid voice stream: after receiving the recording signal, if the breakpoint does not exist, intercepting the voice stream from the breakpoint detection start to the audio signal end as an effective voice stream, and storing the effective voice stream, the corresponding audio start time, the audio length and the associated participant information into a storage medium.
Each voice stream passes through the step S202, breakpoint detection is performed in real time, the effective voice stream is intercepted, and in the detection process, after the recording signal is received, the detection of the step S202 is further performed until the audio signal is ended. In this process, if there are front and rear breakpoints, the process proceeds to step S202 to intercept the valid voice stream. If the breakpoint does not exist, the section of audio from the beginning of breakpoint detection to the end of the audio signal is considered to be an effective voice stream, and interception and storage are carried out.
According to the embodiment, breakpoint detection and interception of the effective voice streams are respectively carried out on each voice stream transmitted by the microphone until an ending recording signal is received, the voice streams are stopped being received, each effective voice stream is stored together with the corresponding audio start time, audio length and associated meeting participant information, and accurate data are provided for the subsequent audio information for distinguishing the meeting roles.
Step S3, generating conference audio: according to the audio start time sequence, synthesizing a plurality of effective voice streams into a section of audio information from the earliest time, according to the audio start time sequence, combining the audio start time, the audio length and the corresponding participant information into a section of role information, mapping the effective voice streams in the audio information and the corresponding audio start time in the role information, and defining the audio information and the role information together as conference audio for storage.
When the conference audio is stored, the audio name input by the user is acquired, the file name of the conference audio is renamed to be the conference audio and then stored, and if the audio name is not acquired within a set time, the file name of the conference audio is renamed to be the earliest audio starting time and then stored.
The step can acquire the audio name input by the user through the management interface, and can display the input interface after the user triggers the recording signal to finish through the management interface, and the user inputs the audio name through the input interface. If any information input by the user is not acquired within a set time, such as 5 minutes, default storage is performed.
In one embodiment, the method further comprises step S4 of audio presentation:
step S401, receiving a request and displaying: and receiving an audio playback request sent by a user, and displaying the file name of the conference audio.
The user can make an audio playback request through an API interface connected with the conference system, and can also send the audio playback request to the conference system through an http request. After receiving the audio playback request, the conference system displays all stored conference audio, and when displaying, displays the conference audio after sequencing according to the file names of the conference audio. For example, the file names are displayed after being sorted according to the storage time sequence, or the file names are displayed after being sorted according to the descending order of English initials.
Step S402, playing audio information and synchronously displaying role information: and after any file name is triggered by a user, playing the audio information corresponding to the file name, and displaying the role information corresponding to the file name.
Because each effective voice stream in the audio information maps the corresponding role information, the step can synchronously display the corresponding role information when the audio information triggered by the user is played, and provides the conference speaker corresponding to the audio information for the user.
According to the embodiment, an audio playback channel is provided for the user, character information is synchronously displayed when audio playback is performed, the user does not need to arrange conference contents, and a conference speaker corresponding to the recorded contents can be intuitively known.
In one embodiment, after step S2, the method further includes:
converting each effective voice stream in the audio information into a translation text by preset voice recognition software; when the audio starting time, the audio length and the corresponding participant information are combined into a piece of character information according to the audio starting time sequence, the translation text is also combined into a piece of character information, and the voice stream in the audio information and the translation text are also mapped; when a user triggers any file name, audio information corresponding to the file name is played, and role information corresponding to the file name is displayed, the translation text is synchronously displayed.
After a plurality of valid voice streams are cut out for each voice stream in step S2, the embodiment also converts each valid voice stream into a translation text for a preset voice recognition software. The voice recognition software decodes the effective voice stream through the acoustic model, and a search algorithm is carried out on the decoded voice through the language model to obtain a translation text. The acoustic model may be a neural network model, the language model may be an N-GRAM model (N-element statistical model), and the search algorithm may be a Viterbi algorithm.
And when the step S3 is combined into a piece of character information, combining the audio starting time, the audio length, the corresponding participant information and the translation text into a piece of character information.
In step S4, the audio is displayed, which also includes displaying the translation text. Because the mapping relation exists between the effective voice stream and the translation text, after a user clicks a certain section of translation text, the user can jump to a corresponding section of effective voice stream to play, and synchronously display the translation text and role information.
The embodiment provides the translation text corresponding to the effective voice stream, and the translation text is displayed together during audio display, so that a user can intuitively know the specific conference content.
In one embodiment, conference audio may also be retrieved:
receiving a search request sent by a user, acquiring keywords, searching whether the keywords exist in a plurality of stored conference audios, and if the keywords exist, displaying the file names of the conference audios corresponding to the keywords; and after the user triggers any file name, playing the audio information corresponding to the file name, and displaying the role information corresponding to the file name and the translation text.
The user of the embodiment can receive the search request through the management interface of the conference system, acquire the keyword, and can also carry out the search request through the API interface connected with the conference system, and can also send the search request to the conference system through the http request.
The key words comprise audio names, audio starting time, meeting personnel information or general words and the like, whether the stored meeting audio contains the key words or not is searched through the key words, and if yes, all the file names of the meeting audio corresponding to the audio information or character information containing the key words are displayed. For example, the keyword is a blockchain, the keyword is a universal word, the keyword is mentioned in a translation text of a participant in three, and the keyword is mentioned in another translation text of the participant in four, so that the file name of the conference audio corresponding to a translation text of the participant in three and the file name of the conference audio corresponding to a translation text of the participant in four are displayed together. The embodiment provides a search channel for the user and provides more expansion functions for the user.
According to the conference character-by-character voice synthesis method, character relations are correspondingly set for the conference room microphones, effective voice streams are segmented and intercepted through a breakpoint detection technology, after a conference is finished, each section of effective voice stream is synthesized into conference audio according to time sequence, corresponding conference participant information and translation text can be known for each section of effective voice stream, and visual conference content is provided for users.
In one embodiment, a conference component character speech synthesis device is provided, as shown in fig. 3, including the following modules:
the information acquisition module is used for acquiring the information of the participants input by the user and the association relation with the microphones, and each participant is associated with one microphone;
the voice stream receiving and storing module is used for receiving a recording starting signal, starting a plurality of microphones, receiving a plurality of voice streams through the plurality of microphones, respectively carrying out breakpoint detection on each voice stream, intercepting a plurality of effective voice streams, storing the plurality of effective voice streams, audio starting time, audio length and associated meeting personnel information corresponding to the effective voice streams together until the recording ending signal is received, and closing the plurality of microphones;
the conference audio generation module is used for synthesizing a plurality of effective voice streams into a section of audio information from the earliest time according to the audio start time sequence, merging the audio start time, the audio length and the corresponding conference participant information into a section of role information according to the audio start time sequence, mapping the effective voice streams in the audio information and the corresponding audio start time in the role information, and defining the audio information and the role information together as conference audio for storage.
In one embodiment, a computer device is provided, including a memory and a processor, where the memory stores computer readable instructions that, when executed by the processor, cause the processor to implement the steps in the conference character-by-character speech synthesis method of each embodiment described above when executing the computer readable instructions.
In one embodiment, a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps in the conference character-by-character speech synthesis method of each of the above embodiments is presented. Wherein the storage medium may be a non-volatile storage medium.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above-described embodiments represent only some exemplary embodiments of the invention, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (10)

1. A conference character-to-character speech synthesis method, comprising:
acquiring the information of the participants input by the user and the association relation with microphones, wherein each participant is associated with one microphone; the meeting personnel information includes: the name, the job number or other unique identification of the participants in the company is used for distinguishing each participant;
receiving a recording starting signal, starting a plurality of microphones, receiving a plurality of voice streams through the microphones, respectively detecting break points of each voice stream, intercepting a plurality of effective voice streams, storing the plurality of effective voice streams, audio starting time, audio length and associated meeting participants information corresponding to the effective voice streams together until the recording ending signal is received, and closing the microphones;
synthesizing a plurality of effective voice streams into a piece of audio information from the earliest time according to the audio start time sequence, merging the audio start time, the audio length and the corresponding meeting participant information into a piece of role information according to the audio start time sequence, mapping the effective voice streams in the audio information and the audio start time corresponding to the role information, and defining the audio information and the role information as meeting audio to store;
the method for acquiring the information of the participants input by the user and the association relation with the microphones, wherein each participant is associated with one microphone, further comprises the following steps: receiving the information of the participants and the association relation between all the participants and the microphone, which are input by the user, through a preset management interface in the conference system; the user clicks the corresponding microphone to trigger an input interface, and inputs the information of the corresponding consultant through the input interface to complete the association relationship between the consultant and the microphone at the system level; and the microphones are connected with the conference system based on the sound receiving equipment, the MAC address of the sound receiving equipment is used as a unique identifier, the microphone names are corresponding to the corresponding MAC addresses, and then the physical association relation between the participants and the microphones is completed.
2. The conference character-by-character speech synthesis method according to claim 1, wherein said receiving a start recording signal, turning on a plurality of said microphones, receiving a plurality of speech streams by a plurality of said microphones, respectively performing breakpoint detection on each of said speech streams, intercepting a plurality of valid speech streams, storing a plurality of said valid speech streams, audio start times, audio lengths, associated participant information together, until receiving an end recording signal, and turning off a plurality of said microphones, comprising:
receiving a recording starting signal, starting a recording function for a plurality of associated microphones, and respectively receiving voice streams transmitted by each microphone;
respectively carrying out breakpoint detection on each voice stream, if a breakpoint exists, intercepting a section of effective voice stream, storing the intercepted effective voice stream and corresponding audio start time, audio length and associated meeting participant information into a storage medium, and continuously carrying out breakpoint detection on the current voice stream;
receiving a recording ending signal, and closing recording functions for a plurality of associated microphones;
after receiving the recording signal, if the breakpoint does not exist, intercepting the voice stream from the breakpoint detection start to the audio signal end as an effective voice stream, and storing the effective voice stream, the corresponding audio start time, the audio length and the associated meeting participant information into a storage medium.
3. The conference character-to-character speech synthesis method according to claim 2, wherein said separately performing breakpoint detection on each of said speech streams, if there is a breakpoint, intercepting a segment of valid speech stream comprises:
dividing the voice stream according to fixed duration, defining each dividing unit as one frame of voice, and collecting N sampling points with the same number for each frame of voice;
and calculating the energy value of each frame of voice, wherein the energy value is calculated according to the following formula:
Figure QLYQS_1
wherein E is the energy value of a frame of voice, f k The peak value of the kth sampling point is N, and the total number of sampling points of one frame of voice;
if the energy value of the continuous M frames of voice is higher than a preset threshold, the first frame of voice which is higher than the preset value in the continuous M frames of voice is defined as a front break point of a section of audio, if the energy value of the continuous M frames of voice is lower than the preset threshold from the beginning and lasts for a preset duration, the M+1 frames of voice is defined as a rear break point of a section of audio, and the audio between the front break point and the rear break point is intercepted to be a section of effective voice stream.
4. The conference character-by-character speech synthesis method according to claim 1, wherein the defining the audio information and the character information together as conference audio to be saved includes:
and acquiring an audio name input by a user, renaming the file name of the conference audio to be the conference audio, then storing the file name, and if the audio name is not acquired within a set time, renaming the file name of the conference audio to be the earliest audio starting time, then storing the file name.
5. The conference-by-character speech synthesis method according to claim 1, further comprising:
receiving an audio playback request sent by a user, and displaying the file name of the conference audio;
and after triggering any file name by a user, playing the audio information corresponding to the file name, and displaying the role information corresponding to the file name.
6. The conference-by-character speech synthesis method according to claim 5, further comprising:
converting each effective voice stream in the audio information into a translation text through preset voice recognition software;
when the audio starting time, the audio length and the corresponding participant information are combined into a piece of role information according to the audio starting time sequence, the translation text is also combined into a piece of role information, and the voice stream and the translation text in the audio information are also mapped;
and after triggering any file name by a user, playing the audio information corresponding to the file name, and synchronously displaying the translation text when displaying the role information corresponding to the file name.
7. The conference character-by-character speech synthesis method according to claim 6, further comprising:
receiving a search request sent by a user, acquiring keywords, searching whether the keywords exist in a plurality of stored conference audios, and if so, displaying the file names of the conference audios corresponding to the keywords;
and after any file name is triggered by a user, playing the audio information corresponding to the file name, and displaying the role information corresponding to the file name and the translation text.
8. A conference character-to-character speech synthesis apparatus, comprising:
the information acquisition module is used for acquiring the information of the participants input by the user and the association relation with the microphones, and each participant is associated with one microphone; the meeting personnel information includes: the name, the job number or other unique identification of the participants in the company is used for distinguishing each participant;
the voice stream receiving and storing module is used for receiving a recording starting signal, starting a plurality of microphones, receiving a plurality of voice streams through the microphones, respectively detecting break points of each voice stream, intercepting a plurality of effective voice streams, storing the effective voice streams, audio starting time, audio length and associated meeting participant information corresponding to the effective voice streams together until the recording ending signal is received, and closing the microphones;
the conference audio generation module is used for synthesizing a plurality of effective voice streams into a piece of audio information from the earliest time according to the audio start time sequence, merging the audio start time, the audio length and the corresponding participant information into a piece of role information according to the audio start time sequence, mapping the effective voice streams in the audio information and the audio start time corresponding to the role information, and defining the audio information and the role information together as conference audio for storage;
the apparatus further comprises: the association information module is used for receiving the information of the participants and the association relation between all the participants and the microphone, which are input by the user, through a preset management interface in the conference system; the user clicks the corresponding microphone to trigger an input interface, and inputs the information of the corresponding consultant through the input interface to complete the association relationship between the consultant and the microphone at the system level; and the microphones are connected with the conference system based on the sound receiving equipment, the MAC address of the sound receiving equipment is used as a unique identifier, the microphone names are corresponding to the corresponding MAC addresses, and then the physical association relation between the participants and the microphones is completed.
9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the conference character-by-character speech synthesis method of any one of claims 1 to 7.
10. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the conference character-by-character speech synthesis method of any one of claims 1 to 7.
CN201910424720.3A 2019-05-21 2019-05-21 Conference character-division speech synthesis method, device, computer equipment and storage medium Active CN110322869B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910424720.3A CN110322869B (en) 2019-05-21 2019-05-21 Conference character-division speech synthesis method, device, computer equipment and storage medium
PCT/CN2019/102448 WO2020232865A1 (en) 2019-05-21 2019-08-26 Meeting role-based speech synthesis method, apparatus, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910424720.3A CN110322869B (en) 2019-05-21 2019-05-21 Conference character-division speech synthesis method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110322869A CN110322869A (en) 2019-10-11
CN110322869B true CN110322869B (en) 2023-06-16

Family

ID=68113334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910424720.3A Active CN110322869B (en) 2019-05-21 2019-05-21 Conference character-division speech synthesis method, device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110322869B (en)
WO (1) WO2020232865A1 (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110808062B (en) * 2019-11-26 2022-12-13 秒针信息技术有限公司 Mixed voice separation method and device
WO2021109000A1 (en) * 2019-12-03 2021-06-10 深圳市欢太科技有限公司 Data processing method and apparatus, electronic device, and storage medium
CN111128132A (en) * 2019-12-19 2020-05-08 秒针信息技术有限公司 Voice separation method, device and system and storage medium
CN111445920B (en) * 2020-03-19 2023-05-16 西安声联科技有限公司 Multi-sound source voice signal real-time separation method, device and pickup
CN111429914B (en) * 2020-03-30 2023-04-18 招商局金融科技有限公司 Microphone control method, electronic device and computer readable storage medium
CN113704312A (en) * 2020-05-21 2021-11-26 北京声智科技有限公司 Information processing method, device, medium and equipment
CN113782026A (en) * 2020-06-09 2021-12-10 北京声智科技有限公司 Information processing method, device, medium and equipment
CN113963452A (en) * 2020-07-02 2022-01-21 Oppo广东移动通信有限公司 Conference sign-in method and device and computer readable storage medium
CN111883168B (en) * 2020-08-04 2023-12-22 上海明略人工智能(集团)有限公司 Voice processing method and device
CN111968686B (en) * 2020-08-06 2022-09-30 维沃移动通信有限公司 Recording method and device and electronic equipment
CN111986715A (en) * 2020-08-19 2020-11-24 科大讯飞股份有限公司 Recording system and recording method
CN112185424A (en) * 2020-09-29 2021-01-05 国家计算机网络与信息安全管理中心 Voice file cutting and restoring method, device, equipment and storage medium
CN112270918A (en) * 2020-10-22 2021-01-26 北京百度网讯科技有限公司 Information processing method, device, system, electronic equipment and storage medium
CN112804401A (en) * 2020-12-31 2021-05-14 中国人民解放军战略支援部队信息工程大学 Conference role determination and voice acquisition control method and device
CN112908336A (en) * 2021-01-29 2021-06-04 深圳壹秘科技有限公司 Role separation method for voice processing device and voice processing device thereof
CN113055529B (en) * 2021-03-29 2022-12-13 深圳市艾酷通信软件有限公司 Recording control method and recording control device
CN113422865A (en) * 2021-06-01 2021-09-21 维沃移动通信有限公司 Directional recording method and device
CN113539269A (en) * 2021-07-20 2021-10-22 上海明略人工智能(集团)有限公司 Audio information processing method, system and computer readable storage medium
CN113708868B (en) * 2021-08-27 2023-06-27 国网安徽省电力有限公司池州供电公司 Dispatching system and dispatching method for multiple pickup devices
CN113723086B (en) * 2021-08-31 2023-09-05 平安科技(深圳)有限公司 Text processing method, system, equipment and medium
CN113542661A (en) * 2021-09-09 2021-10-22 北京鼎天宏盛科技有限公司 Video conference voice recognition method and system
US11838340B2 (en) * 2021-09-20 2023-12-05 International Business Machines Corporation Dynamic mute control for web conferencing
CN115242747A (en) * 2022-07-21 2022-10-25 维沃移动通信有限公司 Voice message processing method and device, electronic equipment and readable storage medium
CN116015996B (en) * 2023-03-28 2023-06-02 南昌航天广信科技有限责任公司 Digital conference audio processing method and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102014004071A1 (en) * 2014-03-20 2015-09-24 Unify Gmbh & Co. Kg Method, device and system for controlling a conference
JP6746923B2 (en) * 2016-01-20 2020-08-26 株式会社リコー Information processing system, information processing apparatus, information processing method, and information processing program
US10553129B2 (en) * 2016-07-27 2020-02-04 David Nelson System and method for recording, documenting and visualizing group conversations
CN107564531A (en) * 2017-08-25 2018-01-09 百度在线网络技术(北京)有限公司 Minutes method, apparatus and computer equipment based on vocal print feature
CN108346034B (en) * 2018-02-02 2021-10-15 深圳市鹰硕技术有限公司 Intelligent conference management method and system
CN108305632B (en) * 2018-02-02 2020-03-27 深圳市鹰硕技术有限公司 Method and system for forming voice abstract of conference
CN109388701A (en) * 2018-08-17 2019-02-26 深圳壹账通智能科技有限公司 Minutes generation method, device, equipment and computer storage medium

Also Published As

Publication number Publication date
CN110322869A (en) 2019-10-11
WO2020232865A1 (en) 2020-11-26

Similar Documents

Publication Publication Date Title
CN110322869B (en) Conference character-division speech synthesis method, device, computer equipment and storage medium
CN108305632B (en) Method and system for forming voice abstract of conference
US11669683B2 (en) Speech recognition and summarization
CN110517689B (en) Voice data processing method, device and storage medium
CN108346034B (en) Intelligent conference management method and system
CN109493850B (en) Growing type dialogue device
WO2020233068A1 (en) Conference audio control method, system, device and computer readable storage medium
CN107274916B (en) Method and device for operating audio/video file based on voiceprint information
US9070369B2 (en) Real time generation of audio content summaries
US6434520B1 (en) System and method for indexing and querying audio archives
CN107562760B (en) Voice data processing method and device
Chaudhuri et al. Ava-speech: A densely labeled dataset of speech activity in movies
WO2019148586A1 (en) Method and device for speaker recognition during multi-person speech
US20110004473A1 (en) Apparatus and method for enhanced speech recognition
WO2005069171A1 (en) Document correlation device and document correlation method
US8719032B1 (en) Methods for presenting speech blocks from a plurality of audio input data streams to a user in an interface
US20130253932A1 (en) Conversation supporting device, conversation supporting method and conversation supporting program
US11823685B2 (en) Speech recognition
CN111415128A (en) Method, system, apparatus, device and medium for controlling conference
US20140114656A1 (en) Electronic device capable of generating tag file for media file based on speaker recognition
JP3437617B2 (en) Time-series data recording / reproducing device
CN109635151A (en) Establish the method, apparatus and computer equipment of audio retrieval index
CN113744742B (en) Role identification method, device and system under dialogue scene
JPH08249343A (en) Device and method for speech information acquisition
CN113889081A (en) Speech recognition method, medium, device and computing equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant