CN110322869B

CN110322869B - Conference character-division speech synthesis method, device, computer equipment and storage medium

Info

Publication number: CN110322869B
Application number: CN201910424720.3A
Authority: CN
Inventors: 岳鹏昱; 闫冬
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-05-21
Filing date: 2019-05-21
Publication date: 2023-06-16
Anticipated expiration: 2039-05-21
Also published as: CN110322869A; WO2020232865A1

Abstract

The present invention relates to the field of artificial intelligence technologies, and in particular, to a conference component character speech synthesis method, apparatus, computer device, and storage medium. The method comprises the following steps: acquiring the information of the participants input by the user and the association relation between the information and the microphone; receiving a plurality of voice streams through a plurality of microphones, respectively performing breakpoint detection on each voice stream, intercepting a plurality of effective voice streams, and storing the plurality of effective voice streams, audio start time, audio length and associated meeting participant information together; and synthesizing a plurality of effective voice streams into a piece of audio information, combining the audio start time, the audio length and the corresponding meeting personnel information into a piece of role information, and defining the audio information and the role information together as meeting audio for storage. According to the invention, the conference room microphone is correspondingly provided with the conference participant information, and each section of audio corresponds to the conference participant information, so that the speaking contents of all the speakers in the conference process can be easily determined.

Description

Conference character-division speech synthesis method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a conference component character speech synthesis method, apparatus, computer device, and storage medium.

Background

As an economic and efficient conference solution, multimedia conferences are increasingly applied to enterprises, and the communication and collaboration efficiency of people is greatly improved. Conference recording is often necessary as a means of multi-person communication, for which the recording of a multimedia conference is a form of conference recording. For example, when a user needs to leave a conference temporarily during a conference, but does not want to miss important utterances of some conference participants in the conference, or when the user wants to record utterances of some conference participants in the conference, the user needs to start recording the conference and record the conference. However, the current conference recording is generally aimed at the whole conference process, that is, if recording is started in the conference process, all the voices of the conference can be recorded, the recording cannot be carried out for the designated participants in the conference, and the roles of the participants cannot be distinguished for recording.

Disclosure of Invention

In view of this, it is necessary to provide a conference character-by-character speech synthesis method, apparatus, computer device, and storage medium for the problem that audio files cannot be stored in different characters during conference recording.

A conference character-to-character speech synthesis method, comprising:

acquiring the information of the participants input by the user and the association relation with microphones, wherein each participant is associated with one microphone;

receiving a recording starting signal, starting a plurality of microphones, receiving a plurality of voice streams through the microphones, respectively detecting break points of each voice stream, intercepting a plurality of effective voice streams, storing the plurality of effective voice streams, audio starting time, audio length and associated meeting participants information corresponding to the effective voice streams together until the recording ending signal is received, and closing the microphones;

and according to the audio start time sequence, combining the audio start time, the audio length and the corresponding meeting participant information into a piece of role information, mapping the effective voice stream in the audio information and the audio start time corresponding to the role information, and defining the audio information and the role information as meeting audio to store.

In one possible design, the receiving a recording start signal, starting a plurality of microphones, receiving a plurality of voice streams through a plurality of microphones, respectively performing breakpoint detection on each voice stream, intercepting a plurality of valid voice streams, storing a plurality of valid voice streams, audio start time, audio length and associated participant information corresponding to the valid voice streams together until receiving an ending recording signal, and closing a plurality of microphones, including:

receiving a recording starting signal, starting a recording function for a plurality of associated microphones, and respectively receiving voice streams transmitted by each microphone;

respectively carrying out breakpoint detection on each voice stream, if a breakpoint exists, intercepting a section of effective voice stream, storing the intercepted effective voice stream and corresponding audio start time, audio length and associated meeting participant information into a storage medium, and continuously carrying out breakpoint detection on the current voice stream;

receiving a recording ending signal, and closing recording functions for a plurality of associated microphones;

after receiving the recording signal, if the breakpoint does not exist, intercepting the voice stream from the breakpoint detection start to the audio signal end as an effective voice stream, and storing the effective voice stream, the corresponding audio start time, the audio length and the associated meeting participant information into a storage medium.

In one possible design, the detecting a break point for each of the voice streams respectively, and intercepting a valid voice stream if a break point exists, includes:

dividing the voice stream according to fixed duration, defining each dividing unit as one frame of voice, and collecting N sampling points with the same number for each frame of voice;

and calculating the energy value of each frame of voice, wherein the energy value is calculated according to the following formula:

wherein E is the energy value of a frame of voice, f _k The peak value of the kth sampling point is N, and the total number of sampling points of one frame of voice;

if the energy value of the continuous M frames of voice is higher than a preset threshold, the first frame of voice which is higher than the preset value in the continuous M frames of voice is defined as a front break point of a section of audio, if the energy value of the continuous M frames of voice is lower than the preset threshold from the beginning and lasts for a preset duration, the M+1 frames of voice is defined as a rear break point of a section of audio, and the audio between the front break point and the rear break point is intercepted to be a section of effective voice stream.

In one possible design, the defining the audio information and the character information together as conference audio for saving includes:

and acquiring an audio name input by a user, renaming the file name of the conference audio to be the conference audio, then storing the file name, and if the audio name is not acquired within a set time, renaming the file name of the conference audio to be the earliest audio starting time, then storing the file name.

In one possible design, the method further comprises:

receiving an audio playback request sent by a user, and displaying the file name of the conference audio;

and after triggering any file name by a user, playing the audio information corresponding to the file name, and displaying the role information corresponding to the file name.

In one possible design, the method further comprises:

converting each effective voice stream in the audio information into a translation text through preset voice recognition software;

when the audio starting time, the audio length and the corresponding participant information are combined into a piece of role information according to the audio starting time sequence, the translation text is also combined into a piece of role information, and the voice stream and the translation text in the audio information are also mapped;

and after triggering any file name by a user, playing the audio information corresponding to the file name, and synchronously displaying the translation text when displaying the role information corresponding to the file name.

In one possible design, the method further comprises:

receiving a search request sent by a user, acquiring keywords, searching whether the keywords exist in a plurality of stored conference audios, and if so, displaying the file names of the conference audios corresponding to the keywords;

and after any file name is triggered by a user, playing the audio information corresponding to the file name, and displaying the role information corresponding to the file name and the translation text.

A conference character-to-character speech synthesis apparatus comprising:

the information acquisition module is used for acquiring the information of the participants input by the user and the association relation with the microphones, and each participant is associated with one microphone;

the voice stream receiving and storing module is used for receiving a recording starting signal, starting a plurality of microphones, receiving a plurality of voice streams through the microphones, respectively detecting break points of each voice stream, intercepting a plurality of effective voice streams, storing the effective voice streams, audio starting time, audio length and associated meeting participant information corresponding to the effective voice streams together until the recording ending signal is received, and closing the microphones;

and the conference audio generation module is used for synthesizing a plurality of effective voice streams into a piece of audio information from the earliest time according to the audio start time sequence, merging the audio start time, the audio length and the corresponding participant information into a piece of role information according to the audio start time sequence, mapping the effective voice streams in the audio information and the audio start time corresponding to the role information, and defining the audio information and the role information together as conference audio for storage.

A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the conference character-by-character speech synthesis method described above.

A storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the conference persona speech synthesis method described above.

The conference character-by-character speech synthesis method, the conference character-by-character speech synthesis device, the computer equipment and the storage medium comprise the steps of acquiring information of participants input by a user and association relations with microphones, wherein each participant is associated with one microphone; receiving a recording starting signal, starting a plurality of microphones, receiving a plurality of voice streams through the microphones, respectively detecting break points of each voice stream, intercepting a plurality of effective voice streams, storing the plurality of effective voice streams, audio starting time, audio length and associated meeting participants information corresponding to the effective voice streams together until the recording ending signal is received, and closing the microphones; and according to the audio start time sequence, combining the audio start time, the audio length and the corresponding meeting participant information into a piece of role information, mapping the effective voice stream in the audio information and the audio start time corresponding to the role information, and defining the audio information and the role information as meeting audio to store. According to the invention, the conference participant information is correspondingly arranged on the microphone of the conference room, the audio is intercepted by the silence detection technology in a segmentation way, after the conference is finished, each section of audio is synthesized into conference audio according to time sequence, and the corresponding role information can be known for each section of audio, so that the speaking content of all speakers in the conference process can be easily determined.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.

FIG. 1 is a flow chart of a conference character-by-character speech synthesis method in one embodiment of the invention;

FIG. 2 is a flowchart of step S2 in an embodiment of the present invention;

fig. 3 is a block diagram of a conference character-by-character speech synthesis apparatus according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Fig. 1 is a flowchart of a conference character-division speech synthesis method according to an embodiment of the present invention, as shown in fig. 1, and the conference character-division speech synthesis method includes the following steps:

step S1, information is acquired: and acquiring the information of the participants input by the user and the association relation with the microphones, wherein each participant is associated with one microphone.

The step can receive the information of the participants and the association relation between all the participants and the microphone which are input by the user through a preset management interface in the conference system. A conference room seating schematic is presented in the management interface, on which the positional information of each microphone in the conference room is marked. And the user clicks the corresponding microphone to trigger the input interface, and inputs the information of the corresponding participants through the input interface to complete the association relationship between the participants and the microphone at the system level. The attendee information may be the name, job number, or other unique identification at the company of the attendee for distinguishing between the attendees.

In the step, a plurality of microphones are connected with a conference system based on the radio equipment of the raspberry group, the MAC address of the radio equipment is used as a unique identifier, the microphone name is corresponding to the corresponding MAC address, and then the physical association relation between the participants and the microphones is completed.

Step S2, receiving and storing the voice stream: receiving a recording starting signal, starting a plurality of microphones, receiving a plurality of voice streams through the microphones, respectively detecting break points of each voice stream, intercepting a plurality of effective voice streams, storing the plurality of effective voice streams, audio starting time, audio length and associated meeting personnel information corresponding to the effective voice streams together until the recording ending signal is received, and closing the microphones.

After the voice streams sent by the microphones are independently received, a plurality of independent threads can be started, the breakpoint detection is carried out on each voice stream by executing the step, and the effective voice streams are intercepted. When the effective voice stream is stored, the corresponding participant information is stored together, so that the determination of which effective voice stream is uttered by which participant is facilitated.

In one embodiment, step S2, as shown in fig. 2, includes:

step S201, starting recording: a recording starting signal is received, a recording function is started for a plurality of associated microphones, and voice streams transmitted by each microphone are respectively received.

The step can receive the recording starting signal through the management interface of the conference system, automatically start the recording function for the associated microphone, and respectively receive the voice streams transmitted by a plurality of microphones.

Step S202, breakpoint detection and interception of effective voice stream: and respectively carrying out breakpoint detection on each voice stream, if a breakpoint exists, intercepting a section of effective voice stream, storing the intercepted effective voice stream, the corresponding audio starting time, audio length and associated participant information into a storage medium, and continuously carrying out breakpoint detection on the current voice stream.

Breakpoint detection is used to detect a segment of a valid voice stream from a continuous voice stream. The method comprises the steps of detecting a starting point, namely a front breakpoint, of an effective voice stream and detecting an ending point, namely a rear breakpoint, of the effective voice stream. The effective voice stream is separated from the continuous voice stream, so that the stored data volume can be reduced, the man-machine interaction can be simplified by using breakpoint detection, for example, if necessary, the recording ending signal can be received without step S203, and the recording ending operation can be determined directly by detecting the real-time breakpoint of the received voice stream.

In this step, when breakpoint detection is performed on the voice stream, in one embodiment, the following manner is adopted:

step S20201, dividing the voice stream: dividing the voice stream according to fixed duration, defining each dividing unit as one frame of voice, and collecting N sampling points with the same number for each frame of voice.

The fixed duration in this step may be 20ms,30ms, etc., and the voice stream is divided into several frames of voice according to the fixed duration. Since the speaking volume for the same word may be different even when the same participant is speaking, the normalization process may be performed on the voice stream before the voice stream is divided in this step: and (3) taking the point with the largest amplitude in each voice stream, increasing the amplitude to be close to 1, recording the increasing proportion, and then stretching all other points according to the proportion.

Step S20202, calculating an energy value: the energy value of each frame of voice is calculated, and the calculation formula of the energy value is as follows:

the energy value of a frame of speech is related to the size of the sampling value and the number of sampling points contained in the frame of speech, and the sampling value, namely the peak value, generally contains positive values and negative values, and the positive and negative values are not considered when the energy value is calculated, so that the square sum of the sampling values is used for defining the energy value of a frame of speech in the step.

Step S20203, determining the front and rear break points: if the energy value of the continuous M frames of voice is higher than a preset threshold, the first frame of voice higher than the preset value in the continuous M frames of voice is defined as a front break point of a section of audio, if the energy value of the continuous M frames of voice is lower than the preset threshold from the beginning and lasts for a preset duration, the M+1 frames of voice is defined as a rear break point of a section of audio, and the audio between the front break point and the rear break point is intercepted to be a section of effective voice stream.

If the energy value of the first few frames of voice of one voice stream is lower than a preset threshold value and the energy values of the continuous M frames of voice are all higher than the preset threshold value, the first frame of voice with the energy value just higher than the preset threshold value is defined as a break-before point. If the energy values of the consecutive M frames of speech are all higher, the energy value of the following frame of speech becomes smaller and lasts for a preset period of time, the following break point can be considered at the place where the energy value is reduced. And intercepting the audio between the front breakpoint and the rear breakpoint and storing the audio as a section of effective voice stream.

The smaller the audio time length corresponding to the continuous M frames in the step is, the higher the breakpoint detection sensitivity is. In the step, because the condition of receiving a large section of voice stream exists in the conference recording process, a longer pause can occur in the middle, and the sensitivity is reduced, the M value of the step can be set to be a larger value, and the corresponding audio time length is between 2000ms and 2500 ms.

The ideal mute energy value in this step is 0, so the preset threshold in this step is taken to be 0 in the ideal state, but in the collected voice stream, there is always a background sound with a certain intensity, the background sound is also calculated to be mute, and the energy value is obviously higher than 0, so the preset threshold is usually not 0 when the preset threshold is set. The preset threshold of this step may be a dynamic threshold: when breakpoint detection is performed on each voice stream, firstly, an average energy value of a voice stream starting duration, for example, an average energy value E0 of a voice stream starting duration of 100ms-1000ms or an average energy value E0 of a voice of the previous 100 frames, is collected, and a coefficient is added to the energy value E0 or multiplied by the coefficient greater than 1 to obtain a preset threshold value in the step.

In the embodiment, a single voice stream is divided into multiple frames of voices, an energy value is calculated for each frame of voices, whether a breakpoint exists or not is judged according to the energy value, the single voice stream is intercepted into multiple effective voice streams, a mute part is omitted, the intercepted multiple effective voice streams are stored, and storage pressure is reduced.

Step S203, end the recording: and receiving an end recording signal, and closing the recording function for a plurality of associated microphones.

The step can also receive the recording signal through the management interface of the conference system, automatically close the recording function to the associated microphone, and stop receiving the voice stream.

Step S204, save the valid voice stream: after receiving the recording signal, if the breakpoint does not exist, intercepting the voice stream from the breakpoint detection start to the audio signal end as an effective voice stream, and storing the effective voice stream, the corresponding audio start time, the audio length and the associated participant information into a storage medium.

Each voice stream passes through the step S202, breakpoint detection is performed in real time, the effective voice stream is intercepted, and in the detection process, after the recording signal is received, the detection of the step S202 is further performed until the audio signal is ended. In this process, if there are front and rear breakpoints, the process proceeds to step S202 to intercept the valid voice stream. If the breakpoint does not exist, the section of audio from the beginning of breakpoint detection to the end of the audio signal is considered to be an effective voice stream, and interception and storage are carried out.

According to the embodiment, breakpoint detection and interception of the effective voice streams are respectively carried out on each voice stream transmitted by the microphone until an ending recording signal is received, the voice streams are stopped being received, each effective voice stream is stored together with the corresponding audio start time, audio length and associated meeting participant information, and accurate data are provided for the subsequent audio information for distinguishing the meeting roles.

Step S3, generating conference audio: according to the audio start time sequence, synthesizing a plurality of effective voice streams into a section of audio information from the earliest time, according to the audio start time sequence, combining the audio start time, the audio length and the corresponding participant information into a section of role information, mapping the effective voice streams in the audio information and the corresponding audio start time in the role information, and defining the audio information and the role information together as conference audio for storage.

When the conference audio is stored, the audio name input by the user is acquired, the file name of the conference audio is renamed to be the conference audio and then stored, and if the audio name is not acquired within a set time, the file name of the conference audio is renamed to be the earliest audio starting time and then stored.

The step can acquire the audio name input by the user through the management interface, and can display the input interface after the user triggers the recording signal to finish through the management interface, and the user inputs the audio name through the input interface. If any information input by the user is not acquired within a set time, such as 5 minutes, default storage is performed.

In one embodiment, the method further comprises step S4 of audio presentation:

step S401, receiving a request and displaying: and receiving an audio playback request sent by a user, and displaying the file name of the conference audio.

The user can make an audio playback request through an API interface connected with the conference system, and can also send the audio playback request to the conference system through an http request. After receiving the audio playback request, the conference system displays all stored conference audio, and when displaying, displays the conference audio after sequencing according to the file names of the conference audio. For example, the file names are displayed after being sorted according to the storage time sequence, or the file names are displayed after being sorted according to the descending order of English initials.

Step S402, playing audio information and synchronously displaying role information: and after any file name is triggered by a user, playing the audio information corresponding to the file name, and displaying the role information corresponding to the file name.

Because each effective voice stream in the audio information maps the corresponding role information, the step can synchronously display the corresponding role information when the audio information triggered by the user is played, and provides the conference speaker corresponding to the audio information for the user.

According to the embodiment, an audio playback channel is provided for the user, character information is synchronously displayed when audio playback is performed, the user does not need to arrange conference contents, and a conference speaker corresponding to the recorded contents can be intuitively known.

In one embodiment, after step S2, the method further includes:

converting each effective voice stream in the audio information into a translation text by preset voice recognition software; when the audio starting time, the audio length and the corresponding participant information are combined into a piece of character information according to the audio starting time sequence, the translation text is also combined into a piece of character information, and the voice stream in the audio information and the translation text are also mapped; when a user triggers any file name, audio information corresponding to the file name is played, and role information corresponding to the file name is displayed, the translation text is synchronously displayed.

After a plurality of valid voice streams are cut out for each voice stream in step S2, the embodiment also converts each valid voice stream into a translation text for a preset voice recognition software. The voice recognition software decodes the effective voice stream through the acoustic model, and a search algorithm is carried out on the decoded voice through the language model to obtain a translation text. The acoustic model may be a neural network model, the language model may be an N-GRAM model (N-element statistical model), and the search algorithm may be a Viterbi algorithm.

And when the step S3 is combined into a piece of character information, combining the audio starting time, the audio length, the corresponding participant information and the translation text into a piece of character information.

In step S4, the audio is displayed, which also includes displaying the translation text. Because the mapping relation exists between the effective voice stream and the translation text, after a user clicks a certain section of translation text, the user can jump to a corresponding section of effective voice stream to play, and synchronously display the translation text and role information.

The embodiment provides the translation text corresponding to the effective voice stream, and the translation text is displayed together during audio display, so that a user can intuitively know the specific conference content.

In one embodiment, conference audio may also be retrieved:

receiving a search request sent by a user, acquiring keywords, searching whether the keywords exist in a plurality of stored conference audios, and if the keywords exist, displaying the file names of the conference audios corresponding to the keywords; and after the user triggers any file name, playing the audio information corresponding to the file name, and displaying the role information corresponding to the file name and the translation text.

The user of the embodiment can receive the search request through the management interface of the conference system, acquire the keyword, and can also carry out the search request through the API interface connected with the conference system, and can also send the search request to the conference system through the http request.

The key words comprise audio names, audio starting time, meeting personnel information or general words and the like, whether the stored meeting audio contains the key words or not is searched through the key words, and if yes, all the file names of the meeting audio corresponding to the audio information or character information containing the key words are displayed. For example, the keyword is a blockchain, the keyword is a universal word, the keyword is mentioned in a translation text of a participant in three, and the keyword is mentioned in another translation text of the participant in four, so that the file name of the conference audio corresponding to a translation text of the participant in three and the file name of the conference audio corresponding to a translation text of the participant in four are displayed together. The embodiment provides a search channel for the user and provides more expansion functions for the user.

According to the conference character-by-character voice synthesis method, character relations are correspondingly set for the conference room microphones, effective voice streams are segmented and intercepted through a breakpoint detection technology, after a conference is finished, each section of effective voice stream is synthesized into conference audio according to time sequence, corresponding conference participant information and translation text can be known for each section of effective voice stream, and visual conference content is provided for users.

In one embodiment, a conference component character speech synthesis device is provided, as shown in fig. 3, including the following modules:

the voice stream receiving and storing module is used for receiving a recording starting signal, starting a plurality of microphones, receiving a plurality of voice streams through the plurality of microphones, respectively carrying out breakpoint detection on each voice stream, intercepting a plurality of effective voice streams, storing the plurality of effective voice streams, audio starting time, audio length and associated meeting personnel information corresponding to the effective voice streams together until the recording ending signal is received, and closing the plurality of microphones;

the conference audio generation module is used for synthesizing a plurality of effective voice streams into a section of audio information from the earliest time according to the audio start time sequence, merging the audio start time, the audio length and the corresponding conference participant information into a section of role information according to the audio start time sequence, mapping the effective voice streams in the audio information and the corresponding audio start time in the role information, and defining the audio information and the role information together as conference audio for storage.

In one embodiment, a computer device is provided, including a memory and a processor, where the memory stores computer readable instructions that, when executed by the processor, cause the processor to implement the steps in the conference character-by-character speech synthesis method of each embodiment described above when executing the computer readable instructions.

In one embodiment, a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps in the conference character-by-character speech synthesis method of each of the above embodiments is presented. Wherein the storage medium may be a non-volatile storage medium.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above-described embodiments represent only some exemplary embodiments of the invention, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A conference character-to-character speech synthesis method, comprising:

acquiring the information of the participants input by the user and the association relation with microphones, wherein each participant is associated with one microphone; the meeting personnel information includes: the name, the job number or other unique identification of the participants in the company is used for distinguishing each participant;

synthesizing a plurality of effective voice streams into a piece of audio information from the earliest time according to the audio start time sequence, merging the audio start time, the audio length and the corresponding meeting participant information into a piece of role information according to the audio start time sequence, mapping the effective voice streams in the audio information and the audio start time corresponding to the role information, and defining the audio information and the role information as meeting audio to store;

the method for acquiring the information of the participants input by the user and the association relation with the microphones, wherein each participant is associated with one microphone, further comprises the following steps: receiving the information of the participants and the association relation between all the participants and the microphone, which are input by the user, through a preset management interface in the conference system; the user clicks the corresponding microphone to trigger an input interface, and inputs the information of the corresponding consultant through the input interface to complete the association relationship between the consultant and the microphone at the system level; and the microphones are connected with the conference system based on the sound receiving equipment, the MAC address of the sound receiving equipment is used as a unique identifier, the microphone names are corresponding to the corresponding MAC addresses, and then the physical association relation between the participants and the microphones is completed.

2. The conference character-by-character speech synthesis method according to claim 1, wherein said receiving a start recording signal, turning on a plurality of said microphones, receiving a plurality of speech streams by a plurality of said microphones, respectively performing breakpoint detection on each of said speech streams, intercepting a plurality of valid speech streams, storing a plurality of said valid speech streams, audio start times, audio lengths, associated participant information together, until receiving an end recording signal, and turning off a plurality of said microphones, comprising:

3. The conference character-to-character speech synthesis method according to claim 2, wherein said separately performing breakpoint detection on each of said speech streams, if there is a breakpoint, intercepting a segment of valid speech stream comprises:

4. The conference character-by-character speech synthesis method according to claim 1, wherein the defining the audio information and the character information together as conference audio to be saved includes:

5. The conference-by-character speech synthesis method according to claim 1, further comprising:

6. The conference-by-character speech synthesis method according to claim 5, further comprising:

7. The conference character-by-character speech synthesis method according to claim 6, further comprising:

8. A conference character-to-character speech synthesis apparatus, comprising:

the information acquisition module is used for acquiring the information of the participants input by the user and the association relation with the microphones, and each participant is associated with one microphone; the meeting personnel information includes: the name, the job number or other unique identification of the participants in the company is used for distinguishing each participant;

the conference audio generation module is used for synthesizing a plurality of effective voice streams into a piece of audio information from the earliest time according to the audio start time sequence, merging the audio start time, the audio length and the corresponding participant information into a piece of role information according to the audio start time sequence, mapping the effective voice streams in the audio information and the audio start time corresponding to the role information, and defining the audio information and the role information together as conference audio for storage;

the apparatus further comprises: the association information module is used for receiving the information of the participants and the association relation between all the participants and the microphone, which are input by the user, through a preset management interface in the conference system; the user clicks the corresponding microphone to trigger an input interface, and inputs the information of the corresponding consultant through the input interface to complete the association relationship between the consultant and the microphone at the system level; and the microphones are connected with the conference system based on the sound receiving equipment, the MAC address of the sound receiving equipment is used as a unique identifier, the microphone names are corresponding to the corresponding MAC addresses, and then the physical association relation between the participants and the microphones is completed.

9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the conference character-by-character speech synthesis method of any one of claims 1 to 7.

10. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the conference character-by-character speech synthesis method of any one of claims 1 to 7.