CN111739553A

CN111739553A - Conference sound acquisition method, conference recording method, conference record presentation method and device

Info

Publication number: CN111739553A
Application number: CN202010497438.0A
Authority: CN
Inventors: 张铖
Original assignee: Shenzhen Weiai Intelligent Co ltd
Current assignee: Shenzhen Weiai Intelligent Co ltd
Priority date: 2020-06-02
Filing date: 2020-06-02
Publication date: 2020-10-02
Anticipated expiration: 2040-06-02
Also published as: CN111739553B

Abstract

The disclosure provides conference sound collection, conference recording and conference recording presentation methods and devices. One embodiment of the conference recording method includes: receiving sound data sent by a conference sound acquisition terminal; carrying out voice separation on the voice data; generating a conference record corresponding to each separated voice data, wherein the conference record corresponding to each separated voice data comprises the separated voice data, and a speaking content text and speaker identity information corresponding to the separated voice data; and sending each generated conference record to a conference record presenting terminal corresponding to a target conference identifier, wherein the target conference identifier is a current conference identifier corresponding to a conference sound acquisition terminal which sends the sound data, and each conference record is used for triggering the conference record presenting terminal which receives each conference record to present each conference record. The embodiment realizes the respective recording of the conference contents of the simultaneous talking of a plurality of people in the conference.

Description

Conference sound acquisition method, conference recording method, conference record presentation method and device

Technical Field

The disclosure relates to the technical field of computers, in particular to a conference sound acquisition method, a conference record presentation device and a conference record presentation device

Background

Speech recognition has begun to be widely used in a variety of fields such as on-vehicle, intelligent audio amplifier, intelligent house, triggers the instruction that the machine can carry out through speech recognition can be very big raise the efficiency, liberate the user experience of both hands, reinforcing product. With the improvement of the recognition rate of speech recognition, the requirement of converting speech into text is increasingly adopted in the daily conference system of people. For example, as a personal intelligent recording pen and other conference recording products, the voice recording device can send recorded voice to a server to be converted into characters, and is convenient to retrieve and look up.

Disclosure of Invention

The disclosure provides conference sound collection, conference recording and conference recording presentation methods and devices.

In a first aspect, the present disclosure provides a conference sound collection method, where the conference sound collection method includes: acquiring sound data collected by the microphone array in real time; and sending the sound data to a conference recording server, wherein the sound data is used for triggering the conference recording server to carry out voice separation on the sound data, generating conference records corresponding to each separated sound data and comprising the separated sound data, a speaking content text and speaker identity information corresponding to the separated sound data, and sending each generated conference record to each conference record presenting terminal corresponding to a current conference identifier corresponding to the conference sound acquisition terminal, and each conference record is used for triggering the conference record presenting terminal receiving each conference record to present each conference record.

In some optional embodiments, the conference sound collection terminal is further provided with at least one speaker direction indicator light; the conference sound collection method further comprises the following steps: estimating the arrival angle of the sound data; and for each estimated arrival angle, determining the speaker direction indicator lamp corresponding to the arrival angle according to the corresponding relation between the preset arrival angle and the speaker direction indicator lamp identifier, and turning on the determined speaker direction indicator lamp for a first preset time.

In some optional embodiments, the sending the sound data to a conference recording server includes: and compressing the sound data and sending the compressed sound data to the conference recording server.

In a second aspect, the present disclosure provides a conference sound collection device applied to a conference sound collection terminal provided with a microphone array, the conference sound collection device including: a sound data acquisition unit configured to acquire sound data acquired by the microphone array in real time; and the sound data sending unit is configured to send the sound data to a conference record server, the sound data is used for triggering the conference record server to perform voice separation on the sound data, generating conference records corresponding to each separated sound data and including the separated sound data and a speaking content text and speaker identity information corresponding to the separated sound data, and sending each generated conference record to each conference record presenting terminal corresponding to a current conference identifier corresponding to the conference sound collecting terminal, wherein each conference record is used for triggering each conference record presenting terminal receiving each conference record to present each conference record.

In some optional embodiments, the conference sound collection terminal is further provided with at least one speaker direction indicator light; and the conference sound collection device further comprises: an arrival angle estimation unit configured to estimate an arrival angle of the sound data; and the indicator lamp turning-on unit is configured to determine the speaker direction indicator lamp corresponding to the arrival angle according to the corresponding relation between the preset arrival angle and the speaker direction indicator lamp identifier for each estimated arrival angle, and turn on the determined speaker direction indicator lamp for a first preset time.

In some optional embodiments, the sound data transmitting unit is further configured to: and compressing the sound data and sending the compressed sound data to the conference recording server.

It should be noted that, for details of implementation and technical effects of each unit in the conference sound acquisition apparatus provided by the present disclosure, reference may be made to relevant descriptions of other embodiments in the present disclosure, and details are not described herein again.

In a third aspect, the present disclosure provides a conference recording method applied to a conference recording server, where the conference recording method includes: receiving sound data sent by a conference sound acquisition terminal; carrying out voice separation on the voice data; generating a conference record corresponding to each separated voice data, wherein the conference record corresponding to each separated voice data comprises the separated voice data, and a speaking content text and speaker identity information corresponding to the separated voice data; and sending each generated conference record to a conference record presenting terminal corresponding to a target conference identifier, wherein the target conference identifier is a current conference identifier corresponding to a conference sound acquisition terminal which sends the sound data, and each conference record is used for triggering the conference record presenting terminal which receives each conference record to present each conference record.

In some optional embodiments, the performing voice separation on the voice data includes: and carrying out voice separation on the received voice data to generate a preset number of separated voice data, wherein the generated separated voice data respectively correspond to the voice source direction ranges in a preset voice source direction range set one by one, and the voice source direction ranges in the preset voice source direction range set are not overlapped with each other.

In some optional embodiments, the generating a conference record corresponding to each separated sound data includes: for each of the generated separated sound data, in response to determining that valid speech is present in the separated sound data, performing the following conference recording generation operations: respectively carrying out voice recognition and voiceprint recognition on the separated voice data to obtain a recognition text and speaker identity information; in response to determining that the separated voice data is a voice starting point, establishing a current voice and a current speaking text corresponding to the target conference identifier and the obtained speaker identity information; splicing the obtained identification text to the tail part of the current speaking text corresponding to the target conference identifier and the obtained speaker identity information, and splicing the separated voice data to the tail part of the current voice corresponding to the target conference identifier and the obtained speaker identity information; and generating a conference record corresponding to the separated voice data by using the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identity information and the determined speaker identity information.

In some optional embodiments, the conference record generating operation further includes: in response to determining that the isolated acoustic data is a speech endpoint, generating a historical meeting record using current speech and current spoken text corresponding to the target meeting identification and the obtained speaker identification information and the determined speaker identification information, and storing the generated historical meeting record as a historical meeting record corresponding to the target meeting identification.

In some optional embodiments, the conference recording method further includes: responding to a received speaking content text updating request sent by a conference record presenting terminal, wherein the speaking content text updating request is sent to the conference record server by the conference record presenting terminal in response to the detection of the modification operation on the speaking content text in the presented historical conference record, the speaking content text updating request comprises the modified speaking content text corresponding to the modification operation and the conference record identification of the historical conference record corresponding to the modification operation, and the speaking content text in the historical conference record corresponding to the conference record identification in the speaking content text updating request is updated to the speaking content text in the speaking content text updating request.

In some optional embodiments, the performing voice recognition on the separated sound data includes: performing voice recognition on the separated sound data based on a voice recognition model; and the conference recording method further comprises: and in response to determining that the preset speech recognition model updating condition is met, updating the speech recognition model based on the sound data in the historical conference record and the corresponding speaking content text, wherein the stored historical conference record is modified by the speaking content text.

In some optional embodiments, the conference record generating operation further includes: and in response to determining that the separated voice data is a voice starting point, determining the current time as a speaking starting time corresponding to the target conference identifier and the obtained speaker identity information.

In some optional embodiments, the generating a conference record corresponding to the separated sound data using the current speech and the current utterance text corresponding to the target conference identifier and the obtained speaker identification information and the determined speaker identification information includes: generating a conference record corresponding to the separated voice data by using the speaking start time, the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identity information and the determined speaker identity information; and/or generating a historical conference record by using the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identity information and the determined speaker identity information, and storing the generated historical conference record as the historical conference record corresponding to the target conference identifier, comprising: and generating a historical conference record by using the speaking starting time, the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identity information and the determined speaker identity information, and storing the generated historical conference record as the historical conference record corresponding to the target conference identifier.

In some optional embodiments, the performing speech recognition and voiceprint recognition on the separated voice data to obtain the recognition text and the speaker identity information respectively includes: respectively sending the separated voice data to a voice recognition server and a voiceprint recognition server, wherein the separated voice data is used for triggering the voice recognition server to perform voice recognition on the received voice data and return a recognition result, and is used for triggering the voiceprint recognition server to perform voiceprint recognition on the received voice data and return a recognition result; and determining the recognition result received from the voice recognition server and the recognition result received from the voiceprint recognition server as a recognition text and speaker identity information obtained by performing voice recognition and voiceprint recognition on the separated voice data, respectively.

In some alternative embodiments, at least one of the conference recording server, the voice recognition server, and the voiceprint recognition server is configured as a private deployment server based on security and/or privacy requirements.

In some optional embodiments, the conference recording method further includes: in response to receiving a meeting record searching request which is sent by a meeting record presenting terminal and comprises a meeting identifier to be searched and a searching person identifier, determining whether the searching person identifier belongs to a participant identifier set corresponding to the meeting identifier to be searched; and responding to the meeting identification to be consulted, acquiring the historical meeting record corresponding to the meeting identification to be consulted, and sending the acquired historical meeting record to the meeting record presenting terminal sending the meeting record consulting request.

In some optional embodiments, the conference recording method further includes: the method comprises the steps of responding to a received conference reservation request which comprises a participant identification set and is sent by a conference record presenting terminal, generating a conference identification, storing the participant identification set in the conference reservation request as the participant identification set corresponding to the generated conference identification, and returning the generated conference identification to the conference record presenting terminal which sends the conference reservation request.

In a fourth aspect, the present disclosure provides a conference recording apparatus applied to a conference recording server, where the conference recording apparatus includes: a sound data receiving unit configured to receive sound data transmitted by the conference sound collecting terminal; a voice separating unit configured to separate voice of the voice data; the conference record generating unit is configured to generate a conference record corresponding to each separated sound data, wherein the conference record corresponding to each separated sound data comprises the separated sound data, and a speaking content text and speaker identity information corresponding to the separated sound data; and a conference record sending unit configured to send each generated conference record to a conference record presenting terminal corresponding to a target conference identifier, where the target conference identifier is a current conference identifier corresponding to a conference sound collecting terminal that sends the sound data, and each conference record is used to trigger a conference record presenting terminal that receives each conference record to present each conference record.

In some optional embodiments, the above-mentioned human voice separation unit is further configured to: and carrying out voice separation on the received voice data to generate a preset number of separated voice data, wherein the generated separated voice data respectively correspond to the voice source direction ranges in a preset voice source direction range set one by one, and the voice source direction ranges in the preset voice source direction range set are not overlapped with each other.

In some optional embodiments, the conference record generating unit is further configured to: for each of the generated separated sound data, in response to determining that valid speech is present in the separated sound data, performing the following conference recording generation operations: respectively carrying out voice recognition and voiceprint recognition on the separated voice data to obtain a recognition text and speaker identity information; in response to determining that the separated voice data is a voice starting point, establishing a current voice and a current speaking text corresponding to the target conference identifier and the obtained speaker identity information; splicing the obtained identification text to the tail part of the current speaking text corresponding to the target conference identifier and the obtained speaker identity information, and splicing the separated voice data to the tail part of the current voice corresponding to the target conference identifier and the obtained speaker identity information; and generating a conference record corresponding to the separated voice data by using the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identity information and the determined speaker identity information.

In some optional embodiments, the conference recording apparatus further includes: and the speaking content text updating unit is configured to respond to a received speaking content text updating request sent by a conference record presenting terminal, wherein the speaking content text updating request is sent by the conference record presenting terminal to the conference record server in response to the detection of the modification operation on the speaking content text in the presented historical conference record, the speaking content text updating request comprises the modified speaking content text corresponding to the modification operation and the conference record identification of the historical conference record corresponding to the modification operation, and the speaking content text in the historical conference record corresponding to the conference record identification in the speaking content text updating request is updated to the speaking content text in the speaking content text updating request.

In some optional embodiments, the performing voice recognition on the separated sound data includes: performing voice recognition on the separated sound data based on a voice recognition model; and the conference recording apparatus further comprises: and a speech recognition model updating unit configured to update the speech recognition model based on the sound data in the stored historic conference records in which the spoken content text is modified and the corresponding spoken content text in response to determining that a preset speech recognition model updating condition is satisfied.

In some optional embodiments, the conference recording apparatus further includes: the system comprises a consulting person identification determining unit, a meeting record consulting device and a meeting person identification determining unit, wherein the consulting person identification determining unit is configured to respond to a meeting record consulting request which is sent by a meeting record presenting terminal and comprises a meeting identification to be consulted and a consulting person identification, and determine whether the consulting person identification belongs to a meeting person identification set corresponding to the meeting identification to be consulted; and the historical conference record acquisition and transmission unit is configured to respond to the determination of belonging, acquire the historical conference record corresponding to the conference identifier to be referred, and transmit the acquired historical conference record to the conference record presentation terminal which transmits the conference record reference request.

In some optional embodiments, the conference recording apparatus further includes: the conference reservation unit is configured to respond to a received conference reservation request which comprises a participant identification set and is sent by a conference record presenting terminal, generate a conference identification, store the participant identification set in the conference reservation request as the participant identification set corresponding to the generated conference identification, and return the generated conference identification to the conference record presenting terminal which sends the conference reservation request.

In a fifth aspect, the present disclosure provides a method for presenting a conference record, which is applied to a terminal for presenting a conference record, and the method for presenting a conference record includes: and presenting the received conference record in response to receiving the conference record sent by the conference record server, wherein the received conference record is a corresponding conference record generated by the conference record server aiming at each separated sound data after the conference record server performs voice separation on the sound data received from the conference sound acquisition terminal, and the conference record corresponding to each separated sound data comprises the separated sound data, and a speaking content text and speaker identity information corresponding to the separated sound data.

In some optional embodiments, the method for presenting a conference record further includes: in response to detecting a meeting record consulting request which is input by a user and comprises a meeting identifier to be consulted and a consultant identifier, sending the meeting record consulting request to a meeting record server, wherein the meeting record consulting request is used for triggering the meeting record server to respond to determining that the consultant identifier belongs to a meeting participant identifier set corresponding to the meeting identifier to be consulted, acquiring a historical meeting record corresponding to the meeting identifier to be consulted, and sending the acquired historical meeting record to a meeting record presenting terminal sending the meeting record consulting request; and presenting the received historical conference record in response to receiving the historical conference record sent by the conference record server in response to the conference record consulting request.

In some optional embodiments, the method for presenting a conference record further includes: and in response to detecting a modification operation on the spoken content text in the presented historical conference record, sending a spoken content text updating request to the conference record server, wherein the spoken content text updating request comprises the modified spoken content text corresponding to the modification operation and the conference record identifier of the historical conference record to which the modification operation is directed, and the spoken content text updating request is used for triggering the conference record server to update the spoken content text in the historical conference record corresponding to the conference record identifier in the spoken content text updating request into the spoken content text in the spoken content text updating request.

In some optional embodiments, the presenting the received meeting record includes: correspondingly presenting at least one of: the method comprises the steps that a speaking content text, speaker identity information and a sound playing icon associated with separated sound data in a received conference record are received in the conference record; in response to detecting a preset operation for the displayed sound playing icon, playing the separated sound data associated with the sound playing icon for which the detected preset operation is directed.

In some optional embodiments, the playing the separated sound data associated with the sound playing icon for which the detected preset operation is directed includes: and playing the separated sound data associated with the sound playing icon corresponding to the detected preset operation, and displaying the playing progress indication information corresponding to the playing process in the playing process.

In some alternative embodiments, the conference recording further includes a talk start time; and the correspondence presents at least one of: the speech content text, speaker identity information and the sound playing icon associated with the separated sound data in the received conference record comprise: correspondingly presenting at least one of: the conference recording includes a speaking start time, a speaking content text, speaker identity information and a sound playing icon associated with the separated sound data in the received conference recording.

In some optional embodiments, the method for presenting a conference record further includes: the method comprises the steps of responding to a conference reservation request which is input by a user and comprises a conference participant identification set, sending the conference reservation request to a conference recording server, wherein the conference reservation request is used for triggering the conference recording server to generate a conference identification, storing the conference participant identification set in the conference reservation request as the conference participant identification set corresponding to the generated conference identification, and returning the generated conference identification to a conference recording presentation terminal which sends the conference reservation request.

In a sixth aspect, the present disclosure provides a conference record presenting device, which is applied to a conference record presenting terminal, and the conference record presenting device includes: and the conference record presenting unit is configured to present the received conference record in response to receiving the conference record sent by the conference record server, wherein the received conference record is a corresponding conference record generated by the conference record server for each separated sound data after the sound data received from the conference sound collecting terminal is separated by voice, and the conference record corresponding to each separated sound data comprises the separated sound data and the speaking content text and the speaker identity information corresponding to the separated sound data.

In some optional embodiments, the conference record presenting apparatus further includes: a meeting record consulting request sending unit configured to send a meeting record consulting request to a meeting record server in response to detecting a meeting record consulting request including a meeting identifier to be consulted and a consultant identifier input by a user, wherein the meeting record consulting request is used for triggering the meeting record server to obtain a historical meeting record corresponding to the meeting identifier to be consulted in response to determining that the consultant identifier belongs to a participant identifier set corresponding to the meeting identifier to be consulted, and send the obtained historical meeting record to a meeting record presenting terminal sending the meeting record consulting request; and a history conference record receiving and presenting unit configured to present the received history conference record in response to receiving the history conference record transmitted by the conference record server in response to the conference record reference request.

In some optional embodiments, the conference record presenting apparatus further includes: and the speaking content text updating request sending unit is configured to respond to the detection of a modification operation on the speaking content text in the presented historical conference record, and send a speaking content text updating request to the conference record server, wherein the speaking content text updating request comprises the modified speaking content text corresponding to the modification operation and the conference record identifier of the historical conference record corresponding to the modification operation, and the speaking content text updating request is used for triggering the conference record server to update the speaking content text in the historical conference record corresponding to the conference record identifier in the speaking content text updating request into the speaking content text in the speaking content text updating request.

In some optional embodiments, the conference record presenting apparatus further includes: the conference reservation request sending unit is configured to respond to a conference reservation request which is input by a user and comprises a conference participant identification set, send the conference reservation request to the conference recording server, wherein the conference reservation request is used for triggering the conference recording server to generate a conference identification, store the conference participant identification set in the conference reservation request as the conference participant identification set corresponding to the generated conference identification, and return the generated conference identification to a conference recording presentation terminal which sends the conference reservation request.

In a seventh aspect, the present disclosure provides a conference sound collecting terminal, including: the microphone array is used for collecting sound data; one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any one of the embodiments of the first aspect.

In an eighth aspect, the present disclosure provides a conference recording server, comprising: one or more processors; a storage device having one or more programs stored thereon; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the embodiments of the third aspect.

In some optional embodiments, the conference recording server is configured as a private deployment server according to security and/or privacy requirements.

In a ninth aspect, the present disclosure provides a conference record presenting terminal, including: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method described in any of the embodiments of the fifth aspect.

In a tenth aspect, the present disclosure provides a computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by one or more processors implements the method as described in any one of the embodiments of the first aspect, or the method as described in any one of the embodiments of the third aspect, or the method as described in any one of the embodiments of the fifth aspect.

In an eleventh aspect, the present disclosure provides a conference recording system, including the conference recording server as described in any one of the eighth aspects, at least one conference sound collection terminal as described in any one of the seventh aspects, and at least one conference record presentation terminal as described in any one of the ninth aspects.

In some optional embodiments, the conference recording system further comprises a voice recognition server and a voiceprint recognition server, wherein the voice recognition server is configured to perform voice recognition on the separated voice data received from the conference recording server and send a recognized speaking content text to the conference recording server, and the voiceprint recognition server is configured to perform voiceprint recognition on the separated voice data received from the conference recording server and send recognized speaker identification information to the conference recording server.

In some optional embodiments, the voice recognition server and/or the voiceprint recognition server are configured as private deployment servers according to security and/or privacy requirements.

Most of the existing conference recording products collect voice data in real time and upload the collected voice data to a cloud server, corresponding speech content is obtained through voice recognition in the cloud server, and the speech content obtained through recognition is returned to the conference recording products. The applicant finds that the existing conference recording product has a good recognition effect in a single-person speaking scene and is often low in recognition rate when a plurality of persons speak. This is because when many people talk, there may be a situation that a plurality of people's voices are mixed together in the collected voice, and the current conference recording product does not separate and separately recognize the voices of many people, but directly recognizes the collected voice, which results in a low recognition rate. In addition, the identity of the speaker is not identified in the current conference recording product, so that only the speaking content and no speaker identity information exist in the conference recording, namely the conference recording content is relatively single.

In order to solve the technical problems discovered by the applicant, the conference sound acquisition method and device, the conference recording method and device, and the conference record presentation method and device provided by the present disclosure are configured such that a microphone array is disposed in a conference sound acquisition terminal, the conference sound acquisition terminal acquires sound data of a conference site in real time after a conference starts, and transmits the sound data acquired in real time to a conference record server. The conference record server separates the voice of the received voice data, generates conference records corresponding to each separated voice data and including the separated voice data, the speaking content text corresponding to the separated voice data and the identity information of the speaker, and then sends the generated conference records to the conference record presentation terminals corresponding to the current conference identifiers corresponding to the conference voice acquisition terminals. The conference record presenting terminal may present each received conference record. The technical effects thereof may include, but are not limited to, the following:

firstly, the conference recording server firstly performs voice separation on the received voice data, and performs voice recognition on each separated voice data independently, so that the accuracy of the speaking content text obtained by voice recognition can be improved.

Secondly, because the conference recording server performs voice separation on the received sound data, and generates a corresponding conference record for each separated sound data, the conference record includes the separated sound data. When the conference record is consulted, the speaking content of each person can be consulted independently and the sound data of each person can be played back, but not the voices of multiple persons are mixed together, so that the distinguishability of the conference sound content and the speaker identity identification is improved.

Thirdly, the conference record generated by the conference record server includes the separated sound data and the corresponding speaking content text, and also includes speaker identity information, so that the content of the conference record is enriched, and then the conference record content that can be received by the user at the conference record presentation terminal is also enriched, that is, not only the speaking content of the participants can be recorded, but also the identity information of the participants corresponding to each section of speaking content can be recorded.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram for one embodiment of a conference recording system according to the present disclosure;

fig. 2A and 2C are timing diagrams of one embodiment of a conference recording system according to the present disclosure;

FIG. 2B is an exploded flow diagram for one embodiment of a meeting record generation operation in accordance with the present disclosure;

FIG. 3 is a flow diagram for one embodiment of a conference sound collection method according to the present disclosure;

FIG. 4 is a flow diagram for one embodiment of a conference recording method according to the present disclosure;

FIG. 5 is a flow diagram for one embodiment of a method of presenting a conference recording according to the present disclosure;

FIG. 6 is a schematic block diagram of one embodiment of a conference sound collection apparatus according to the present disclosure;

FIG. 7 is a schematic block diagram of one embodiment of a conference recording apparatus according to the present disclosure;

FIG. 8 is a schematic block diagram illustration of one embodiment of a conference recording presentation apparatus according to the present disclosure;

FIG. 9 is a schematic block diagram of a computer system suitable for use in implementing the conference sound collection terminal of the present disclosure;

fig. 10 is a schematic block diagram of a computer system suitable for implementing a conference recording server or a conference recording presentation terminal of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 of one embodiment of a conference recording system according to the present disclosure.

As shown in fig. 1, the system architecture 100 may include conference

sound collection terminals

1011, 1012, 1013, a network 102, a conference recording server 103, a network 104, and conference

recording presentation terminals

1051, 1052, 1053.

The network 102 is a medium to provide a communication link between the conference

sound collecting terminals

1011, 1012, 1013 and the conference recording server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The network 104 is used to provide a medium of communication links between the conference

recording presentation terminals

1051, 1052, 1053 and the conference recording server 103. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The conference

sound collecting terminals

1011, 1012, 1013 may be provided with microphone arrays, respectively. Here, a microphone array refers to a sound collection system composed of a certain number of acoustic sensors (typically microphones) that collect sounds from different spatial directions using the certain number of acoustic sensors.

In some alternative embodiments, in order to improve the spatial directional coverage of the sound collected by the microphone array, the number of the acoustic sensors in the microphone array may be greater than or equal to 4.

Various arrangements of the acoustic sensors in the microphone array are possible. In some alternative embodiments, in order to improve the spatial directional coverage of the sound collected by the microphone array, the acoustic sensors in the microphone array may be uniformly distributed on a circumference.

In order to enable the conference

sound collection terminals

1011, 1012, and 1013 to send the sound data collected in real time to the conference recording server 103, the conference

sound collection terminals

1011, 1012, and 1013 may be further provided with a communication unit in addition to the microphone array, for implementing data interaction with the conference recording server. The communication units in the conference

sound collection terminals

1011, 1012, 1013 may include wireless communication devices and/or wired communication devices. The wireless communication device may include various local wireless communication modules such as a Wi-Fi module, a bluetooth module, and a Zigbee module. The local wireless communication module can be connected to other electronic devices (e.g., a remote connection or a local connection) through a relay network device, such as a Wi-Fi router, a bluetooth repeater, a Zigbee base station, etc. The wireless communication device may also include various wide area wireless communication modules based on 2G (EDGE, CDMA 1X), 3G (TD-SCDMA, CDMA EVDO, WCDMA), 4G (LTE, WiMAX), 5G, etc. The wide area wireless communication module is connected to other electronic equipment through a communication network accessed by the wide area wireless communication module. A wired communication device (e.g., a wired network card) may be connected to a router or modem (modem) via a network cable to connect to other electronic devices.

In some optional embodiments, the conference

sound collecting terminals

1011, 1012, 1013 may further be respectively provided with at least one speaker direction indicating lamp for indicating the direction of the speaker currently. Here, the speaker direction indicator lamp may be various forms of lamps, such as various LED (Light Emitting Diode) lamps.

The user can interact with the conference recording server 103 through the network 102 via the conference

sound collection terminals

1011, 1012, 1013 to send the sound data of the user's meeting to the conference recording server 103 for storage and processing.

The user may also use the conference

record rendering terminals

1051, 1052, 1053 to interact with the conference record server 103 via the network 104 to receive or send messages or the like. The conference

record presenting terminals

1051, 1052, 1053 may have installed thereon various communication client applications, such as a conference record presenting application, a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The conference

record presentation terminals

1051, 1052, 1053 may be hardware or software. When the conference

recording presentation terminal

1051, 1052, 1053 is hardware, it may be a variety of electronic devices having a display screen and supporting information input, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the conference

record presenting terminals

1051, 1052, 1053 are software, they may be installed in the electronic devices listed above. It may be implemented as a plurality of software or software modules (for example to provide services) or as a single software or software module. And is not particularly limited herein.

The conference recording server 103 may be a server that provides various services. For example, the conference record server 103 may provide voice separation for the sound data received from the conference

sound collecting terminals

1011, 1012, 1013, generate a conference record corresponding to each separated sound data, and send each generated conference record to the conference

record presenting terminals

1051, 1052, 1053 for corresponding presentation.

The conference recording server 103 may be hardware or software. When the conference recording server 103 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server, or may be implemented as a cloud computing center. When the conference recording server 103 is software, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or may be implemented as a single software or software module. And is not particularly limited herein.

It should be noted that the conference sound collection method provided by the present disclosure is generally executed by the conference

sound collection terminals

1011, 1012, 1013, and accordingly, the conference sound collection device is generally disposed in the conference

sound collection terminals

1011, 1012, 1013.

It should be noted that the conference recording method provided by the present disclosure is generally executed by the conference recording server 103, and accordingly, the conference recording apparatus is generally disposed in the conference recording server 103.

It should be noted that the conference record presenting method provided by the present disclosure is generally executed by the conference

record presenting terminals

1051, 1052, 1053, and accordingly, the conference record presenting apparatus is generally disposed in the conference

record presenting terminals

1051, 1052, 1053.

It should be understood that the number of conference sound collection terminals, networks, conference recording servers, and conference recording presentation terminals in fig. 1 are merely illustrative. There may be any number of conference sound collection terminals, networks, conference recording servers, and conference recording presentation terminals, as desired for implementation.

With continued reference to fig. 2, a timing sequence 200 for one embodiment of a conference recording system according to the present disclosure is shown.

The conference recording system in the embodiment of the present disclosure may include a conference recording server, at least one conference sound collection terminal, and at least one conference recording presentation terminal. Wherein, the conference sound collecting terminal can be provided with a microphone array.

As shown in fig. 2, a time sequence 200 according to one embodiment of the conference recording system of the present disclosure may include the steps of:

step 201, the conference sound collection terminal obtains sound data collected by the microphone array in real time.

In this embodiment, the microphone array disposed in the conference sound collecting terminal can collect the sound data of the surrounding environment in real time in a working state, so that the conference sound collecting terminal can also obtain the sound data collected by the microphone array disposed in the conference sound collecting terminal in real time.

In practice, the user can set the conference sound collection terminal to be in the working state before meeting. For example, the setting of the conference sound collection terminal to the operating state may be achieved by powering on and powering on the conference sound collection terminal. The conference participants can then begin a conference discussion and speak around the conference sound collection terminal. Therefore, the conference sound acquisition terminal can acquire the sound data acquired by the microphone array in real time.

Step 202, the conference sound collection terminal sends the sound data to the conference recording server.

In this embodiment, the conference sound collection terminal may transmit sound data collected in real time from the microphone array to the conference recording server.

In some alternative embodiments, the conference sound collection terminal may send the sound data collected in real time from the microphone array directly to the conference recording server.

In some optional embodiments, the conference sound collecting terminal may also compress the sound data collected in real time from the microphone array and send the compressed sound data to the conference recording server. Therefore, the data transmission quantity between the conference sound collection terminal and the conference recording server can be reduced, and the requirement on the network bandwidth between the conference sound collection terminal and the conference recording server can be further reduced. On the other hand, as the data volume to be transmitted is reduced, the transmission speed of the conference sound acquisition terminal for sending the sound data to the conference recording server can be increased, and the real-time processing performance of the conference recording server is improved.

In some optional embodiments, the conference sound collecting terminal may further be provided with at least one speaker direction indicator light, and the timing sequence 200 may further include, after step 201 or after step 202, performing step 202A and step 202B:

step 202A, the conference sound collection terminal performs arrival angle estimation on the sound data.

Here, the conference sound collecting terminal may perform arrival angle estimation on the sound data acquired in step 201 using various known and future developed microphone array-based sound source localization methods. For example, a microphone array based sound source localization method may include, but is not limited to: time Difference Of Arrival (TDOA), Generalized Cross Correlation (GCC), High Resolution Spectral Estimation (HRSE), and the like.

It should be noted that, in practice, when sound source positioning method based on a microphone array is used to estimate arrival angles of sound data, at least one arrival angle and related parameters such as confidence, energy intensity or voice density corresponding to each arrival angle may be generally obtained through calculation, and an arrival angle of at least one calculated arrival angle in which the corresponding confidence, energy intensity or voice density is greater than a corresponding preset confidence threshold, preset energy intensity threshold or preset voice density threshold may be determined as the calculated arrival angle. Or, of the at least one calculated arrival angle, the arrival angle at which the corresponding confidence, energy intensity, or speech density is the greatest may be determined as the calculated arrival angle. Or, in the at least one calculated arrival angle, a preset number of arrival angles with the largest corresponding confidence, energy intensity or speech density may be determined as the calculated arrival angle, where the preset number of arrival angles is a positive integer.

Step 202B, the conference sound collection terminal determines the speaker direction indicator corresponding to each estimated arrival angle according to the corresponding relation between the preset arrival angle and the speaker direction indicator mark, and turns on the determined speaker direction indicator for a first preset time.

Here, the correspondence between the arrival angle and the speaker winker indicator mark may be set in advance. For example, the conference sound collecting terminal may be provided with 12 speaker direction indicators, and the angle between 0 ° and 360 ° may be divided into 12 angular ranges, which are respectively greater than (n-1) × 30 ° and less than or equal to (n-1) × 30 °, where n is a positive integer between 1 and 12, on average. The speaker winker n may be respectively associated with the angle range (n-1) × 30 ° and (n-1) × 30 °, that is, the speaker winker n if the arrival angle is within the angle range (n-1) × 30 ° and (n-1) × 30 °.

Here, the first preset time period may be a preset time period. In practice, the first preset duration may be determined according to a time interval of the microphone array in the conference sound collection terminal collecting the voice in real time.

Step 203, the conference recording server receives the sound data sent by the conference sound collection terminal.

In this embodiment, the conference recording server may receive sound data sent by the conference sound collection terminal.

In some optional embodiments, if the sound data sent by the conference sound collection terminal to the conference recording server is not compressed, the conference recording server may directly receive the sound data sent by the conference sound collection terminal.

In some optional embodiments, if the sound data sent by the conference sound collection terminal to the conference recording server is compressed, the conference recording server may receive the data sent by the conference sound collection terminal first, and then decompress the received data according to a corresponding decompression method adopted by the conference sound collection terminal to compress the sound data to obtain the sound data.

Step 204, the conference recording server performs voice separation on the voice data.

In this embodiment, the conference recording server may use various embodiments to perform voice separation (also referred to as speaker separation or blind source separation) on the sound data obtained in step 203, so as to obtain at least one separated sound data, and this application does not specifically limit the specific embodiments.

For example, the conference recording server may use Independent Component Analysis (ICA) in combination with the microphone array to treat the sound sources corresponding to the received sound data as statistically Independent sound sources by using a beamforming method and generate separate sound data as Independent as possible. The received sound data is divided into a plurality of frequency bands, each frequency band is independently processed, and then results under different frequencies are recombined to generate at least one piece of separated sound data.

In practice, the conference recording server may also implement voice separation on the sound data obtained in step 203 by using a voice separation service provided by a voice separation service provider (for example, by calling an application program interface for implementing voice separation), so as to obtain at least one separated sound data.

In some alternative embodiments, step 204 may proceed as follows: and the conference recording server performs voice separation on the received voice data and generates a preset number of separated voice data. The generated separated sound data respectively correspond to the sound source direction ranges in the preset sound source direction range set one by one, and the sound source direction ranges in the preset sound source direction range set are not overlapped with each other. Here, the preset number may be manually preset by a technician. As an example, since the sound data received by the conference recording server is collected by a microphone array provided in the conference sound collecting terminal, the above-mentioned process can be implemented by digital signal processing and beam forming technology.

In step 205, the conference recording server generates a conference record corresponding to each separated sound data.

In this embodiment, the conference recording server may generate a conference record corresponding to each separated sound data obtained after the voice separation in step 204. The conference record corresponding to each separated voice data may include the separated voice data itself, and the speaking content text and the speaker identity information corresponding to the separated voice data. It is understood that the speaking content text corresponding to the separated voice data can be obtained by performing speech recognition on the separated voice data, and the speaker identity information corresponding to the separated voice data can be obtained by performing voiceprint recognition on the separated voice data. The speech recognition and the voiceprint recognition are the prior art which is widely researched and applied at present, and are not described herein again.

And step 206, the conference record server sends the generated conference records to the conference record presenting terminal corresponding to the target conference identifier.

In this embodiment, the conference record server may send each generated conference record to the conference record presenting terminal corresponding to the target conference identifier after generating the conference record corresponding to each separated sound data. Here, the target conference identifier is a current conference identifier corresponding to the conference sound collection terminal that transmits the sound data received by the conference recording server. It can be understood that, if the conference sound collecting terminal is collecting the conference sound, the corresponding current conference identifier of the conference sound collecting terminal is correspondingly stored in the conference recording server. And the conference record server also stores a conference participant identifier set corresponding to each conference identifier (including a historical conference identifier and a current conference identifier), and the conference record presenting terminal corresponding to the conference identifier can be a conference record presenting terminal which logs in the conference record server through the conference participant identifier in the conference participant identifier set corresponding to the conference identifier.

In practice, before a meeting is started, a user can reserve the meeting to the meeting record server in advance through the meeting record presenting terminal, and each participant identification participating in the meeting can be input during the conference reservation. The conference recording server may generate a corresponding conference identifier, determine a participant identifier set corresponding to the generated conference identifier as each received participant identifier, and feed back the generated conference identifier to the conference recording presentation terminal. Subsequently, the user can obtain the conference identifier through the conference record presenting terminal. When a conference is started, a user can adopt various embodiments to establish an association relationship between a conference identifier of an upcoming conference and a conference sound collection terminal to be used. For example, a user may open a switch of a conference sound collection terminal first, and if an information input device is disposed in the conference collection terminal, a conference identifier of a conference to be performed may be manually input on the conference sound collection terminal, and then the conference sound collection terminal may send the conference identifier input by the user to a conference recording server as a current conference identifier corresponding to the conference sound collection terminal, so that the conference recording server determines, after receiving the current conference identifier sent by the conference collection terminal, the received current conference identifier as a current conference identifier of a terminal identifier of the conference sound collection terminal that sent the current conference identifier; or, the system may also input or select the current conference identifier by using the conference record presenting terminal, and scan the two-dimensional code corresponding to the conference collecting terminal (for example, the two-dimensional code may be posted on the outer surface of the conference collecting terminal), and then the conference record presenting terminal sends the scanned two-dimensional code and the current conference identifier selected or input by the user to the conference record server, and the conference record server converts the two-dimensional code received from the conference record presenting terminal into the terminal identifier of the conference sound collecting terminal, and determines the received current conference identifier as the current conference identifier corresponding to the converted terminal identifier. In the conference opening process, in order to obtain conference record contents, each participant can log in a conference record server through a pre-registered participant identifier and input or select a conference identifier of a currently participated conference, and if the participant identifier used for logging in the conference record server is in a participant identifier set corresponding to the conference identifier input or selected by the participant recorded by the conference record server, the conference record server can determine that a conference record presentation terminal used when the participant logs in is a conference record presentation terminal corresponding to the conference identifier input or selected by the participant.

Step 207, the conference record presenting terminal responds to the received conference record sent by the conference record server and presents the received conference record.

In this embodiment, the conference record presenting terminal may present the received conference record by adopting various embodiments when receiving the conference record sent by the conference record server. The received conference record may be a conference record corresponding to the conference record generated by the conference record server for each separated voice data after voice separation is performed on the voice data received from the conference voice collecting terminal, and the conference record corresponding to each separated voice data may include the separated voice data, and a speech content text and speaker identity information corresponding to the separated voice data.

In some alternative embodiments, step 207 may be performed as follows:

the conference record presenting terminal correspondingly presents at least one of the following items: the content text of the utterance in the received meeting record, the speaker identification information, and the sound playing icon associated with the separated sound data in the received meeting record. And the conference record presenting terminal responds to the detection of the preset operation aiming at the displayed sound playing icon, and plays the separated sound data associated with the sound playing icon aiming at the detected preset operation. The preset operation may be, for example: single click, double click, drag, slide, and the like.

Through the optional implementation mode, the participants can correspondingly acquire at least one of the speaking voice, the speaking content and the identity information of each currently speaking person in the participants in the conference record presenting terminal. And then real-time conference recording is realized, and the identification and identity marking of the speaker identity and the speaker content of different speakers can be realized.

In some optional embodiments, the playing the separated sound data associated with the sound playing icon for which the detected preset operation is directed may include: and playing the separated sound data associated with the sound playing icon for which the detected preset operation is directed, and displaying playing progress indication information corresponding to the playing process in the playing process. For example, a corresponding play progress bar may be presented during play.

As can be seen from the above description, in the conference opening process, through the corresponding operations of the conference sound collection terminal, the conference record server, and the conference record presentation terminal in the above steps 201 to 207, the participant currently carrying out the conference can view the conference record in real time, including the voices of different speakers, the text of the content of the speech, and the identity information of the speakers in the conference.

In some alternative embodiments, step 205 may also proceed as follows: for each of the separated sound data generated in step 204, a conference recording generation operation is performed in response to determining that valid speech is present in the separated sound data. That is, by performing the conference recording generation operation only when it is determined that valid speech exists in the separated sound data, the number of times of performing the conference recording generation operation is reduced, which in turn reduces the computational burden of the conference recording server. It should be noted that various now known and future developed embodiments for determining whether valid speech exists in the sound data may be used to determine whether valid speech exists in the separated sound data, and the disclosure is not limited in particular. For example, it may be determined whether there is a speech frame with energy greater than a preset energy threshold in the separated sound data, and if so, it is determined that there is valid speech in the separated sound data. For another example, the separation sound data may be first filtered, denoised, and the like to obtain processed separation sound data, and then it may be determined whether there is a speech frame with energy greater than a preset energy threshold in the processed separation sound data, and if so, it may be determined that there is valid speech in the separation sound data.

Here, the conference record generation operation may include sub-steps 2051 through 2054 as shown in fig. 2B. Referring to fig. 2B, an exploded flow diagram for one embodiment of a meeting record generation operation in accordance with the present disclosure is shown:

substep 2051 performs speech recognition and voiceprint recognition on the separated voice data, respectively, to obtain a recognized text and speaker identity information.

Here, the conference recording server may perform speech recognition and voiceprint recognition on the separated voice data to obtain the recognition text and the speaker identity information, respectively, in various embodiments. For example, the conference recording server may perform speech recognition and voiceprint recognition on the separated voice data locally, concurrently or in parallel, respectively, to obtain the recognition text and the speaker identity information.

In some alternative embodiments, sub-step 2051 may also be performed as follows: first, the conference recording server transmits the separated sound data to the voice recognition server and the voiceprint recognition server, respectively. The voice recognition server performs voice recognition on the received sound data in real time and returns a recognition result (i.e., a recognition text) to the conference recording server. The voiceprint recognition server performs voiceprint recognition on the received voice data in real time and returns a recognition result (namely the identity information of the speaker) to the conference recording server. Then, the conference recording server may determine the recognition result received from the voice recognition server and the recognition result received from the voiceprint recognition server as a recognition text and speaker identification information obtained by performing voice recognition and voiceprint recognition on the separated voice data, respectively. Through the optional implementation mode, the voice recognition is carried out through the voice recognition server, the voiceprint recognition is carried out through the voiceprint recognition server, the calculation burden of the conference recording server can be reduced, and the speed of generating the conference record by the conference recording server is improved.

In some alternative embodiments, at least one of the conference recording server, the voice recognition server, and the voiceprint recognition server described above may be configured as a private deployment server in accordance with security and/or privacy requirements. It can be understood that, due to the access privacy feature of the private deployment server, the conference recording server, the voice recognition server or the voiceprint recognition server implemented in the private deployment server manner can ensure the security of the data stored thereon. Compared with the existing widely-adopted method for realizing voice recognition and conference recording on a public server, the method has higher safety.

Sub-step 2052, in response to determining the isolated voice data to be a voice start point, creates a current voice and a current spoken text corresponding to the target conference identification and the obtained speaker identification information.

Here, the conference recording server may first determine whether the separated sound data is a speech start point after obtaining the identification text and the speaker identity information corresponding to the separated sound data in sub-step 2051. Then, if it is determined that the isolated audio data is a speech start point, a current speech and a current speech text corresponding to the target conference identifier and the obtained speaker identification information are created, and then the procedure goes to sub-step 2053 for execution.

That is, it is indicated that, in the conference indicated by the target conference identifier, the participant indicated by the obtained speaker identity information starts new speech, so that a new conference record corresponding to the target conference identifier and the obtained speaker identity information can be formed, that is, a new current speech and a new current speech text corresponding to the target conference identifier and the obtained speaker identity information are created, where the new current speech may be null, and the new current speech text may also be null.

It should be noted that if it is determined that the separated sound data is not a speech start point, the process goes directly to sub-step 2053 for execution.

Substep 2053 of concatenating the obtained recognized text to the tail of the current spoken text corresponding to the target conference identification and the obtained speaker identification information, and concatenating the separated voice data to the tail of the current speech corresponding to the target conference identification and the obtained speaker identification information.

If the conference recording server determines in sub-step 2052 that the separated sound data is a speech start point and newly creates an empty current speech and an empty current speech text corresponding to the target conference identifier and the speaker identification information obtained in sub-step 2051, the recognition text obtained in sub-step 2051 may be spliced to a tail of the current speech text corresponding to the target conference identifier and the obtained speaker identification information, and the separated sound data may be spliced to a tail of the current speech corresponding to the target conference identifier and the obtained speaker identification information.

If the conference recording server determines in sub-step 2052 that the separated sound data is not a speech start point, it indicates that a corresponding current speech and a current speech text have been previously established for the speaker identification information obtained in sub-step 2051 and the target conference identification, it may directly splice the recognition text obtained in sub-step 2051 to the end of the current speech text corresponding to the target conference identification and the obtained speaker identification information, and splice the separated sound data to the end of the current speech corresponding to the target conference identification and the obtained speaker identification information.

Execution may go to sub-step 2054 after sub-step 2053 is performed.

Sub-step 2054 generates a meeting record corresponding to the isolated voice data using the current speech and current spoken text corresponding to the target meeting identification and the obtained speaker identification information and the determined speaker identification information.

Here, the conference recording server may splice the obtained recognized text to the end of the current spoken text corresponding to the target conference identification and the obtained speaker identification information and splice the separated voice data to the end of the current speech corresponding to the target conference identification and the obtained speaker identification information after performing sub-step 2053, i.e., regardless of whether the separated voice data is a voice start point. That is, after the current speech and the current speech text corresponding to the target conference identifier and the obtained speaker identification information have been updated, the conference recording server may generate a conference recording corresponding to the separated sound data using the updated current speech and the current speech text corresponding to the target conference identifier and the obtained speaker identification information and the determined speaker identification information.

As can be seen from the description of the above alternative embodiment, the conference record generated according to the above alternative embodiment can achieve the following advantages including but not limited to: firstly, the calculation amount of a conference recording server can be reduced by generating a corresponding conference recording when effective voice is detected; second, real-time updating of what each person speaks (including voice data, spoken content text, and speaker identity information) that begins a new utterance in an ongoing conference can be achieved.

In some optional embodiments, the above-described meeting record generation operation may further include sub-step 2055:

sub-step 2055, in response to determining that the isolated sound data is a speech endpoint, generates a historical meeting record using the current speech and current spoken text corresponding to the target meeting identification and the obtained speaker identification information and the determined speaker identification information, and stores the generated historical meeting record as a historical meeting record corresponding to the target meeting identification.

After sub-step 2055, the meeting record server may store the spoken content as a historical meeting record corresponding to the current ongoing meeting for future reference when it detects that a sentence is spoken during the current ongoing meeting.

In some optional embodiments, the above-described meeting record generation operation may further include sub-step 2056:

sub-step 2056, in response to determining the isolated sound data to be a speech start point, determines the current time to be a speech start time corresponding to the target conference identification and the obtained speaker identification information.

That is, if the separated sound data is the speech start point, it indicates that there is a person who starts a new speech in the conference indicated by the target conference identifier, and the speaking person is the person indicated by the obtained speaker information, and here, the speech start time of the new speech is recorded.

Based on the above-described alternative embodiment of recording the speaking start time of the new utterance in the conference record generating operation, the above-described sub-step 2054 may also proceed as follows:

generating a conference record corresponding to the separated voice data using the speaking start time, the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identification information, and the determined speaker identification information.

That is, the conference record generated for the separated voice data includes the voice, the speaking content text and the speaker identity information, and also includes the speaking start time, thereby enriching the recording content of the conference record.

Based on the above-described optional implementation of sub-step 2054, at least one of the following is correspondingly presented in step 207: the text of the speech content in the received conference record, the identity information of the speaker and the sound playing icon associated with the separated sound data in the received conference record can also be performed as follows: correspondingly presenting at least one of: the conference recording includes a speaking start time, a speaking content text, speaker identity information and a sound playing icon associated with the separated sound data in the received conference recording. That is, besides presenting the separated voice data, the speaking content text and the speaker identity information in the conference record, the corresponding speaking start time can be presented, and the presented conference record content is further enriched.

Based on the above-described alternative embodiment of recording the speaking start time of the new utterance in the conference record generating operation, the above-described sub-step 2055 may also proceed as follows:

generating a historical conference record by using the speaking starting time, the current voice and the current speaking text corresponding to the target conference identification and the obtained speaker identity information and the determined speaker identity information, and storing the generated historical conference record as the historical conference record corresponding to the target conference identification.

That is, the history conference record generated for the target conference identifier includes the speaking start time in addition to the sound, the speaking content text and the speaker identity information, thereby enriching the recording content of the history conference record.

With continued reference to fig. 2C due to page display limitations, it should be noted that the flow of fig. 2C may include various steps shown in fig. 2A in addition to the flow shown in fig. 2C.

In some optional embodiments, the timing sequence 200 may further include the following steps 208 to 211:

in step 208, the conference record presenting terminal sends a conference record viewing request to the conference record server in response to detecting the conference record viewing request comprising the conference identifier to be viewed and the viewer identifier input by the user.

Here, the conference identifier to be referred input by the user may be used to indicate a conference identifier of a currently ongoing conference, and may also be used to indicate a conference identifier of a conference that has ended.

It will be appreciated that the conference identification described in this disclosure may be stored locally at the conference recording server or in other electronic devices networked to the conference recording server, and is used to uniquely identify each conference. The conference identification may include at least one of: numbers, english letters, symbols, chinese characters, and other language words.

It will be appreciated that the reference or participant identifiers described in this disclosure may be stored locally at the conference recording server or in other electronic devices networked to the conference recording server for uniquely identifying each participant. The reference or participant identification may include at least one of: numbers, english letters, symbols, chinese characters, and other language words.

In step 209, the conference record server determines whether the identifier of the consultant belongs to the participant identifier set corresponding to the conference identifier to be consulted in response to receiving the conference record consultation request including the conference identifier to be consulted and the consultant identifier sent by the conference record presentation terminal.

Here, in the conference record server, when receiving a conference record viewing request including a conference identifier to be viewed and a viewer identifier sent by the conference record presenting terminal, first, a participant identifier set corresponding to the received conference identifier to be viewed may be locally or remotely acquired. It is then determined whether the received reference identifier belongs to the acquired set of participant identifiers.

It should be noted that the conference recording server may store a participant identifier set corresponding to each conference identifier locally or in other electronic devices connected to the conference recording server through a network. The participant identifier set corresponding to the conference identifier is used for indicating that the conference indicated by the conference identifier is authorized only for the participants indicated by the participant identifiers in the participant identifier set corresponding to the conference identifier, that is, only the participants indicated by the participant identifiers in the participant identifier set corresponding to the conference identifier can refer to the historical conference record of the conference indicated by the conference identifier.

In step 210, the conference record server, in response to determining the meeting, acquires a historical conference record corresponding to the conference identifier to be referred, and sends the acquired historical conference record to the conference record presenting terminal that sent the conference record referring request.

Here, the conference recording server may first acquire a history conference record corresponding to the received conference identifier to be referred in the case where it is determined in step 201 that the received reference identifier belongs to the acquired set of participant identifiers. Then, each acquired historical conference record may be transmitted to the conference record presenting terminal that transmitted the conference record viewing request received in step 209.

In step 211, the meeting record presenting terminal presents the received historical meeting record in response to receiving the historical meeting record sent by the meeting record server in response to the meeting record consulting request.

Here, the conference record presenting terminal presents the received history conference record in response to receiving the history conference record transmitted by the conference record server in response to the conference record reference request transmitted by the conference record presenting terminal.

Through the optional implementation mode, the participant can use the conference record presenting terminal to present the speaking content and the speaker identity of each person in the current conference in real time, the participant can also use the conference record presenting terminal to present the sound data of the conference content discussed in the current conference, the corresponding speaking content text and the speaker identity information, and the participant can also use the conference record presenting terminal to obtain the historical conference record of the conference which is finished, so that the comprehensive conference record lookup of the current conference and the conference which is finished can be realized.

Based on an alternative implementation of sub-step 2055 above, presenting the received historical meeting record in step 211 may proceed as follows: correspondingly presenting at least one of: the conference system further comprises a voice playing icon associated with the separated voice data in the received conference record, and the voice playing icon is associated with the voice playing icon. That is, besides presenting the sound playing icon, the speaking content text and the speaker identity information associated with the separated sound data in the conference record, the corresponding speaking start time can be presented, and the presented historical conference record content is further enriched. And in response to detecting a preset operation for the displayed sound playing icon, playing the separated sound data associated with the sound playing icon for which the detected preset operation is directed.

Based on the above-mentioned optional embodiments regarding reviewing historical meeting records, in some optional embodiments, the above-mentioned time sequence 200 may further include the following steps 212 and 213:

in step 212, the meeting record presenting terminal sends a speaking content text updating request to the meeting record server in response to detecting the modification operation aiming at the speaking content text in the presented historical meeting record.

In order to make the conference record content more accurate, an interface for correcting the voice content with the recognition error in the process of referring to the historical conference record by the participant at the later stage can be provided in the conference record presenting terminal. That is, the participant can modify the spoken content text in the presented historical meeting record using the meeting presentation terminal, and the meeting record presentation terminal can send a spoken content text update request to the meeting record server in response to detecting a modification operation for the presented spoken content text in the historical meeting record. Here, the request for updating the speaking content text may include the modified speaking content text corresponding to the modification operation and the conference record identifier of the historical conference record for which the modification operation is directed.

In step 213, the conference record server updates the speaking content text in the historical conference record corresponding to the conference record identifier in the speaking content text updating request to the speaking content text in the speaking content text updating request in response to receiving the speaking content text updating request sent by the conference record presenting terminal.

Through the steps 212 and 213, the user can correct the incorrectly identified utterance text in the historical conference record, so as to achieve fine management of the historical conference record.

In practice, speech recognition of the separated sound data is generally based on a speech recognition model. In order to improve the recognition rate of the speech recognition, based on the above-mentioned optional embodiment regarding the modification of the historical conference record, in some optional embodiments, the time sequence 200 may further include the following step 214:

in step 214, the conference recording server updates the speech recognition model based on the stored sound data in the historical conference record with the modified spoken content text in the historical conference record and the corresponding spoken content text in response to determining that the preset speech recognition model update condition is satisfied.

Here, the preset speech recognition model update condition may be preset according to practice. For example, the preset speech recognition model update condition may be that a time interval between the current time and the last update time is a preset time interval (e.g., one week). For another example, the preset speech recognition model update condition may be that the number of the historical conference records in which the text of the spoken content is modified between the last update and the current time is greater than or equal to a preset modified conference record number threshold (e.g., one thousand).

While the general speech recognition model is based on supervised training. Therefore, the voice data in the history conference record with the modified speaking content text can be input into the voice recognition model to obtain the actually output recognition result, and the model parameters of the voice recognition model are adjusted according to the difference between the actually output recognition result and the corresponding modified speaking content text (equivalent to expected output), so as to realize the updating operation of the voice recognition model.

Through the step 214, the speech recognition model can be updated, and the recognition accuracy of the speech recognition model is further improved.

In some optional embodiments, the timing sequence 200 may further include the following steps 215 and 216:

in step 215, the meeting record presenting terminal sends a meeting scheduling request to the meeting record server in response to detecting the meeting scheduling request comprising the set of participant identifications input by the user.

Before meeting with the conference sound collecting terminal, the participants need to use the conference record presenting terminal to book the conference. When the conference is scheduled, the participants need to determine each participant of the participants who participate in the conference on the conference record presenting terminal, that is, the participant identification set needs to be input, and then the conference record presenting terminal can generate a conference scheduling request including the participant identification set and send the generated conference scheduling request to the conference record server for conference scheduling.

In step 216, the conference recording server generates a conference identifier in response to receiving a conference reservation request including a participant identifier set sent by the conference recording presentation terminal, stores the participant identifier set in the conference reservation request as the participant identifier set corresponding to the generated conference identifier, and returns the generated conference identifier to the conference recording presentation terminal sending the conference reservation request.

The conference recording server may first generate a conference identifier when receiving a conference reservation request including a participant identifier set sent by the conference recording presentation terminal, then store the participant identifier set in the conference reservation request as a participant identifier set corresponding to the generated conference identifier, and return the generated conference identifier to the conference recording presentation terminal that sent the conference reservation request.

Optionally, the meeting reservation request input by the participant on the meeting record presenting terminal in step 215 may further include the meeting time and the expected meeting time, so that the meeting record presenting terminal may transmit the meeting reservation request including the meeting time, the expected meeting time and the participant identification set to the meeting record server in step 215. And, in step 216, the conference recording server may, upon receiving the conference reservation request, the suggested conference information corresponding to the received conference reservation request can be determined according to the current state of each conference sound collection terminal included in the conference recording system and the conference condition corresponding to each conference which has been reserved, the suggested conference information may include a terminal identification of the conference sound collection terminal, a conference start time, a conference duration, generating corresponding conference identification, storing the participant identification set in the conference reservation request as the participant identification set corresponding to the generated conference identification, storing the determined conference recommendation information as the conference recommendation information corresponding to the generated conference identification, and returning the generated conference identifier and the determined suggested conference information to the conference record presenting terminal which sends the conference reservation request. Therefore, the participants can acquire the conference record identification and the corresponding conference suggestion information in the conference record presentation terminal and can start the conference according to the conference suggestion information.

Through the above optional implementation mode of reserving the conference, the conference reservation of the conference by the conference participants in advance by using the conference record presenting terminal can be realized.

Referring now to fig. 3, fig. 3 illustrates a flow 300 of one embodiment of a conference sound collection method according to the present disclosure. The conference sound collection method can be applied to a conference sound collection terminal provided with a microphone array. The process 300 includes the following steps:

step 301, acquiring sound data collected by a microphone array in real time.

In the present embodiment, the detailed operation of step 301 and the technical effects thereof are substantially the same as the operation and effects of step 201 in the embodiment shown in fig. 2A, and are not repeated herein.

Step 302, sending the sound data to a conference recording server.

In the present embodiment, the detailed operation of step 302 and the technical effects thereof are substantially the same as the operation and effects of step 202 in the embodiment shown in fig. 2A, and are not repeated herein.

In some alternative embodiments, step 302 may also proceed as follows: and compressing the sound data and sending the compressed sound data to a conference recording server.

Here, the sound data may be used to trigger the conference record server to perform voice separation on the sound data, generate conference records corresponding to each separated sound data and including the separated sound data and the speaking content text and the speaker identity information corresponding to the separated sound data, and send each generated conference record to each conference record presenting terminal corresponding to the current conference identifier corresponding to the conference sound acquiring terminal, where each conference record is used to trigger the conference record presenting terminal receiving each conference record to present each conference record.

In some optional embodiments, the conference sound collection terminal may further be provided with at least one speaker direction indicator lamp; and the flow 300 may further include step 303:

step 303, estimating the arrival angle of the sound data.

And step 304, for each estimated arrival angle, determining the speaker direction indicator corresponding to the arrival angle according to the corresponding relation between the preset arrival angle and the speaker direction indicator mark, and turning on the determined speaker direction indicator for a first preset time.

Here, the specific operations of step 303 and step 304 and the technical effects thereof are substantially the same as those of step 202A and step 202B in the embodiment shown in fig. 2A, and are not repeated herein.

According to the conference sound acquisition method provided by the embodiment of the disclosure, the sound data acquired in real time from the microphone array is sent to the conference recording server, so that the conference sound data is acquired in real time and sent to the conference recording server for processing, and the calculation processing burden of the conference sound acquisition terminal is reduced.

With further reference to fig. 4, fig. 4 illustrates a flow 400 of one embodiment of a conference recording method according to the present disclosure. The conference recording method can be applied to a conference recording server. The process 400 includes the following steps:

step 401, receiving sound data sent by a conference sound collection terminal.

Step 402, performing voice separation on the voice data.

Step 403, generating a conference record corresponding to each separated sound data.

The conference record corresponding to each separated voice data may include the separated voice data and the speaking content text and the speaker identity information corresponding to the separated voice data.

And step 404, sending each generated conference record to a conference record presenting terminal corresponding to the target conference identifier.

Here, the target conference identifier is a current conference identifier corresponding to the conference sound collection terminal that sends the sound data. Each conference record may be used to trigger the conference record presenting terminal that receives each conference record to present each conference record.

In this embodiment, the specific operations of step 401, step 402, step 403, and step 404 and the technical effects thereof are substantially the same as the operations and effects of step 203, step 204, step 205, and step 206 in the embodiment shown in fig. 2A, and are not described herein again.

In some alternative embodiments, the step 402 may be performed as follows: and carrying out voice separation on the received voice data to generate a preset number of separated voice data, wherein the generated separated voice data respectively correspond to the voice source direction ranges in a preset voice source direction range set one by one, and the voice source direction ranges in the preset voice source direction range set are not overlapped with each other.

In some alternative embodiments, step 403 may proceed as follows: for each of the generated separated sound data, a conference recording generation operation is performed in response to determining that valid speech exists in the separated sound data. Wherein the conference record generation operation may include sub-steps 2051 through 2054 as shown in fig. 2B. Please refer to the related description of sub-steps 2051 to 2054 in the embodiment shown in fig. 2B, which is not repeated herein.

In some optional embodiments, the conference recording generation operation may further include sub-step 2055 shown in fig. 2B. Please refer to the related description of sub-step 2055 in the embodiment shown in fig. 2B, which is not repeated herein.

In some optional embodiments, the conference recording generation operation may further include sub-step 2056 shown in fig. 2B. Please specifically refer to the related description of step 2056 in fig. 2B, which is not repeated herein.

Based on the above optional implementation of sub-step 2056, in some optional implementations, the above sub-step 2054 may also be performed as follows: generating a conference record corresponding to the separated voice data using the speaking start time, the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identification information, and the determined speaker identification information. Reference may be made specifically to the description relating to an alternative implementation of sub-step 2054 in the example shown in fig. 2B.

Based on the above optional implementation of sub-step 2056, in some optional implementations, the above sub-step 2055 may also be performed as follows: generating a historical conference record by using the speaking starting time, the current voice and the current speaking text corresponding to the target conference identification and the obtained speaker identity information and the determined speaker identity information, and storing the generated historical conference record as the historical conference record corresponding to the target conference identification. Reference may be made specifically to the description relating to an alternative implementation of sub-step 2055 in the example shown in fig. 2B.

In some alternative embodiments, sub-step 2051 shown in fig. 2B may also be performed as follows: first, the separated voice data is transmitted to a voice recognition server and a voiceprint recognition server, respectively. Then, the recognition result received from the voice recognition server and the recognition result received from the voiceprint recognition server are determined as the recognition text and the speaker identification information obtained by performing the voice recognition and the voiceprint recognition on the separated voice data, respectively. Reference may be made specifically to the description relating to an alternative implementation of sub-step 2051 in the example shown in fig. 2B.

In some alternative embodiments, at least one of the conference recording server, the voice recognition server, and the voiceprint recognition server described above may be configured as a private deployment server in accordance with security and/or privacy requirements. Reference is made in particular to the description of the related alternative embodiment in the embodiment shown in fig. 2.

In some optional embodiments, the above flow 400 may further include the following step 405:

step 405, in response to receiving the request for updating the speaking content text sent by the conference record presenting terminal, updating the speaking content text in the historical conference record corresponding to the conference record identifier in the request for updating the speaking content text into the speaking content text in the request for updating the speaking content text.

Here, the utterance content text update request may be sent by the conference record presenting terminal to the conference record server in response to detecting a modification operation for the utterance content text in the presented historical conference record, where the utterance content text update request includes the modified utterance content text corresponding to the modification operation and the conference record identifier of the historical conference record to which the modification operation is directed.

Here, the detailed operation of step 405 and the technical effects thereof are substantially the same as the operation and effects of step 213 in the embodiment shown in fig. 2C, and are not repeated herein.

In practice, speech recognition of the separated sound data is generally based on a speech recognition model. In order to improve the recognition rate of speech recognition, based on the above-mentioned alternative embodiments regarding updating the text of the utterance, in some alternative embodiments, the timing sequence 400 may further include the following steps 406:

and step 406, in response to determining that the preset speech recognition model updating condition is met, updating the speech recognition model based on the sound data in the historical conference record in which the speaking content text is modified in the stored historical conference record and the corresponding speaking content text.

Here, the detailed operation of step 406 and the technical effect thereof are substantially the same as the operation and effect of step 214 in the embodiment shown in fig. 2C, and are not repeated herein.

In some optional embodiments, the above flow 400 may further include the following steps 407 and 408:

step 407, in response to receiving a meeting record consulting request including a meeting identifier to be consulted and a consultant identifier sent by the meeting record presenting terminal, determining whether the consultant identifier belongs to a participant identifier set corresponding to the meeting identifier to be consulted.

And step 408, responding to the determination of the meeting, acquiring the historical meeting record corresponding to the meeting identifier to be referred, and sending the acquired historical meeting record to the meeting record presenting terminal sending the meeting record referring request.

Here, the specific operations of step 407 and step 408 and the technical effects thereof are substantially the same as those of step 209 and step 210 in the embodiment shown in fig. 2C, and are not repeated herein.

In some optional embodiments, the above flow 400 may further include the following step 409:

step 409, responding to a received conference reservation request including the participant identification set sent by the conference record presenting terminal, generating a conference identification, storing the participant identification set in the conference reservation request as the participant identification set corresponding to the generated conference identification, and returning the generated conference identification to the conference record presenting terminal sending the conference reservation request.

Here, the specific operation of step 409 and the technical effect thereof are substantially the same as the operation and effect of step 216 in the embodiment shown in fig. 2C, and are not repeated herein.

The conference recording method provided by the above embodiment of the present disclosure may implement, by separating voice data received from the conference voice collecting terminal, and generating a corresponding conference record for each separated voice data separately, the following beneficial effects including but not limited to:

first, the accuracy of the text of the spoken content in the generated meeting record is improved.

Secondly, the conference records comprise speaker identity information besides the speaking content text, and the content of the conference records is enriched.

And thirdly, the accuracy of the speaker identity information in the generated conference record is improved.

In addition, other optional implementations in the above embodiments of the present disclosure may achieve, but are not limited to, the following beneficial effects:

first, by performing the conference record generating operation only in the case where it is determined that there is valid speech in the separated sound data, the number of times of performing the conference record generating operation is reduced, which in turn reduces the calculation load.

Secondly, the separated sound data are respectively sent to a voice recognition server for voice recognition and sent to a voiceprint recognition server for voiceprint recognition, so that the calculation load of a conference recording server is reduced, and the speed of generating the conference recording is improved.

Third, by configuring at least one of the conference recording server, the voice recognition server, and the voiceprint recognition server as a private deployment server according to security and/or privacy requirements, there is a higher security than implementing voice recognition and conference recording on public servers that are currently widely deployed.

Fourth, by the conference record generating operation shown in fig. 2B, real-time update of the content (including voice data, the text of the utterance content, and the speaker identification information) spoken by each person who starts new speech in the ongoing conference is realized.

Fifthly, through the conference record generating operation shown in fig. 2B, for the currently ongoing conference, every time a spoken word is detected, the spoken content is stored as the historical conference record corresponding to the currently ongoing conference, so that the historical conference record of the conference can be referred in the future.

Sixthly, the conference record generated by aiming at the separated sound data comprises the sound, the speaking content text and the identity information of the speaker, and also comprises the speaking starting time, so that the recording content of the conference record is enriched.

Seventh, the recording content of the historical conference record is enriched by including the utterance start time in addition to the voice, the utterance content text, and the speaker identification information in the generated historical conference record.

And eighthly, under the condition of receiving the conference consulting request, confirming whether the consulting person has consulting authority or not, so that the safety of consulting the conference record is improved.

And ninthly, realizing fine management of the historical conference record by updating the wrong speaking content text in the historical conference record provided by the user.

Tenth, the speech recognition model is updated based on the stored sound data in the historical conference record in which the spoken content text is modified and the corresponding spoken content text under the condition that the preset speech recognition model updating condition is met, so that the recognition accuracy of the speech recognition model is improved.

Referring now to fig. 5, fig. 5 illustrates a flow 500 of one embodiment of a meeting record presentation method in accordance with the present disclosure. The conference record presenting method can be applied to a conference record presenting terminal. The process 500 includes the following steps:

step 501, in response to receiving a conference record sent by a conference record server, presenting the received conference record.

The received conference record may be a conference record corresponding to the conference record generated by the conference record server for each separated voice data after voice separation is performed on the voice data received from the conference voice collecting terminal, and the conference record corresponding to each separated voice data may include the separated voice data, and a speech content text and speaker identity information corresponding to the separated voice data.

In the present embodiment, the detailed operation of step 501 and the technical effects thereof are substantially the same as the operation and effects of step 207 in the embodiment shown in fig. 2A, and are not repeated herein.

In some alternative embodiments, step 501 may also be performed as follows: correspondingly presenting at least one of: the content text of the utterance in the received meeting record, the speaker identification information, and the sound playing icon associated with the separated sound data in the received meeting record. And in response to detecting a preset operation for the displayed sound playing icon, playing the separated sound data associated with the sound playing icon for which the detected preset operation is directed. The detailed operation and the technical effect of the above-mentioned optional implementation are substantially the same as those of the corresponding optional implementation in step 207 in the embodiment shown in fig. 2A, and are not repeated herein.

In some optional embodiments, the conference recording received from the conference recording server may further include a speaking start time, such that the corresponding presentation is at least one of: the text of the speech content in the received conference record, the identity information of the speaker and the sound playing icon associated with the separated sound data in the received conference record can also be performed as follows: correspondingly presenting at least one of: the conference recording includes a speaking start time, a speaking content text, speaker identity information and a sound playing icon associated with the separated sound data in the received conference recording. That is, besides presenting the separated voice data, the speaking content text and the speaker identity information in the conference record, the corresponding speaking start time can be presented, and the presented conference record content is further enriched.

In some optional embodiments, the above flow 500 may further include the following steps 502 and 503:

step 502, in response to detecting a meeting record consulting request comprising a meeting identifier to be consulted and a consulter identifier input by a user, sending a meeting record consulting request to a meeting record server.

Here, the meeting record lookup request may be used to trigger the meeting record server to obtain a historical meeting record corresponding to the meeting identifier to be consulted in response to determining that the consulter identifier belongs to the set of meeting attendee identifiers corresponding to the meeting identifier to be consulted, and to send the obtained historical meeting record to the meeting record presentation terminal that sent the meeting record lookup request.

Step 503, in response to receiving the historical meeting record sent by the meeting record server in response to the meeting record reference request, presenting the received historical meeting record.

Here, the specific operations of step 502 and step 503 and the technical effects thereof are substantially the same as those of step 208 and step 211 in the embodiment shown in fig. 2C, and are not repeated herein.

In some optional embodiments, the historical meeting record received from the meeting record server may further include a speaking start time, so that presenting the received historical meeting record in step 503 may be performed as follows: correspondingly presenting at least one of: the conference system further comprises a voice playing icon associated with the separated voice data in the received conference record, and the voice playing icon is associated with the voice playing icon. That is, besides presenting the sound playing icon, the speaking content text and the speaker identity information associated with the separated sound data in the conference record, the corresponding speaking start time can be presented, and the presented historical conference record content is further enriched. And in response to detecting a preset operation for the displayed sound playing icon, playing the separated sound data associated with the sound playing icon for which the detected preset operation is directed.

In some optional embodiments, the above flow 500 may further include the following step 504:

step 504, in response to detecting the modification operation for the speaking content text in the presented historical meeting record, sending a speaking content text updating request to the meeting record server.

Here, the request for updating the speaking content text may include the modified speaking content text corresponding to the modification operation and the conference record identifier of the historical conference record for which the modification operation is directed. The spoken content text update request may be used to trigger the conference record server to update the spoken content text in the historical conference record corresponding to the conference record identifier in the spoken content text update request to the spoken content text in the spoken content text update request.

Here, the detailed operation of step 504 and the technical effect thereof are substantially the same as the operation and effect of step 212 in the embodiment shown in fig. 2C, and are not repeated herein.

In some optional embodiments, the above flow 500 may further include the following step 505:

step 505, in response to detecting the user input meeting reservation request comprising the set of participant identifications, sending the meeting reservation request to the meeting record server.

Here, the conference reservation request may be used to trigger the conference recording server to generate a conference identifier, store a participant identifier set in the conference reservation request as a participant identifier set corresponding to the generated conference identifier, and return the generated conference identifier to the conference recording presentation terminal that sent the conference reservation request.

Here, the detailed operation of step 505 and the technical effect thereof are substantially the same as the operation and effect of step 215 in the embodiment shown in fig. 2C, and are not repeated herein.

The conference record presenting method provided by the above embodiment of the present disclosure may implement, by receiving and presenting the conference record from the conference record server, the following beneficial effects including but not limited to:

first, when more than one person speaks simultaneously, the separation of the voices of different persons is realized, and a corresponding conference record is generated on the basis of the separated voices. The problem of mixed speech of multiple speaking persons is avoided.

And secondly, presenting the voice, the speaking content and the identity of each person in real time.

Third, the speaker identity can be presented in addition to the speech and speech content in the conference.

Fourth, because the conference recording is generated based on the separated sound data, when more than one person speaks simultaneously, the presented speaking content and speaker identity information are more accurate.

In addition, other optional implementations in the above embodiments of the present disclosure may achieve, but are not limited to, the following advantageous effects:

firstly, the modification of the speaking content text in the historical conference record can be realized on the conference record presenting terminal, and the fine management of the historical conference record can be realized.

Second, referring to a history of a conference currently in progress or a conference that has ended may be implemented on the conference record presentation terminal.

Thirdly, the presented meeting record or the historical meeting record can also comprise the speaking starting time, and the content of the meeting record or the historical meeting record is enriched.

With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a conference sound collecting apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 3, and the apparatus may be specifically applied to a conference sound collecting terminal, and a microphone array may be disposed in the conference sound collecting terminal.

As shown in fig. 6, the conference sound collection apparatus 600 of the present embodiment includes: a sound data acquisition unit 601 and a sound data transmission unit 602. The sound data acquisition unit 601 is configured to acquire sound data acquired by the microphone array in real time; and the sound data sending unit 602 is configured to send the sound data to a conference record server, where the sound data is used to trigger the conference record server to perform voice separation on the sound data, generate a conference record corresponding to each separated sound data and including the separated sound data and a speech content text and speaker identity information corresponding to the separated sound data, and send each generated conference record to each conference record presenting terminal corresponding to a current conference identifier corresponding to the conference sound acquiring terminal, where each conference record is used to trigger each conference record presenting terminal receiving each conference record to present each conference record.

In this embodiment, specific processing of the sound data acquiring unit 601 and the sound data sending unit 602 of the conference sound collecting device 600 and technical effects brought by the processing can refer to related descriptions of step 301 and step 302 in the corresponding embodiment of fig. 3, which are not described herein again.

In some optional embodiments, the conference sound collection terminal may further include at least one speaker direction indicator light; and the conference sound collection device further comprises: an arrival angle estimation unit 603 configured to estimate an arrival angle of the sound data; and the indicator lamp turning-on unit is configured to determine the speaker direction indicator lamp corresponding to the arrival angle according to the corresponding relation between the preset arrival angle and the speaker direction indicator lamp identifier for each estimated arrival angle, and turn on the determined speaker direction indicator lamp for a first preset time.

In some optional embodiments, the sound data sending unit 602 may be further configured to: and compressing the sound data and sending the compressed sound data to the conference recording server.

With further reference to fig. 7, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a conference sound collecting apparatus, which corresponds to the method embodiment shown in fig. 4, and which may be specifically applied in a conference recording server.

As shown in fig. 7, the conference recording apparatus 700 of the present embodiment includes: a sound data receiving unit 701, a human voice separating unit 702, a conference record generating unit 703, and a conference record transmitting unit 704. The sound data receiving unit 701 is configured to receive sound data sent by a conference sound collecting terminal; a voice separating unit 702 configured to separate voice from the voice data; a conference record generating unit 703 configured to generate a conference record corresponding to each separated sound data, where the conference record corresponding to each separated sound data includes the separated sound data and a speech content text and speaker identity information corresponding to the separated sound data; and the conference record sending unit 704 is configured to send each generated conference record to a conference record presenting terminal corresponding to a target conference identifier, where the target conference identifier is a current conference identifier corresponding to a conference sound collecting terminal that sends the sound data, and each conference record is used to trigger a conference record presenting terminal that receives each conference record to present each conference record.

In this embodiment, specific processing of the sound data receiving unit 701, the voice separating unit 702, the conference record generating unit 703 and the conference record sending unit 704 of the conference recording apparatus 700 and technical effects brought by the processing can refer to the related descriptions of step 401, step 402, step 403 and step 404 in the corresponding embodiment of fig. 4, which are not described herein again.

In some optional embodiments, the human voice separation unit 702 may be further configured to: and carrying out voice separation on the received voice data to generate a preset number of separated voice data, wherein the generated separated voice data respectively correspond to the voice source direction ranges in a preset voice source direction range set one by one, and the voice source direction ranges in the preset voice source direction range set are not overlapped with each other.

In some optional embodiments, the conference record generating unit 703 may be further configured to: for each of the generated separated sound data, in response to determining that valid speech is present in the separated sound data, performing the following conference recording generation operations: respectively carrying out voice recognition and voiceprint recognition on the separated voice data to obtain a recognition text and speaker identity information; in response to determining that the separated voice data is a voice starting point, establishing a current voice and a current speaking text corresponding to the target conference identifier and the obtained speaker identity information; splicing the obtained identification text to the tail part of the current speaking text corresponding to the target conference identifier and the obtained speaker identity information, and splicing the separated voice data to the tail part of the current voice corresponding to the target conference identifier and the obtained speaker identity information; and generating a conference record corresponding to the separated voice data by using the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identity information and the determined speaker identity information.

In some optional embodiments, the conference record generating operation may further include: in response to determining that the isolated acoustic data is a speech endpoint, generating a historical meeting record using current speech and current spoken text corresponding to the target meeting identification and the obtained speaker identification information and the determined speaker identification information, and storing the generated historical meeting record as a historical meeting record corresponding to the target meeting identification.

In some optional embodiments, the conference recording apparatus may further include: the spoken content text updating unit 705 is configured to, in response to receiving a spoken content text updating request sent by a conference record presenting terminal, where the spoken content text updating request is sent by the conference record presenting terminal to the conference record server in response to detecting a modification operation on a spoken content text in a presented historical conference record, and the spoken content text updating request includes a modified spoken content text corresponding to the modification operation and a conference record identifier of a historical conference record to which the modification operation is directed, update the spoken content text in the historical conference record corresponding to the conference record identifier in the spoken content text updating request to the spoken content text in the spoken content text updating request.

In some optional embodiments, the performing voice recognition on the separated sound data may include: performing voice recognition on the separated sound data based on a voice recognition model; and the conference recording apparatus may further include: and a speech recognition model updating unit 706 configured to update the speech recognition model based on the stored sound data in the historic conference record in which the utterance content text is modified and the corresponding utterance content text in the historic conference record in response to determining that a preset speech recognition model updating condition is satisfied.

In some optional embodiments, the conference record generating operation may further include: and in response to determining that the separated voice data is a voice starting point, determining the current time as a speaking starting time corresponding to the target conference identifier and the obtained speaker identity information.

In some alternative embodiments, the generating a conference record corresponding to the separated sound data using the current speech and the current utterance text corresponding to the target conference identifier and the obtained speaker identification information and the determined speaker identification information may include: generating a conference record corresponding to the separated voice data by using the speaking start time, the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identity information and the determined speaker identity information; and/or generating a historical conference record by using the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identity information and the determined speaker identity information, and storing the generated historical conference record as the historical conference record corresponding to the target conference identifier may include: and generating a historical conference record by using the speaking starting time, the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identity information and the determined speaker identity information, and storing the generated historical conference record as the historical conference record corresponding to the target conference identifier.

In some optional embodiments, the performing speech recognition and voiceprint recognition on the separated voice data to obtain the recognition text and the speaker identity information respectively may include: respectively sending the separated voice data to a voice recognition server and a voiceprint recognition server, wherein the separated voice data is used for triggering the voice recognition server to perform voice recognition on the received voice data and return a recognition result, and is used for triggering the voiceprint recognition server to perform voiceprint recognition on the received voice data and return a recognition result; and determining the recognition result received from the voice recognition server and the recognition result received from the voiceprint recognition server as a recognition text and speaker identity information obtained by performing voice recognition and voiceprint recognition on the separated voice data, respectively.

In some alternative embodiments, at least one of the conference recording server, the voice recognition server, and the voiceprint recognition server may be configured as a private deployment server based on security and/or privacy requirements.

In some optional embodiments, the conference recording apparatus may further include: a consultant identifier determining unit 707 configured to determine, in response to receiving a meeting record consulting request including a meeting identifier to be consulted and a consultant identifier sent by the meeting record presenting terminal, whether the consultant identifier belongs to a set of participant identifiers corresponding to the meeting identifier to be consulted; and the historical conference record acquisition and transmission unit is configured to respond to the determination of belonging, acquire the historical conference record corresponding to the conference identifier to be referred, and transmit the acquired historical conference record to the conference record presentation terminal which transmits the conference record reference request.

In some optional embodiments, the conference recording apparatus may further include: the conference reservation unit 708 is configured to generate a conference identifier in response to receiving a conference reservation request including a participant identifier set sent by a conference record presenting terminal, store the participant identifier set in the conference reservation request as the participant identifier set corresponding to the generated conference identifier, and return the generated conference identifier to the conference record presenting terminal sending the conference reservation request.

It should be noted that, for details of implementation and technical effects of each unit in the conference recording apparatus provided by the present disclosure, reference may be made to descriptions of other embodiments in the present disclosure, and details are not described herein again.

Referring to fig. 8, as an implementation of the method shown in the above-mentioned figures, the present disclosure provides an embodiment of a conference record presenting apparatus, which corresponds to the method embodiment shown in fig. 5, and which is particularly applicable to a conference record presenting terminal.

As shown in fig. 8, the conference record presenting apparatus 800 of the present embodiment includes: the conference record presenting unit 801 is configured to present the received conference record in response to receiving the conference record sent by the conference record server, where the received conference record is a corresponding conference record generated by the conference record server for each separated sound data after performing voice separation on the sound data received from the conference sound collecting terminal, and the conference record corresponding to each separated sound data includes the separated sound data and a speech content text and speaker identity information corresponding to the separated sound data.

In this embodiment, the specific processing of the conference record presenting unit 801 of the conference record presenting apparatus and the technical effects brought by the processing can refer to the related description of step 501 in the corresponding embodiment of fig. 5, which is not described herein again.

In some optional embodiments, the conference record presenting apparatus may further include: a meeting record consulting request sending unit 802, configured to send, in response to detecting a meeting record consulting request including a meeting identifier to be consulted and a consultant identifier input by a user, the meeting record consulting request to a meeting record server, wherein the meeting record consulting request is used for triggering the meeting record server to obtain a historical meeting record corresponding to the meeting identifier to be consulted in response to determining that the consultant identifier belongs to a participant identifier set corresponding to the meeting identifier to be consulted, and send the obtained historical meeting record to a meeting record presenting terminal that sent the meeting record consulting request; and a history conference record receiving and presenting unit configured to present the received history conference record in response to receiving the history conference record transmitted by the conference record server in response to the conference record reference request.

In some optional embodiments, the conference record presenting apparatus may further include: a speaking content text updating request sending unit 803, configured to send, in response to detecting a modification operation on a speaking content text in the presented historical meeting record, a speaking content text updating request to the meeting record server, where the speaking content text updating request includes the modified speaking content text corresponding to the modification operation and the meeting record identifier of the historical meeting record to which the modification operation is directed, and the speaking content text updating request is used to trigger the meeting record server to update the speaking content text in the historical meeting record corresponding to the meeting record identifier in the speaking content text updating request to the speaking content text in the speaking content text updating request.

In some optional embodiments, the presenting the received meeting record may include: correspondingly presenting at least one of: the method comprises the steps that a speaking content text, speaker identity information and a sound playing icon associated with separated sound data in a received conference record are received in the conference record; in response to detecting a preset operation for the displayed sound playing icon, playing the separated sound data associated with the sound playing icon for which the detected preset operation is directed.

In some optional embodiments, the playing the separated sound data associated with the sound playing icon for which the detected preset operation is directed may include: and playing the separated sound data associated with the sound playing icon corresponding to the detected preset operation, and displaying the playing progress indication information corresponding to the playing process in the playing process.

In some alternative embodiments, the conference recording may also include a talk start time; and the correspondence presents at least one of: the text of the speech content in the received meeting record, the speaker identification information, and the sound playing icon associated with the separated sound data in the received meeting record may include: correspondingly presenting at least one of: the conference recording includes a speaking start time, a speaking content text, speaker identity information and a sound playing icon associated with the separated sound data in the received conference recording.

In some optional embodiments, the conference record presenting apparatus may further include: a conference reservation request sending unit 804, configured to, in response to detecting a conference reservation request including a conference participant identification set input by a user, send the conference reservation request to the conference recording server, where the conference reservation request is used to trigger the conference recording server to generate a conference identification, store the conference participant identification set in the conference reservation request as a conference participant identification set corresponding to the generated conference identification, and return the generated conference identification to a conference recording presentation terminal that sent the conference reservation request.

Referring now to fig. 9, there is shown a schematic block diagram of a computer system 900 suitable for use in implementing the conference sound collection terminal of the present disclosure. The conference sound collection terminal shown in fig. 9 is only an example, and should not bring any limitation to the function and the range of use of the present disclosure.

As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901, a memory 902, a bus 903, an Input/Output (I/O) interface 904, an Input Unit 905, and a communication Unit 906. Wherein the central processing unit 901, the memory 902 and the I/O interface 904 are connected to each other via a bus 903. An input unit 905 and a communication unit 906 are connected to the bus 903 through the I/O interface 904. The input unit 905 may include a microphone array. The following components are also connected to the I/O interface 904: a communication unit 906 including a Network interface card such as a LAN (Local Area Network) card, a modem, a WiFi module, a mobile Network module, or the like. The communication unit 906 performs communication processing via a network such as the internet.

In some optional embodiments, the input unit 905 may further include, for example, a touch screen, a keyboard, an information input button, and the like.

In some alternative embodiments, computer system 900 may also include an output unit 907, with output unit 907 also connected to I/O interface 904. The output unit 907 may include, for example, a speaker direction indicator lamp or an operation state indicator lamp, etc.

Here, the method according to the present disclosure may be implemented as a computer program and stored in the memory 902. The central processing unit 901 in the conference sound collection terminal 900 realizes the conference sound collection function defined in the method of the present disclosure in detail by calling the above-mentioned computer program stored in the memory 902. In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication unit 906. The above-described functions defined in the method of the present disclosure are performed when the computer program is executed by a Central Processing Unit (CPU) 901.

Here, the CPU 901 may include at least one processor, and the processor may be, for example, various microprocessors.

In some alternative embodiments, the processor may include specially designed hardware for controlling the operation of the conference sound collection terminal. For example, the processor may be an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a charged Erasable Programmable read only memory (EEPROM), and the like.

In some alternative embodiments, the memory 902 may also be an integral part of the central processing unit 901. Memory 902 may be coupled to computer system 900 in a variety of ways. The memory 902 may be used for various purposes, such as: cache and/or store data, program instructions, and the like.

The term "processor" is not limited herein to just the various integrated circuits referred to in the art as processors, but may broadly refer to microcontrollers, microcomputers, programmable logic controllers, application specific integrated circuits, and any other programmable circuits.

Referring now to fig. 10, there is shown a schematic block diagram of a computer system 1000 suitable for use in implementing a conference record server or a conference record rendering terminal of the present disclosure. The conference recording server or the conference recording presentation terminal shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the present disclosure.

As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU) 1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1006 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the system 1000 are also stored. The central processing unit 1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An Input/Output (I/O) interface 1005 is also connected to the bus 1004.

The following components are connected to the I/O interface 1005: a communication section 1007 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like, and a storage section 1006 including a hard disk or the like. The communication section 1007 performs communication processing via a network such as the internet.

In some alternative embodiments, computer system 1000 may also include an input portion 1008 and/or an output portion 1009. An input section 1008 is connected to the I/O interface 1005, and the input section 1008 may include, for example, a keyboard, a mouse, a touch screen, a stylus pen, a tablet, and the like. The output section 1009 is connected to the I/O interface 1005, and the output section 1009 may include a Display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), a touch panel, and the like, and a speaker and the like.

In some alternative embodiments, the computer system 1000 may also include a driver 1010, the driver 1010 also being connected to the I/O interface 1005 as needed.

In some alternative embodiments, the computer system 1000 may further include a removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted in the storage section 1006 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1007 and/or installed from the removable medium 1011. The computer program performs the above-described functions defined in the method of the present disclosure when executed by the central processing unit 1001.

It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, Python, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in this disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a sound data receiving unit, a human voice separating unit, a conference record generating unit, and a conference record transmitting unit. The names of these units do not in some cases constitute a limitation on the units themselves, and for example, the sound data receiving unit may also be described as a "unit that receives sound data transmitted from a conference sound collecting terminal". As another example, it can be described as: a processor includes a meeting record rendering unit. Also for example, it can be described as: a processor includes a sound data acquisition unit and a sound data transmission unit.

As another aspect, the present disclosure also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to implement the conference sound collection method shown in the alternative embodiments of fig. 3 and 3, and/or the conference recording method shown in the alternative embodiments of fig. 4 and 4, and/or the conference recording presentation method shown in the alternative embodiments of fig. 5 and 5.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept as defined above. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A conference sound collection method is applied to a conference sound collection terminal provided with a microphone array, and comprises the following steps:

acquiring sound data collected by the microphone array in real time;

and sending the sound data to a conference recording server, wherein the sound data is used for triggering the conference recording server to carry out voice separation on the sound data, generating conference records corresponding to each separated sound data and including the separated sound data, a speaking content text and speaker identity information corresponding to the separated sound data, and sending each generated conference record to each conference record presenting terminal corresponding to a current conference identifier corresponding to the conference sound acquisition terminal, and each conference record is used for triggering the conference record presenting terminal receiving each conference record to present each conference record.

2. The conference sound collection method according to claim 1, wherein the conference sound collection terminal is further provided with at least one speaker direction indicator lamp; and

the conference sound collection method further comprises:

estimating an arrival angle of the sound data;

and for each estimated arrival angle, determining the speaker direction indicator lamp corresponding to the arrival angle according to the corresponding relation between the preset arrival angle and the speaker direction indicator lamp identifier, and turning on the determined speaker direction indicator lamp for a first preset time.

3. The conference sound collection method according to claim 1 or 2, wherein the transmitting the sound data to a conference recording server comprises:

and compressing the sound data and then sending the compressed sound data to the conference recording server.

4. The utility model provides a meeting sound collection system, is applied to the meeting sound collection terminal that is provided with the microphone array, meeting sound collection system includes:

a sound data acquisition unit configured to acquire sound data acquired by the microphone array in real time;

and the sound data sending unit is configured to send the sound data to a conference recording server, the sound data is used for triggering the conference recording server to perform voice separation on the sound data, generating conference records corresponding to each separated sound data and including the separated sound data and a speaking content text and speaker identity information corresponding to the separated sound data, and sending each generated conference record to each conference record presenting terminal corresponding to a current conference identifier corresponding to the conference sound collecting terminal, wherein each conference record is used for triggering each conference record presenting terminal receiving each conference record to present each conference record.

5. A conference recording method is applied to a conference recording server, and comprises the following steps:

receiving sound data sent by a conference sound acquisition terminal;

carrying out human voice separation on the voice data;

generating a conference record corresponding to each separated voice data, wherein the conference record corresponding to each separated voice data comprises the separated voice data, and a speaking content text and speaker identity information corresponding to the separated voice data;

and sending each generated conference record to a conference record presenting terminal corresponding to a target conference identifier, wherein the target conference identifier is a current conference identifier corresponding to a conference sound acquisition terminal which sends the sound data, and each conference record is used for triggering the conference record presenting terminal which receives each conference record to present each conference record.

6. The conference recording method as claimed in claim 5, wherein said separating of said sound data from voice comprises:

and carrying out voice separation on the received voice data to generate a preset number of separated voice data, wherein the generated separated voice data respectively correspond to the voice source direction ranges in a preset voice source direction range set one by one, and the voice source direction ranges in the preset voice source direction range set are not overlapped with each other.

7. The conference recording method according to claim 5 or 6, wherein the generating of the conference record corresponding to each separated sound data comprises:

for each of the generated separated sound data, in response to determining that valid speech is present in the separated sound data, performing the following conference recording generation operations: respectively carrying out voice recognition and voiceprint recognition on the separated voice data to obtain a recognition text and speaker identity information; in response to determining that the separated voice data is a voice starting point, establishing a current voice and a current speaking text corresponding to the target conference identifier and the obtained speaker identity information; splicing the obtained identification text to the tail part of the current speaking text corresponding to the target conference identifier and the obtained speaker identity information, and splicing the separated voice data to the tail part of the current voice corresponding to the target conference identifier and the obtained speaker identity information; and generating a conference record corresponding to the separated voice data by using the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identity information and the determined speaker identity information.

8. The conference recording method of claim 7, wherein said conference recording generating operation further comprises:

in response to determining that the isolated acoustic data is a speech endpoint, generating a historical meeting record using current speech and current spoken text corresponding to the target meeting identification and the obtained speaker identification information and the determined speaker identification information, and storing the generated historical meeting record as a historical meeting record corresponding to the target meeting identification.

9. The conference recording method as claimed in claim 8, wherein said conference recording method further comprises:

in response to receiving a speaking content text updating request sent by a conference record presenting terminal, wherein the speaking content text updating request is sent by the conference record presenting terminal to the conference record server in response to detecting a modification operation on a speaking content text in a presented historical conference record, the speaking content text updating request comprises the modified speaking content text corresponding to the modification operation and a conference record identifier of the historical conference record corresponding to the modification operation, and the speaking content text in the historical conference record corresponding to the conference record identifier in the speaking content text updating request is updated to the speaking content text in the speaking content text updating request.

10. The conference recording method as claimed in claim 9, wherein said performing voice recognition on the separated sound data comprises: performing voice recognition on the separated sound data based on a voice recognition model; and

the conference recording method further comprises:

and in response to determining that a preset speech recognition model updating condition is met, updating the speech recognition model based on the sound data in the historical conference record and the corresponding speaking content text, wherein the stored historical conference record is modified by the speaking content text.

11. The conference recording method of claim 10, wherein said conference recording generating operation further comprises:

in response to determining that the isolated voice data is a speech start point, determining a current time as a speech start time corresponding to the target conference identification and the obtained speaker identity information.

12. The conference recording method as in claim 11, wherein said generating a conference recording corresponding to the isolated sound data using the current speech and current spoken text corresponding to the target conference identification and the obtained speaker identity information and the determined speaker identity information comprises:

generating a conference record corresponding to the separated sound data by using the speaking start time, the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identity information and the determined speaker identity information; and/or

Generating a historical conference record by using the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identity information and the determined speaker identity information, and storing the generated historical conference record as the historical conference record corresponding to the target conference identifier, wherein the method comprises the following steps:

generating a historical conference record by using the speaking starting time, the current voice and the current speaking text corresponding to the target conference identifier and the obtained speaker identity information and the determined speaker identity information, and storing the generated historical conference record as the historical conference record corresponding to the target conference identifier.

13. The conference recording method as claimed in claim 12, wherein said performing speech recognition and voiceprint recognition on the separated voice data to obtain recognized text and speaker identification information respectively comprises:

respectively sending the separated voice data to a voice recognition server and a voiceprint recognition server, wherein the separated voice data is used for triggering the voice recognition server to perform voice recognition on the received voice data and return a recognition result, and is used for triggering the voiceprint recognition server to perform voiceprint recognition on the received voice data and return a recognition result;

and respectively determining the recognition result received from the voice recognition server and the recognition result received from the voiceprint recognition server as a recognition text and speaker identity information obtained by performing voice recognition and voiceprint recognition on the separated voice data.

14. The conference recording method as claimed in any one of claims 7-13, wherein said conference recording method further comprises:

in response to receiving a meeting record looking up request which is sent by a meeting record presenting terminal and comprises a meeting identifier to be looked up and a consultant identifier, determining whether the consultant identifier belongs to a participant identifier set corresponding to the meeting identifier to be looked up;

and responding to the meeting identification to be consulted, acquiring the historical meeting record corresponding to the meeting identification to be consulted, and sending the acquired historical meeting record to a meeting record presenting terminal sending the meeting record consulting request.

15. The conference recording method as claimed in any one of claims 7-13, wherein said conference recording method further comprises:

the method comprises the steps of responding to a received conference reservation request which comprises a participant identification set and is sent by a conference record presenting terminal, generating a conference identification, storing the participant identification set in the conference reservation request as the participant identification set corresponding to the generated conference identification, and returning the generated conference identification to the conference record presenting terminal which sends the conference reservation request.

16. A conference recording apparatus applied to a conference recording server, the conference recording apparatus comprising:

a sound data receiving unit configured to receive sound data transmitted by the conference sound collecting terminal;

a voice separation unit configured to perform voice separation on the sound data;

the conference record generating unit is configured to generate a conference record corresponding to each separated sound data, wherein the conference record corresponding to each separated sound data comprises the separated sound data, and a speaking content text and speaker identity information corresponding to the separated sound data;

and a conference record sending unit configured to send each generated conference record to a conference record presenting terminal corresponding to a target conference identifier, where the target conference identifier is a current conference identifier corresponding to a conference sound collecting terminal that sends the sound data, and each conference record is used to trigger a conference record presenting terminal that receives each conference record to present each conference record.

17. A conference record presenting method is applied to a conference record presenting terminal, and comprises the following steps:

and presenting the received conference record in response to receiving the conference record sent by the conference record server, wherein the received conference record is a corresponding conference record generated by the conference record server aiming at each separated sound data after the conference record server performs voice separation on the sound data received from the conference sound acquisition terminal, and the conference record corresponding to each separated sound data comprises the separated sound data, and a speaking content text and speaker identity information corresponding to the separated sound data.

18. The conference recording presentation method of claim 17, wherein the conference recording presentation method further comprises:

in response to detecting a meeting record consulting request which is input by a user and comprises a meeting identifier to be consulted and a consultant identifier, sending the meeting record consulting request to a meeting record server, wherein the meeting record consulting request is used for triggering the meeting record server to respond to determining that the consultant identifier belongs to a meeting participant identifier set corresponding to the meeting identifier to be consulted, acquiring a historical meeting record corresponding to the meeting identifier to be consulted, and sending the acquired historical meeting record to a meeting record presenting terminal sending the meeting record consulting request;

presenting the received historical meeting record in response to receiving the historical meeting record sent by the meeting record server in response to the meeting record reference request.

19. The conference recording presentation method of claim 18, wherein the conference recording presentation method further comprises:

and in response to detecting a modification operation on the spoken content text in the presented historical conference record, sending a spoken content text updating request to the conference record server, wherein the spoken content text updating request comprises the modified spoken content text corresponding to the modification operation and the conference record identifier of the historical conference record to which the modification operation is directed, and the spoken content text updating request is used for triggering the conference record server to update the spoken content text in the historical conference record corresponding to the conference record identifier in the spoken content text updating request to the spoken content text in the spoken content text updating request.

20. The conference recording presentation method of any one of claims 17-19, wherein said presenting the received conference recording comprises:

correspondingly presenting at least one of: the method comprises the steps that a speaking content text, speaker identity information and a sound playing icon associated with separated sound data in a received conference record are received in the conference record;

in response to detecting a preset operation for the displayed sound playing icon, playing the separated sound data associated with the sound playing icon for which the detected preset operation is directed.

21. The conference record presenting method according to claim 20, wherein the playing the separated sound data associated with the sound playing icon for which the detected preset operation is directed includes:

and playing the separated sound data associated with the sound playing icon for which the detected preset operation is directed, and displaying the playing progress indication information corresponding to the playing process in the playing process.

22. A conference recording presentation method according to claim 21, wherein the conference recording further comprises a speaking start time; and

the correspondence presents at least one of: the speech content text, speaker identity information and the sound playing icon associated with the separated sound data in the received conference record comprise:

correspondingly presenting at least one of: the conference recording includes a speaking start time, a speaking content text, speaker identity information and a sound playing icon associated with the separated sound data in the received conference recording.

23. The conference recording presentation method of claim 22, wherein the conference recording presentation method further comprises:

the method comprises the steps of responding to a conference reservation request which is input by a user and comprises a conference participant identification set, sending the conference reservation request to a conference recording server, wherein the conference reservation request is used for triggering the conference recording server to generate a conference identification, storing the conference participant identification set in the conference reservation request as the conference participant identification set corresponding to the generated conference identification, and returning the generated conference identification to a conference recording presentation terminal sending the conference reservation request.

24. A conference record presenting apparatus applied to a conference record presenting terminal, the conference record presenting apparatus comprising:

and the conference record presenting unit is configured to present the received conference record in response to receiving the conference record sent by the conference record server, wherein the received conference record is a corresponding conference record generated by the conference record server for each separated sound data after the conference record server performs voice separation on the sound data received from the conference sound collecting terminal, and the conference record corresponding to each separated sound data comprises the separated sound data and the speaking content text and the speaker identity information corresponding to the separated sound data.

25. A conference sound collection terminal comprising:

the microphone array is used for collecting sound data;

one or more processors;

a storage device having one or more programs stored thereon,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-3.

26. A conference recording server, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 5-15.

27. A conference recording presentation terminal comprising:

one or more processors;

a storage device having one or more programs stored thereon;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 17-23.

28. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by one or more processors implements a conference sound acquisition method as claimed in any one of claims 1-3, or a conference recording method as claimed in any one of claims 5-15, or a conference recording presentation method as claimed in any one of claims 17-23.

29. A conference recording system comprising a conference recording server according to claim 27, at least one conference sound collection terminal according to claim 25 and at least one conference recording presentation terminal according to claim 27.

30. The conference recording system of claim 29, wherein said conference recording system further comprises a voice recognition server and a voiceprint recognition server, wherein said voice recognition server is configured to perform voice recognition on the separated voice data received from said conference recording server and to transmit a recognized utterance content text to said conference recording server, and wherein said voiceprint recognition server is configured to perform voiceprint recognition on the separated voice data received from said conference recording server and to transmit recognized speaker identification information to said conference recording server.