CN113438440A

CN113438440A - Video conference voice conversion text summary method and system

Info

Publication number: CN113438440A
Application number: CN202110610479.0A
Authority: CN
Inventors: 秦凤枝; 王远丰; 罗崇立; 陈燕; 罗一文; 潘亮; 凌怡珍; 陈业钊; 徐晓东; 彭文昊; 翟长华
Original assignee: Guangdong Electric Power Communication Technology Co Ltd
Current assignee: Guangdong Electric Power Communication Technology Co Ltd
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2021-09-24

Abstract

The application discloses a video conference voice conversion text summary method and a system, which relate to the video conference technology, and the method comprises the following steps: determining a target sound pickup currently used; performing echo cancellation on the first audio information collected by the target sound pickup to obtain second audio information; inputting all or part of the second audio information into a gender determination model to determine the gender of the speaker; and selecting a voice-text conversion model according to the gender of the speaker to convert the second audio information to obtain the text summary, wherein each gender is provided with a corresponding voice-text conversion model. According to the method and the device, the gender of the speaker is firstly identified, and the voice-character conversion model corresponding to the gender is selected for translation, so that the conversion accuracy can be improved.

Description

Video conference voice conversion text summary method and system

Technical Field

The application relates to a video conference technology, in particular to a video conference voice conversion text summary method and a video conference voice conversion text summary system.

Background

In a video conference, in order to record the contents of the conference or display subtitles, it is sometimes necessary to convert the words spoken by a speaker into text, thereby forming a conference summary.

In the prior art, the same sentence is pronounced by users with different genders, and the conversion result may have difference, which shows that the prior art has the problem of insufficient precision.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a video conference voice conversion text summary method and a video conference voice conversion text summary system, which are used for overcoming the defect of inaccurate identification caused by gender difference.

In one aspect, embodiments of the present application provide:

a video conference voice conversion text summary method comprises the following steps:

determining a target sound pickup currently used;

performing echo cancellation on the first audio information collected by the target sound pickup to obtain second audio information;

inputting all or part of the second audio information into a gender determination model to determine the gender of the speaker;

and selecting a voice-text conversion model according to the gender of the speaker to convert the second audio information to obtain the text summary, wherein each gender is provided with a corresponding voice-text conversion model.

In some embodiments, the optional speech-to-text conversion model comprises a first speech-to-text conversion model and a second speech-to-text conversion model;

the first voice-to-text conversion model is obtained by training a voice sample of a male;

the second speech-to-text conversion model is obtained by training a female speech sample.

In some embodiments, inputting the second audio information to a gender determination model to determine the gender of the speaker comprises:

inputting all or part of the second audio information into a gender identification model, and enabling the gender identification model to output a first probability that the second audio information belongs to males and a second probability that the second audio information belongs to females;

and when the absolute value of the difference between the first probability and the second probability is larger than a preset value, taking the gender corresponding to the larger value of the first probability and the second probability as the gender of the speaker.

In some embodiments, the method further comprises the steps of: and when the absolute value of the difference between the first probability and the second probability is smaller than or equal to a preset value, acquiring the picture of a user of the target sound pickup, determining the face of the speaker according to the position of the target sound pickup in the picture, and identifying through the face to obtain the gender of the speaker.

dividing the second audio information into a plurality of sub-audio information;

inputting a sub-audio information of the second audio information into a gender identification model, and enabling the gender identification model to output a first probability that the second audio information belongs to males and a second probability that the second audio information belongs to females;

and when the absolute value of the difference between the first probability and the second probability is less than or equal to a preset value, determining the gender of the speaker by using the other sub-audio information of the second audio information until the absolute value of the difference between the first probability and the second probability corresponding to one sub-audio information is greater than the preset value.

In some embodiments, the gender identification model is obtained by:

acquiring a training set, wherein the training set comprises a plurality of voice samples, and each voice sample is marked with gender;

training the gender identification model through a training set until a condition for stopping training is met.

In some embodiments, the target microphone currently used is determined, specifically:

determining a sound pick-up currently in a use state;

and taking the sound pickup with the maximum volume as a target sound pickup.

In another aspect, an embodiment of the present application provides: a videoconference voice-to-text summary system, comprising:

a determination unit configured to determine a target sound pickup currently used;

the audio processing unit is used for carrying out echo cancellation on the first audio information collected by the target sound pickup to obtain second audio information;

a gender determination unit for inputting all or part of the second audio information to a gender identification model to determine the gender of the speaker;

and the conversion unit is used for selecting a voice character conversion model according to the gender of the speaker to convert the second audio information to obtain the character summary.

the program is executed in a computer to execute the steps of,

a memory for storing a program;

and the processor is used for loading the program to execute the video conference voice conversion text summary method.

According to the voice processing method and device, the currently used target sound pickup is determined, echo cancellation is carried out on the first audio information collected by the target sound pickup to obtain the second audio information, then all or part of the second audio information is input into the gender identification model to determine the gender of a speaker, finally, the voice-character conversion model is selected according to the gender of the speaker to convert the second audio information to obtain the character summary, so that the corresponding voice-character conversion model can be selected according to the gender of the speaker to process voice, and the accuracy of voice-character conversion is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method for converting text summary in real time in voice of a video conference according to an embodiment of the present application;

fig. 2 is a block diagram of a video conference voice real-time text summary conversion system according to an embodiment of the present disclosure;

fig. 3 is a block diagram of another video conference voice real-time text-to-text conversion system according to an embodiment of the present application.

Detailed Description

In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions of the present application will be clearly and completely described below through embodiments with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description of the present invention, the meaning of a plurality is one or more, the meaning of a plurality is two or more, and the above, below, exceeding, etc. are understood as excluding the present numbers, and the above, below, within, etc. are understood as including the present numbers. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present invention, unless otherwise explicitly defined, terms such as set, etc. should be broadly construed, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the detailed contents of the technical solutions.

In the description of the present invention, reference to the description of the terms "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples," etc., means that a particular feature or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Referring to fig. 1, the embodiment discloses a method for converting text into voice in a video conference, which includes the following steps:

and step 110, determining a target sound pickup currently used.

It will be appreciated that the microphone may be the microphone of the device, and in some scenarios all conference participants are accessed through a device such as a cell phone, which is typically configured with a microphone. The server may determine the target pickup based on the microphone activation status of each device. For example, a conference may be configured such that only one person can speak at a time, at which point the only currently speaking microphone may be the target microphone. In part of scenes, part of conference participants carry out video conferences in a conference room, a plurality of microphones are connected to a controller of the conference room, and a target microphone can be determined based on the starting states of the hardware devices. When it is assumed that only one person speaks, the target sound pickup may be determined according to the volume detected by the plurality of usable microphones, and the sound pickup with the largest detected volume is the target sound pickup.

And step 120, performing echo cancellation on the first audio information collected by the target sound pickup to obtain second audio information.

Generally, a loudspeaker is arranged in a device or a conference room, so that in order to avoid the situation that echo or noise is generated due to the fact that the loudspeaker circularly enters the microphone, the step is added with echo cancellation processing, and the effect of removing the noise can be achieved on the first audio information so as to facilitate subsequent identification.

All or a portion of the second audio information is input to a gender determination model to determine the gender of the speaker, step 130.

It will be appreciated that the second audio information may be tailored at the time of sexing based on the input length of the sexing model.

Wherein the sex determination model is obtained by the following steps:

and training the gender identification model through the training set until the condition of stopping training is met.

The gender identification model can be realized by adopting the existing voice classification model, and the gender identification model mainly learns the characteristics of different genders in the voice samples through the voice samples and the gender labels corresponding to the samples. When the gender identification model predicts a new voice sample, the probability that the voice sample belongs to different genders can be judged.

Step 140, selecting a voice-to-text conversion model according to the gender of the speaker to convert the second audio information to obtain a text summary, wherein each gender is configured with a corresponding voice-to-text conversion model. In some embodiments, the method further comprises the following steps: displaying the text summary.

Wherein, the selectable voice-to-text conversion model comprises a first voice-to-text conversion model and a second voice-to-text conversion model;

It can be understood that the two models are trained through the voice samples of different genders, and the two models respectively learn the characteristics of the corresponding gender voice, so that the recognition accuracy is better on the voice recognition of the same gender relative to the comprehensively trained models. In this embodiment, the task of learning gender characteristics of the speech conversion model is actually split, and the speech conversion model is split into gender recognition and speech-to-text conversion by using the gender-corresponding model. Therefore, the conversion effect of the present embodiment is more accurate.

and when the absolute value of the difference between the first probability and the second probability is larger than a preset value, the gender corresponding to the larger value of the first probability and the second probability is taken as the gender of the speaker.

And when the absolute value of the difference between the first probability and the second probability is smaller than or equal to a preset value, acquiring the picture of a user of the target sound pickup, determining the face of the speaker according to the position of the target sound pickup in the picture, and identifying through the face to obtain the gender of the speaker.

In the above embodiment, when the gender identification model determines that the voice is male or female (it is shown that the probability difference between the voices of the male and female is less than or equal to the set threshold), the camera may be called to obtain the picture of the user, and secondary determination may be performed according to the face of the user in the picture, so as to determine the gender of the speaker. Among them, the method is particularly effective for a scene in which a user uses a mobile phone to perform a video conference.

dividing the second audio information into a plurality of sub audio information;

inputting a sub-audio information of the second audio information into the gender identification model, and enabling the gender identification model to output a first probability that the second audio information belongs to males and a second probability that the second audio information belongs to females;

In the above example, the second audio information is decomposed in order to adapt to the restriction of the input length of the gender identification model, and the recognition duration can be reduced. Also, when the gender identification model is difficult to distinguish, the next sub-audio information can be input for re-identification. By the method, the situation that the blank content is intercepted so that the blank content cannot be correctly identified can be avoided under the condition of reducing the identification time.

determining a sound pick-up currently in a use state;

and taking the sound pickup with the maximum volume as a target sound pickup.

It can be understood that the present embodiment can quickly locate the sound pickup corresponding to the current speaker in the case that there are multiple microphones available in the conference.

Referring to fig. 2, the present embodiment discloses a video conference voice conversion text summary system, including:

Referring to fig. 3, this embodiment discloses a video conference voice conversion text summary system, including:

the program is executed in a computer to execute the steps of,

a memory for storing a program;

and the processor is used for loading a program to execute the video conference voice conversion text summary method.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the appended claims.

Claims

1. A video conference voice conversion text summary method is characterized by comprising the following steps:

determining a target sound pickup currently used;

2. The videoconference voice-to-text summary method of claim 1, wherein the optional voice-to-text model comprises a first voice-to-text model and a second voice-to-text model;

3. The videoconference voice-to-text summary method of claim 1, wherein inputting the second audio information into a gender identification model to determine the gender of a speaker comprises:

4. The videoconference voice-to-text summary method of claim 3, further comprising the steps of: and when the absolute value of the difference between the first probability and the second probability is smaller than or equal to a preset value, acquiring the picture of a user of the target sound pickup, determining the face of the speaker according to the position of the target sound pickup in the picture, and identifying through the face to obtain the gender of the speaker.

5. The videoconference voice-to-text summary method of claim 1, wherein inputting the second audio information into a gender identification model to determine the gender of a speaker comprises:

6. The videoconference speech-to-text summary method of claim 1, wherein the gender identification model is derived by:

7. The videoconference text-to-speech summary method of claim 1, wherein the determining a target microphone currently in use comprises:

determining a sound pick-up currently in a use state;

and taking the sound pickup with the maximum volume as a target sound pickup.

8. The videoconference voice-to-text summary method of claim 7, further comprising the steps of: displaying the text summary.

9. A videoconference voice-to-text summary system, comprising:

10. A videoconference voice-to-text summary system, comprising:

the program is executed in a computer to execute the steps of,

a memory for storing a program;

a processor for loading the program to perform the videoconference voice converted text summary method of any of claims 1-7.