CN113438440A - Video conference voice conversion text summary method and system - Google Patents

Video conference voice conversion text summary method and system Download PDF

Info

Publication number
CN113438440A
CN113438440A CN202110610479.0A CN202110610479A CN113438440A CN 113438440 A CN113438440 A CN 113438440A CN 202110610479 A CN202110610479 A CN 202110610479A CN 113438440 A CN113438440 A CN 113438440A
Authority
CN
China
Prior art keywords
gender
audio information
voice
text
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110610479.0A
Other languages
Chinese (zh)
Inventor
秦凤枝
王远丰
罗崇立
陈燕
罗一文
潘亮
凌怡珍
陈业钊
徐晓东
彭文昊
翟长华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Electric Power Communication Technology Co Ltd
Original Assignee
Guangdong Electric Power Communication Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Electric Power Communication Technology Co Ltd filed Critical Guangdong Electric Power Communication Technology Co Ltd
Priority to CN202110610479.0A priority Critical patent/CN113438440A/en
Publication of CN113438440A publication Critical patent/CN113438440A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application discloses a video conference voice conversion text summary method and a system, which relate to the video conference technology, and the method comprises the following steps: determining a target sound pickup currently used; performing echo cancellation on the first audio information collected by the target sound pickup to obtain second audio information; inputting all or part of the second audio information into a gender determination model to determine the gender of the speaker; and selecting a voice-text conversion model according to the gender of the speaker to convert the second audio information to obtain the text summary, wherein each gender is provided with a corresponding voice-text conversion model. According to the method and the device, the gender of the speaker is firstly identified, and the voice-character conversion model corresponding to the gender is selected for translation, so that the conversion accuracy can be improved.

Description

Video conference voice conversion text summary method and system
Technical Field
The application relates to a video conference technology, in particular to a video conference voice conversion text summary method and a video conference voice conversion text summary system.
Background
In a video conference, in order to record the contents of the conference or display subtitles, it is sometimes necessary to convert the words spoken by a speaker into text, thereby forming a conference summary.
In the prior art, the same sentence is pronounced by users with different genders, and the conversion result may have difference, which shows that the prior art has the problem of insufficient precision.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a video conference voice conversion text summary method and a video conference voice conversion text summary system, which are used for overcoming the defect of inaccurate identification caused by gender difference.
In one aspect, embodiments of the present application provide:
a video conference voice conversion text summary method comprises the following steps:
determining a target sound pickup currently used;
performing echo cancellation on the first audio information collected by the target sound pickup to obtain second audio information;
inputting all or part of the second audio information into a gender determination model to determine the gender of the speaker;
and selecting a voice-text conversion model according to the gender of the speaker to convert the second audio information to obtain the text summary, wherein each gender is provided with a corresponding voice-text conversion model.
In some embodiments, the optional speech-to-text conversion model comprises a first speech-to-text conversion model and a second speech-to-text conversion model;
the first voice-to-text conversion model is obtained by training a voice sample of a male;
the second speech-to-text conversion model is obtained by training a female speech sample.
In some embodiments, inputting the second audio information to a gender determination model to determine the gender of the speaker comprises:
inputting all or part of the second audio information into a gender identification model, and enabling the gender identification model to output a first probability that the second audio information belongs to males and a second probability that the second audio information belongs to females;
and when the absolute value of the difference between the first probability and the second probability is larger than a preset value, taking the gender corresponding to the larger value of the first probability and the second probability as the gender of the speaker.
In some embodiments, the method further comprises the steps of: and when the absolute value of the difference between the first probability and the second probability is smaller than or equal to a preset value, acquiring the picture of a user of the target sound pickup, determining the face of the speaker according to the position of the target sound pickup in the picture, and identifying through the face to obtain the gender of the speaker.
In some embodiments, inputting the second audio information to a gender determination model to determine the gender of the speaker comprises:
dividing the second audio information into a plurality of sub-audio information;
inputting a sub-audio information of the second audio information into a gender identification model, and enabling the gender identification model to output a first probability that the second audio information belongs to males and a second probability that the second audio information belongs to females;
and when the absolute value of the difference between the first probability and the second probability is less than or equal to a preset value, determining the gender of the speaker by using the other sub-audio information of the second audio information until the absolute value of the difference between the first probability and the second probability corresponding to one sub-audio information is greater than the preset value.
In some embodiments, the gender identification model is obtained by:
acquiring a training set, wherein the training set comprises a plurality of voice samples, and each voice sample is marked with gender;
training the gender identification model through a training set until a condition for stopping training is met.
In some embodiments, the target microphone currently used is determined, specifically:
determining a sound pick-up currently in a use state;
and taking the sound pickup with the maximum volume as a target sound pickup.
In another aspect, an embodiment of the present application provides: a videoconference voice-to-text summary system, comprising:
a determination unit configured to determine a target sound pickup currently used;
the audio processing unit is used for carrying out echo cancellation on the first audio information collected by the target sound pickup to obtain second audio information;
a gender determination unit for inputting all or part of the second audio information to a gender identification model to determine the gender of the speaker;
and the conversion unit is used for selecting a voice character conversion model according to the gender of the speaker to convert the second audio information to obtain the character summary.
In another aspect, an embodiment of the present application provides: a videoconference voice-to-text summary system, comprising:
the program is executed in a computer to execute the steps of,
a memory for storing a program;
and the processor is used for loading the program to execute the video conference voice conversion text summary method.
According to the voice processing method and device, the currently used target sound pickup is determined, echo cancellation is carried out on the first audio information collected by the target sound pickup to obtain the second audio information, then all or part of the second audio information is input into the gender identification model to determine the gender of a speaker, finally, the voice-character conversion model is selected according to the gender of the speaker to convert the second audio information to obtain the character summary, so that the corresponding voice-character conversion model can be selected according to the gender of the speaker to process voice, and the accuracy of voice-character conversion is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a method for converting text summary in real time in voice of a video conference according to an embodiment of the present application;
fig. 2 is a block diagram of a video conference voice real-time text summary conversion system according to an embodiment of the present disclosure;
fig. 3 is a block diagram of another video conference voice real-time text-to-text conversion system according to an embodiment of the present application.
Detailed Description
In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions of the present application will be clearly and completely described below through embodiments with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the description of the present invention, the meaning of a plurality is one or more, the meaning of a plurality is two or more, and the above, below, exceeding, etc. are understood as excluding the present numbers, and the above, below, within, etc. are understood as including the present numbers. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present invention, unless otherwise explicitly defined, terms such as set, etc. should be broadly construed, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the detailed contents of the technical solutions.
In the description of the present invention, reference to the description of the terms "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples," etc., means that a particular feature or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Referring to fig. 1, the embodiment discloses a method for converting text into voice in a video conference, which includes the following steps:
and step 110, determining a target sound pickup currently used.
It will be appreciated that the microphone may be the microphone of the device, and in some scenarios all conference participants are accessed through a device such as a cell phone, which is typically configured with a microphone. The server may determine the target pickup based on the microphone activation status of each device. For example, a conference may be configured such that only one person can speak at a time, at which point the only currently speaking microphone may be the target microphone. In part of scenes, part of conference participants carry out video conferences in a conference room, a plurality of microphones are connected to a controller of the conference room, and a target microphone can be determined based on the starting states of the hardware devices. When it is assumed that only one person speaks, the target sound pickup may be determined according to the volume detected by the plurality of usable microphones, and the sound pickup with the largest detected volume is the target sound pickup.
And step 120, performing echo cancellation on the first audio information collected by the target sound pickup to obtain second audio information.
Generally, a loudspeaker is arranged in a device or a conference room, so that in order to avoid the situation that echo or noise is generated due to the fact that the loudspeaker circularly enters the microphone, the step is added with echo cancellation processing, and the effect of removing the noise can be achieved on the first audio information so as to facilitate subsequent identification.
All or a portion of the second audio information is input to a gender determination model to determine the gender of the speaker, step 130.
It will be appreciated that the second audio information may be tailored at the time of sexing based on the input length of the sexing model.
Wherein the sex determination model is obtained by the following steps:
acquiring a training set, wherein the training set comprises a plurality of voice samples, and each voice sample is marked with gender;
and training the gender identification model through the training set until the condition of stopping training is met.
The gender identification model can be realized by adopting the existing voice classification model, and the gender identification model mainly learns the characteristics of different genders in the voice samples through the voice samples and the gender labels corresponding to the samples. When the gender identification model predicts a new voice sample, the probability that the voice sample belongs to different genders can be judged.
Step 140, selecting a voice-to-text conversion model according to the gender of the speaker to convert the second audio information to obtain a text summary, wherein each gender is configured with a corresponding voice-to-text conversion model. In some embodiments, the method further comprises the following steps: displaying the text summary.
Wherein, the selectable voice-to-text conversion model comprises a first voice-to-text conversion model and a second voice-to-text conversion model;
the first voice-to-text conversion model is obtained by training a voice sample of a male;
the second speech-to-text conversion model is obtained by training a female speech sample.
It can be understood that the two models are trained through the voice samples of different genders, and the two models respectively learn the characteristics of the corresponding gender voice, so that the recognition accuracy is better on the voice recognition of the same gender relative to the comprehensively trained models. In this embodiment, the task of learning gender characteristics of the speech conversion model is actually split, and the speech conversion model is split into gender recognition and speech-to-text conversion by using the gender-corresponding model. Therefore, the conversion effect of the present embodiment is more accurate.
In some embodiments, inputting the second audio information to a gender determination model to determine the gender of the speaker comprises:
inputting all or part of the second audio information into a gender identification model, and enabling the gender identification model to output a first probability that the second audio information belongs to males and a second probability that the second audio information belongs to females;
and when the absolute value of the difference between the first probability and the second probability is larger than a preset value, the gender corresponding to the larger value of the first probability and the second probability is taken as the gender of the speaker.
And when the absolute value of the difference between the first probability and the second probability is smaller than or equal to a preset value, acquiring the picture of a user of the target sound pickup, determining the face of the speaker according to the position of the target sound pickup in the picture, and identifying through the face to obtain the gender of the speaker.
In the above embodiment, when the gender identification model determines that the voice is male or female (it is shown that the probability difference between the voices of the male and female is less than or equal to the set threshold), the camera may be called to obtain the picture of the user, and secondary determination may be performed according to the face of the user in the picture, so as to determine the gender of the speaker. Among them, the method is particularly effective for a scene in which a user uses a mobile phone to perform a video conference.
In some embodiments, inputting the second audio information to a gender determination model to determine the gender of the speaker comprises:
dividing the second audio information into a plurality of sub audio information;
inputting a sub-audio information of the second audio information into the gender identification model, and enabling the gender identification model to output a first probability that the second audio information belongs to males and a second probability that the second audio information belongs to females;
and when the absolute value of the difference between the first probability and the second probability is less than or equal to a preset value, determining the gender of the speaker by using the other sub-audio information of the second audio information until the absolute value of the difference between the first probability and the second probability corresponding to one sub-audio information is greater than the preset value.
In the above example, the second audio information is decomposed in order to adapt to the restriction of the input length of the gender identification model, and the recognition duration can be reduced. Also, when the gender identification model is difficult to distinguish, the next sub-audio information can be input for re-identification. By the method, the situation that the blank content is intercepted so that the blank content cannot be correctly identified can be avoided under the condition of reducing the identification time.
In some embodiments, the target microphone currently used is determined, specifically:
determining a sound pick-up currently in a use state;
and taking the sound pickup with the maximum volume as a target sound pickup.
It can be understood that the present embodiment can quickly locate the sound pickup corresponding to the current speaker in the case that there are multiple microphones available in the conference.
Referring to fig. 2, the present embodiment discloses a video conference voice conversion text summary system, including:
a determination unit configured to determine a target sound pickup currently used;
the audio processing unit is used for carrying out echo cancellation on the first audio information collected by the target sound pickup to obtain second audio information;
a gender determination unit for inputting all or part of the second audio information to a gender identification model to determine the gender of the speaker;
and the conversion unit is used for selecting a voice character conversion model according to the gender of the speaker to convert the second audio information to obtain the character summary.
Referring to fig. 3, this embodiment discloses a video conference voice conversion text summary system, including:
the program is executed in a computer to execute the steps of,
a memory for storing a program;
and the processor is used for loading a program to execute the video conference voice conversion text summary method.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the appended claims.

Claims (10)

1. A video conference voice conversion text summary method is characterized by comprising the following steps:
determining a target sound pickup currently used;
performing echo cancellation on the first audio information collected by the target sound pickup to obtain second audio information;
inputting all or part of the second audio information into a gender determination model to determine the gender of the speaker;
and selecting a voice-text conversion model according to the gender of the speaker to convert the second audio information to obtain the text summary, wherein each gender is provided with a corresponding voice-text conversion model.
2. The videoconference voice-to-text summary method of claim 1, wherein the optional voice-to-text model comprises a first voice-to-text model and a second voice-to-text model;
the first voice-to-text conversion model is obtained by training a voice sample of a male;
the second speech-to-text conversion model is obtained by training a female speech sample.
3. The videoconference voice-to-text summary method of claim 1, wherein inputting the second audio information into a gender identification model to determine the gender of a speaker comprises:
inputting all or part of the second audio information into a gender identification model, and enabling the gender identification model to output a first probability that the second audio information belongs to males and a second probability that the second audio information belongs to females;
and when the absolute value of the difference between the first probability and the second probability is larger than a preset value, taking the gender corresponding to the larger value of the first probability and the second probability as the gender of the speaker.
4. The videoconference voice-to-text summary method of claim 3, further comprising the steps of: and when the absolute value of the difference between the first probability and the second probability is smaller than or equal to a preset value, acquiring the picture of a user of the target sound pickup, determining the face of the speaker according to the position of the target sound pickup in the picture, and identifying through the face to obtain the gender of the speaker.
5. The videoconference voice-to-text summary method of claim 1, wherein inputting the second audio information into a gender identification model to determine the gender of a speaker comprises:
dividing the second audio information into a plurality of sub-audio information;
inputting a sub-audio information of the second audio information into a gender identification model, and enabling the gender identification model to output a first probability that the second audio information belongs to males and a second probability that the second audio information belongs to females;
and when the absolute value of the difference between the first probability and the second probability is less than or equal to a preset value, determining the gender of the speaker by using the other sub-audio information of the second audio information until the absolute value of the difference between the first probability and the second probability corresponding to one sub-audio information is greater than the preset value.
6. The videoconference speech-to-text summary method of claim 1, wherein the gender identification model is derived by:
acquiring a training set, wherein the training set comprises a plurality of voice samples, and each voice sample is marked with gender;
training the gender identification model through a training set until a condition for stopping training is met.
7. The videoconference text-to-speech summary method of claim 1, wherein the determining a target microphone currently in use comprises:
determining a sound pick-up currently in a use state;
and taking the sound pickup with the maximum volume as a target sound pickup.
8. The videoconference voice-to-text summary method of claim 7, further comprising the steps of: displaying the text summary.
9. A videoconference voice-to-text summary system, comprising:
a determination unit configured to determine a target sound pickup currently used;
the audio processing unit is used for carrying out echo cancellation on the first audio information collected by the target sound pickup to obtain second audio information;
a gender determination unit for inputting all or part of the second audio information to a gender identification model to determine the gender of the speaker;
and the conversion unit is used for selecting a voice character conversion model according to the gender of the speaker to convert the second audio information to obtain the character summary.
10. A videoconference voice-to-text summary system, comprising:
the program is executed in a computer to execute the steps of,
a memory for storing a program;
a processor for loading the program to perform the videoconference voice converted text summary method of any of claims 1-7.
CN202110610479.0A 2021-06-01 2021-06-01 Video conference voice conversion text summary method and system Pending CN113438440A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110610479.0A CN113438440A (en) 2021-06-01 2021-06-01 Video conference voice conversion text summary method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110610479.0A CN113438440A (en) 2021-06-01 2021-06-01 Video conference voice conversion text summary method and system

Publications (1)

Publication Number Publication Date
CN113438440A true CN113438440A (en) 2021-09-24

Family

ID=77803439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110610479.0A Pending CN113438440A (en) 2021-06-01 2021-06-01 Video conference voice conversion text summary method and system

Country Status (1)

Country Link
CN (1) CN113438440A (en)

Similar Documents

Publication Publication Date Title
US11823679B2 (en) Method and system of audio false keyphrase rejection using speaker recognition
CN107910014B (en) Echo cancellation test method, device and test equipment
US9293133B2 (en) Improving voice communication over a network
US10878824B2 (en) Speech-to-text generation using video-speech matching from a primary speaker
CN110517689B (en) Voice data processing method, device and storage medium
US9626970B2 (en) Speaker identification using spatial information
CN108682420B (en) Audio and video call dialect recognition method and terminal equipment
WO2020238209A1 (en) Audio processing method, system and related device
CN107818797B (en) Voice quality evaluation method, device and system
US20200012724A1 (en) Bidirectional speech translation system, bidirectional speech translation method and program
MX2008016354A (en) Detecting an answering machine using speech recognition.
CN112530408A (en) Method, apparatus, electronic device, and medium for recognizing speech
CN111128223A (en) Text information-based auxiliary speaker separation method and related device
US20130246061A1 (en) Automatic realtime speech impairment correction
US11270691B2 (en) Voice interaction system, its processing method, and program therefor
CN112601045A (en) Speaking control method, device, equipment and storage medium for video conference
US10762906B2 (en) Automatically identifying speakers in real-time through media processing with dialog understanding supported by AI techniques
CN114842849B (en) Voice dialogue detection method and device
CN115472174A (en) Sound noise reduction method and device, electronic equipment and storage medium
CN106486134B (en) Language state determination device and method
CN111400463B (en) Dialogue response method, device, equipment and medium
CN107886940B (en) Voice translation processing method and device
CN107767862B (en) Voice data processing method, system and storage medium
EP2913822A1 (en) Speaker recognition method
JP7055327B2 (en) Conversation collection device, conversation collection system and conversation collection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication