CN110661923A

CN110661923A - Method and device for recording speech information in conference

Info

Publication number: CN110661923A
Application number: CN201810688911.6A
Authority: CN
Inventors: 彭宇龙; 韩杰; 王艳辉; 刘宝臣
Original assignee: Visionvera Information Technology Co Ltd
Current assignee: Visionvera Information Technology Co Ltd
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2020-01-07

Abstract

The embodiment of the invention provides a method and a device for recording speaking information in a conference, wherein the method comprises the following steps: acquiring audio signals in a conference; segmenting one or more voice signals from the audio signals according to preset segmentation parameters; identifying a speaker to which the voice signal belongs; carrying out voice recognition on the voice signal to obtain text information; and recording the text information into the speaking information of the speaker. The embodiment of the invention automatically generates the speech information of the user by automatically identifying the speaker and the text information corresponding to the voice signal, avoids the user from manually and repeatedly listening to the speech information of the user recorded in the conference record, and greatly improves the efficiency of generating the conference record.

Description

Method and device for recording speech information in conference

Technical Field

The present invention relates to the field of computer processing technology, and in particular, to a method for recording speech information in a conference and an apparatus for recording speech information in a conference.

Background

In places such as schools, enterprises, factories, and the like, users often need to meet and discuss various things such as learning and work.

In the process of the conference, the recording personnel record the organization condition and the specific content of the conference to form a conference record, and the conference record is archived and shared to other users.

At present, because the speaking speed of a user is high, a recording person usually records sound in the process of a conference, manually records partial speaking information of the user, and repeatedly listens to the sound recording after the conference is finished to fill up the speaking information of the user, so that the efficiency of generating a conference record is low.

Disclosure of Invention

The embodiment of the invention provides a method for recording speech information in a conference, which aims to solve the problem of low efficiency of generating a conference record caused by manually listening to the speech information of a recording user of the conference repeatedly.

According to an aspect of the present invention, there is provided a method of recording speech information in a conference, including:

acquiring audio signals in a conference;

segmenting one or more voice signals from the audio signals according to preset segmentation parameters;

identifying a speaker to which the voice signal belongs;

carrying out voice recognition on the voice signal to obtain text information;

and recording the text information into the speaking information of the speaker.

Optionally, the segmenting parameter includes an interruption time, and segmenting one or more voice signals from the audio signal according to a preset segmenting parameter includes:

judging whether the audio signal is a voice signal or not;

if so, marking the initial position of the voice signal;

calculating the interruption time of the voice signal;

if the interruption time exceeds a preset time threshold, marking the end position of the voice signal;

segmenting a speech signal located between the start position and the end position from the audio signal.

Optionally, the recognizing a speaker to which the voice signal belongs includes:

extracting target voice characteristic information from the voice signal;

matching the target voice characteristic information with reference voice characteristic information of a speaker;

and if the target voice characteristic information is successfully matched with the reference voice characteristic information of a speaker, determining that the voice signal belongs to the speaker.

Optionally, the extracting target speech feature information from the speech signal includes:

if the time length of the voice signal exceeds a preset length threshold, extracting at least two voice fragment signals from the voice signal;

and extracting the voice characteristic information of each section of target voice characteristic information, and combining the voice characteristic information into the target voice characteristic information.

Optionally, the recognizing a speaker to which the voice signal belongs further includes:

and if the target voice characteristic information is successfully matched with the reference voice characteristic information of at least two speakers, adjusting the segmentation parameters, and returning to the step of segmenting one or more voice signals from the audio signals according to the preset segmentation parameters.

if the target voice characteristic information fails to be matched with the reference voice characteristic information of all speakers, generating a temporary user as the speaker, and setting the target language characteristic information as the reference voice characteristic information of the temporary user;

determining that a speaker to which the voice signal belongs is the temporary user.

Optionally, the recording the text information to the speaking information of the speaker includes:

setting the time of the voice information as the speaking time of the speaker;

searching text information at the same speaking time;

and setting the text information as the speaking information of the speaker in the speaking time.

Optionally, the recording the text information to the speaking information of the speaker further includes:

and if the speaker is a temporary user and receives change information aiming at the temporary user, modifying the information of the temporary user according to the change information.

According to another aspect of the present invention, there is provided an apparatus for recording utterance information in a conference, including:

the audio signal acquisition module is used for acquiring audio signals in a conference;

the voice signal segmentation module is used for segmenting one or more voice signals from the audio signals according to preset segmentation parameters;

the speaker recognition module is used for recognizing a speaker to which the voice signal belongs;

the voice recognition module is used for carrying out voice recognition on the voice signal to obtain text information;

and the speech information recording module is used for recording the text information into the speech information of the speaker.

Optionally, the segmentation parameter includes an interruption time, and the speech signal segmentation module includes:

the voice signal judgment submodule is used for judging whether the audio signal is a voice signal; if yes, calling an initial position marking sub-module;

a start position marking submodule for marking a start position of the voice signal;

the interruption time calculation submodule is used for calculating the interruption time of the voice signal;

an end position marking submodule, configured to mark an end position of the voice signal if the interruption time exceeds a preset time threshold;

and the position segmentation submodule is used for segmenting the voice signal between the starting position and the ending position from the audio signal.

Optionally, the speaker recognition module comprises:

the target voice characteristic information extraction submodule is used for extracting target voice characteristic information from the voice signal;

the voice characteristic information matching submodule is used for matching the target voice characteristic information with reference voice characteristic information of a speaker;

and the speaker determining submodule is used for determining that the voice signal belongs to a speaker if the target voice characteristic information is successfully matched with the reference voice characteristic information of the speaker.

Optionally, the target speech feature information extraction sub-module includes:

the voice segment signal extraction unit is used for extracting at least two voice segment signals from the voice signals if the time length of the voice signals exceeds a preset length threshold;

and the voice characteristic information combination unit is used for extracting the voice characteristic information of each section of target voice characteristic information and combining the voice characteristic information into the target voice characteristic information.

Optionally, the speaker recognition module further comprises:

and the segmentation parameter adjusting submodule is used for adjusting the segmentation parameters and returning the segmentation parameters to the voice signal segmentation module if the target voice characteristic information is successfully matched with the reference voice characteristic information of at least two speakers.

Optionally, the speaker recognition module further comprises:

the temporary user setting submodule is used for generating a temporary user as a speaker if the target voice characteristic information fails to be matched with the reference voice characteristic information of all speakers, and setting the target language characteristic information as the reference voice characteristic information of the temporary user;

and the temporary user determination submodule is used for determining that the speaker to which the voice signal belongs is the temporary user.

Optionally, the utterance information recording module includes:

the speaking time setting submodule is used for setting the time of the voice information as the speaking time of the speaker;

the text information searching submodule is used for searching the text information at the same speaking time;

and the text information setting submodule is used for setting the text information as the speaking information of the speaker in the speaking time.

Optionally, the speech information recording module further includes:

and the temporary user changing submodule is used for modifying the information of the temporary user according to the changing information if the speaking person is the temporary user and the changing information aiming at the temporary user is received.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, the audio signal is collected in the conference, one or more voice signals are cut from the audio signal according to the preset cutting parameter, the speaker to which the voice signal belongs is identified, the voice signal is identified, the text information is obtained, the speech information of the speaker is recorded in the text information, the speech information of the user is automatically generated by automatically identifying the speaker and the text information corresponding to the voice signal, the user is prevented from manually and repeatedly listening to the speech information of the user recorded in the conference recording, and the efficiency of generating the conference recording is greatly improved.

Drawings

Fig. 1 is a flow chart illustrating the steps of a method for recording speech information in a conference according to an embodiment of the present invention;

fig. 2 is a block diagram of an apparatus for recording speech information in a conference according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of a method for recording utterance information in a conference according to an embodiment of the present invention is shown, which may specifically include the following steps:

step 101, collecting audio signals in a conference.

In a specific implementation, one or more sound pick-up devices are deployed in a place where a conference is carried out, and after the conference is started, the sound pick-up devices are started to work and collect audio signals.

The collected audio signals can be amplified and filtered by a band-pass filter (20HZ-20KHZ) to remove noise other than human voice.

Step 102, segmenting one or more voice signals from the audio signal according to preset segmentation parameters.

Since the microphone is generally used to continuously capture audio signals in the environment, including voice signals from the user (speaker), and possibly voice signals from other non-users (speakers), such as sounds made by the user walking, etc.

Therefore, a slicing parameter may be set for a characteristic of a speech of a user, and one or more continuous speech signals may be sliced from the audio signal according to the slicing parameter as a speech signal of a speech of a certain user.

In one embodiment of the present invention, the slicing parameter includes an interrupt time, and step 102 may include the sub-steps of:

a substep S11 of determining whether the audio signal is a speech signal; if yes, go to substep S12.

And a substep S12 of marking a start position of the speech signal.

And a substep S13 of calculating an interruption time of the speech signal.

And a substep S14, if the interruption time exceeds a preset time threshold, marking the end position of the voice signal.

Sub-step S15, slicing the speech signal located between the start position and the end position from the audio signal.

In the embodiment of the present invention, Voice Activity Detection (VAD) may be performed on the collected audio signal, and for a part of the voice signal (i.e., the sound spoken by the user (speaker)), a part of the non-voice signal is discarded.

Further, since the speech of the user (talker) is interrupted for a short time due to the operation of switching the speech of the user (talker), if the voice signal is detected, the start time thereof is marked and the detection is continued.

If a speech signal interruption is detected, the interruption time is recorded.

If the interruption time exceeds a preset time threshold, the speech is possibly ended, and the ending time is marked, so that the speech signal between the starting position and the ending position is cut out to be used as a speech of a certain user (a speaker).

Otherwise, if the interruption time does not exceed the preset time threshold, the voice signal is detected again, which indicates that the speaking may not be finished, the detection is continued, and the interruption time is cleared.

And 103, identifying the speaker to which the voice signal belongs.

In practical applications, the Voiceprint (Voiceprint) of a person, that is, the sound spectrum carrying speech information displayed by an electroacoustic apparatus, has characteristics of specificity and relative stability, so that after the person grows up, the voice of the person can be kept relatively stable and unchanged for a long time, and therefore, the Voiceprint can be identified for a voice signal, and a user (speaker) belonging to the voice signal can be identified.

In one embodiment of the present invention, step 103 may comprise the following sub-steps:

and a substep S21 of extracting target speech feature information from the speech signal.

In the embodiment of the present invention, in order to reduce the amount of computation, the extracted target speech feature information may be MFCC (Mel Frequency Cepstrum Coefficient).

The Mel frequency is extracted based on the auditory characteristics of human ears, and the Mel frequency and the Hz frequency form a nonlinear corresponding relation.

Mel Frequency Cepstral Coefficients (MFCC) are the Hz spectral features calculated by using the relationship between them.

Of course, besides MFCC, other features may be extracted as target speech feature information, such as prosodic features, which is not limited in this embodiment of the present invention.

Further, if the time length of the speech signal exceeds a preset length threshold, that is, the length is long, in order to reduce the amount of calculation, at least two speech segment signals are extracted from the speech signal by means of bisection, randomness, or the like, and the length of the speech segment signal is smaller than the length of the speech signal.

And a substep S22 of matching the target speech characteristic information with reference speech characteristic information of a speaker.

And a substep S23, determining that the speech signal belongs to a speaker if the target speech feature information is successfully matched with the reference speech feature information of the speaker.

By applying the embodiment of the invention, the voice signal of the speaker can be collected in advance, the voice characteristic information is extracted as the reference voice characteristic information, and the mapping relation is established with the information (such as name, position and the like) of the speaker.

Further, each user records a 3-5 minute voice signal before the conference starts and extracts voice feature information, or the organizer of the conference can select users who participate in the conference when reserving the conference, and these users record a 3-5 minute voice signal by logging in their own client, and so on.

After the voice signals are extracted in the conference, matching the target voice characteristic information with the reference voice characteristic information of the speaker, if the similarity of the two voice characteristic information is greater than or equal to a preset similarity threshold value, determining that the two voice characteristic information are successfully matched, otherwise, determining that the two voice characteristic information are failed to be matched.

In one embodiment of the present invention, step 103 may further include the following sub-steps:

and a substep S24, if the target voice feature information is successfully matched with the reference voice feature information of at least two speakers, adjusting the segmentation parameters, and returning to the step 102.

And if the target voice characteristic information is successfully matched with the reference voice characteristic information of the at least two speakers, determining that the voice signal is segmented wrongly and contains the speech information of the at least two speakers, readjusting segmentation parameters, and if the time threshold is reduced, segmenting the voice signal again.

and a substep S25, if the target speech feature information fails to match the reference speech feature information of all speakers, generating a temporary user as the speaker, and setting the target language feature information as the reference speech feature information of the temporary user.

And a substep S26 of determining that the speaker to which the speech signal belongs is the provisional user.

In practical application, if the pre-recorded voice signal of the user is missed to some extent, the reference voice feature information is missed to some extent, or other participants are temporarily added, and the like, the matching between the target voice feature information and the reference voice feature information of any speaker fails.

At this time, information of the provisional user may be generated, such as the name set to "provisional user 1", the position set to unknown, the provisional user set to a new speaker, and the target language feature information set as the reference speech feature information of the provisional user for matching with the subsequent speech signal.

In addition, it is confirmed that the speaker to which the voice signal belongs is the provisional user.

And 104, performing voice recognition on the voice signal to obtain text information.

In the embodiment of the present invention, speech recognition may be performed on a speech signal to obtain content (i.e., text information) as a speech of a user (speaker).

Speech Recognition, also known as Automatic Speech Recognition (ASR), has the task of converting the vocabulary content of a Speech signal emitted by a user into text information that can be read in by a computer.

In a specific implementation, the voice signal may be locally subjected to voice recognition and text information is obtained, or the voice signal may be sent to the server for voice recognition and text information returned by the server is received.

Further, a speech recognition system for performing speech recognition typically includes the following basic modules:

1. signal processing and feature extraction module

The signal processing and feature extraction module extracts features from the voice signal for processing by the acoustic model.

Meanwhile, the signal processing and feature extraction module generally performs some signal processing to reduce the influence of environmental noise, channels, speakers and other factors on the voice data as much as possible.

2. Acoustic model

The voice recognition system is mostly modeled by a first-order hidden Markov model.

3. Pronunciation dictionary

The pronunciation dictionary contains the vocabulary set that the speech recognition system can handle and its pronunciation, provides the mapping of acoustic model and language model.

4. Language model

The language model models the language for which the speech recognition system is directed.

In general, various language models including regular language and context-free grammar can be used as the language model, but currently, the statistical-based N-gram and its variants are more adopted.

5. Decoder

The decoder is one of the cores of the speech recognition system, and for an input signal, according to the processing results of acoustics, a language model and a dictionary, a word string capable of outputting the signal with the maximum probability is searched, and the relation between the modules can be more clearly understood from the mathematical point of view.

And 105, recording the text information into the speaking information of the speaker.

In the conference recording, if the speaker corresponding to the voice signal is recognized, the text information corresponding to the voice signal may be recorded as the utterance information of the speaker in the conference.

In one embodiment of the present invention, step 105 may comprise the sub-steps of:

and a substep S31 of setting the time of the voice message as the speaking time of the speaker.

Sub-step S32, text information at the same speaking time is looked up.

And a substep S33 of setting the text information as speaking information of the speaker at the speaking time.

In the embodiment of the present invention, since the speech sound (i.e., the speech signal) of the same user (the speaker) and the speech content (i.e., the text information) are coincident with each other on the time axis, the speech time of the speech signal and the time of the text information can be matched, and if the overlapping ratio of the speech time and the time of the text information exceeds a preset ratio threshold, it can be confirmed that the speech time and the time of the text information are successfully matched, and the text information is set as the speech information of the speaker at the speech time in the conference record.

In one embodiment of the present invention, step 105 may further comprise the sub-steps of:

and a substep S34, if the speaker is a temporary user and receives change information for the temporary user, modifying the information of the temporary user according to the change information.

In the embodiment of the present invention, if the speaker is a temporary user, after the conference record is generated, the user may generate corresponding change information (such as a change name, a change position, and the like) for the information (such as a name, a position, and the like) of the temporary user, and uniformly adjust the information of the temporary user in the conference record.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 2, a block diagram of a structure of an apparatus for recording utterance information in a conference according to an embodiment of the present invention is shown, which may specifically include the following modules:

an audio signal acquisition module 201, configured to acquire an audio signal in a conference;

a voice signal segmentation module 202, configured to segment one or more voice signals from the audio signal according to a preset segmentation parameter;

a speaker recognition module 203, configured to recognize a speaker to which the voice signal belongs;

the voice recognition module 204 is configured to perform voice recognition on the voice signal to obtain text information;

and an utterance information recording module 205, configured to record the text information in utterance information of the speaker.

In an embodiment of the present invention, the slicing parameter includes an interruption time, and the speech signal slicing module 202 includes:

In one embodiment of the present invention, the speaker recognition module 204 includes:

In an embodiment of the present invention, the target speech feature information extraction sub-module includes:

In an embodiment of the present invention, the speaker recognition module 203 further includes:

In an embodiment of the present invention, the speech information recording module 205 includes:

In an embodiment of the present invention, the speech information recording module 205 further includes:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method for recording speech information in a conference and the device for recording speech information in a conference provided by the invention are described in detail above, and specific examples are applied in the text to explain the principle and the implementation of the invention, and the description of the above examples is only used to help understanding the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of recording speech information in a conference, comprising:

acquiring audio signals in a conference;

identifying a speaker to which the voice signal belongs;

carrying out voice recognition on the voice signal to obtain text information;

2. The method of claim 1, wherein the slicing parameter comprises an interruption time, and wherein slicing out one or more speech signals from the audio signal according to the preset slicing parameter comprises:

judging whether the audio signal is a voice signal or not;

if so, marking the initial position of the voice signal;

calculating the interruption time of the voice signal;

3. The method of claim 1 or 2, wherein the identifying the speaker to which the speech signal belongs comprises:

extracting target voice characteristic information from the voice signal;

4. The method of claim 3, wherein the extracting target speech feature information from the speech signal comprises:

5. The method of claim 3, wherein the identifying the speaker to which the speech signal belongs, further comprises:

6. The method of claim 3, wherein the identifying the speaker to which the speech signal belongs, further comprises:

7. The method of claim 1, 2, 4, 5 or 6, wherein the recording the text information to the speaking person comprises:

setting the time of the voice information as the speaking time of the speaker;

searching text information at the same speaking time;

8. The method of claim 7, wherein the recording the text information to the speaker's speech information further comprises:

9. An apparatus for recording speech information in a conference, comprising:

10. The apparatus of claim 9, wherein the slicing parameter comprises an interrupt time, and wherein the speech signal slicing module comprises: