CN110661923A - Method and device for recording speech information in conference - Google Patents
Method and device for recording speech information in conference Download PDFInfo
- Publication number
- CN110661923A CN110661923A CN201810688911.6A CN201810688911A CN110661923A CN 110661923 A CN110661923 A CN 110661923A CN 201810688911 A CN201810688911 A CN 201810688911A CN 110661923 A CN110661923 A CN 110661923A
- Authority
- CN
- China
- Prior art keywords
- voice
- information
- speaker
- signal
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 230000005236 sound signal Effects 0.000 claims abstract description 44
- 230000011218 segmentation Effects 0.000 claims abstract description 29
- 230000008859 change Effects 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 4
- 239000012634 fragment Substances 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 10
- 238000000605 extraction Methods 0.000 description 9
- 238000004590 computer program Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000005520 cutting process Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/42221—Conversation recording systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the invention provides a method and a device for recording speaking information in a conference, wherein the method comprises the following steps: acquiring audio signals in a conference; segmenting one or more voice signals from the audio signals according to preset segmentation parameters; identifying a speaker to which the voice signal belongs; carrying out voice recognition on the voice signal to obtain text information; and recording the text information into the speaking information of the speaker. The embodiment of the invention automatically generates the speech information of the user by automatically identifying the speaker and the text information corresponding to the voice signal, avoids the user from manually and repeatedly listening to the speech information of the user recorded in the conference record, and greatly improves the efficiency of generating the conference record.
Description
Technical Field
The present invention relates to the field of computer processing technology, and in particular, to a method for recording speech information in a conference and an apparatus for recording speech information in a conference.
Background
In places such as schools, enterprises, factories, and the like, users often need to meet and discuss various things such as learning and work.
In the process of the conference, the recording personnel record the organization condition and the specific content of the conference to form a conference record, and the conference record is archived and shared to other users.
At present, because the speaking speed of a user is high, a recording person usually records sound in the process of a conference, manually records partial speaking information of the user, and repeatedly listens to the sound recording after the conference is finished to fill up the speaking information of the user, so that the efficiency of generating a conference record is low.
Disclosure of Invention
The embodiment of the invention provides a method for recording speech information in a conference, which aims to solve the problem of low efficiency of generating a conference record caused by manually listening to the speech information of a recording user of the conference repeatedly.
According to an aspect of the present invention, there is provided a method of recording speech information in a conference, including:
acquiring audio signals in a conference;
segmenting one or more voice signals from the audio signals according to preset segmentation parameters;
identifying a speaker to which the voice signal belongs;
carrying out voice recognition on the voice signal to obtain text information;
and recording the text information into the speaking information of the speaker.
Optionally, the segmenting parameter includes an interruption time, and segmenting one or more voice signals from the audio signal according to a preset segmenting parameter includes:
judging whether the audio signal is a voice signal or not;
if so, marking the initial position of the voice signal;
calculating the interruption time of the voice signal;
if the interruption time exceeds a preset time threshold, marking the end position of the voice signal;
segmenting a speech signal located between the start position and the end position from the audio signal.
Optionally, the recognizing a speaker to which the voice signal belongs includes:
extracting target voice characteristic information from the voice signal;
matching the target voice characteristic information with reference voice characteristic information of a speaker;
and if the target voice characteristic information is successfully matched with the reference voice characteristic information of a speaker, determining that the voice signal belongs to the speaker.
Optionally, the extracting target speech feature information from the speech signal includes:
if the time length of the voice signal exceeds a preset length threshold, extracting at least two voice fragment signals from the voice signal;
and extracting the voice characteristic information of each section of target voice characteristic information, and combining the voice characteristic information into the target voice characteristic information.
Optionally, the recognizing a speaker to which the voice signal belongs further includes:
and if the target voice characteristic information is successfully matched with the reference voice characteristic information of at least two speakers, adjusting the segmentation parameters, and returning to the step of segmenting one or more voice signals from the audio signals according to the preset segmentation parameters.
Optionally, the recognizing a speaker to which the voice signal belongs further includes:
if the target voice characteristic information fails to be matched with the reference voice characteristic information of all speakers, generating a temporary user as the speaker, and setting the target language characteristic information as the reference voice characteristic information of the temporary user;
determining that a speaker to which the voice signal belongs is the temporary user.
Optionally, the recording the text information to the speaking information of the speaker includes:
setting the time of the voice information as the speaking time of the speaker;
searching text information at the same speaking time;
and setting the text information as the speaking information of the speaker in the speaking time.
Optionally, the recording the text information to the speaking information of the speaker further includes:
and if the speaker is a temporary user and receives change information aiming at the temporary user, modifying the information of the temporary user according to the change information.
According to another aspect of the present invention, there is provided an apparatus for recording utterance information in a conference, including:
the audio signal acquisition module is used for acquiring audio signals in a conference;
the voice signal segmentation module is used for segmenting one or more voice signals from the audio signals according to preset segmentation parameters;
the speaker recognition module is used for recognizing a speaker to which the voice signal belongs;
the voice recognition module is used for carrying out voice recognition on the voice signal to obtain text information;
and the speech information recording module is used for recording the text information into the speech information of the speaker.
Optionally, the segmentation parameter includes an interruption time, and the speech signal segmentation module includes:
the voice signal judgment submodule is used for judging whether the audio signal is a voice signal; if yes, calling an initial position marking sub-module;
a start position marking submodule for marking a start position of the voice signal;
the interruption time calculation submodule is used for calculating the interruption time of the voice signal;
an end position marking submodule, configured to mark an end position of the voice signal if the interruption time exceeds a preset time threshold;
and the position segmentation submodule is used for segmenting the voice signal between the starting position and the ending position from the audio signal.
Optionally, the speaker recognition module comprises:
the target voice characteristic information extraction submodule is used for extracting target voice characteristic information from the voice signal;
the voice characteristic information matching submodule is used for matching the target voice characteristic information with reference voice characteristic information of a speaker;
and the speaker determining submodule is used for determining that the voice signal belongs to a speaker if the target voice characteristic information is successfully matched with the reference voice characteristic information of the speaker.
Optionally, the target speech feature information extraction sub-module includes:
the voice segment signal extraction unit is used for extracting at least two voice segment signals from the voice signals if the time length of the voice signals exceeds a preset length threshold;
and the voice characteristic information combination unit is used for extracting the voice characteristic information of each section of target voice characteristic information and combining the voice characteristic information into the target voice characteristic information.
Optionally, the speaker recognition module further comprises:
and the segmentation parameter adjusting submodule is used for adjusting the segmentation parameters and returning the segmentation parameters to the voice signal segmentation module if the target voice characteristic information is successfully matched with the reference voice characteristic information of at least two speakers.
Optionally, the speaker recognition module further comprises:
the temporary user setting submodule is used for generating a temporary user as a speaker if the target voice characteristic information fails to be matched with the reference voice characteristic information of all speakers, and setting the target language characteristic information as the reference voice characteristic information of the temporary user;
and the temporary user determination submodule is used for determining that the speaker to which the voice signal belongs is the temporary user.
Optionally, the utterance information recording module includes:
the speaking time setting submodule is used for setting the time of the voice information as the speaking time of the speaker;
the text information searching submodule is used for searching the text information at the same speaking time;
and the text information setting submodule is used for setting the text information as the speaking information of the speaker in the speaking time.
Optionally, the speech information recording module further includes:
and the temporary user changing submodule is used for modifying the information of the temporary user according to the changing information if the speaking person is the temporary user and the changing information aiming at the temporary user is received.
The embodiment of the invention has the following advantages:
in the embodiment of the invention, the audio signal is collected in the conference, one or more voice signals are cut from the audio signal according to the preset cutting parameter, the speaker to which the voice signal belongs is identified, the voice signal is identified, the text information is obtained, the speech information of the speaker is recorded in the text information, the speech information of the user is automatically generated by automatically identifying the speaker and the text information corresponding to the voice signal, the user is prevented from manually and repeatedly listening to the speech information of the user recorded in the conference recording, and the efficiency of generating the conference recording is greatly improved.
Drawings
Fig. 1 is a flow chart illustrating the steps of a method for recording speech information in a conference according to an embodiment of the present invention;
fig. 2 is a block diagram of an apparatus for recording speech information in a conference according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1, a flowchart illustrating steps of a method for recording utterance information in a conference according to an embodiment of the present invention is shown, which may specifically include the following steps:
In a specific implementation, one or more sound pick-up devices are deployed in a place where a conference is carried out, and after the conference is started, the sound pick-up devices are started to work and collect audio signals.
The collected audio signals can be amplified and filtered by a band-pass filter (20HZ-20KHZ) to remove noise other than human voice.
Since the microphone is generally used to continuously capture audio signals in the environment, including voice signals from the user (speaker), and possibly voice signals from other non-users (speakers), such as sounds made by the user walking, etc.
Therefore, a slicing parameter may be set for a characteristic of a speech of a user, and one or more continuous speech signals may be sliced from the audio signal according to the slicing parameter as a speech signal of a speech of a certain user.
In one embodiment of the present invention, the slicing parameter includes an interrupt time, and step 102 may include the sub-steps of:
a substep S11 of determining whether the audio signal is a speech signal; if yes, go to substep S12.
And a substep S12 of marking a start position of the speech signal.
And a substep S13 of calculating an interruption time of the speech signal.
And a substep S14, if the interruption time exceeds a preset time threshold, marking the end position of the voice signal.
Sub-step S15, slicing the speech signal located between the start position and the end position from the audio signal.
In the embodiment of the present invention, Voice Activity Detection (VAD) may be performed on the collected audio signal, and for a part of the voice signal (i.e., the sound spoken by the user (speaker)), a part of the non-voice signal is discarded.
Further, since the speech of the user (talker) is interrupted for a short time due to the operation of switching the speech of the user (talker), if the voice signal is detected, the start time thereof is marked and the detection is continued.
If a speech signal interruption is detected, the interruption time is recorded.
If the interruption time exceeds a preset time threshold, the speech is possibly ended, and the ending time is marked, so that the speech signal between the starting position and the ending position is cut out to be used as a speech of a certain user (a speaker).
Otherwise, if the interruption time does not exceed the preset time threshold, the voice signal is detected again, which indicates that the speaking may not be finished, the detection is continued, and the interruption time is cleared.
And 103, identifying the speaker to which the voice signal belongs.
In practical applications, the Voiceprint (Voiceprint) of a person, that is, the sound spectrum carrying speech information displayed by an electroacoustic apparatus, has characteristics of specificity and relative stability, so that after the person grows up, the voice of the person can be kept relatively stable and unchanged for a long time, and therefore, the Voiceprint can be identified for a voice signal, and a user (speaker) belonging to the voice signal can be identified.
In one embodiment of the present invention, step 103 may comprise the following sub-steps:
and a substep S21 of extracting target speech feature information from the speech signal.
In the embodiment of the present invention, in order to reduce the amount of computation, the extracted target speech feature information may be MFCC (Mel Frequency Cepstrum Coefficient).
The Mel frequency is extracted based on the auditory characteristics of human ears, and the Mel frequency and the Hz frequency form a nonlinear corresponding relation.
Mel Frequency Cepstral Coefficients (MFCC) are the Hz spectral features calculated by using the relationship between them.
Of course, besides MFCC, other features may be extracted as target speech feature information, such as prosodic features, which is not limited in this embodiment of the present invention.
Further, if the time length of the speech signal exceeds a preset length threshold, that is, the length is long, in order to reduce the amount of calculation, at least two speech segment signals are extracted from the speech signal by means of bisection, randomness, or the like, and the length of the speech segment signal is smaller than the length of the speech signal.
And extracting the voice characteristic information of each section of target voice characteristic information, and combining the voice characteristic information into the target voice characteristic information.
And a substep S22 of matching the target speech characteristic information with reference speech characteristic information of a speaker.
And a substep S23, determining that the speech signal belongs to a speaker if the target speech feature information is successfully matched with the reference speech feature information of the speaker.
By applying the embodiment of the invention, the voice signal of the speaker can be collected in advance, the voice characteristic information is extracted as the reference voice characteristic information, and the mapping relation is established with the information (such as name, position and the like) of the speaker.
Further, each user records a 3-5 minute voice signal before the conference starts and extracts voice feature information, or the organizer of the conference can select users who participate in the conference when reserving the conference, and these users record a 3-5 minute voice signal by logging in their own client, and so on.
After the voice signals are extracted in the conference, matching the target voice characteristic information with the reference voice characteristic information of the speaker, if the similarity of the two voice characteristic information is greater than or equal to a preset similarity threshold value, determining that the two voice characteristic information are successfully matched, otherwise, determining that the two voice characteristic information are failed to be matched.
And if the target voice characteristic information is successfully matched with the reference voice characteristic information of a speaker, determining that the voice signal belongs to the speaker.
In one embodiment of the present invention, step 103 may further include the following sub-steps:
and a substep S24, if the target voice feature information is successfully matched with the reference voice feature information of at least two speakers, adjusting the segmentation parameters, and returning to the step 102.
And if the target voice characteristic information is successfully matched with the reference voice characteristic information of the at least two speakers, determining that the voice signal is segmented wrongly and contains the speech information of the at least two speakers, readjusting segmentation parameters, and if the time threshold is reduced, segmenting the voice signal again.
In one embodiment of the present invention, step 103 may further include the following sub-steps:
and a substep S25, if the target speech feature information fails to match the reference speech feature information of all speakers, generating a temporary user as the speaker, and setting the target language feature information as the reference speech feature information of the temporary user.
And a substep S26 of determining that the speaker to which the speech signal belongs is the provisional user.
In practical application, if the pre-recorded voice signal of the user is missed to some extent, the reference voice feature information is missed to some extent, or other participants are temporarily added, and the like, the matching between the target voice feature information and the reference voice feature information of any speaker fails.
At this time, information of the provisional user may be generated, such as the name set to "provisional user 1", the position set to unknown, the provisional user set to a new speaker, and the target language feature information set as the reference speech feature information of the provisional user for matching with the subsequent speech signal.
In addition, it is confirmed that the speaker to which the voice signal belongs is the provisional user.
And 104, performing voice recognition on the voice signal to obtain text information.
In the embodiment of the present invention, speech recognition may be performed on a speech signal to obtain content (i.e., text information) as a speech of a user (speaker).
Speech Recognition, also known as Automatic Speech Recognition (ASR), has the task of converting the vocabulary content of a Speech signal emitted by a user into text information that can be read in by a computer.
In a specific implementation, the voice signal may be locally subjected to voice recognition and text information is obtained, or the voice signal may be sent to the server for voice recognition and text information returned by the server is received.
Further, a speech recognition system for performing speech recognition typically includes the following basic modules:
1. signal processing and feature extraction module
The signal processing and feature extraction module extracts features from the voice signal for processing by the acoustic model.
Meanwhile, the signal processing and feature extraction module generally performs some signal processing to reduce the influence of environmental noise, channels, speakers and other factors on the voice data as much as possible.
2. Acoustic model
The voice recognition system is mostly modeled by a first-order hidden Markov model.
3. Pronunciation dictionary
The pronunciation dictionary contains the vocabulary set that the speech recognition system can handle and its pronunciation, provides the mapping of acoustic model and language model.
4. Language model
The language model models the language for which the speech recognition system is directed.
In general, various language models including regular language and context-free grammar can be used as the language model, but currently, the statistical-based N-gram and its variants are more adopted.
5. Decoder
The decoder is one of the cores of the speech recognition system, and for an input signal, according to the processing results of acoustics, a language model and a dictionary, a word string capable of outputting the signal with the maximum probability is searched, and the relation between the modules can be more clearly understood from the mathematical point of view.
And 105, recording the text information into the speaking information of the speaker.
In the conference recording, if the speaker corresponding to the voice signal is recognized, the text information corresponding to the voice signal may be recorded as the utterance information of the speaker in the conference.
In one embodiment of the present invention, step 105 may comprise the sub-steps of:
and a substep S31 of setting the time of the voice message as the speaking time of the speaker.
Sub-step S32, text information at the same speaking time is looked up.
And a substep S33 of setting the text information as speaking information of the speaker at the speaking time.
In the embodiment of the present invention, since the speech sound (i.e., the speech signal) of the same user (the speaker) and the speech content (i.e., the text information) are coincident with each other on the time axis, the speech time of the speech signal and the time of the text information can be matched, and if the overlapping ratio of the speech time and the time of the text information exceeds a preset ratio threshold, it can be confirmed that the speech time and the time of the text information are successfully matched, and the text information is set as the speech information of the speaker at the speech time in the conference record.
In one embodiment of the present invention, step 105 may further comprise the sub-steps of:
and a substep S34, if the speaker is a temporary user and receives change information for the temporary user, modifying the information of the temporary user according to the change information.
In the embodiment of the present invention, if the speaker is a temporary user, after the conference record is generated, the user may generate corresponding change information (such as a change name, a change position, and the like) for the information (such as a name, a position, and the like) of the temporary user, and uniformly adjust the information of the temporary user in the conference record.
In the embodiment of the invention, the audio signal is collected in the conference, one or more voice signals are cut from the audio signal according to the preset cutting parameter, the speaker to which the voice signal belongs is identified, the voice signal is identified, the text information is obtained, the speech information of the speaker is recorded in the text information, the speech information of the user is automatically generated by automatically identifying the speaker and the text information corresponding to the voice signal, the user is prevented from manually and repeatedly listening to the speech information of the user recorded in the conference recording, and the efficiency of generating the conference recording is greatly improved.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 2, a block diagram of a structure of an apparatus for recording utterance information in a conference according to an embodiment of the present invention is shown, which may specifically include the following modules:
an audio signal acquisition module 201, configured to acquire an audio signal in a conference;
a voice signal segmentation module 202, configured to segment one or more voice signals from the audio signal according to a preset segmentation parameter;
a speaker recognition module 203, configured to recognize a speaker to which the voice signal belongs;
the voice recognition module 204 is configured to perform voice recognition on the voice signal to obtain text information;
and an utterance information recording module 205, configured to record the text information in utterance information of the speaker.
In an embodiment of the present invention, the slicing parameter includes an interruption time, and the speech signal slicing module 202 includes:
the voice signal judgment submodule is used for judging whether the audio signal is a voice signal; if yes, calling an initial position marking sub-module;
a start position marking submodule for marking a start position of the voice signal;
the interruption time calculation submodule is used for calculating the interruption time of the voice signal;
an end position marking submodule, configured to mark an end position of the voice signal if the interruption time exceeds a preset time threshold;
and the position segmentation submodule is used for segmenting the voice signal between the starting position and the ending position from the audio signal.
In one embodiment of the present invention, the speaker recognition module 204 includes:
the target voice characteristic information extraction submodule is used for extracting target voice characteristic information from the voice signal;
the voice characteristic information matching submodule is used for matching the target voice characteristic information with reference voice characteristic information of a speaker;
and the speaker determining submodule is used for determining that the voice signal belongs to a speaker if the target voice characteristic information is successfully matched with the reference voice characteristic information of the speaker.
In an embodiment of the present invention, the target speech feature information extraction sub-module includes:
the voice segment signal extraction unit is used for extracting at least two voice segment signals from the voice signals if the time length of the voice signals exceeds a preset length threshold;
and the voice characteristic information combination unit is used for extracting the voice characteristic information of each section of target voice characteristic information and combining the voice characteristic information into the target voice characteristic information.
In an embodiment of the present invention, the speaker recognition module 203 further includes:
and the segmentation parameter adjusting submodule is used for adjusting the segmentation parameters and returning the segmentation parameters to the voice signal segmentation module if the target voice characteristic information is successfully matched with the reference voice characteristic information of at least two speakers.
In an embodiment of the present invention, the speaker recognition module 203 further includes:
the temporary user setting submodule is used for generating a temporary user as a speaker if the target voice characteristic information fails to be matched with the reference voice characteristic information of all speakers, and setting the target language characteristic information as the reference voice characteristic information of the temporary user;
and the temporary user determination submodule is used for determining that the speaker to which the voice signal belongs is the temporary user.
In an embodiment of the present invention, the speech information recording module 205 includes:
the speaking time setting submodule is used for setting the time of the voice information as the speaking time of the speaker;
the text information searching submodule is used for searching the text information at the same speaking time;
and the text information setting submodule is used for setting the text information as the speaking information of the speaker in the speaking time.
In an embodiment of the present invention, the speech information recording module 205 further includes:
and the temporary user changing submodule is used for modifying the information of the temporary user according to the changing information if the speaking person is the temporary user and the changing information aiming at the temporary user is received.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
In the embodiment of the invention, the audio signal is collected in the conference, one or more voice signals are cut from the audio signal according to the preset cutting parameter, the speaker to which the voice signal belongs is identified, the voice signal is identified, the text information is obtained, the speech information of the speaker is recorded in the text information, the speech information of the user is automatically generated by automatically identifying the speaker and the text information corresponding to the voice signal, the user is prevented from manually and repeatedly listening to the speech information of the user recorded in the conference recording, and the efficiency of generating the conference recording is greatly improved.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The method for recording speech information in a conference and the device for recording speech information in a conference provided by the invention are described in detail above, and specific examples are applied in the text to explain the principle and the implementation of the invention, and the description of the above examples is only used to help understanding the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (10)
1. A method of recording speech information in a conference, comprising:
acquiring audio signals in a conference;
segmenting one or more voice signals from the audio signals according to preset segmentation parameters;
identifying a speaker to which the voice signal belongs;
carrying out voice recognition on the voice signal to obtain text information;
and recording the text information into the speaking information of the speaker.
2. The method of claim 1, wherein the slicing parameter comprises an interruption time, and wherein slicing out one or more speech signals from the audio signal according to the preset slicing parameter comprises:
judging whether the audio signal is a voice signal or not;
if so, marking the initial position of the voice signal;
calculating the interruption time of the voice signal;
if the interruption time exceeds a preset time threshold, marking the end position of the voice signal;
segmenting a speech signal located between the start position and the end position from the audio signal.
3. The method of claim 1 or 2, wherein the identifying the speaker to which the speech signal belongs comprises:
extracting target voice characteristic information from the voice signal;
matching the target voice characteristic information with reference voice characteristic information of a speaker;
and if the target voice characteristic information is successfully matched with the reference voice characteristic information of a speaker, determining that the voice signal belongs to the speaker.
4. The method of claim 3, wherein the extracting target speech feature information from the speech signal comprises:
if the time length of the voice signal exceeds a preset length threshold, extracting at least two voice fragment signals from the voice signal;
and extracting the voice characteristic information of each section of target voice characteristic information, and combining the voice characteristic information into the target voice characteristic information.
5. The method of claim 3, wherein the identifying the speaker to which the speech signal belongs, further comprises:
and if the target voice characteristic information is successfully matched with the reference voice characteristic information of at least two speakers, adjusting the segmentation parameters, and returning to the step of segmenting one or more voice signals from the audio signals according to the preset segmentation parameters.
6. The method of claim 3, wherein the identifying the speaker to which the speech signal belongs, further comprises:
if the target voice characteristic information fails to be matched with the reference voice characteristic information of all speakers, generating a temporary user as the speaker, and setting the target language characteristic information as the reference voice characteristic information of the temporary user;
determining that a speaker to which the voice signal belongs is the temporary user.
7. The method of claim 1, 2, 4, 5 or 6, wherein the recording the text information to the speaking person comprises:
setting the time of the voice information as the speaking time of the speaker;
searching text information at the same speaking time;
and setting the text information as the speaking information of the speaker in the speaking time.
8. The method of claim 7, wherein the recording the text information to the speaker's speech information further comprises:
and if the speaker is a temporary user and receives change information aiming at the temporary user, modifying the information of the temporary user according to the change information.
9. An apparatus for recording speech information in a conference, comprising:
the audio signal acquisition module is used for acquiring audio signals in a conference;
the voice signal segmentation module is used for segmenting one or more voice signals from the audio signals according to preset segmentation parameters;
the speaker recognition module is used for recognizing a speaker to which the voice signal belongs;
the voice recognition module is used for carrying out voice recognition on the voice signal to obtain text information;
and the speech information recording module is used for recording the text information into the speech information of the speaker.
10. The apparatus of claim 9, wherein the slicing parameter comprises an interrupt time, and wherein the speech signal slicing module comprises:
the voice signal judgment submodule is used for judging whether the audio signal is a voice signal; if yes, calling an initial position marking sub-module;
a start position marking submodule for marking a start position of the voice signal;
the interruption time calculation submodule is used for calculating the interruption time of the voice signal;
an end position marking submodule, configured to mark an end position of the voice signal if the interruption time exceeds a preset time threshold;
and the position segmentation submodule is used for segmenting the voice signal between the starting position and the ending position from the audio signal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810688911.6A CN110661923A (en) | 2018-06-28 | 2018-06-28 | Method and device for recording speech information in conference |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810688911.6A CN110661923A (en) | 2018-06-28 | 2018-06-28 | Method and device for recording speech information in conference |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110661923A true CN110661923A (en) | 2020-01-07 |
Family
ID=69026415
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810688911.6A Pending CN110661923A (en) | 2018-06-28 | 2018-06-28 | Method and device for recording speech information in conference |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110661923A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113055529A (en) * | 2021-03-29 | 2021-06-29 | 深圳市艾酷通信软件有限公司 | Recording control method and recording control device |
CN115828907A (en) * | 2023-02-16 | 2023-03-21 | 南昌航天广信科技有限责任公司 | Intelligent conference management method, system, readable storage medium and computer equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106657865A (en) * | 2016-12-16 | 2017-05-10 | 联想(北京)有限公司 | Method and device for generating conference summary and video conference system |
CN106782545A (en) * | 2016-12-16 | 2017-05-31 | 广州视源电子科技股份有限公司 | System and method for converting audio and video data into character records |
CN107978317A (en) * | 2017-12-18 | 2018-05-01 | 北京百度网讯科技有限公司 | Meeting summary synthetic method, system and terminal device |
CN108074570A (en) * | 2017-12-26 | 2018-05-25 | 安徽声讯信息技术有限公司 | Surface trimming, transmission, the audio recognition method preserved |
-
2018
- 2018-06-28 CN CN201810688911.6A patent/CN110661923A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106657865A (en) * | 2016-12-16 | 2017-05-10 | 联想(北京)有限公司 | Method and device for generating conference summary and video conference system |
CN106782545A (en) * | 2016-12-16 | 2017-05-31 | 广州视源电子科技股份有限公司 | System and method for converting audio and video data into character records |
CN107978317A (en) * | 2017-12-18 | 2018-05-01 | 北京百度网讯科技有限公司 | Meeting summary synthetic method, system and terminal device |
CN108074570A (en) * | 2017-12-26 | 2018-05-25 | 安徽声讯信息技术有限公司 | Surface trimming, transmission, the audio recognition method preserved |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113055529A (en) * | 2021-03-29 | 2021-06-29 | 深圳市艾酷通信软件有限公司 | Recording control method and recording control device |
CN115828907A (en) * | 2023-02-16 | 2023-03-21 | 南昌航天广信科技有限责任公司 | Intelligent conference management method, system, readable storage medium and computer equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111816218B (en) | Voice endpoint detection method, device, equipment and storage medium | |
JP6171617B2 (en) | Response target speech determination apparatus, response target speech determination method, and response target speech determination program | |
US7881930B2 (en) | ASR-aided transcription with segmented feedback training | |
US10074363B2 (en) | Method and apparatus for keyword speech recognition | |
WO2017084360A1 (en) | Method and system for speech recognition | |
US9911411B2 (en) | Rapid speech recognition adaptation using acoustic input | |
US20070118374A1 (en) | Method for generating closed captions | |
US20070118364A1 (en) | System for generating closed captions | |
US20220238118A1 (en) | Apparatus for processing an audio signal for the generation of a multimedia file with speech transcription | |
US20190180758A1 (en) | Voice processing apparatus, voice processing method, and non-transitory computer-readable storage medium for storing program | |
CN113192535B (en) | Voice keyword retrieval method, system and electronic device | |
US20170270923A1 (en) | Voice processing device and voice processing method | |
CN114385800A (en) | Voice conversation method and device | |
US7689414B2 (en) | Speech recognition device and method | |
KR101122590B1 (en) | Apparatus and method for speech recognition by dividing speech data | |
JP4791857B2 (en) | Utterance section detection device and utterance section detection program | |
CN110661923A (en) | Method and device for recording speech information in conference | |
CN111739536A (en) | Audio processing method and device | |
CN109065026B (en) | Recording control method and device | |
WO2021152566A1 (en) | System and method for shielding speaker voice print in audio signals | |
Georgescu et al. | Rodigits-a romanian connected-digits speech corpus for automatic speech and speaker recognition | |
CN116229987B (en) | Campus voice recognition method, device and storage medium | |
JP6526602B2 (en) | Speech recognition apparatus, method thereof and program | |
CN112820281B (en) | Voice recognition method, device and equipment | |
CN108364654B (en) | Voice processing method, medium, device and computing equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200107 |