CN111312260A - Human voice separation method, device and equipment - Google Patents

Human voice separation method, device and equipment Download PDF

Info

Publication number
CN111312260A
CN111312260A CN202010299798.XA CN202010299798A CN111312260A CN 111312260 A CN111312260 A CN 111312260A CN 202010299798 A CN202010299798 A CN 202010299798A CN 111312260 A CN111312260 A CN 111312260A
Authority
CN
China
Prior art keywords
conference
audio conference
participant
audio
sound data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010299798.XA
Other languages
Chinese (zh)
Inventor
肖龙源
李稀敏
刘晓葳
谭玉坤
叶志坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202010299798.XA priority Critical patent/CN111312260A/en
Publication of CN111312260A publication Critical patent/CN111312260A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Abstract

The invention discloses a human voice separation method, a human voice separation device and human voice separation equipment. Wherein the method comprises the following steps: the method comprises the steps of collecting sound data of each participant participating in the audio conference during a conference recording scene of the audio conference, extracting voiceprint features of the sound data, separating voices of audio conference contents corresponding to the conference recording scene of the audio conference according to the extracted voiceprint features, respectively marking the sound data of each participant in the audio conference contents after the voices are separated according to a timestamp mode, and forming text conference records of the audio conference contents corresponding to the conference recording scene of the audio conference according to the respectively marked sound data. Through the mode, when the audio conference system is applied to a conference recording scene of an audio conference, the voice separation can be carried out according to the audio conference content without manual work to form the text conference record, and the accuracy of the text conference record formed by the voice separation can be improved.

Description

Human voice separation method, device and equipment
Technical Field
The invention relates to the technical field of voice separation, in particular to a voice separation method, a voice separation device and voice separation equipment.
Background
Audio conferencing refers to a conference between two or more individuals or groups in different places, which transmits sounds to each other through a transmission line and multimedia devices, so as to realize instant and interactive communication.
However, when the existing voice separation scheme is applied to a conference recording scene of an audio conference, voice separation is generally performed manually according to audio conference content to form a text conference recording, but the resolution of the audio conference content related to voices of multiple persons and voices of the ears of the people is limited, and the accuracy of the text conference recording formed by voice separation is general due to the initiative of the persons.
Disclosure of Invention
In view of the above, an object of the present invention is to provide a method, an apparatus, and a device for separating voice, which can perform voice separation according to audio conference content without manual work to form a text conference record when being applied to a conference recording scene of an audio conference, and can improve the accuracy of the text conference record formed by voice separation.
According to an aspect of the present invention, there is provided a human voice separating method including: collecting sound data of each participant participating in the audio conference when the conference recording scene of the audio conference is carried out; performing voiceprint feature extraction on the collected voice data of each participant; according to the extracted voiceprint features, carrying out voice separation on audio conference content corresponding to a conference recording scene of the audio conference; according to a timestamp mode, respectively marking the sound data of each participant in the audio conference content after the voice separation; and forming a text conference record of the audio conference content corresponding to the conference recording scene of the audio conference according to the sound data of each participant after being respectively marked.
Wherein, according to the extracted voiceprint features, carrying out voice separation on the audio conference content corresponding to the conference recording scene of the audio conference, comprises: the method comprises the steps of obtaining voice sound data of audio conference content corresponding to a conference recording scene of the audio conference, carrying out voiceprint feature extraction on the obtained voice sound data, comparing the extracted voiceprint features with the extracted voiceprint features, and carrying out voice separation on the audio conference content corresponding to the conference recording scene of the audio conference by adopting a mode of separating the audio conference content corresponding to the voiceprint features which are the same as the extracted voiceprint features in the extracted voiceprint features.
Wherein, according to the timestamp mode, the sound data of each participant in the audio conference content after the voice separation is respectively marked, including: and generating a label associated with the timestamp according to the timestamp corresponding to the sound data of each participant in the audio conference content after the voice separation, and respectively marking the sound data of each participant in the audio conference content after the voice separation according to the generated label.
Wherein, the forming of the text conference record of the audio conference content corresponding to the conference record scene of the audio conference according to the respectively marked sound data of each participant comprises: and forming a text conference record of the audio conference content corresponding to the conference recording scene of the audio conference by adopting a natural language processing mode according to the sound data of each participant after being respectively marked.
Wherein, after the forming of the text conference record of the audio conference content corresponding to the conference recording scene of the audio conference according to the respectively marked sound data of each participant, the method further comprises: and configuring the text conference record of each participant according to the formed text conference record.
According to another aspect of the present invention, there is provided a human voice separating apparatus comprising: the device comprises an acquisition module, an extraction module, a separation module, a marking module and a forming module; the acquisition module is used for acquiring the sound data of each participant participating in the audio conference when a conference recording scene of the audio conference is carried out; the extraction module is used for extracting the voiceprint characteristics of the collected voice data of each participant; the separation module is used for carrying out voice separation on the audio conference content corresponding to the conference recording scene of the audio conference according to the extracted voiceprint characteristics; the marking module is used for respectively marking the sound data of each participant in the audio conference content after the voice separation in a time stamp mode; and the forming module is used for forming the text conference record of the audio conference content corresponding to the conference recording scene of the audio conference according to the sound data of each participant after being respectively marked.
Wherein, the separation module is specifically configured to: the method comprises the steps of obtaining voice sound data of audio conference content corresponding to a conference recording scene of the audio conference, carrying out voiceprint feature extraction on the obtained voice sound data, comparing the extracted voiceprint features with the extracted voiceprint features, and carrying out voice separation on the audio conference content corresponding to the conference recording scene of the audio conference by adopting a mode of separating the audio conference content corresponding to the voiceprint features which are the same as the extracted voiceprint features in the extracted voiceprint features.
Wherein, the marking module is specifically configured to: and generating a label associated with the timestamp according to the timestamp corresponding to the sound data of each participant in the audio conference content after the voice separation, and respectively marking the sound data of each participant in the audio conference content after the voice separation according to the generated label.
Wherein the forming module is specifically configured to: and forming a text conference record of the audio conference content corresponding to the conference recording scene of the audio conference by adopting a natural language processing mode according to the sound data of each participant after being respectively marked.
Wherein, the voice separator still includes: a configuration module; and the configuration module is used for configuring the text conference record of each participant according to the formed text conference record.
According to still another aspect of the present invention, there is provided a human voice separating apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the person voice separation methods described above.
According to still another aspect of the present invention, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the human voice separating method of any one of the above.
It can be found that, according to the above scheme, when the audio conference recording scene is applied, the audio data of each participant participating in the audio conference can be collected, the voiceprint feature extraction can be performed on the collected audio data of each participant, the audio conference content corresponding to the audio conference recording scene of the audio conference can be separated according to the extracted voiceprint feature, the audio data of each participant in the audio conference content after the audio separation can be respectively marked according to the timestamp mode, the text conference record of the audio conference content corresponding to the audio conference recording scene of the audio conference can be formed according to the sound data of each participant after the respective marking, and when the audio conference recording scene is applied, the text conference record can be formed without manual work by performing the voice separation according to the audio conference content, the accuracy of the text conference record formed by separating the human voice can be improved.
Furthermore, above scheme, can acquire the voice sound data of the audio frequency meeting content that the meeting record scene of this audio frequency meeting corresponds, and carry out the voiceprint feature extraction to this voice sound data who acquires, and compare this voiceprint feature of refining and this voiceprint feature of extracting, adopt and carry out the mode that the same voiceprint feature of the voiceprint feature corresponds in this voiceprint feature of refining with this voiceprint feature audio frequency meeting content that the voiceprint feature that extracts carries out the separation of human sound, the advantage of this kind is that can realize through the uniqueness of voiceprint feature, improve the accuracy rate that the separation of human sound is carried out to the audio frequency meeting content that the meeting record scene of this audio frequency meeting corresponds.
Furthermore, according to the above scheme, the tag associated with the timestamp can be generated according to the timestamp corresponding to the sound data of each participant in the audio conference content after the voice separation, and the sound data of each participant in the audio conference content after the voice separation can be respectively marked according to the generated tag, so that the advantage that the sound data of each participant in the audio conference content after the voice separation can be respectively marked, the sound data of each participant can be accurately distinguished, and the accuracy of the text conference record formed according to the sound data of each participant can be improved.
Furthermore, according to the scheme, the text conference record of the audio conference content corresponding to the conference recording scene of the audio conference can be formed by adopting a natural language processing mode according to the sound data of each participant after being respectively marked, and the method has the advantage that the accuracy of the text conference record formed by separating the voices can be improved.
Further, according to the above scheme, the text conference record of each participant can be configured according to the formed text conference record, which has the advantage that the text conference record of each participant can be managed conveniently.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of an embodiment of a human voice separation method of the present invention;
FIG. 2 is a schematic flow chart of another embodiment of the human voice separation method of the present invention;
FIG. 3 is a schematic structural diagram of an embodiment of a human voice separating apparatus according to the present invention;
FIG. 4 is a schematic structural diagram of another embodiment of the human voice separating apparatus of the present invention;
fig. 5 is a schematic structural diagram of an embodiment of the human voice separating apparatus of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be noted that the following examples are only illustrative of the present invention, and do not limit the scope of the present invention. Similarly, the following examples are only some but not all examples of the present invention, and all other examples obtained by those skilled in the art without any inventive work are within the scope of the present invention.
The invention provides a voice separation method, which can be used for realizing voice separation according to the content of an audio conference without manual work to form a text conference record when being applied to a conference recording scene of the audio conference, and can improve the accuracy of the text conference record formed by voice separation.
Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a human voice separation method according to the present invention. It should be noted that the method of the present invention is not limited to the flow sequence shown in fig. 1 if the results are substantially the same. As shown in fig. 1, the method comprises the steps of:
s101: during a conference recording scene of an audio conference, sound data of each participant participating in the audio conference is collected.
In this embodiment, the sound data of each participant participating in the audio conference may be collected at one time, the sound data of each participant participating in the audio conference may be collected for multiple times, the sound data of each participant participating in the audio conference may be collected one by one, and the like, and the present invention is not limited thereto.
In this embodiment, the present invention may collect multiple pieces of sound data of the same participant, may collect a single piece of sound data of the same participant, may collect multiple pieces of sound data of multiple participants, and the like, and the present invention is not limited thereto.
S102: and carrying out voiceprint feature extraction on the collected voice data of each participant.
In this embodiment, the voiceprint feature extraction may be performed on the collected voice data of each participant at one time, or may be performed on the collected voice data of each participant for multiple times, or may be performed on the collected voice data of each participant one by one, and the voiceprint feature extraction may be performed on the collected voice data of each participant one by one, and the like, which is not limited in the present invention.
S103: and according to the extracted voiceprint characteristics, carrying out voice separation on the audio conference content corresponding to the conference recording scene of the audio conference.
Wherein, should carry out the separation of human voice to the audio frequency meeting content that the meeting record scene of this audio frequency meeting corresponds according to this vocal print characteristic of extracting, can include:
the method comprises the steps of obtaining voice sound data of audio conference content corresponding to a conference recording scene of the audio conference, carrying out voiceprint feature extraction on the obtained voice sound data, comparing the extracted voiceprint features with the extracted voiceprint features, adopting a mode of separating the audio conference content corresponding to the voiceprint features which are the same as the extracted voiceprint features in the extracted voiceprint features, carrying out voice separation on the audio conference content corresponding to the conference recording scene of the audio conference, and improving the accuracy of carrying out voice separation on the audio conference content corresponding to the conference recording scene of the audio conference.
S104: and respectively marking the sound data of each participant in the audio conference content after the voice separation in a time stamp mode.
Wherein, according to the timestamp mode, the sound data of each participant in the audio conference content after separating the voice is respectively marked, which may include:
according to the time stamp corresponding to the sound data of each participant in the audio conference content after the voice separation, the tag related to the time stamp is generated, and according to the generated tag, the sound data of each participant in the audio conference content after the voice separation is respectively marked.
S105: and forming a text conference record of the audio conference content corresponding to the conference record scene of the audio conference according to the sound data of each participant after being respectively marked.
Wherein, the forming of the text conference record of the audio conference content corresponding to the conference record scene of the audio conference according to the sound data of each participant after being respectively marked may include:
according to the sound data of each participant after being marked respectively, a Natural Language Processing (NLP) mode is adopted to form the text conference record of the audio conference content corresponding to the conference recording scene of the audio conference, and the advantage is that the accuracy of the text conference record formed by separating the voices can be improved.
After the text conference record of the audio conference content corresponding to the conference record scene of the audio conference is formed according to the sound data of each participant after being respectively marked, the method may further include:
configuring the text conference record of each participant according to the formed text conference record has the advantage that the text conference record of each participant can be conveniently managed.
It can be found that, in this embodiment, when the conference recording scene of the audio conference is performed, the sound data of each participant participating in the audio conference can be collected, the voiceprint feature extraction can be performed on the collected sound data of each participant, the audio conference content corresponding to the conference recording scene of the audio conference can be voice-separated according to the extracted voiceprint feature, the sound data of each participant in the audio conference content after the voice separation can be respectively marked according to the timestamp manner, and the text conference record of the audio conference content corresponding to the conference recording scene of the audio conference can be formed according to the sound data of each participant after the respective marking, so that when the audio conference recording scene of the audio conference is applied, the text conference record can be formed without manual work by performing the voice separation according to the audio conference content, the accuracy of the text conference record formed by separating the human voice can be improved.
Further, in this embodiment, the voice data of the audio conference content corresponding to the conference recording scene of the audio conference may be obtained, the voiceprint feature extraction may be performed on the obtained voice data, the extracted voiceprint feature and the extracted voiceprint feature may be compared, and the method of separating the audio conference content corresponding to the voiceprint feature that is the same as the extracted voiceprint feature in the extracted voiceprint feature is adopted, so that the voice separation may be performed on the audio conference content corresponding to the conference recording scene of the audio conference.
Further, in this embodiment, a tag associated with the timestamp may be generated according to a timestamp corresponding to the sound data of each participant in the audio conference content after the voice separation, and the sound data of each participant in the audio conference content after the voice separation may be respectively tagged according to the generated tag.
Further, in this embodiment, a natural language processing method may be used to form a text conference record of the audio conference content corresponding to the conference recording scene of the audio conference according to the sound data of each participant after being respectively marked, which can improve the accuracy of the text conference record formed by separating the voices.
Referring to fig. 2, fig. 2 is a schematic flow chart of another embodiment of the human voice separation method of the present invention. In this embodiment, the method includes the steps of:
s201: during a conference recording scene of an audio conference, sound data of each participant participating in the audio conference is collected.
As described above in S101, further description is omitted here.
S202: and carrying out voiceprint feature extraction on the collected voice data of each participant.
As described above in S102, further description is omitted here.
S203: and according to the extracted voiceprint characteristics, carrying out voice separation on the audio conference content corresponding to the conference recording scene of the audio conference.
As described above in S103, which is not described herein.
S204: and respectively marking the sound data of each participant in the audio conference content after the voice separation in a time stamp mode.
As described above in S104, and will not be described herein.
S205: and forming a text conference record of the audio conference content corresponding to the conference record scene of the audio conference according to the sound data of each participant after being respectively marked.
As described above in S105, which is not described herein.
S206: and configuring the text conference record of each participant according to the formed text conference record.
It can be seen that, in this embodiment, the text conference record of each participant can be configured according to the formed text conference record, which has the advantage of being able to conveniently manage the text conference record of each participant.
The invention also provides a voice separation device, which can realize that voice separation can be carried out according to the content of the audio conference without manual work to form text conference records when being applied to a conference recording scene of the audio conference, and can improve the accuracy of the text conference records formed by voice separation.
Referring to fig. 3, fig. 3 is a schematic structural diagram of an embodiment of a human voice separation apparatus according to the present invention. In this embodiment, the human voice separating apparatus 30 includes an acquisition module 31, an extraction module 32, a separation module 33, a marking module 34, and a forming module 35.
The collecting module 31 is configured to collect sound data of each participant participating in the audio conference when a conference recording scene of the audio conference is performed.
The extracting module 32 is configured to perform voiceprint feature extraction on the collected voice data of each participant.
The separation module 33 is configured to perform voice separation on the audio conference content corresponding to the conference recording scene of the audio conference according to the extracted voiceprint feature.
The marking module 34 is configured to mark the sound data of each participant in the audio conference content after the voice separation in a time stamp manner.
The forming module 35 is configured to form a text conference record of the audio conference content corresponding to the conference recording scene of the audio conference according to the sound data of each participant after being respectively marked.
Optionally, the separation module 33 may be specifically configured to:
the method comprises the steps of obtaining voice sound data of audio conference content corresponding to a conference recording scene of the audio conference, carrying out voiceprint feature extraction on the obtained voice sound data, comparing the extracted voiceprint feature with the extracted voiceprint feature, separating the audio conference content corresponding to the voiceprint feature which is the same as the extracted voiceprint feature in the extracted voiceprint feature by adopting a mode of separating the audio conference content, and carrying out voice separation on the audio conference content corresponding to the conference recording scene of the audio conference.
Optionally, the marking module 34 may be specifically configured to:
and generating a label associated with the timestamp according to the timestamp corresponding to the sound data of each participant in the audio conference content after the voice separation, and respectively marking the sound data of each participant in the audio conference content after the voice separation according to the generated label.
Optionally, the forming module 35 may be specifically configured to:
and forming a text conference record of the audio conference content corresponding to the conference recording scene of the audio conference by adopting a natural language processing mode according to the sound data of each participant after being respectively marked.
Referring to fig. 4, fig. 4 is a schematic structural diagram of another embodiment of the human voice separation apparatus of the present invention. Different from the previous embodiment, the human voice separating apparatus 40 of the present embodiment further includes a configuration module 41.
The configuration module 41 is configured to configure the text conference record of each participant according to the formed text conference record.
Each unit module of the voice separating apparatus 30/40 can respectively execute the corresponding steps in the above method embodiments, and therefore, the detailed description of each unit module is omitted here, and please refer to the description of the corresponding steps above.
The present invention also provides a human voice separating apparatus, as shown in fig. 5, comprising: at least one processor 51; and a memory 52 communicatively coupled to the at least one processor 51; the memory 52 stores instructions executable by the at least one processor 51, and the instructions are executed by the at least one processor 51 to enable the at least one processor 51 to execute the voice separating method.
Wherein the memory 52 and the processor 51 are coupled in a bus, which may comprise any number of interconnected buses and bridges, which couple one or more of the various circuits of the processor 51 and the memory 52 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 51 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 51.
The processor 51 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory 52 may be used to store data used by the processor 51 in performing operations.
The present invention further provides a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.
It can be found that, according to the above scheme, when the audio conference recording scene is applied, the audio data of each participant participating in the audio conference can be collected, the voiceprint feature extraction can be performed on the collected audio data of each participant, the audio conference content corresponding to the audio conference recording scene of the audio conference can be separated according to the extracted voiceprint feature, the audio data of each participant in the audio conference content after the audio separation can be respectively marked according to the timestamp mode, the text conference record of the audio conference content corresponding to the audio conference recording scene of the audio conference can be formed according to the sound data of each participant after the respective marking, and when the audio conference recording scene is applied, the text conference record can be formed without manual work by performing the voice separation according to the audio conference content, the accuracy of the text conference record formed by separating the human voice can be improved.
Furthermore, above scheme, can acquire the voice sound data of the audio frequency meeting content that the meeting record scene of this audio frequency meeting corresponds, and carry out the voiceprint feature extraction to this voice sound data who acquires, and compare this voiceprint feature of refining and this voiceprint feature of extracting, adopt and carry out the mode that the same voiceprint feature of the voiceprint feature corresponds in this voiceprint feature of refining with this voiceprint feature audio frequency meeting content that the voiceprint feature that extracts carries out the separation of human sound, the advantage of this kind is that can realize through the uniqueness of voiceprint feature, improve the accuracy rate that the separation of human sound is carried out to the audio frequency meeting content that the meeting record scene of this audio frequency meeting corresponds.
Furthermore, according to the above scheme, the tag associated with the timestamp can be generated according to the timestamp corresponding to the sound data of each participant in the audio conference content after the voice separation, and the sound data of each participant in the audio conference content after the voice separation can be respectively marked according to the generated tag, so that the advantage that the sound data of each participant in the audio conference content after the voice separation can be respectively marked, the sound data of each participant can be accurately distinguished, and the accuracy of the text conference record formed according to the sound data of each participant can be improved.
Furthermore, according to the scheme, the text conference record of the audio conference content corresponding to the conference recording scene of the audio conference can be formed by adopting a natural language processing mode according to the sound data of each participant after being respectively marked, and the method has the advantage that the accuracy of the text conference record formed by separating the voices can be improved.
Further, according to the above scheme, the text conference record of each participant can be configured according to the formed text conference record, which has the advantage that the text conference record of each participant can be managed conveniently.
In the several embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be substantially or partially implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a part of the embodiments of the present invention, and not intended to limit the scope of the present invention, and all equivalent devices or equivalent processes performed by the present invention through the contents of the specification and the drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A human voice separation method, comprising:
collecting sound data of each participant participating in the audio conference when the conference recording scene of the audio conference is carried out;
performing voiceprint feature extraction on the collected voice data of each participant;
according to the extracted voiceprint features, carrying out voice separation on audio conference content corresponding to a conference recording scene of the audio conference;
according to a timestamp mode, respectively marking the sound data of each participant in the audio conference content after the voice separation;
and forming a text conference record of the audio conference content corresponding to the conference recording scene of the audio conference according to the sound data of each participant after being respectively marked.
2. The method for separating human voice according to claim 1, wherein the separating the audio conference content corresponding to the conference recording scene of the audio conference according to the extracted voiceprint features comprises:
the method comprises the steps of obtaining voice sound data of audio conference content corresponding to a conference recording scene of the audio conference, carrying out voiceprint feature extraction on the obtained voice sound data, comparing the extracted voiceprint features with the extracted voiceprint features, and carrying out voice separation on the audio conference content corresponding to the conference recording scene of the audio conference by adopting a mode of separating the audio conference content corresponding to the voiceprint features which are the same as the extracted voiceprint features in the extracted voiceprint features.
3. The method for separating human voice according to claim 1, wherein the step of marking the sound data of each participant in the audio conference content after the human voice separation in a time stamp manner comprises:
and generating a label associated with the timestamp according to the timestamp corresponding to the sound data of each participant in the audio conference content after the voice separation, and respectively marking the sound data of each participant in the audio conference content after the voice separation according to the generated label.
4. The method for separating human voice according to claim 1, wherein the forming a text conference record of the audio conference content corresponding to the conference record scene of the audio conference according to the sound data of each participant after being respectively marked comprises:
and forming a text conference record of the audio conference content corresponding to the conference recording scene of the audio conference by adopting a natural language processing mode according to the sound data of each participant after being respectively marked.
5. The method for separating human voice according to claim 1, wherein after the forming of the text conference record of the audio conference content corresponding to the conference record scene of the audio conference according to the sound data of each participant after being respectively marked, further comprising:
and configuring the text conference record of each participant according to the formed text conference record.
6. A human voice separating apparatus, comprising:
the device comprises an acquisition module, an extraction module, a separation module, a marking module and a forming module;
the acquisition module is used for acquiring the sound data of each participant participating in the audio conference when a conference recording scene of the audio conference is carried out;
the extraction module is used for extracting the voiceprint characteristics of the collected voice data of each participant;
the separation module is used for carrying out voice separation on the audio conference content corresponding to the conference recording scene of the audio conference according to the extracted voiceprint characteristics;
the marking module is used for respectively marking the sound data of each participant in the audio conference content after the voice separation in a time stamp mode;
and the forming module is used for forming the text conference record of the audio conference content corresponding to the conference recording scene of the audio conference according to the sound data of each participant after being respectively marked.
7. The human voice separation device of claim 6, wherein the separation module is specifically configured to:
the method comprises the steps of obtaining voice sound data of audio conference content corresponding to a conference recording scene of the audio conference, carrying out voiceprint feature extraction on the obtained voice sound data, comparing the extracted voiceprint features with the extracted voiceprint features, and carrying out voice separation on the audio conference content corresponding to the conference recording scene of the audio conference by adopting a mode of separating the audio conference content corresponding to the voiceprint features which are the same as the extracted voiceprint features in the extracted voiceprint features.
8. The human voice separation device of claim 6, wherein the tagging module is specifically configured to:
and generating a label associated with the timestamp according to the timestamp corresponding to the sound data of each participant in the audio conference content after the voice separation, and respectively marking the sound data of each participant in the audio conference content after the voice separation according to the generated label.
9. The voice separation device of claim 6, wherein the forming module is specifically configured to:
and forming a text conference record of the audio conference content corresponding to the conference recording scene of the audio conference by adopting a natural language processing mode according to the sound data of each participant after being respectively marked.
10. The human voice separating apparatus as claimed in claim 6, further comprising:
a configuration module;
and the configuration module is used for configuring the text conference record of each participant according to the formed text conference record.
CN202010299798.XA 2020-04-16 2020-04-16 Human voice separation method, device and equipment Pending CN111312260A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010299798.XA CN111312260A (en) 2020-04-16 2020-04-16 Human voice separation method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010299798.XA CN111312260A (en) 2020-04-16 2020-04-16 Human voice separation method, device and equipment

Publications (1)

Publication Number Publication Date
CN111312260A true CN111312260A (en) 2020-06-19

Family

ID=71147628

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010299798.XA Pending CN111312260A (en) 2020-04-16 2020-04-16 Human voice separation method, device and equipment

Country Status (1)

Country Link
CN (1) CN111312260A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114093383A (en) * 2022-01-17 2022-02-25 北京远鉴信息技术有限公司 Method and device for determining participant voice, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2784090A1 (en) * 2011-09-01 2013-03-01 Research In Motion Limited Conferenced voice to text transcription
US20150296181A1 (en) * 2013-01-16 2015-10-15 Adobe Systems Incorporated Augmenting web conferences via text extracted from audio content
CN107564531A (en) * 2017-08-25 2018-01-09 百度在线网络技术(北京)有限公司 Minutes method, apparatus and computer equipment based on vocal print feature
CN108922538A (en) * 2018-05-29 2018-11-30 平安科技(深圳)有限公司 Conferencing information recording method, device, computer equipment and storage medium
CN109388701A (en) * 2018-08-17 2019-02-26 深圳壹账通智能科技有限公司 Minutes generation method, device, equipment and computer storage medium
CN109741754A (en) * 2018-12-10 2019-05-10 上海思创华信信息技术有限公司 A kind of conference voice recognition methods and system, storage medium and terminal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2784090A1 (en) * 2011-09-01 2013-03-01 Research In Motion Limited Conferenced voice to text transcription
US20150296181A1 (en) * 2013-01-16 2015-10-15 Adobe Systems Incorporated Augmenting web conferences via text extracted from audio content
CN107564531A (en) * 2017-08-25 2018-01-09 百度在线网络技术(北京)有限公司 Minutes method, apparatus and computer equipment based on vocal print feature
CN108922538A (en) * 2018-05-29 2018-11-30 平安科技(深圳)有限公司 Conferencing information recording method, device, computer equipment and storage medium
CN109388701A (en) * 2018-08-17 2019-02-26 深圳壹账通智能科技有限公司 Minutes generation method, device, equipment and computer storage medium
CN109741754A (en) * 2018-12-10 2019-05-10 上海思创华信信息技术有限公司 A kind of conference voice recognition methods and system, storage medium and terminal

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114093383A (en) * 2022-01-17 2022-02-25 北京远鉴信息技术有限公司 Method and device for determining participant voice, electronic equipment and storage medium
CN114093383B (en) * 2022-01-17 2022-04-12 北京远鉴信息技术有限公司 Method and device for determining participant voice, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109361825A (en) Meeting summary recording method, terminal and computer storage medium
CN106303658A (en) It is applied to exchange method and the device of net cast
CN109474763A (en) A kind of AI intelligent meeting system and its implementation based on voice, semanteme
CN107564531A (en) Minutes method, apparatus and computer equipment based on vocal print feature
CN107102990A (en) The method and apparatus translated to voice
CN105488227A (en) Electronic device and method for processing audio file based on voiceprint features through same
CN111048095A (en) Voice transcription method, equipment and computer readable storage medium
TW201624468A (en) Meeting minutes device and method thereof for automatically creating meeting minutes
CN111583932A (en) Sound separation method, device and equipment based on human voice model
CN109271503A (en) Intelligent answer method, apparatus, equipment and storage medium
DE112019002205T5 (en) REAL-TIME NOTIFICATION OF SYMPTOMS IN TELEMEDICINE
CN111312260A (en) Human voice separation method, device and equipment
CN107910006A (en) Audio recognition method, device and multiple source speech differentiation identifying system
CN104821109A (en) Online question answering system based on image information and voice information
CN107196979A (en) Pre- system for prompting of calling out the numbers based on speech recognition
CN111583953A (en) Voiceprint feature-based voice separation method, device and equipment
CN105338282B (en) A kind of information processing method and electronic equipment
CN111223487B (en) Information processing method and electronic equipment
CN112562644A (en) Customer service quality inspection method, system, equipment and medium based on human voice separation
CN111221987A (en) Hybrid audio tagging method and apparatus
WO2023146803A1 (en) Intelligent topic segmentation within a communication session
CN111326163B (en) Voiceprint recognition method, device and equipment
CN110275860B (en) System and method for recording teaching process
CN111415669B (en) Voiceprint model construction method, device and equipment
CN105378829A (en) Note-taking assistance system, information delivery device, terminal, note-taking assistance method, and computer-readable recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200619