CN114464198A - Visual human voice separation system, method and device - Google Patents
Visual human voice separation system, method and device Download PDFInfo
- Publication number
- CN114464198A CN114464198A CN202111437237.2A CN202111437237A CN114464198A CN 114464198 A CN114464198 A CN 114464198A CN 202111437237 A CN202111437237 A CN 202111437237A CN 114464198 A CN114464198 A CN 114464198A
- Authority
- CN
- China
- Prior art keywords
- voice separation
- audio
- sentence
- file
- human voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000926 separation method Methods 0.000 title claims abstract description 125
- 230000000007 visual effect Effects 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 title claims abstract description 13
- 230000011218 segmentation Effects 0.000 claims abstract description 18
- 238000013473 artificial intelligence Methods 0.000 claims abstract description 8
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 238000012549 training Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 4
- 238000004891 communication Methods 0.000 claims description 3
- 230000003287 optical effect Effects 0.000 claims description 3
- 230000015572 biosynthetic process Effects 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 5
- 238000012800 visualization Methods 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 3
- 238000013139 quantization Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/116—Details of conversion of file system types or formats
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
- G06F16/168—Details of user interfaces specifically adapted to file systems, e.g. browsing and visualisation, 2d or 3d GUIs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
The invention belongs to the technical field of artificial intelligence voice separation, and particularly relates to a visual voice separation system, a visual voice separation method and a visual voice separation device, wherein the method comprises the steps of opening the visual voice separation system, and importing audio/video files to be separated into the system; converting the audio/video into an audio format matched with a human voice separation algorithm; carrying out logic segmentation on an audio file to be processed, and carrying out sentence segmentation according to time sequence to finally form a json file containing the name of a speaker, the start time and the end time in each sentence; displaying the separated result on an interface, displaying the audio file on the upper half part in a waveform form, and displaying the analyzed json file on the lower half part in a list form; playing and adjusting each sentence on a result display interface to realize accurate voice separation; and separating the separated voice clauses, and selecting and exporting the voice clauses according to requirements. According to the invention, on the basis of an artificial intelligent voice separation algorithm, manual adjustment of interface visualization is carried out, so that an accurate voice separation effect is achieved.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence voice separation, and particularly relates to a visual voice separation system, method and device.
Background
The call records are generally single-channel files containing the voices of two persons, and in order to further determine the identity of a criminal, the call records of the two persons need to be separated to form a mode of one speaker and one audio file, so that a criminal suspect can be conveniently searched from a voiceprint library or a voiceprint identification in a ratio of 1:1 is conveniently carried out.
Because the criminal suspicion person identification is involved, a more accurate person voice separation method is needed, and with the development of artificial intelligence, a deep learning multilayer neural network mode is used, so that the person voice separation accuracy is greatly improved, but the accuracy of 100% cannot be guaranteed. On the basis of the accuracy of the artificial intelligence algorithm, the existence of a more accurate human voice separation method becomes a problem to be solved urgently.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a visual voice separation system, a visual voice separation method and a visual voice separation device, which are used for manually adjusting interface visualization on the basis of an artificial intelligent voice separation algorithm so as to achieve an accurate voice separation effect.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a visual human voice separation system, which comprises:
the audio/video format conversion module is used for converting the file uploaded to the system into an audio format matched with the human voice separation algorithm module;
the voice separation algorithm module is used for logically segmenting the audio file to be processed, and carrying out sentence segmentation according to the time sequence to form a json file with each sentence including the name of a speaker, the starting time and the ending time;
the separation result display module is used for displaying the result segmented by the voice separation algorithm module on an interface, wherein the upper half part of the interface displays waveforms, and the lower half part of the interface displays list information of clauses;
the visual voice separation adjusting module is used for observing the existence and the size of voice energy on the oscillogram through the independent playing of each clause in the playing/pause control list, and repeatedly finely adjusting the starting time and the ending time of each clause so as to adjust the time boundary;
and the voice separation task management module is used for managing the uploaded voice separation tasks, and the voice/video uploaded by the user each time is managed as a single task.
Furthermore, the voice separation algorithm module adopts an artificial intelligent processing mode and realizes automatic voice separation through voice segmentation clustering based on mixed characteristics of the Mel frequency cepstrum coefficient and the gamma frequency cepstrum coefficient.
The invention also provides a visual human voice separation method, which comprises the following steps:
opening a visual human voice separation system, and importing the audio/video files to be separated into the system;
converting the audio/video into an audio format matched with a human voice separation algorithm;
carrying out logic segmentation on an audio file to be processed, and carrying out sentence segmentation according to time sequence to finally form a json file with each sentence including a speaker name, start time and end time;
displaying the separated result on an interface, displaying the audio file on the upper half part in a waveform form, and displaying the analyzed json file on the lower half part in a list form;
playing and adjusting each sentence on a result display interface to realize accurate voice separation;
and separating the separated voice clauses, and selecting and exporting the voice clauses according to requirements.
Further, the audio/video file to be separated is stored in a storage medium such as a U disk, a removable hard disk, an optical drive or a computer hard disk.
Further, the converting the audio/video into the audio format matched with the human voice separation algorithm comprises:
the audio format that the human voice separation algorithm can recognize during training is fixed, and in order to adapt the human voice separation algorithm, the imported audio/video format must be converted into the audio format during training.
Further, the logic segmentation is performed on the audio file to be processed, and the sentence division is performed according to the time sequence, so that a json file with each sentence including the speaker name, the start time and the end time is finally formed, and the method includes the following steps:
and calling a voice separation algorithm in the system to logically divide the converted audio file, and marking the audio file after sentence division, wherein each sentence comprises a speaker name, a start time, an end time and a single sentence duration, the sentences are stored in a text form, the sentences are arranged according to a time sequence, and all the sentences are finally combined to form a json file.
Further, the displaying the separated result on the interface, the audio file being displayed in a waveform form on the upper half, and the analyzed json file being displayed in a list form on the lower half, includes:
each clause in the json file corresponds to the start time and the end time of the audio file, and the audio waveform and the corresponding relation between the clause and the waveform in the json file are displayed on an interface; the display interface is divided into an upper part and a lower part, the upper part displays audio waveforms, and the lower part displays analyzed json files which are arranged in a list form according to time sequence.
Further, the playing and adjusting of each sentence is performed on the result display interface, so that accurate voice separation is realized, and the method comprises the following steps:
covering a semi-transparent identification layer on the oscillogram according to the start time and the end time of each sentence analyzed by the json file, and identifying the time boundary of each sentence on the waveform interface through the identification layer; the starting time and the ending time of each sentence are adjusted by dragging the whole mark layer of each sentence left and right or dragging the boundary left and right independently by independently playing each clause in the play/pause control list and observing the existence and the size of the voice energy on the oscillogram, and the system automatically saves after the time adjustment;
each word in the list has a corresponding relation with the identification layer on the oscillogram, after a word in the list is selected, the identification layer on the oscillogram is also selected, the voice playing of each row is controlled through playing/pausing in the list, and the content of each word is adjusted through deleting, newly adding and modifying operations.
Further, the selecting and exporting of the separated voice clauses according to the requirements comprises:
and if the selected clause contains a plurality of speakers, exporting a compressed file containing a plurality of audio file contents, wherein one speaker corresponds to one audio file.
The invention also provides a visual voice separation device which comprises a processor, a memory, a user interface, a network interface and a data bus, wherein the data bus connects the processor, the memory, the user interface and the network interface together, the memory stores an operating system, and the visual voice separation system, a user interface module and a network communication module are installed in the operating system.
Compared with the prior art, the invention has the following advantages:
in order to solve the problem that the accuracy of the existing voice separation is not high, the invention provides a visual voice separation method, firstly, an audio file is automatically segmented through an artificial intelligent voice separation algorithm to form a json file containing the name, the starting time and the ending time of a speaker in each sentence, on the basis of high separation accuracy of the artificial intelligent voice separation algorithm, if the accuracy of voice separation is further improved, manual adjustment of interface visualization is needed, and the time boundary is adjusted by adjusting the starting time and the ending time of each sentence, so that the accurate voice separation effect is achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a visual human voice separation system according to an embodiment of the present invention, in fig. 1, 101 denotes an audio/video format conversion module, 102 denotes a human voice separation algorithm module, 103 denotes a separation result display module, 104 denotes a visual human voice separation adjustment module, and 105 denotes a human voice separation task management module;
fig. 2 is a schematic structural diagram of a visual human voice separation device according to an embodiment of the present invention, in fig. 2, 201 denotes a processor, 202 denotes a data bus, 203 denotes a user interface, 204 denotes a network interface, and 205 denotes a memory;
FIG. 3 is a schematic flow chart of a visual human voice separation method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of visual human voice separation adjustment according to an embodiment of the present invention;
fig. 5 is a visual human voice separation adjustment interface according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in FIG. 1, to realize visual voice separation, a software system must be constructed as a support, the software system may be of a B/S architecture or a C/S architecture, and if the software system is of a B/S architecture, the system may be opened for use in any browser. The visual human voice separation system of the embodiment can be a page version or an installation version and is operated in a computer PC. The system comprises an audio/video format conversion module 101, a voice separation algorithm module 102, a separation result display module 103, a visual voice separation adjustment module 104 and a voice separation task management module 105.
The audio/video format conversion module 101 is used for converting the file uploaded to the system into an audio format matched with the human voice separation algorithm module, each artificial intelligence algorithm has a default audio format during training, and the automatic separation and identification accuracy can be improved to the maximum extent only by converting the file into the format matched with the default audio format.
The human voice separation algorithm module 102 is used for logically segmenting an audio file to be processed, and segmenting the audio file according to time sequence to form a json file containing a speaker name, a start time and an end time in each sentence, the algorithm adopts an artificial intelligent processing mode, and realizes automatic human voice separation through voice segmentation clustering based on mixed characteristics of a Mel Frequency Cepstrum Coefficient (MFCC) and a Gamma Frequency Cepstrum Coefficient (GFCC), and the algorithm recognition rate is improved by two ways, namely, new data is continuously marked to perform iterative training, and the algorithm is subjected to iterative upgrade, so that the recognition rate of automatic human voice segmentation is continuously improved by the two ways.
And the separation result display module 103 is used for displaying the result segmented by the voice separation algorithm module on an interface, wherein the upper half part of the interface displays waveforms, and the lower half part of the interface displays list information of clauses.
A visual voice separation adjustment module 104, configured to observe whether there is voice energy and the size of the voice energy on the oscillogram through independent playing of each clause in the play/pause control list, and repeatedly fine-tune the start time and the end time of each clause to adjust the time boundary; the content of each sentence is adjusted through deleting, newly adding and modifying operations, so that the aim of visual adjustment can be fulfilled.
The voice separation task management module 105 is used for managing the uploaded voice separation tasks, audio/video uploaded by a user each time is managed as a single task, a default task name is the audio/video name uploaded by the user, the audio/video name can also be modified manually, the number of speakers contained in a file can be filled in when the task is uploaded, a voice separation algorithm is assisted to perform better separation, the separated tasks are bound with login work numbers in a system, and historical voice separation tasks can be seen during each login.
As shown in fig. 3, based on the above visualized human voice separation system, the embodiment further provides a visualized human voice separation method, which includes the following steps:
step S301, opening a visual human voice separation system, and importing an audio/video file to be separated into the system;
step S302, converting the audio/video into an audio format matched with a human voice separation algorithm;
step S303, carrying out logic segmentation on the audio file to be processed, and carrying out sentence segmentation according to the time sequence to finally form a json file with each sentence including the name of the speaker, the starting time and the ending time;
step S304, displaying the separated result on an interface, displaying the audio file on the upper half part in a waveform form, and displaying the analyzed json file on the lower half part in a list form;
step S305, playing and adjusting each sentence on a result display interface so as to achieve an accurate voice separation effect;
and S306, separating the separated voice clauses, and selecting and exporting the voice clauses according to requirements.
In step S301, the audio/video file to be separated is stored in storage media such as a usb disk, a mobile hard disk, an optical drive, or a computer hard disk, the visual voice separation system is installed in an operating system of a computer, if the visual voice separation system is desired to be used, a human-computer interaction function needs to be provided, this example provides a page upload import function, selects any storage media connected to a computer PC, selects the audio/video file to be operated, and the file format can be imported into the system for processing after verification.
In step S302, the audio/video formats to be processed are very different, and the formats in common market include mp3, ogg, wav, mp4, avi, m4a, etc., but these formats cannot be directly used after being uploaded into the system, because an artificial intelligent human voice separation algorithm is called to automatically distinguish, the human voice separation algorithm in the system needs to be trained in advance to complete the high automatic human voice separation recognition accuracy, the algorithm is trained on a fixed audio format, the sampling rate and quantization rate of the audio are involved, the audio/video format to be adapted to the human voice separation algorithm in the system needs to be converted into an audio format matched with the audio/video format, such as the audio sampling rate 8K and quantization rate 16bit in the training, the audio/video to be imported into the system needs to call a format conversion function to convert the sampling rate and quantization rate into 8K and 16bit, this allows better recognition by the algorithm. The format conversion function can convert most of the commercially available audio/video formats to the format required by the algorithm.
In step S303, a speech separation algorithm in the system is called to perform clause segmentation on the converted audio file, where the clause of this example is a logical clause rather than a physical clause for the audio file, that is, the audio itself is not cut, and only clause labeling is performed according to the time sequence according to the speech characteristics in the audio, each clause includes a speaker name, a start time, an end time, and a single sentence duration, and the clause is stored in a text form, and if the number of speakers in a file filled in when the audio is uploaded is 2, the speaker name defaults to speaker 1 and speaker 2, and the speakers in each sentence are distinguished by the name, and the start time and the end time are the time of each sentence relative to the total audio duration. The clauses are arranged according to the time sequence, and all the clauses are finally combined to form a json file which is used as the output of the calling algorithm. The updating iteration speed of the artificial intelligence algorithm is high, continuous optimization is needed for improving the accuracy of automatic voice separation, the json text mode is used for outputting the result of the algorithm without influencing other modules on the basis of updating the optimization algorithm, and therefore the independence of the algorithm is guaranteed while the flexibility is considered.
In step S304, after the voice separation algorithm is called, the separated result and the audio file need to be displayed on the interface, so as to facilitate the subsequent adjustment operation; each clause in the json file has a corresponding relation with the start time and the end time of the audio file, so that the audio waveform is required to be displayed on an interface, and each clause in the json file and the corresponding relation between the clause and the waveform are required to be displayed; the example provides a visual adjustment interface, the display interface is divided into an upper part and a lower part, the upper part displays audio waveforms, and the lower part displays analyzed json files which are arranged in a list form according to time sequence, as shown in fig. 4 and 5; the waveform display area loads an audio file to be processed, the size of voice energy can be clearly seen on the waveform, namely, the voice is contained in a mute place and a mute place, the upper amplitude and the lower amplitude of the waveform with large voice energy are large, the upper amplitude and the lower amplitude of the waveform with small voice energy are not obvious, and the waveform displayed in an imaging form is convenient to select and adjust the starting time and the ending time of each sentence.
In step S305, a semitransparent mark layer is covered on the oscillogram according to the start time and the end time of each sentence analyzed from the json file, so that the time boundary of each sentence on the waveform interface can be clearly marked while the waveform is seen; the starting time and the ending time of each sentence are adjusted by dragging the whole mark layer of each sentence left and right or dragging the boundary left and right independently by independently playing each clause in the play/pause control list and observing the existence and the size of the voice energy on the oscillogram, and the system automatically saves after the time adjustment. After the json file is successfully analyzed, arranging the json files in a clause list area from small to large according to time sequence, wherein the display content of each line comprises the name of a speaker, the starting time, the ending time, the duration and operation information, each sentence in the list has a corresponding relation with an identification layer on a oscillogram, and after a sentence in the list is selected, the identification layer on the oscillogram is also selected; the operation information of playing can be controlled in the list, the voice playing of the current line can be controlled by clicking playing/pausing, in addition, a deleting operation is provided, if a sentence has invalid voices with long blank or much noise, the deleting operation can be carried out, only the clauses in the following list are deleted, and the audio file per se is not deleted. If the speaker name of the current clause is not identified accurately or the default speaker name is required to be modified, the system provides a function of modifying the speaker name. Through the adjustment of visual interface function, can realize accurate people's voice separation, can distinguish two or more people in the single channel call record very accurately.
In step S306, the voices after the system separation are stored in the system only in a list form, and a export function needs to be provided for an external system to use, as an output of the whole visual voice separation. And the exported voices are combined according to the time sequence, the same speaker is combined into an audio file by taking the name of the speaker in the list as the standard, and if the selected clause contains a plurality of speakers, a compressed file is exported, the compressed file contains a plurality of audio file contents, and one speaker corresponds to one audio file.
As shown in fig. 2, the embodiment further provides a visual human voice separation device, which includes a processor 201, a memory 205, a user interface 203, a network interface 204, and a data bus 202, where the data bus 202 connects the processor 201, the memory 205, the user interface 203, and the network interface 204 together to form a whole set of usable hardware resources. The processor 201 is a CPU on a computer, and is not limited to foreign brands, but includes all current domestic brands. The user interface 203 provides an interface for interaction with a user, including a mouse interface, a keyboard interface, a display interface, and the like. The memory 205 stores an operating system in which a visual human voice separation system, a user interface module, and a network communication module are installed.
According to the invention, on the basis of high separation accuracy of the artificial intelligence human voice separation algorithm, the manual adjustment of the interface visualization is carried out, and the expected accurate separation effect can be achieved.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Finally, it is to be noted that: the above description is only a preferred embodiment of the present invention, and is only used to illustrate the technical solutions of the present invention, and not to limit the protection scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.
Claims (10)
1. A visual human voice separation system, the system comprising:
the audio/video format conversion module is used for converting the file uploaded to the system into an audio format matched with the human voice separation algorithm module;
the voice separation algorithm module is used for logically segmenting the audio file to be processed, and carrying out sentence segmentation according to the time sequence to form a json file with each sentence including the name of a speaker, the starting time and the ending time;
the separation result display module is used for displaying the result segmented by the voice separation algorithm module on an interface, wherein the upper half part of the interface displays waveforms, and the lower half part of the interface displays list information of clauses;
the visual voice separation adjusting module is used for observing the existence and the size of voice energy on the oscillogram through the independent playing of each clause in the playing/pause control list, and repeatedly finely adjusting the starting time and the ending time of each clause so as to adjust the time boundary;
and the voice separation task management module is used for managing the uploaded voice separation tasks, and the voice/video uploaded by the user each time is managed as a single task.
2. The visual human voice separation system according to claim 1, wherein the human voice separation algorithm module adopts an artificial intelligence processing mode to realize automatic human voice separation through voice segmentation clustering based on mixed features of mel frequency cepstrum coefficients and gamma frequency cepstrum coefficients.
3. A visual human voice separation method is characterized by comprising the following steps:
opening a visual human voice separation system, and importing the audio/video files to be separated into the system;
converting the audio/video into an audio format matched with a human voice separation algorithm;
carrying out logic segmentation on an audio file to be processed, and carrying out sentence segmentation according to time sequence to finally form a json file with each sentence including a speaker name, start time and end time;
displaying the separated result on an interface, displaying the audio file on the upper half part in a waveform form, and displaying the analyzed json file on the lower half part in a list form;
playing and adjusting each sentence on a result display interface to realize accurate voice separation;
and separating the separated voice clauses, and selecting and exporting the voice clauses according to requirements.
4. A visual human voice separation method according to claim 3, characterized in that the audio/video file to be separated is stored in a storage medium such as a usb disk, a removable hard disk, an optical drive or a computer hard disk.
5. A visual human voice separation method according to claim 3, wherein the converting the audio/video into an audio format matching the human voice separation algorithm comprises:
the audio format that the human voice separation algorithm can recognize during training is fixed, and in order to adapt the human voice separation algorithm, the imported audio/video format must be converted into the audio format during training.
6. A visual human voice separation method according to claim 3, wherein the logic segmentation of the audio file to be processed, the sentence segmentation in chronological order, and the final formation of a json file containing the speaker name, the start time, and the end time per sentence, comprises:
and calling a voice separation algorithm in the system to logically divide the converted audio file, and marking the audio file after sentence division, wherein each sentence comprises a speaker name, a start time, an end time and a single sentence duration, the sentences are stored in a text form, the sentences are arranged according to a time sequence, and all the sentences are finally combined to form a json file.
7. A visualized human voice separation method according to claim 3, wherein the separated result is displayed on the interface, the audio file is displayed in the upper half part in a waveform form, and the analyzed json file is displayed in the lower half part in a list form, and the method comprises:
each clause in the json file corresponds to the start time and the end time of the audio file, and the audio waveform and the corresponding relation between the clause and the waveform in the json file are displayed on an interface; the display interface is divided into an upper part and a lower part, the upper part displays audio waveforms, and the lower part displays analyzed json files which are arranged in a list form according to time sequence.
8. The visual human voice separation method according to claim 3, wherein the playing and adjusting of each sentence is performed on the result display interface to realize accurate human voice separation, and the method comprises the following steps:
covering a semi-transparent identification layer on the oscillogram according to the start time and the end time of each sentence analyzed by the json file, and identifying the time boundary of each sentence on the waveform interface through the identification layer; the starting time and the ending time of each sentence are adjusted by dragging the whole mark layer of each sentence left and right or dragging the boundary left and right independently by independently playing each clause in the play/pause control list and observing the existence and the size of the voice energy on the oscillogram, and the system automatically saves after the time adjustment;
each word in the list has a corresponding relation with the identification layer on the oscillogram, after a word in the list is selected, the identification layer on the oscillogram is also selected, the voice playing of each row is controlled through playing/pausing in the list, and the content of each word is adjusted through deleting, newly adding and modifying operations.
9. The visual human voice separating method according to claim 3, wherein the selecting and deriving of the separated human voice clauses according to the requirement comprises:
and if the selected clause contains a plurality of speakers, exporting a compressed file containing a plurality of audio file contents, wherein one speaker corresponds to one audio file.
10. A visual human voice separation device is characterized by comprising a processor, a memory, a user interface, a network interface and a data bus, wherein the data bus connects the processor, the memory, the user interface and the network interface together, the memory stores an operating system, and the operating system is internally provided with a visual human voice separation system, a user interface module and a network communication module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111437237.2A CN114464198B (en) | 2021-11-30 | 2021-11-30 | Visual human voice separation system, method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111437237.2A CN114464198B (en) | 2021-11-30 | 2021-11-30 | Visual human voice separation system, method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114464198A true CN114464198A (en) | 2022-05-10 |
CN114464198B CN114464198B (en) | 2023-06-06 |
Family
ID=81406299
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111437237.2A Active CN114464198B (en) | 2021-11-30 | 2021-11-30 | Visual human voice separation system, method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114464198B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004173058A (en) * | 2002-11-21 | 2004-06-17 | Nippon Telegr & Teleph Corp <Ntt> | Method and device for visualizing conference information, and program and recording medium with the program recorded |
CN105161094A (en) * | 2015-06-26 | 2015-12-16 | 徐信 | System and method for manually adjusting cutting point in audio cutting of voice |
CN105868400A (en) * | 2016-04-19 | 2016-08-17 | 乐视控股(北京)有限公司 | Recorded sound information processing method and recorded sound information processing device |
CN106024009A (en) * | 2016-04-29 | 2016-10-12 | 北京小米移动软件有限公司 | Audio processing method and device |
CN106448683A (en) * | 2016-09-30 | 2017-02-22 | 珠海市魅族科技有限公司 | Method and device for viewing recording in multimedia files |
CN106548793A (en) * | 2015-09-16 | 2017-03-29 | 中兴通讯股份有限公司 | Storage and the method and apparatus for playing audio file |
WO2019183904A1 (en) * | 2018-03-29 | 2019-10-03 | 华为技术有限公司 | Method for automatically identifying different human voices in audio |
CN110709924A (en) * | 2017-11-22 | 2020-01-17 | 谷歌有限责任公司 | Audio-visual speech separation |
CN112487238A (en) * | 2020-10-27 | 2021-03-12 | 百果园技术(新加坡)有限公司 | Audio processing method, device, terminal and medium |
CN113707173A (en) * | 2021-08-30 | 2021-11-26 | 平安科技(深圳)有限公司 | Voice separation method, device and equipment based on audio segmentation and storage medium |
-
2021
- 2021-11-30 CN CN202111437237.2A patent/CN114464198B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004173058A (en) * | 2002-11-21 | 2004-06-17 | Nippon Telegr & Teleph Corp <Ntt> | Method and device for visualizing conference information, and program and recording medium with the program recorded |
CN105161094A (en) * | 2015-06-26 | 2015-12-16 | 徐信 | System and method for manually adjusting cutting point in audio cutting of voice |
CN106548793A (en) * | 2015-09-16 | 2017-03-29 | 中兴通讯股份有限公司 | Storage and the method and apparatus for playing audio file |
CN105868400A (en) * | 2016-04-19 | 2016-08-17 | 乐视控股(北京)有限公司 | Recorded sound information processing method and recorded sound information processing device |
CN106024009A (en) * | 2016-04-29 | 2016-10-12 | 北京小米移动软件有限公司 | Audio processing method and device |
CN106448683A (en) * | 2016-09-30 | 2017-02-22 | 珠海市魅族科技有限公司 | Method and device for viewing recording in multimedia files |
CN110709924A (en) * | 2017-11-22 | 2020-01-17 | 谷歌有限责任公司 | Audio-visual speech separation |
WO2019183904A1 (en) * | 2018-03-29 | 2019-10-03 | 华为技术有限公司 | Method for automatically identifying different human voices in audio |
CN112487238A (en) * | 2020-10-27 | 2021-03-12 | 百果园技术(新加坡)有限公司 | Audio processing method, device, terminal and medium |
CN113707173A (en) * | 2021-08-30 | 2021-11-26 | 平安科技(深圳)有限公司 | Voice separation method, device and equipment based on audio segmentation and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114464198B (en) | 2023-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107409061B (en) | Method and system for phonetic summarization | |
US10522136B2 (en) | Method and device for training acoustic model, computer device and storage medium | |
US10977299B2 (en) | Systems and methods for consolidating recorded content | |
CN111883110B (en) | Acoustic model training method, system, equipment and medium for speech recognition | |
CN108305632A (en) | A kind of the voice abstract forming method and system of meeting | |
CN109889920B (en) | Network course video editing method, system, equipment and storage medium | |
US11646032B2 (en) | Systems and methods for audio processing | |
US7440900B2 (en) | Voice message processing system and method | |
WO2017067206A1 (en) | Training method for multiple personalized acoustic models, and voice synthesis method and device | |
US6775651B1 (en) | Method of transcribing text from computer voice mail | |
US7292975B2 (en) | Systems and methods for evaluating speaker suitability for automatic speech recognition aided transcription | |
WO1993007562A1 (en) | Method and apparatus for managing information | |
CN109192194A (en) | Voice data mask method, device, computer equipment and storage medium | |
CN101290766A (en) | Syllable splitting method of Tibetan language of Anduo | |
US20230223016A1 (en) | User interface linking analyzed segments of transcripts with extracted key points | |
US20220238118A1 (en) | Apparatus for processing an audio signal for the generation of a multimedia file with speech transcription | |
CN111653265A (en) | Speech synthesis method, speech synthesis device, storage medium and electronic equipment | |
CN111370030A (en) | Voice emotion detection method and device, storage medium and electronic equipment | |
US6963835B2 (en) | Cascaded hidden Markov model for meta-state estimation | |
CN110211592A (en) | Intelligent sound data processing equipment and method | |
Dongmei | Design of English text-to-speech conversion algorithm based on machine learning | |
Mishra et al. | Optimization of stammering in speech recognition applications | |
CN113901186A (en) | Telephone recording marking method, device, equipment and storage medium | |
CN113555133A (en) | Medical inquiry data processing method and device | |
JP2014123813A (en) | Automatic scoring device for dialog between operator and customer, and operation method for the same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |