CN112420069A - Voice processing method, device, machine readable medium and equipment - Google Patents

Voice processing method, device, machine readable medium and equipment Download PDF

Info

Publication number
CN112420069A
CN112420069A CN202011292034.4A CN202011292034A CN112420069A CN 112420069 A CN112420069 A CN 112420069A CN 202011292034 A CN202011292034 A CN 202011292034A CN 112420069 A CN112420069 A CN 112420069A
Authority
CN
China
Prior art keywords
voice
clustering
sub
segments
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011292034.4A
Other languages
Chinese (zh)
Inventor
晏超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yuncong Technology Co ltd
Original Assignee
Beijing Yuncong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yuncong Technology Co ltd filed Critical Beijing Yuncong Technology Co ltd
Priority to CN202011292034.4A priority Critical patent/CN112420069A/en
Publication of CN112420069A publication Critical patent/CN112420069A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Business, Economics & Management (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a voice processing method, which comprises the following steps: acquiring a voice fragment with human voice; segmenting the voice segment with the voice to obtain a plurality of voice sub-segments; performing voiceprint feature extraction on the plurality of voice sub-segments to obtain a voiceprint feature corresponding to each voice sub-segment; carrying out cascade clustering on the voiceprint characteristics corresponding to all the voice sub-segments to obtain a speaker label corresponding to each voice sub-segment; and combining the same speaker tag to finish voice segmentation. The invention can correct and smooth the audio frequency of a plurality of speakers or the unbalanced content of the speakers by the cascade clustering technology, reduces the situation of misallocation of the voice content of the speakers, and ensures that the speaker separation system can meet the requirements of practical application scenes.

Description

Voice processing method, device, machine readable medium and equipment
Technical Field
The invention relates to the field of voice processing, in particular to a voice processing method, a voice processing device, a machine readable medium and equipment.
Background
In application scenes such as bank quality inspection, intelligent customer service, conference recording and the like, different speakers in a single-channel voice audio are often required to be divided and distinguished, and starting points of the speaking contents of the different speakers are marked.
However, in the existing speaker segmentation (speaker separation) algorithm, in actual use, the effect of speaker separation is not ideal, on one hand, because the signal to noise ratio of the audio recorded in the actual scene is difficult to control, under the scene with strong noise, more noise can have a great negative effect on the separation effect, and on the other hand, when the algorithm has multiple speakers simultaneously in the audio or the contents of two speakers in the audio are greatly unbalanced, the situation of erroneously distinguishing the speaking contents of the speakers is serious, and the requirement of using the actual application scene cannot be met.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide a method, an apparatus, a machine-readable medium and a device for processing speech, which are used to solve the problems of the prior art.
To achieve the above and other related objects, the present invention provides a speech processing method, including:
acquiring a voice fragment with human voice;
segmenting the voice segment with the voice to obtain a plurality of voice sub-segments;
performing voiceprint feature extraction on the plurality of voice sub-segments to obtain a voiceprint feature corresponding to each voice sub-segment;
carrying out cascade clustering on the voiceprint characteristics corresponding to all the voice sub-segments to obtain a speaker label corresponding to each voice sub-segment;
and combining the same speaker tag to finish voice segmentation.
Optionally, the performing cascade clustering on the voiceprint features corresponding to all the voice sub-segments includes:
performing first clustering on the voiceprint characteristics corresponding to all the voice sub-segments by using an AHC hierarchical clustering algorithm to obtain a first clustering result;
and performing secondary clustering on the first clustering result by taking the first clustering result as the initial value of the Kmeans clustering to obtain a second clustering result.
Optionally, the Kmeans clustering is made to satisfy the iteration Stop condition by the Early Stop mechanism.
Optionally, the iteration stop condition is: the cluster center point change rate of part of the plurality of clusters does not exceed a set range.
Optionally, the acquiring the voice segment with the human voice includes:
extracting bottom layer characteristics of the voice audio;
and obtaining a voice segment with human voice based on the first neural network and the bottom layer characteristics of the voice audio.
Optionally, the first neural network comprises a deep latency neural network.
Optionally, the voice segment with the voice is segmented by adopting a sliding window to obtain a plurality of voice sub-segments.
Optionally, a deep neural network is employed to extract voiceprint features of the plurality of speech sub-segments.
To achieve the above and other related objects, the present invention provides a speech processing apparatus comprising:
the voice acquiring module is used for acquiring voice fragments with voices;
the voice sub-frame framing module is used for segmenting the voice segment with the voice to obtain a plurality of voice sub-segments;
the voiceprint information extraction module is used for carrying out voiceprint feature extraction on the plurality of voice sub-segments to obtain a voiceprint feature corresponding to each voice sub-segment;
the voiceprint feature clustering module is used for carrying out cascade clustering on the voiceprint features corresponding to all the voice sub-segments to obtain a speaker label corresponding to each voice sub-segment;
and the voice post-processing module is used for merging the same speaker tag to finish voice segmentation.
Optionally, the voiceprint feature clustering module includes:
the first clustering submodule is used for carrying out first clustering on the voiceprint characteristics corresponding to all the voice sub-fragments by using an AHC hierarchical clustering algorithm to obtain a first clustering result;
the second clustering submodule is used for carrying out second clustering on the first clustering result by taking the first clustering result as the initial value of the Kmeans clustering to obtain a second clustering result; and enabling the Kmeans cluster to meet the iteration Stop condition through an Early Stop mechanism.
Optionally, the iteration stop condition is: the cluster center point change rate of part of the plurality of clusters does not exceed a set range.
Optionally, the human voice acquiring module includes:
the characteristic extraction submodule is used for extracting bottom layer characteristics of the voice audio;
and the voice fragment acquisition submodule is used for acquiring a voice fragment with human voice based on the first neural network and the bottom layer characteristics of the voice audio.
Optionally, the voice sub-frame framing module divides the voice segment with the voice by using a sliding window to obtain a plurality of voice sub-segments.
To achieve the above and other related objects, the present invention also provides a voice processing apparatus comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform one or more of the methods described previously.
To achieve the above objects and other related objects, the present invention also provides one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform one or more of the methods described above.
As described above, the speaker separation method, device, machine-readable medium and apparatus based on speaker identity and cascade clustering provided by the present invention have the following advantages:
the invention discloses a voice processing method, which comprises the following steps: acquiring a voice fragment with human voice; segmenting the voice segment with the voice to obtain a plurality of voice sub-segments; performing voiceprint feature extraction on the plurality of voice sub-segments to obtain a voiceprint feature corresponding to each voice sub-segment; carrying out cascade clustering on the voiceprint characteristics corresponding to all the voice sub-segments to obtain a speaker label corresponding to each voice sub-segment; and combining the same speaker tag to finish voice segmentation. The invention is based on the speaker voice print information extraction and voice detection of the deep neural network, combines the method of cascade clustering to separate speakers, can greatly reduce the negative influence of various strong noise speaker separation effects through the voice detection and voice print information extraction of the deep neural network, improve the processing capacity of the algorithm to the audio with lower signal-to-noise ratio, correct and smooth the audio with a plurality of speakers or unbalanced speaker contents through the cascade clustering technology, reduce the situation of misallocation of the speaker voice contents, and enable the speaker separation system to meet the requirements of practical application scenes.
Drawings
FIG. 1 is a flow chart of a speech processing method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for obtaining a voice segment with a human voice according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for performing cascade clustering on voiceprint features corresponding to all speech sub-segments according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
As shown in fig. 1, a speech processing method includes:
s11, acquiring a voice fragment with human voice;
s12, segmenting the voice segment with the voice to obtain a plurality of voice sub-segments;
s13, extracting the voiceprint characteristics of the plurality of voice sub-segments to obtain the voiceprint characteristics corresponding to each voice sub-segment;
s14, carrying out cascade clustering on the voiceprint characteristics corresponding to all the voice sub-segments to obtain a speaker label corresponding to each voice sub-segment;
and S15, merging the same speaker label to finish the voice segmentation.
The invention is based on the speaker voice print information extraction and voice detection of the deep neural network, combines the method of cascade clustering to separate speakers, can greatly reduce the negative influence of various strong noise speaker separation effects through the voice detection and voice print information extraction of the deep neural network, improve the processing capacity of the algorithm to the audio with lower signal-to-noise ratio, correct and smooth the audio with a plurality of speakers or unbalanced speaker contents through the cascade clustering technology, reduce the situation of misallocation of the speaker voice contents, and enable the speaker separation system to meet the requirements of practical application scenes.
In one embodiment, as shown in fig. 2, the obtaining of the voice segment with human voice includes
S21 extracting the bottom layer characteristics of the voice audio;
s22, obtaining a voice segment with human voice based on the first neural network and the bottom layer characteristics of the voice audio.
When the bottom layer characteristics of the voice audio are extracted, firstly, the voice original information is subjected to characteristic transformation from a time domain to a frequency domain to obtain the bottom layer characteristics of the voice audio; then inputting the bottom layer characteristics of the voice audio into a first neural network trained in advance, segmenting the original voice audio into human voice and non-human voice through the first neural network, and finally extracting voice segments containing the human voice.
The first neural network may adopt a deep delay neural network. When the first neural network is trained, the first neural network can be trained by adjusting the human voice proportion weight (for example, if more human voices are detected, the human voice proportion weight can be increased, and if more non-human voices are detected, the non-human voice proportion weight can be increased), so that the negative influence of a strong noise environment on a voice segmentation effect can be greatly reduced, the voice segmentation has stronger environmental noise resistance and self-adaptability, and the requirements of different use scenes can be met.
In an embodiment, the voice segment with the voice is segmented by adopting a sliding window to obtain a plurality of voice sub-segments. When the sliding window is used for processing the voice segment with the voice, the size and the step length of the sliding window can be set according to actual requirements. When sliding on a voice segment with human voice, the voice sub-segments obtained by two adjacent sliding can include overlapped parts.
In one embodiment, deep neural networks are employed to extract voiceprint features of multiple speech sub-segments. The deep neural network can be accumulated by using the long-term information at the frame level and the sentence level when the voiceprint features are extracted, so that more robust voiceprint information can be obtained, and compared with the traditional xvector, dvector and vector voiceprint features, the method can be used for simultaneously processing the long-term audio and the short-term audio.
In an embodiment, as shown in fig. 3, the performing cascade clustering on the voiceprint features corresponding to all the speech sub-segments includes:
s31, clustering the voiceprint characteristics corresponding to all the voice sub-segments for the first time by using an AHC hierarchical clustering algorithm to obtain a first clustering result; wherein the first clustering result comprises n categories.
S32, clustering the first clustering result for the second time by taking the first clustering result as the initial value of the Kmeans clustering, and obtaining a second clustering result.
When Kmeans clustering is carried out, firstly, the number K of clusters is given, and then K centers to be clustered are initially and randomly given. In this embodiment, the number K of clusters is the number of categories included in the first clustering result, and each initial center to be clustered is selected from the first clustering results. One to-be-clustered center in the Kmeans cluster is selected from one category in the first clustering result.
In one embodiment, by Early Stop mechanism (Early Stop method), Kmeans cluster satisfies the iteration Stop condition: the cluster center point change rate of part of the plurality of clusters does not exceed a set range.
The Kmeans clustering algorithm is calculated iteratively until a stop condition is reached. In general, the stopping condition is that convergence is reached or a certain stopping threshold is reached, or the cluster center point change rate of all categories is stable, and so on. In the present application, the iteration stop condition is: the cluster center point change rate of part of the plurality of clusters does not exceed a set range. For example, when the number of clusters is set, 10 clusters are set, and through an Early Stop mechanism, the change rate of the cluster center points of 3 of the 10 clusters does not exceed a set range, so that iterative computation can be stopped, and secondary clustering is completed.
The convergence rate of the Kmeans clustering algorithm can be accelerated through an Early Stop mechanism, so that the processing speed and the performance can meet the requirements of practical application scenes when a voice processing method is specifically executed.
In one embodiment, Kmeans + +, spectral clustering, or other correction methods may also be used to perform the second clustering on the first clustering result.
The invention uses cascade clustering to correct and smooth the plurality of speakers and the audio time-space line with unbalanced speaker content, thereby reducing the speaker distribution error rate and enabling the speaker separation to meet the use requirements of application scenes.
As shown in fig. 4, a speech processing apparatus includes:
a voice acquiring module 41, configured to acquire a voice segment with voice;
a voice sub-frame framing module 42, configured to segment the voice segment with voice to obtain multiple voice sub-segments;
a voiceprint information extraction module 43, configured to perform voiceprint feature extraction on the multiple voice sub-segments to obtain a voiceprint feature corresponding to each voice sub-segment;
the voiceprint feature clustering module 44 is configured to perform cascade clustering on the voiceprint features corresponding to all the voice sub-segments to obtain a speaker tag corresponding to each voice sub-segment;
and the voice post-processing module 45 is used for merging the same speaker tag to finish voice segmentation.
In one embodiment, the voiceprint feature clustering module comprises:
the first clustering submodule is used for carrying out first clustering on the voiceprint characteristics corresponding to all the voice sub-fragments by using an AHC hierarchical clustering algorithm to obtain a first clustering result;
the second clustering submodule is used for carrying out second clustering on the first clustering result by taking the first clustering result as the initial value of the Kmeans clustering to obtain a second clustering result; and enabling the Kmeans cluster to meet the iteration Stop condition through an Early Stop mechanism.
In one embodiment, the iteration stop condition is: the cluster center point change rate of part of the plurality of clusters does not exceed a set range.
In one embodiment, the human voice obtaining module includes:
the characteristic extraction submodule is used for extracting bottom layer characteristics of the voice audio;
and the voice fragment acquisition submodule is used for acquiring a voice fragment with human voice based on the first neural network and the bottom layer characteristics of the voice audio.
In an embodiment, the voice sub-frame framing module performs segmentation on the voice segment with voice by using a sliding window to obtain a plurality of voice sub-segments.
In this embodiment, the embodiment of the apparatus corresponds to the embodiment of the method, and specific functions and technical effects are only referred to the embodiment, which is not described herein again.
An embodiment of the present application further provides an apparatus, which may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of fig. 1. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.
The present application further provides a non-transitory readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may be caused to execute instructions (instructions) of steps included in the method in fig. 1 according to the present application.
Fig. 5 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.
Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the first processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.
Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like.
In this embodiment, the processor of the terminal device includes a module for executing functions of each module in each device, and specific functions and technical effects may refer to the foregoing embodiments, which are not described herein again.
Fig. 6 is a schematic hardware structure diagram of a terminal device according to an embodiment of the present application. FIG. 6 is a specific embodiment of the implementation of FIG. 5. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.
The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment.
The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication component 1203, power component 1204, multimedia component 1205, speech component 1206, input/output interfaces 1207, and/or sensor component 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.
The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the data processing method described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.
The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.
The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
The voice component 1206 is configured to output and/or input voice signals. For example, the voice component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, the speech component 1206 further comprises a speaker for outputting speech signals.
The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.
The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.
The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.
As can be seen from the above, the communication component 1203, the voice component 1206, the input/output interface 1207 and the sensor component 1208 referred to in the embodiment of fig. 6 can be implemented as the input device in the embodiment of fig. 5.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (15)

1. A method of speech processing, comprising:
acquiring a voice fragment with human voice;
segmenting the voice segment with the voice to obtain a plurality of voice sub-segments;
performing voiceprint feature extraction on the plurality of voice sub-segments to obtain a voiceprint feature corresponding to each voice sub-segment;
carrying out cascade clustering on the voiceprint characteristics corresponding to all the voice sub-segments to obtain a speaker label corresponding to each voice sub-segment;
and combining the same speaker tag to finish voice segmentation.
2. The speech processing method according to claim 1, wherein the performing cascade clustering on the voiceprint features corresponding to all the speech sub-segments comprises:
performing first clustering on the voiceprint characteristics corresponding to all the voice sub-segments by using an AHC hierarchical clustering algorithm to obtain a first clustering result;
and performing secondary clustering on the first clustering result by taking the first clustering result as the initial value of the Kmeans clustering to obtain a second clustering result.
3. The speech processing method according to claim 1, wherein the Kmeans clustering is made to satisfy the iteration Stop condition by an Early Stop mechanism.
4. The speech processing method according to claim 2, wherein the iteration stop condition is: the cluster center point change rate of part of the plurality of clusters does not exceed a set range.
5. The speech processing method according to claim 1, wherein the obtaining of the speech segment with human voice comprises:
extracting bottom layer characteristics of the voice audio;
and obtaining a voice segment with human voice based on the first neural network and the bottom layer characteristics of the voice audio.
6. The method of speech processing according to claim 5 wherein the first neural network comprises a deep-delay neural network.
7. The speech processing method according to claim 1, wherein the speech segment with human voice is segmented by using a sliding window to obtain a plurality of speech sub-segments.
8. The speech processing method of claim 1 wherein deep neural networks are used to extract voiceprint features of a plurality of speech sub-segments.
9. A speech processing apparatus, comprising:
the voice acquiring module is used for acquiring voice fragments with voices;
the voice sub-frame framing module is used for segmenting the voice segment with the voice to obtain a plurality of voice sub-segments;
the voiceprint information extraction module is used for carrying out voiceprint feature extraction on the plurality of voice sub-segments to obtain a voiceprint feature corresponding to each voice sub-segment;
the voiceprint feature clustering module is used for carrying out cascade clustering on the voiceprint features corresponding to all the voice sub-segments to obtain a speaker label corresponding to each voice sub-segment;
and the voice post-processing module is used for merging the same speaker tag to finish voice segmentation.
10. The speech processing apparatus of claim 9, wherein the voiceprint feature clustering module comprises:
the first clustering submodule is used for carrying out first clustering on the voiceprint characteristics corresponding to all the voice sub-fragments by using an AHC hierarchical clustering algorithm to obtain a first clustering result;
the second clustering submodule is used for carrying out second clustering on the first clustering result by taking the first clustering result as the initial value of the Kmeans clustering to obtain a second clustering result; and enabling the Kmeans cluster to meet the iteration Stop condition through an Early Stop mechanism.
11. The speech processing apparatus according to claim 10, wherein the iteration stop condition is: the cluster center point change rate of part of the plurality of clusters does not exceed a set range.
12. The speech processing apparatus of claim 9, wherein the human voice acquisition module comprises:
the characteristic extraction submodule is used for extracting bottom layer characteristics of the voice audio;
and the voice fragment acquisition submodule is used for acquiring a voice fragment with human voice based on the first neural network and the bottom layer characteristics of the voice audio.
13. The apparatus according to claim 9, wherein the speech subframe framing module segments the speech segment with human voice by using a sliding window to obtain a plurality of speech sub-segments.
14. A speech processing device, comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method recited by one or more of claims 1-8.
15. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method recited by one or more of claims 1-8.
CN202011292034.4A 2020-11-18 2020-11-18 Voice processing method, device, machine readable medium and equipment Pending CN112420069A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011292034.4A CN112420069A (en) 2020-11-18 2020-11-18 Voice processing method, device, machine readable medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011292034.4A CN112420069A (en) 2020-11-18 2020-11-18 Voice processing method, device, machine readable medium and equipment

Publications (1)

Publication Number Publication Date
CN112420069A true CN112420069A (en) 2021-02-26

Family

ID=74832562

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011292034.4A Pending CN112420069A (en) 2020-11-18 2020-11-18 Voice processing method, device, machine readable medium and equipment

Country Status (1)

Country Link
CN (1) CN112420069A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113113001A (en) * 2021-04-20 2021-07-13 深圳市友杰智新科技有限公司 Human voice activation detection method and device, computer equipment and storage medium
CN113593597A (en) * 2021-08-27 2021-11-02 中国电信股份有限公司 Voice noise filtering method and device, electronic equipment and medium
CN113674755A (en) * 2021-08-19 2021-11-19 北京百度网讯科技有限公司 Voice processing method, device, electronic equipment and medium
CN113707130A (en) * 2021-08-16 2021-11-26 北京搜狗科技发展有限公司 Voice recognition method and device for voice recognition
CN113793592A (en) * 2021-10-29 2021-12-14 浙江核新同花顺网络信息股份有限公司 Method and system for distinguishing speakers
CN117594058A (en) * 2024-01-19 2024-02-23 南京龙垣信息科技有限公司 Audio speaker separation method based on deep learning

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1556522A (en) * 2004-01-06 2004-12-22 中国人民解放军保密委员会技术安全研 Telephone channel speaker voice print identification system
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN107274892A (en) * 2017-04-24 2017-10-20 乐视控股(北京)有限公司 Method for distinguishing speek person and device
CN107492382A (en) * 2016-06-13 2017-12-19 阿里巴巴集团控股有限公司 Voiceprint extracting method and device based on neutral net
CN109256137A (en) * 2018-10-09 2019-01-22 深圳市声扬科技有限公司 Voice acquisition method, device, computer equipment and storage medium
CN110211595A (en) * 2019-06-28 2019-09-06 四川长虹电器股份有限公司 A kind of speaker clustering system based on deep learning
CN110689906A (en) * 2019-11-05 2020-01-14 江苏网进科技股份有限公司 Law enforcement detection method and system based on voice processing technology
CN111243601A (en) * 2019-12-31 2020-06-05 北京捷通华声科技股份有限公司 Voiceprint clustering method and device, electronic equipment and computer-readable storage medium
CN111524527A (en) * 2020-04-30 2020-08-11 合肥讯飞数码科技有限公司 Speaker separation method, device, electronic equipment and storage medium
CN111816218A (en) * 2020-07-31 2020-10-23 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1556522A (en) * 2004-01-06 2004-12-22 中国人民解放军保密委员会技术安全研 Telephone channel speaker voice print identification system
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN107492382A (en) * 2016-06-13 2017-12-19 阿里巴巴集团控股有限公司 Voiceprint extracting method and device based on neutral net
CN107274892A (en) * 2017-04-24 2017-10-20 乐视控股(北京)有限公司 Method for distinguishing speek person and device
CN109256137A (en) * 2018-10-09 2019-01-22 深圳市声扬科技有限公司 Voice acquisition method, device, computer equipment and storage medium
CN110211595A (en) * 2019-06-28 2019-09-06 四川长虹电器股份有限公司 A kind of speaker clustering system based on deep learning
CN110689906A (en) * 2019-11-05 2020-01-14 江苏网进科技股份有限公司 Law enforcement detection method and system based on voice processing technology
CN111243601A (en) * 2019-12-31 2020-06-05 北京捷通华声科技股份有限公司 Voiceprint clustering method and device, electronic equipment and computer-readable storage medium
CN111524527A (en) * 2020-04-30 2020-08-11 合肥讯飞数码科技有限公司 Speaker separation method, device, electronic equipment and storage medium
CN111816218A (en) * 2020-07-31 2020-10-23 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113113001A (en) * 2021-04-20 2021-07-13 深圳市友杰智新科技有限公司 Human voice activation detection method and device, computer equipment and storage medium
CN113707130A (en) * 2021-08-16 2021-11-26 北京搜狗科技发展有限公司 Voice recognition method and device for voice recognition
CN113674755A (en) * 2021-08-19 2021-11-19 北京百度网讯科技有限公司 Voice processing method, device, electronic equipment and medium
CN113674755B (en) * 2021-08-19 2024-04-02 北京百度网讯科技有限公司 Voice processing method, device, electronic equipment and medium
CN113593597A (en) * 2021-08-27 2021-11-02 中国电信股份有限公司 Voice noise filtering method and device, electronic equipment and medium
CN113593597B (en) * 2021-08-27 2024-03-19 中国电信股份有限公司 Voice noise filtering method, device, electronic equipment and medium
CN113793592A (en) * 2021-10-29 2021-12-14 浙江核新同花顺网络信息股份有限公司 Method and system for distinguishing speakers
CN117594058A (en) * 2024-01-19 2024-02-23 南京龙垣信息科技有限公司 Audio speaker separation method based on deep learning

Similar Documents

Publication Publication Date Title
CN112420069A (en) Voice processing method, device, machine readable medium and equipment
CN112200062B (en) Target detection method and device based on neural network, machine readable medium and equipment
US20150088515A1 (en) Primary speaker identification from audio and video data
EP3164865A1 (en) Replay attack detection in automatic speaker verification systems
CN111598012B (en) Picture clustering management method, system, device and medium
CN105335754A (en) Character recognition method and device
CN110175223A (en) A kind of method and device that problem of implementation generates
CN105512685A (en) Object identification method and apparatus
CN105335713A (en) Fingerprint identification method and device
WO2014176750A1 (en) Reminder setting method, apparatus and system
CN112200318B (en) Target detection method, device, machine readable medium and equipment
CN105354560A (en) Fingerprint identification method and device
CN104361896B (en) Voice quality assessment equipment, method and system
CN111310725A (en) Object identification method, system, machine readable medium and device
CN112529939A (en) Target track matching method and device, machine readable medium and equipment
CN107172258A (en) A kind of method, device, terminal and storage medium for preserving associated person information
CN108268507B (en) Browser-based processing method and device and electronic equipment
CN112966756A (en) Visual access rule generation method and device, machine readable medium and equipment
CN111178455B (en) Image clustering method, system, device and medium
CN112423019A (en) Method and device for adjusting audio playing speed, electronic equipment and storage medium
CN103973870A (en) Information processing device and information processing method
CN115798459A (en) Audio processing method and device, storage medium and electronic equipment
CN112347982A (en) Video-based unsupervised difficult case data mining method, device, medium and equipment
CN114943872A (en) Training method and device of target detection model, target detection method and device, medium and equipment
CN112417197B (en) Sorting method, sorting device, machine readable medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210226

RJ01 Rejection of invention patent application after publication