CN111613249A

CN111613249A - Voice analysis method and equipment

Info

Publication number: CN111613249A
Application number: CN202010444381.8A
Authority: CN
Inventors: 李旭滨; 范红亮
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2020-09-01

Abstract

The invention provides a voice analysis method and equipment, which are applied to single-channel voice analysis, and the method comprises the following steps: dividing the voice data to be analyzed into a voice part and a non-voice part; the voice data to be analyzed comprises voice data of a plurality of speakers; segmenting a voice part into a plurality of voice segments; clustering the voice segments with the time exceeding the preset duration to obtain the information of each voice segment; processing each voice segment with the determined information to determine the voice characteristics of a plurality of speakers; comparing the voice segment which is not processed currently and is sorted in time at the forefront with the voice characteristics of a plurality of processed speakers, determining the speaker corresponding to the voice segment which is compared currently, and setting the voice segment which is compared currently as the processed voice segment. The role separation is carried out aiming at the single-channel audio, the sounds of a plurality of pronouncing persons are separated, the follow-up operation such as quality inspection and intention analysis is facilitated, and the processing efficiency is improved.

Description

Voice analysis method and equipment

Technical Field

The present invention relates to the field of speech processing, and in particular, to a speech analysis method and apparatus.

Background

At present, telephone customer service generally appears in the aspect of our life. How to improve the service quality of telephone service and analyze the intention of the customer are very important topics. The voice recognition is needed to carry out customer service quality inspection and customer intention analysis, but the current customer service quality inspection and customer intention analysis are detected by manual sampling, so that the efficiency is low, and the omission ratio is high

Automatic detection and analysis by machine is then becoming more and more important, and it is relatively much easier to analyze customers and clients separately when they are on different telephone channels, simply by recognizing the text by speech and then analyzing it. When the voice of the customer and the customer service is stored in one channel, it becomes extremely difficult to perform quality inspection of the customer service and analysis of the intention of the customer at the same time, and in this case, the most difficult part of the analysis is role analysis.

Thus, there is a need for a better solution to this problem.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a voice analysis method and equipment, which can separate the voices of a plurality of speakers by performing role separation aiming at single-channel audio, are beneficial to subsequent operations such as quality inspection, intention analysis and the like, and the scheme supports online real-time processing and improves the processing efficiency.

Specifically, the present invention proposes the following specific examples:

the embodiment of the invention provides a voice analysis method, which is applied to single-channel voice analysis and comprises the following steps:

dividing the voice data to be analyzed into a voice part and a non-voice part; the voice data to be analyzed comprises voice data of a plurality of speakers;

segmenting the voice part into a plurality of voice segments;

clustering voice segments with time exceeding a preset duration to obtain information of each voice segment;

processing each of the voice segments with determined information to determine voice characteristics of a plurality of the speakers;

comparing the voice segment which is not processed currently and is sorted in time at the forefront with the voice characteristics of the plurality of processed speakers, determining the speaker corresponding to the voice segment which is compared currently, and setting the voice segment which is compared currently as the processed voice segment.

In a specific embodiment, the segmenting the voice data to be analyzed into a voice portion and a non-voice portion includes:

and segmenting the voice data to be analyzed by a Voice Activity Detection (VAD) method so as to divide the voice data to be analyzed into a voice part and a non-voice part.

In a specific embodiment, the segmenting the voice portion into a plurality of voice segments includes:

dividing the voice part into a plurality of non-overlapping voice segments according to a preset time length;

and if the time length of the last voice segment is less than the preset value, combining the last voice segment with the adjacent voice segments.

In a specific embodiment, the speech segments include forward and backward frames and/or overlap.

In a specific embodiment, in a time period corresponding to the preset time duration, each speaker in the voice data to be analyzed performs voice with a specified time duration.

In a particular embodiment, the information includes any combination of one or more of the following: characteristics, speaker of the speech, time point of the speech.

In a specific embodiment, said processing each of said speech segments for which information is determined to determine speech characteristics of a plurality of said speakers comprises:

smoothing each voice segment with determined information, merging adjacent voice segments belonging to the same speaker, and setting the speaker corresponding to a preset voice segment as the speaker same as the adjacent voice segments so as to determine the voice characteristics of a plurality of speakers;

the preset voice segment is located between the front and the back adjacent voice segments, the corresponding speakers of the front and the back adjacent voice segments are the same, and the time length of the preset voice segment is smaller than the preset time length.

The embodiment of the invention also provides a voice analysis device, which is applied to single-channel voice analysis and comprises:

the first segmentation module is used for segmenting the voice data to be analyzed into a voice part and a non-voice part; the voice data to be analyzed comprises voice data of a plurality of speakers;

the second segmentation module is used for segmenting the voice part into a plurality of voice segments;

the clustering module is used for clustering the voice segments with the time exceeding the preset duration so as to obtain the information of each voice segment;

the determining module is used for processing each voice segment with determined information and determining the voice characteristics of a plurality of speakers;

and the analysis module is used for comparing the voice segment which is not processed currently and is most time-sequenced with the voice characteristics of the processed speakers, determining the speaker corresponding to the currently compared voice segment, and setting the currently compared voice segment as the processed voice segment.

In a specific embodiment, the first division module is configured to:

In a specific embodiment, the second segmentation module is configured to:

Therefore, the embodiment of the invention provides a voice analysis method and equipment, which are applied to single-channel voice analysis, and the method comprises the following steps: dividing the voice data to be analyzed into a voice part and a non-voice part; the voice data to be analyzed comprises voice data of a plurality of speakers; segmenting the voice part into a plurality of voice segments; clustering voice segments with time exceeding a preset duration to obtain information of each voice segment; processing each of the voice segments with determined information to determine voice characteristics of a plurality of the speakers; comparing the voice segment which is not processed currently and is sorted in time at the forefront with the voice characteristics of the plurality of processed speakers, determining the speaker corresponding to the voice segment which is compared currently, and setting the voice segment which is compared currently as the processed voice segment. The role separation is carried out to the single channel audio frequency, can isolate a plurality of pronunciation people's sound, does benefit to the follow-up operation such as quality control and intention analysis that carries on, and this scheme supports online real-time processing, has promoted the treatment effeciency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic flow chart of a speech analysis method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a speech analysis method according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a speech analysis method according to an embodiment of the present invention.

Detailed Description

Various embodiments of the present disclosure will be described more fully hereinafter. The present disclosure is capable of various embodiments and of modifications and variations therein. However, it should be understood that: there is no intention to limit the various embodiments of the disclosure to the specific embodiments disclosed herein, but rather, the disclosure is to cover all modifications, equivalents, and/or alternatives falling within the spirit and scope of the various embodiments of the disclosure.

The terminology used in the various embodiments of the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the various embodiments of the present disclosure. As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the various embodiments of the present disclosure belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined in various embodiments of the present disclosure.

Example 1

Embodiment 1 of the present invention discloses a speech analysis method, which is applied to single-channel speech analysis, and as shown in fig. 1-2, includes the following steps:

step 101, dividing voice data to be analyzed into a voice part and a non-voice part; the voice data to be analyzed comprises voice data of a plurality of speakers;

the specific voice data to be analyzed may be single-channel voice data, specifically, for example, single-channel voice mixed with two persons, namely customer service and customer, or single-channel voice mixed with more speakers, so that the step 101 of dividing the voice data to be analyzed into voice parts and non-voice parts includes:

the method comprises the steps of segmenting Voice data to be analyzed through a Voice Activity Detection (VAD) method so as to divide the Voice data to be analyzed into a Voice part and a non-Voice part.

Specifically, besides the VAD method, other methods may also be used for segmentation, for example, the segmentation may also be performed according to the waveform of the voice, as long as the voice data to be analyzed can be divided into a voice part and a non-voice part, and the method is not limited to the VAD method.

Specifically, as shown in fig. 2, it is the stage of VAD, and the specific voice part is the region corresponding to speed 1 and speed 2.

Step 102, segmenting the voice part into a plurality of voice segments;

specifically, the segmenting the voice part into a plurality of voice segments in step 102 includes:

Specifically, the preset time length may be set to 500ms, for example, and the preset value may be set to 300ms, for example, to explain this, the voice portion is divided into voice segments that do not overlap with each other, and specifically, each voice segment has a length of 500 ms. If the length of the last voice segment is less than 300ms, the last voice segment and the previous voice segment are spliced into a longer voice segment; the last speech segment can be used alone as a speech segment if it is greater than or equal to 300ms, but less than 500 ms.

In the segmentation principle in the present scheme, only one speaker is considered in each segmented speech segment, so that the length of each speech segment cannot be too long or too short, generally several hundred milliseconds, and 500ms is a preferred embodiment through experiments, and in addition, according to different application scenarios, the preset time length may also be set to a certain value between 400 and 600ms, for example, and the preset value may be set to a certain value between 250 and 350ms, for example.

Specifically, in one embodiment, the speech segment includes forward and backward frames and/or overlap.

Corresponding to the stage 2 and Segment in fig. 2, a speech part is specifically divided into small speech segments (speech segments), and features of the segments are extracted. In order to ensure better effect, the voice segment has information of frame expansion and/or overlap.

Specifically, when the speech segment is processed in the scheme, a front-back frame expansion and/or overlap technology is adopted, so that the information extraction accuracy of the segment and the overall performance of the system can be greatly improved. The term "forward and backward extended frames" refers to frames processed frame by frame when extracting information of a speech segment, but the information of the current frame is not processed during processing, and the frames before and after the frame are included together for processing, that is, the total obtained information is the information of the current frame including "context information", in this case, the forward and backward extended frames are the frames before and after the current frame.

The Overlap means that the moving mode of the "current frame" is overlapped (Overlap) in the process of extracting information frame by frame. The overlap type means that, for example, the window length of each frame is 25ms, the window shift is 10ms, that is, the current frame and the next frame have an overlap of 15 ms; the information extracted by the method is more accurate.

103, clustering voice segments with time exceeding a preset duration to obtain information of each voice segment;

clustering refers to collecting and analyzing information of the individual voice segments obtained in the previous step, and classifying the individual voice segments into specific categories, wherein a bottom-up hierarchical clustering algorithm AHC (adaptive high performance clustering) clustering algorithm can be adopted in the scheme.

In a specific embodiment, in order to ensure that all speakers can be accurately identified subsequently, in a time period corresponding to the preset time period, each speaker in the voice data to be analyzed performs voice with a specified time period, in an embodiment, for example, there are 2 speakers, and the preset time period includes that each of the 2 speakers speaks for 1 minute.

In addition, specific information includes any combination of one or more of the following: characteristics, speaker information of the voice, time point of the voice.

Specifically, as shown in fig. 2, the segment clustering method corresponds to the Cluster stage, so that after a sufficiently long time (for example, when both the client and the customer service say at least one minute), all segments at present are clustered to obtain information (features, speakers, time points, etc.) of each segment.

104, processing each voice segment with determined information to determine voice characteristics of a plurality of speakers;

in a specific step 104, the processing each of the speech segments for which information is determined to determine speech characteristics of a plurality of the speakers includes:

Specifically, as shown in fig. 2, in response to the smoothening process, adjacent segments belonging to the same speaker are merged by a Smoothing process, and some segments of too short and adjacent different voices are "smoothed out". Therefore, the characteristics of the client and the customer service and the speaking segment information can be obtained.

Specifically, the smoothing process includes two cases: merging and floating. Wherein, merging refers to merging adjacent voice segments belonging to the same speaker. The floating is that if two voice segments belonging to the same speaker a are mixed with voice segments of other speakers B, and the length of the voice segment of this speaker B is very small (smaller than a preset threshold), the speaker can be modified from B to a (the floating means is too short, the voice segment is different from the adjacent voice segment in decision, and the decision is modified to the same speaker as the adjacent decision).

And 105, comparing the voice segment which is not processed currently and is most time-sequenced with the voice features of the processed speakers, determining the speaker corresponding to the currently compared voice segment, and setting the currently compared voice segment as the processed voice segment.

In the step 104, after the voice features of each person (speaker) are obtained, subsequent character analysis and recognition processing may be performed, for example, online processing may be performed, and a subsequent segment (segment) is compared with the two obtained speaker features to obtain the speaker to which the segment belongs, that is, the speaker of the voice or the speaker of the voice; and then smoothing is carried out to be combined with the existing information. This step loops until the session ends.

Example 2

Embodiment 2 of the present invention further discloses a speech analysis device, which is applied to single-channel speech analysis, and as shown in fig. 3, the speech analysis device includes:

a first segmentation module 201, configured to segment the voice data to be analyzed into a voice part and a non-voice part; the voice data to be analyzed comprises voice data of a plurality of speakers;

a second segmentation module 202, configured to segment the voice portion into a plurality of voice segments;

the clustering module 203 is configured to cluster the voice segments with time exceeding a preset duration to obtain information of each voice segment;

a determining module 204, configured to process each of the speech segments for which information is determined, and determine speech features of a plurality of speakers;

the analysis module 205 is configured to compare the currently unprocessed voice segment with the most time-ordered voice features with the plurality of processed voice features of the speaker, determine the speaker corresponding to the currently compared voice segment, and set the currently compared voice segment as the processed voice segment.

In a specific embodiment, the first dividing module 201 is configured to:

In a specific embodiment, the second segmentation module 202 is configured to:

In a particular embodiment, the information includes any combination of one or more of the following: characteristics, speaker information of the voice, time point of the voice.

In a specific embodiment, the determining module 204 is configured to:

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above-mentioned invention numbers are merely for description and do not represent the merits of the implementation scenarios.

The above disclosure is only a few specific implementation scenarios of the present invention, however, the present invention is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims

1. A speech analysis method, applied to single-channel speech analysis, the method comprising:

segmenting the voice part into a plurality of voice segments;

2. The speech analysis method of claim 1, wherein said segmenting the speech data to be analyzed into speech portions and non-speech portions comprises:

3. The speech analysis method of claim 1, wherein said segmenting said speech portion into a plurality of speech segments comprises:

4. The speech analysis method of claim 1, wherein the speech segments comprise forward and backward frames and/or overlap.

5. The speech analysis method according to claim 1, wherein each speaker in the speech data to be analyzed performs speech of a specified duration in a time period corresponding to the preset duration.

6. The speech analysis method of claim 1, wherein the information comprises any combination of one or more of: characteristics, speaker of the speech, time point of the speech.

7. The speech analysis method of claim 1 wherein said processing each of said speech segments for which information is determined to determine speech characteristics of a plurality of said speakers comprises:

8. A speech analysis apparatus, for single channel speech analysis, comprising:

9. The speech analysis device of claim 8, wherein the first scoring module is to:

10. The speech analysis device of claim 8, wherein the second segmentation module is to: