CN110728996A - Real-time voice quality inspection method, device, equipment and computer storage medium - Google Patents

Real-time voice quality inspection method, device, equipment and computer storage medium Download PDF

Info

Publication number
CN110728996A
CN110728996A CN201911018521.9A CN201911018521A CN110728996A CN 110728996 A CN110728996 A CN 110728996A CN 201911018521 A CN201911018521 A CN 201911018521A CN 110728996 A CN110728996 A CN 110728996A
Authority
CN
China
Prior art keywords
audio
quality inspection
analysis
identification information
recording data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911018521.9A
Other languages
Chinese (zh)
Inventor
苑维然
金增笑
李宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiuhu Times Intelligent Technology Co Ltd
Original Assignee
Beijing Jiuhu Times Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiuhu Times Intelligent Technology Co Ltd filed Critical Beijing Jiuhu Times Intelligent Technology Co Ltd
Priority to CN201911018521.9A priority Critical patent/CN110728996A/en
Publication of CN110728996A publication Critical patent/CN110728996A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/51Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing
    • H04M3/5175Call or contact centers supervision arrangements

Abstract

The application discloses a real-time voice quality inspection method, a device, equipment and a computer storage medium, wherein the real-time voice quality inspection method comprises the following steps: slicing the target recording data to obtain at least two audio slices; performing effectiveness analysis on the at least two audio slices according to a preset first analysis rule so as to screen out effective audio slices from the at least two audio slices based on an effectiveness analysis result; and analyzing the effective audio slice according to a preset second analysis rule to obtain quality inspection analysis information, wherein the quality inspection analysis information at least comprises one of emotion identification information, speech speed identification information, silence identification information, call robbing identification information and voice effectiveness identification information. The method and the device can improve the quality inspection efficiency and coverage rate of the customer service audio data and reduce the quality inspection cost, and meanwhile, invalid audio data in the customer service audio data can be filtered, so that server resources can be saved.

Description

Real-time voice quality inspection method, device, equipment and computer storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a real-time voice quality inspection method, apparatus, device, and computer storage medium.
Background
In the customer service quality inspection industry, in order to guarantee the quality of customer service, the quality inspection of the customer service call records needs to be carried out manually, but the manual quality inspection has the defects of high cost and low efficiency.
At present, aiming at the defects of high cost and low efficiency of manual quality inspection, the prior art adopts a solution that a call center system is used for butting a CRM center, although the solution can realize machine quality inspection to a certain degree and can improve the quality inspection efficiency, the solution still has the defect of low quality inspection coverage rate, and meanwhile, the solution can only complete the process from telephone answering to quality inspection and rechecking by butting a plurality of systems.
Disclosure of Invention
The embodiment of the application aims to disclose a real-time voice quality inspection method, a real-time voice quality inspection device, a real-time voice quality inspection equipment and a computer storage medium, which are used for improving the quality inspection efficiency and coverage rate of customer service data and reducing the quality inspection cost. Meanwhile, the method and the device are also used for filtering invalid audio data and saving server resources.
The first aspect of the present application discloses a real-time voice quality inspection method, which includes:
slicing the target recording data to obtain at least two audio slices;
performing effectiveness analysis on the at least two audio slices according to a preset first analysis rule so as to screen out effective audio slices from the at least two audio slices based on an effectiveness analysis result;
and analyzing the effective audio slice according to a preset second analysis rule to obtain quality inspection analysis information, wherein the quality inspection analysis information at least comprises one of emotion identification information, speech speed identification information, silence identification information, call robbing identification information and voice effectiveness identification information.
The real-time voice quality inspection method can be used for processing target recording data in a slicing mode, analyzing effectiveness and analyzing quality inspection, and obtaining quality inspection information required by quality inspection, so that the efficiency of quality inspection can be improved, and the cost of quality inspection can be reduced. On the other hand, because the recording data is subjected to effectiveness analysis in the embodiment of the application, the total capacity of the recording data can be reduced, so that the quality inspection can be performed on the effective recording data, the efficiency of the quality inspection can be further improved, and the server resources can be saved by reducing the total capacity of the recording data.
As an alternative embodiment, slicing the target recording data and obtaining at least two audio slices includes:
determining a slicing time length value according to the time length value corresponding to the target recording data and the capacity corresponding to the target recording data;
and dividing the recording data according to the slice duration value to obtain at least two audio slices.
This optional implementation mode can confirm the slice duration value according to the duration value of recording data and the capacity of recording file, and then divides two audio frequency slices according to the slice duration value, so, can make the section of recording data handle more rationally to reduce the emergence probability of this kind of problem of audio frequency slice capacity too big undersize.
As an alternative embodiment, the validity analysis of the at least two audio slices according to the preset first analysis rule includes:
performing fast Fourier transform on each audio slice and obtaining frequency domain energy distribution corresponding to each audio slice;
inputting the frequency domain energy distribution corresponding to each audio slice into a second neural network;
and classifying the audio slice into one of an effective audio slice and an ineffective audio slice according to a preset matching rule based on an output result of the second neural network.
The optional embodiment can introduce the frequency domain energy distribution of the audio slice into the second neural network, so that the audio slice can be classified as one of a valid audio slice and an invalid audio slice based on the output result of the second neural network.
As an optional implementation manner, analyzing the valid audio slice according to the preset second analysis rule and obtaining the quality inspection analysis information includes:
counting the emotional factor proportion in the effective audio slice, wherein the emotional factor proportion at least comprises at least one of the angry proportion, the fear proportion, the happy proportion, the wounded proportion and the neutral proportion;
and calculating according to the emotional factor proportion to obtain the emotion recognition information.
According to the optional implementation mode, the emotion recognition information can be obtained through calculation according to the proportion of the emotion factors, so that the emotion recognition function can be realized, and whether the emotion meets the requirements or not in the service process of the customer service is convenient to judge.
As an optional implementation manner, analyzing the valid audio slice according to the preset second analysis rule and obtaining the quality inspection analysis information includes:
extracting a phoneme sequence from the valid audio slice;
extracting the duration corresponding to each phoneme in the phoneme sequence;
counting the total duration of the phoneme sequence according to the duration corresponding to each phoneme;
and calculating to obtain the speech speed identification information according to the total duration of the phoneme sequence and the number of phonemes in the phoneme sequence.
The optional embodiment can generate the speech speed identification information of the audio slice according to the duration of the phoneme of the audio slice, so that whether the customer service talks too fast or too slow can be judged.
As an optional implementation, the method further comprises:
and judging whether the phoneme sequence has an interval sequence meeting the mute condition, and if so, generating mute identification information. The optional embodiment can generate the mute identification information according to the phoneme sequence, so that whether the customer service has the question of responding to the user in time can be judged.
As an optional implementation, the method further comprises:
extracting first track audio and second track audio in the valid audio slice;
and generating the speech robbing identification information based on the overlapping interval of the first track audio and the second track audio. The optional implementation mode can judge whether the customer service has the call snatching according to the first track audio and the second track audio in the effective audio slice.
As an optional implementation manner, before performing the slicing processing on the target recording data and obtaining at least two audio slices, the method further includes:
recognizing scene characteristics in target recording data by utilizing a preset first neural network;
and determining the scene type corresponding to the target sound recording data according to the scene characteristics.
According to the optional implementation mode, the preset first neural network can be used for determining the scene type of the target recording data, so that the target recording data in a specific scene can be subjected to rechecking during rechecking, and the rechecking efficiency can be improved.
The second aspect of the present application discloses a real-time voice quality inspection device, the device includes:
the slicing processing module is used for carrying out slicing processing on the target recording data and obtaining at least two audio slices;
the first analysis module is used for carrying out effectiveness analysis on the at least two audio slices according to a preset first analysis rule so as to screen out effective audio slices from the at least two audio slices based on an effectiveness analysis result;
and the second analysis module is used for analyzing the effective audio slice according to a preset second analysis rule and obtaining quality inspection analysis information, wherein the quality inspection analysis information at least comprises one of emotion recognition information, speech speed recognition information, silence recognition information, call robbing recognition information and voice effectiveness recognition information.
The real-time voice quality inspection device can process target recording data by slicing, analyze effectiveness and analyze quality, and obtain quality inspection information required by quality inspection, so that the efficiency of quality inspection can be improved, and the cost of quality inspection can be reduced. On the other hand, because the recording data is subjected to effectiveness analysis in the embodiment of the application, the total capacity of the recording data can be reduced, so that the quality inspection can be performed on the effective recording data, the efficiency of the quality inspection can be further improved, and the server resources can be saved by reducing the total capacity of the recording data.
The third aspect of the present application discloses a real-time voice quality inspection apparatus, which includes:
a processor; and
a memory configured to store machine readable instructions which, when executed by the processor, cause the processor to perform a real-time speech quality inspection method according to the first aspect of the present application.
The real-time voice quality inspection equipment can process target recording data by slicing, analyze effectiveness and analyze quality, and can obtain quality inspection information required by quality inspection, so that the efficiency of quality inspection can be improved, and the cost of quality inspection can be reduced. On the other hand, because the recording data is subjected to effectiveness analysis in the embodiment of the application, the total capacity of the recording data can be reduced, so that the quality inspection can be performed on the effective recording data, the efficiency of the quality inspection can be further improved, and the server resources can be saved by reducing the total capacity of the recording data.
A fourth aspect of the present application discloses a computer storage medium storing a computer program, the computer program being executed by a processor to perform a real-time voice quality inspection method according to the present application.
The computer storage medium can obtain quality inspection information required by quality inspection by performing slicing processing, effectiveness analysis and quality inspection analysis on target recording data through executing the real-time voice quality inspection method, so that the efficiency of quality inspection can be improved and the cost of quality inspection can be reduced. On the other hand, because the recording data is subjected to effectiveness analysis in the embodiment of the application, the total capacity of the recording data can be reduced, so that the quality inspection can be performed on the effective recording data, the efficiency of the quality inspection can be further improved, and the server resources can be saved by reducing the total capacity of the recording data.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic flowchart illustrating a real-time voice quality inspection method according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating a sub-step of step 101 disclosed in an embodiment of the present application;
FIG. 3 is a flow chart illustrating a sub-step of step 102 according to an embodiment of the present disclosure;
FIG. 4 is a flow chart illustrating a sub-step of step 103 disclosed in an embodiment of the present application;
FIG. 5 is a schematic flow chart illustrating another sub-step of step 103 disclosed in an embodiment of the present application;
FIG. 6 is a schematic flow chart illustrating still another sub-step of step 103 disclosed in an embodiment of the present application;
fig. 7 is a schematic flowchart illustrating a real-time voice quality inspection method according to a second embodiment of the present application;
fig. 8 is a schematic structural diagram of a real-time voice quality inspection apparatus according to a third embodiment of the present application;
fig. 9 is a schematic structural diagram of a real-time voice quality inspection apparatus according to a fourth embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a real-time voice quality inspection method according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
101. and slicing the target recording data to obtain at least two audio slices.
In the embodiment of the application, the target recording data can be collected by the collecting unit on the customer service robot.
In the embodiment of the application, the customer service robot acquires the target recording data and uploads the target recording data to the cloud server, so that the cloud server slices the target recording data and obtains at least two audio slices.
In the embodiment of the present application, two audio slices or six audio slices may be obtained by slicing the target recording data. For example, a 20M sound file is sliced into six audio slices.
In this embodiment of the application, the target recording data may be an audio file in an MP3 format, or an audio file in a WAV format, and the format of the target recording data is not limited in this application.
102. And performing effectiveness analysis on the at least two audio slices according to a preset first analysis rule so as to screen out effective audio slices from the at least two audio slices based on the effectiveness analysis result.
103. And analyzing the effective audio slice according to a preset second analysis rule to obtain quality inspection analysis information, wherein the quality inspection analysis information at least comprises one of emotion identification information, speech speed identification information, silence identification information, call robbing identification information and voice effectiveness identification information.
The real-time voice quality inspection method can be used for processing target recording data in a slicing mode, analyzing effectiveness and analyzing quality inspection, and obtaining quality inspection information required by quality inspection, so that the efficiency of quality inspection can be improved, and the cost of quality inspection can be reduced. On the other hand, because the recording data is subjected to effectiveness analysis in the embodiment of the application, the total capacity of the recording data can be reduced, so that the quality inspection can be performed on the effective recording data, the efficiency of the quality inspection can be further improved, and the server resources can be saved by reducing the total capacity of the recording data.
As an alternative implementation, please refer to fig. 2, and fig. 2 is a schematic flow chart of a sub-step of step 101 disclosed in the embodiments of the present application. As shown in fig. 2, step 101 comprises the sub-steps of:
1011. and determining the slice time length value according to the time length value corresponding to the target recording data and the capacity corresponding to the target recording data.
In this alternative embodiment, the capacity corresponding to the target sound recording data refers to the data capacity of the target sound recording data, for example, the capacity of a 20M target sound recording data is 20M.
1012. And dividing the recording data according to the slice duration value to obtain at least two audio slices.
This optional implementation mode can confirm the slice duration value according to the duration value of recording data and the capacity of recording file, and then divides two audio frequency slices according to the slice duration value, so, can make the section of recording data handle more rationally to reduce the probability of occurrence of audio frequency slice capacity this type of problem.
As an alternative implementation, please refer to fig. 3, in which fig. 3 is a schematic flowchart of a sub-step of step 102 disclosed in the embodiments of the present application. As shown in fig. 3, step 102 comprises the sub-steps of:
1021. performing fast Fourier transform on each audio slice and obtaining frequency domain energy distribution corresponding to each audio slice;
1022. the frequency domain energy distribution corresponding to each audio slice is input into a second neural network.
In particular, the second neural network may be obtained from a large number of training samples.
1023. And classifying the audio slice into one of an effective audio slice and an ineffective audio slice according to a preset matching rule based on an output result of the second neural network.
Illustratively, assuming an audio slice with an energy distribution between 10HZ and 80HZ, after the audio slice is input into the second neural network, the second neural network converts the energy distribution between 10HZ and 80HZ into a series of digital strings, such as "000000012000000", and the audio slice can be classified as one of a valid audio slice and an invalid audio slice according to preset matching rules and the digital strings.
As another example, the preset matching rule may be to determine whether the audio slice is invalid according to the corresponding numerical value of each digit in the numeric string and the duration of the numerical value, for example, for the numeric string "000000012000000", where 0 indicates no sound, 1 indicates sound occurrence, and 2 indicates sound interruption, if 4 "1" s continuously occur in the numeric string, the audio slice may be determined to be valid, and otherwise, the audio slice is invalid.
The optional embodiment can introduce the frequency domain energy distribution of the audio slice into the second neural network, so that the audio slice can be classified as one of a valid audio slice and an invalid audio slice based on the output result of the second neural network.
As an alternative implementation, please refer to fig. 4, in which fig. 4 is a schematic flowchart of a sub-step of step 103 disclosed in the embodiments of the present application. As shown in fig. 4, step 103 comprises the sub-steps of:
1031. counting the emotional factor proportion in the effective audio slice, wherein the emotional factor proportion at least comprises at least one of the angry proportion, the fear proportion, the happy proportion, the wounded proportion and the neutral proportion;
1032. and calculating according to the emotional factor proportion to obtain the emotion recognition information.
Specifically, obtaining emotion recognition information by calculating according to the proportion of emotional factors includes:
respectively calculating the average of the vital energy ratios, the average of the fear ratio, the average of the happy ratio, the average of the wounded ratio and the average of the neutral ratio based on a plurality of effective slices;
and calculating emotion recognition information according to the average of the vital energy ratios, the average of the fear ratios, the average of the happy ratios, the average of the wounded ratios and the average of the neutral ratios.
Illustratively, suppose that the average value of the ratio of anger is "0.1, 0.2, 0.3,0.4, 0.5" for the average value of the ratio of fear, the average value of the ratio of happy, the average value of the ratio of wounded, and the average value of the ratio of neutral, respectively, wherein the average value of the ratio of neutral is the highest, so the emotion recognition information of the target audio file is "neutral".
According to the optional implementation mode, the emotion recognition information can be obtained through calculation according to the proportion of the emotion factors, so that the emotion recognition function can be realized, and whether the emotion meets the requirements or not in the service process of the customer service is convenient to judge.
As an alternative implementation, please refer to fig. 5, and fig. 5 is a schematic flow chart of another sub-step of step 103 disclosed in the embodiments of the present application. As shown in fig. 4, step 103 comprises the sub-steps of:
1033. extracting a phoneme sequence from the valid audio slice;
1034. extracting the duration corresponding to each phoneme in the phoneme sequence;
1035. counting the total duration of the phoneme sequence according to the duration corresponding to each phoneme;
1036. and calculating to obtain the speech speed identification information according to the total duration of the phoneme sequence and the number of phonemes in the phoneme sequence.
Illustratively, assuming that the content of an audio slice is "i", the phoneme sequence corresponding to the audio slice is "wo", the phoneme sequence includes 2 phonemes ("w", "o"), and on the other hand, the duration corresponding to "w" is 0.5S, and the duration corresponding to "o" is 0.3S, the total duration of the phoneme sequence is 0.8S, and the speech rate information of the audio slice is "0.4S/phoneme".
The optional embodiment can generate the speech speed identification information of the audio slice according to the duration of the phoneme of the audio slice, so that whether the customer service talks too fast or too slow can be judged.
As an alternative implementation, please refer to fig. 6, in which fig. 6 is a schematic flowchart of another sub-step of step 103 disclosed in the embodiments of the present application. As shown in fig. 6, step 103 comprises the sub-steps of:
1037. judging whether the phoneme sequence has an interval sequence meeting a mute condition, and if so, generating mute identification information;
1038. extracting first track audio and second track audio in the valid audio slice;
1039. and generating the speech robbing identification information based on the overlapping interval of the first track audio and the second track audio.
In this alternative embodiment, the valid audio slice may include a first track of audio and a second track of audio, where the first track of audio is customer audio and the second track of audio is customer service audio, such that if the first track of audio and the second track of audio overlap, a customer service is indicated as being in the event of a rescue.
In this optional embodiment, optionally, the call preemption identification information can include time point information and call preemption duration information.
The optional embodiment can generate the mute recognition letter according to the phoneme sequence, so that whether the customer service has the question of responding to the user in time can be judged. On the other hand, the optional embodiment can judge whether the service has the call snatching or not according to the first track audio and the second track audio in the effective audio slice.
Example two
Referring to fig. 7, fig. 7 is a schematic flowchart illustrating a real-time voice quality inspection method according to an embodiment of the present application. As shown in fig. 7, the method includes the steps of:
201. recognizing scene characteristics in target recording data by utilizing a preset first neural network;
202. determining a scene type corresponding to the target sound recording data according to the scene characteristics;
203. slicing the target recording data to obtain at least two audio slices;
204. performing effectiveness analysis on the at least two audio slices according to a preset first analysis rule so as to screen out effective audio slices from the at least two audio slices based on an effectiveness analysis result;
205. and analyzing the effective audio slice according to a preset second analysis rule to obtain quality inspection analysis information, wherein the quality inspection analysis information at least comprises one of emotion identification information, speech speed identification information, silence identification information, call robbing identification information and voice effectiveness identification information.
The method and the device for the target recording data review have the technical effects achieved by the first method and the first device, the scene type of the target recording data can be determined by the aid of the preset first neural network, the target recording data of a specific scene can be reviewed during review, and the review efficiency can be improved.
EXAMPLE III
Please refer to fig. 8, fig. 8 is a schematic structural diagram of a real-time voice quality inspection apparatus according to an embodiment of the present application. As shown in fig. 8, the apparatus includes:
a slicing processing module 301, configured to slice the target audio record data and obtain at least two audio slices;
a first analysis module 302, configured to perform validity analysis on at least two audio slices according to a preset first analysis rule, so as to screen out valid audio slices from the at least two audio slices based on a result of the validity analysis;
the second analysis module 303 is configured to analyze the valid audio slice according to a preset second analysis rule and obtain quality inspection analysis information, where the quality inspection analysis information at least includes one of emotion recognition information, speech rate recognition information, silence recognition information, speech robbing recognition information, and speech validity recognition information.
In the embodiment of the application, the target recording data can be collected by the collecting unit on the customer service robot.
In the embodiment of the application, the customer service robot acquires the target recording data and uploads the target recording data to the cloud server, so that the cloud server slices the target recording data and obtains at least two audio slices.
In the embodiment of the present application, two audio slices or six audio slices may be obtained by slicing the target recording data. For example, a 20M sound file is sliced into six audio slices.
In this embodiment of the application, the target recording data may be an audio file in an MP3 format, or an audio file in a WAV format, and the format of the target recording data is not limited in this application.
The real-time voice quality inspection device of the embodiment of the application can acquire quality inspection information required by quality inspection by slicing the target recording data, analyzing validity and analyzing quality inspection through executing the real-time voice quality inspection method, so that the efficiency of quality inspection can be improved and the cost of quality inspection can be reduced. On the other hand, because the recording data is subjected to effectiveness analysis in the embodiment of the application, the total capacity of the recording data can be reduced, so that the quality inspection can be performed on the effective recording data, the efficiency of the quality inspection can be further improved, and the server resources can be saved by reducing the total capacity of the recording data.
As an alternative embodiment, the slice processing module 301 may include:
the determining submodule is used for determining a slice time length value according to the time length value corresponding to the target sound recording data and the capacity corresponding to the target sound recording data;
and the dividing submodule is used for dividing the recording data according to the slice time length value and obtaining at least two audio slices.
This optional implementation mode can confirm the slice duration value according to the duration value of recording data and the capacity of recording file, and then divides two audio frequency slices according to the slice duration value, so, can make the section of recording data handle more rationally to reduce the probability of occurrence of audio frequency slice capacity this type of problem.
As an alternative embodiment, the first analysis module 302 includes:
the transform submodule is used for carrying out fast Fourier transform on each audio slice and obtaining frequency domain energy distribution corresponding to each audio slice;
an input sub-module for inputting the frequency domain energy distribution corresponding to each audio slice into a second neural network;
and the classification submodule is used for classifying the audio slice into one of an effective audio slice and an ineffective audio slice according to a preset matching rule based on the output result of the second neural network.
Illustratively, assuming an audio slice with an energy distribution between 10HZ and 80HZ, after the audio slice is input into the second neural network, the second neural network converts the energy distribution between 10HZ and 80HZ into a series of digital strings, such as "000000012000000", and the audio slice can be classified as one of a valid audio slice and an invalid audio slice according to preset matching rules and the digital strings.
As another example, the preset matching rule may be to determine whether the audio slice is invalid according to the corresponding numerical value of each digit in the numeric string and the duration of the numerical value, for example, for the numeric string "000000012000000", where 0 indicates no sound, 1 indicates sound occurrence, and 2 indicates sound interruption, if 4 "1" s continuously occur in the numeric string, the audio slice may be determined to be valid, and otherwise, the audio slice is invalid.
The optional embodiment can introduce the frequency domain energy distribution of the audio slice into the second neural network, so that the audio slice can be classified as one of a valid audio slice and an invalid audio slice based on the output result of the second neural network.
As an alternative implementation, the second analysis module 303 includes:
the first statistic submodule is used for counting the emotional factor proportion in the effective audio slice, wherein the emotional factor proportion at least comprises at least one of a anger proportion, a fear proportion, a happy proportion, a wounded proportion and a neutral proportion;
and the first calculation submodule is used for calculating and obtaining the emotion recognition information according to the emotional factor ratio.
Specifically, obtaining emotion recognition information by calculating according to the proportion of emotional factors includes:
respectively calculating the average of the vital energy ratios, the average of the fear ratio, the average of the happy ratio, the average of the wounded ratio and the average of the neutral ratio based on a plurality of effective slices;
and calculating emotion recognition information according to the average of the vital energy ratios, the average of the fear ratios, the average of the happy ratios, the average of the wounded ratios and the average of the neutral ratios.
Illustratively, suppose that the average value of the ratio of anger is "0.1, 0.2, 0.3,0.4, 0.5" for the average value of the ratio of fear, the average value of the ratio of happy, the average value of the ratio of wounded, and the average value of the ratio of neutral, respectively, wherein the average value of the ratio of neutral is the highest, so the emotion recognition information of the target audio file is "neutral".
According to the optional implementation mode, the emotion recognition information can be obtained through calculation according to the proportion of the emotion factors, so that the emotion recognition function can be realized, and whether the emotion meets the requirements or not in the service process of the customer service is convenient to judge.
As an alternative implementation, the second analysis module 303 includes:
a second extraction submodule for extracting phoneme sequences from the valid audio slice;
the third extraction submodule is used for extracting the duration corresponding to each phoneme in the phoneme sequence;
the second statistic submodule is used for counting the total duration of the phoneme sequence according to the duration corresponding to each phoneme;
and the second calculation submodule is used for calculating the speech speed identification information according to the total duration of the phoneme sequence and the number of phonemes in the phoneme sequence.
Illustratively, assuming that the content of an audio slice is "i", the phoneme sequence corresponding to the audio slice is "wo", the phoneme sequence includes 2 phonemes ("w", "o"), and on the other hand, the duration corresponding to "w" is 0.5S, and the duration corresponding to "o" is 0.3S, the total duration of the phoneme sequence is 0.8S, and the speech rate information of the audio slice is "0.4S/phoneme".
The optional embodiment can generate the speech speed identification information of the audio slice according to the duration of the phoneme of the audio slice, so that whether the customer service talks too fast or too slow can be judged.
As an alternative implementation, the second analysis module 303 includes:
and the judging submodule is used for judging whether the phoneme sequence has an interval sequence meeting the mute condition, and if so, generating mute identification information.
The optional embodiment can generate the mute identification information according to the phoneme sequence, so that whether the customer service has the question of responding to the user in time can be judged.
As an optional implementation, the second analysis module 303 further includes:
the fourth extraction module is used for extracting the first track audio and the second track audio in the effective audio slice;
and generating a sub-module for extracting the first track audio and the second track audio in the effective audio slice.
The optional implementation mode can judge whether the customer service has the call snatching according to the first track audio and the second track audio in the effective audio slice.
In this alternative embodiment, the valid audio slice may include a first track of audio and a second track of audio, where the first track of audio is customer audio and the second track of audio is customer service audio, such that if the first track of audio and the second track of audio overlap, a customer service is indicated as being in the event of a rescue.
In this optional embodiment, optionally, the call preemption identification information can include time point information and call preemption duration information.
As an optional implementation manner, the real-time voice quality inspection apparatus further includes:
the recognition module is used for recognizing scene characteristics in the target recording data by utilizing a preset first neural network;
the determining module is used for determining the scene type corresponding to the target sound recording data according to the scene characteristics;
according to the optional implementation mode, the preset first neural network can be used for determining the scene type of the target recording data, so that the target recording data in a specific scene can be subjected to rechecking during rechecking, and the rechecking efficiency can be improved.
Example four
Referring to fig. 9, fig. 9 is a schematic structural diagram of a real-time voice quality inspection apparatus according to an embodiment of the present application. As shown in fig. 9, the apparatus includes:
a processor 402; and
the memory 401 is configured to store machine readable instructions, which when executed by the processor 402, cause the processor 402 to perform the steps of the real-time voice quality inspection method according to any one of the first to second embodiments of the present application.
The real-time voice quality inspection equipment provided by the embodiment of the application can acquire quality inspection information required by quality inspection by slicing the target recording data, analyzing validity and analyzing quality inspection through executing the real-time voice quality inspection method, so that the efficiency of quality inspection can be improved and the cost of quality inspection can be reduced. On the other hand, because the recording data is subjected to effectiveness analysis in the embodiment of the application, the total capacity of the recording data can be reduced, so that the quality inspection can be performed on the effective recording data, the efficiency of the quality inspection can be further improved, and the server resources can be saved by reducing the total capacity of the recording data.
EXAMPLE five
The embodiment of the application discloses a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to perform the steps in the real-time voice quality inspection method according to any one of the first embodiment to the third embodiment of the application.
The computer-readable storage medium of the embodiment of the application can obtain quality inspection information required by quality inspection by performing a real-time voice quality inspection method through slicing, validity analysis and quality inspection analysis on target recording data, so that the efficiency of quality inspection can be improved and the cost of quality inspection can be reduced. On the other hand, because the recording data is subjected to effectiveness analysis in the embodiment of the application, the total capacity of the recording data can be reduced, so that the quality inspection can be performed on the effective recording data, the efficiency of the quality inspection can be further improved, and the server resources can be saved by reducing the total capacity of the recording data.
EXAMPLE six
The embodiment of the application discloses a computer program product, which comprises a non-transitory computer readable storage medium storing a computer program, wherein the computer program is operable to make a computer execute a real-time voice quality inspection method according to any one of the first embodiment to the second embodiment of the application.
The computer program product of the embodiment of the application can obtain quality inspection information required by quality inspection by performing the real-time voice quality inspection method through slicing, effectiveness analysis and quality inspection analysis on target recording data, so that the efficiency of quality inspection can be improved and the cost of quality inspection can be reduced. On the other hand, because the recording data is subjected to effectiveness analysis in the embodiment of the application, the total capacity of the recording data can be reduced, so that the quality inspection can be performed on the effective recording data, the efficiency of the quality inspection can be further improved, and the server resources can be saved by reducing the total capacity of the recording data.
In the embodiments disclosed in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A real-time voice quality inspection method is characterized by comprising the following steps:
slicing the target recording data to obtain at least two audio slices;
performing effectiveness analysis on the at least two audio slices according to a preset first analysis rule so as to screen out effective audio slices from the at least two audio slices based on an effectiveness analysis result;
and analyzing the effective audio slice according to a preset second analysis rule to obtain quality inspection analysis information, wherein the quality inspection analysis information at least comprises one of emotion identification information, speech speed identification information, silence identification information, call robbing identification information and voice effectiveness identification information.
2. The real-time voice quality inspection method of claim 1, wherein slicing the target recording data to obtain at least two audio slices comprises:
determining a slicing time length value according to the time length value corresponding to the target recording data and the capacity corresponding to the target recording data;
and dividing the recording data according to the slice duration value to obtain the at least two audio slices.
3. The real-time voice quality inspection method according to any one of claims 1-2, wherein performing validity analysis on the at least two audio slices according to a preset first analysis rule comprises:
performing fast Fourier transform on each audio slice and obtaining frequency domain energy distribution corresponding to each audio slice;
inputting the frequency domain energy distribution corresponding to each audio slice into a second neural network;
and classifying the audio slice into one of a valid audio slice and an invalid audio slice according to a preset matching rule based on an output result of the second neural network.
4. The method of claim 1, wherein analyzing the valid audio slice according to a preset second analysis rule and obtaining quality control analysis information comprises:
counting the emotional factor proportion in the effective audio slice, wherein the emotional factor proportion at least comprises at least one of an angry proportion, a fear proportion, a happy proportion, a casualty proportion and a neutral proportion;
and calculating to obtain the emotion recognition information according to the emotional factor proportion.
5. The method of claim 1, wherein analyzing the valid audio slice according to a predetermined second analysis rule and obtaining quality analysis information comprises:
extracting a sequence of phonemes from the valid audio slice;
extracting a duration corresponding to each phoneme in the phoneme sequence;
counting the total duration of the phoneme sequence according to the duration corresponding to each phoneme;
and calculating to obtain the speech rate identification information according to the total duration of the phoneme sequence and the number of phonemes in the phoneme sequence.
6. The method of claim 5, wherein analyzing the valid audio slice according to a predetermined second analysis rule and obtaining quality analysis information comprises:
judging whether the phoneme sequence has an interval sequence meeting a mute condition, and if so, generating mute identification information;
and, the method further comprises:
extracting first and second track audio in the valid audio slice;
and generating the speech robbing identification information based on the overlapping interval of the first track audio and the second track audio.
7. The real-time voice quality inspection method of claim 1, wherein before slicing the target recording data and obtaining at least two audio slices, the method further comprises:
recognizing scene features in the target sound recording data by utilizing a preset first neural network;
and determining the scene type corresponding to the target sound recording data according to the scene characteristics.
8. A real-time voice quality inspection apparatus, the apparatus comprising:
the slicing processing module is used for carrying out slicing processing on the target recording data and obtaining at least two audio slices;
the first analysis module is used for carrying out effectiveness analysis on the at least two audio slices according to a preset first analysis rule so as to screen out effective audio slices from the at least two audio slices based on an effectiveness analysis result;
and the second analysis module is used for analyzing the effective audio slice according to a preset second analysis rule and obtaining quality inspection analysis information, wherein the quality inspection analysis information at least comprises one of emotion identification information, speech speed identification information, silence identification information, call robbing identification information and voice effectiveness identification information.
9. A real-time voice quality inspection apparatus, the apparatus comprising:
a processor; and
a memory configured to store machine readable instructions that, when executed by the processor, cause the processor to perform the real-time voice quality inspection method of any one of claims 1-7.
10. A computer storage medium, characterized in that the computer storage medium stores a computer program, characterized in that the computer program is executed by a processor to perform the real-time voice quality inspection method according to any one of claims 1-7.
CN201911018521.9A 2019-10-24 2019-10-24 Real-time voice quality inspection method, device, equipment and computer storage medium Pending CN110728996A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911018521.9A CN110728996A (en) 2019-10-24 2019-10-24 Real-time voice quality inspection method, device, equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911018521.9A CN110728996A (en) 2019-10-24 2019-10-24 Real-time voice quality inspection method, device, equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN110728996A true CN110728996A (en) 2020-01-24

Family

ID=69221916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911018521.9A Pending CN110728996A (en) 2019-10-24 2019-10-24 Real-time voice quality inspection method, device, equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN110728996A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368130A (en) * 2020-02-26 2020-07-03 深圳前海微众银行股份有限公司 Quality inspection method, device and equipment for customer service recording and storage medium
CN111627461A (en) * 2020-05-29 2020-09-04 平安医疗健康管理股份有限公司 Voice quality inspection method and device, server and storage medium
CN112037819A (en) * 2020-09-03 2020-12-04 阳光保险集团股份有限公司 Voice quality inspection method and device based on semantics
CN114374924A (en) * 2022-01-07 2022-04-19 上海纽泰仑教育科技有限公司 Recording quality detection method and related device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716470A (en) * 2012-09-29 2014-04-09 华为技术有限公司 Method and device for speech quality monitoring
CN106611604A (en) * 2015-10-23 2017-05-03 中国科学院声学研究所 An automatic voice summation tone detection method based on a deep neural network
CN107204195A (en) * 2017-05-19 2017-09-26 四川新网银行股份有限公司 A kind of intelligent quality detecting method analyzed based on mood
US20170310820A1 (en) * 2016-04-26 2017-10-26 Fmr Llc Determining customer service quality through digitized voice characteristic measurement and filtering
CN107452405A (en) * 2017-08-16 2017-12-08 北京易真学思教育科技有限公司 A kind of method and device that data evaluation is carried out according to voice content
CN107464573A (en) * 2017-09-06 2017-12-12 竹间智能科技(上海)有限公司 A kind of new customer service call quality inspection system and method
CN109448730A (en) * 2018-11-27 2019-03-08 广州广电运通金融电子股份有限公司 A kind of automatic speech quality detecting method, system, device and storage medium
CN109600526A (en) * 2019-01-08 2019-04-09 上海上湖信息技术有限公司 Customer service quality determining method and device, readable storage medium storing program for executing
CN109830246A (en) * 2019-01-25 2019-05-31 北京海天瑞声科技股份有限公司 Audio quality appraisal procedure, device, electronic equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716470A (en) * 2012-09-29 2014-04-09 华为技术有限公司 Method and device for speech quality monitoring
CN106611604A (en) * 2015-10-23 2017-05-03 中国科学院声学研究所 An automatic voice summation tone detection method based on a deep neural network
US20170310820A1 (en) * 2016-04-26 2017-10-26 Fmr Llc Determining customer service quality through digitized voice characteristic measurement and filtering
CN107204195A (en) * 2017-05-19 2017-09-26 四川新网银行股份有限公司 A kind of intelligent quality detecting method analyzed based on mood
CN107452405A (en) * 2017-08-16 2017-12-08 北京易真学思教育科技有限公司 A kind of method and device that data evaluation is carried out according to voice content
CN107464573A (en) * 2017-09-06 2017-12-12 竹间智能科技(上海)有限公司 A kind of new customer service call quality inspection system and method
CN109448730A (en) * 2018-11-27 2019-03-08 广州广电运通金融电子股份有限公司 A kind of automatic speech quality detecting method, system, device and storage medium
CN109600526A (en) * 2019-01-08 2019-04-09 上海上湖信息技术有限公司 Customer service quality determining method and device, readable storage medium storing program for executing
CN109830246A (en) * 2019-01-25 2019-05-31 北京海天瑞声科技股份有限公司 Audio quality appraisal procedure, device, electronic equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368130A (en) * 2020-02-26 2020-07-03 深圳前海微众银行股份有限公司 Quality inspection method, device and equipment for customer service recording and storage medium
CN111627461A (en) * 2020-05-29 2020-09-04 平安医疗健康管理股份有限公司 Voice quality inspection method and device, server and storage medium
CN112037819A (en) * 2020-09-03 2020-12-04 阳光保险集团股份有限公司 Voice quality inspection method and device based on semantics
CN114374924A (en) * 2022-01-07 2022-04-19 上海纽泰仑教育科技有限公司 Recording quality detection method and related device
CN114374924B (en) * 2022-01-07 2024-01-19 上海纽泰仑教育科技有限公司 Recording quality detection method and related device

Similar Documents

Publication Publication Date Title
CN110728996A (en) Real-time voice quality inspection method, device, equipment and computer storage medium
US10642889B2 (en) Unsupervised automated topic detection, segmentation and labeling of conversations
CN112804400B (en) Customer service call voice quality inspection method and device, electronic equipment and storage medium
US10878823B2 (en) Voiceprint recognition method, device, terminal apparatus and storage medium
US9875739B2 (en) Speaker separation in diarization
US8412530B2 (en) Method and apparatus for detection of sentiment in automated transcriptions
CN108039181B (en) Method and device for analyzing emotion information of sound signal
CN111128223B (en) Text information-based auxiliary speaker separation method and related device
CN108986830B (en) Audio corpus screening method and device
CN111080109A (en) Customer service quality evaluation method and device and electronic equipment
CN111462758A (en) Method, device and equipment for intelligent conference role classification and storage medium
CN106548786A (en) A kind of detection method and system of voice data
CN107680584B (en) Method and device for segmenting audio
JP2017167726A (en) Conversation analyzer, method and computer program
CN114666618A (en) Audio auditing method, device, equipment and readable storage medium
CN114610840A (en) Sensitive word-based accounting monitoring method, device, equipment and storage medium
JP4201204B2 (en) Audio information classification device
CN111010484A (en) Automatic quality inspection method for call recording
Xie et al. Acoustic feature extraction using perceptual wavelet packet decomposition for frog call classification
CN114822557A (en) Method, device, equipment and storage medium for distinguishing different sounds in classroom
CN114049898A (en) Audio extraction method, device, equipment and storage medium
CN114038487A (en) Audio extraction method, device, equipment and readable storage medium
CN114446284A (en) Speaker log generation method and device, computer equipment and readable storage medium
CN111640450A (en) Multi-person audio processing method, device, equipment and readable storage medium
CN114065742B (en) Text detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200124

RJ01 Rejection of invention patent application after publication