CN111696527B - Method and device for positioning voice quality inspection area, positioning equipment and storage medium - Google Patents

Method and device for positioning voice quality inspection area, positioning equipment and storage medium Download PDF

Info

Publication number
CN111696527B
CN111696527B CN202010544756.8A CN202010544756A CN111696527B CN 111696527 B CN111696527 B CN 111696527B CN 202010544756 A CN202010544756 A CN 202010544756A CN 111696527 B CN111696527 B CN 111696527B
Authority
CN
China
Prior art keywords
quality inspection
voice
link
text
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010544756.8A
Other languages
Chinese (zh)
Other versions
CN111696527A (en
Inventor
聂镭
邹茂泰
聂颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Longma Zhixin Zhuhai Hengqin Technology Co ltd
Original Assignee
Longma Zhixin Zhuhai Hengqin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Longma Zhixin Zhuhai Hengqin Technology Co ltd filed Critical Longma Zhixin Zhuhai Hengqin Technology Co ltd
Priority to CN202010544756.8A priority Critical patent/CN111696527B/en
Publication of CN111696527A publication Critical patent/CN111696527A/en
Application granted granted Critical
Publication of CN111696527B publication Critical patent/CN111696527B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application is applicable to the technical field of voice processing, and provides a method, a device, a positioning device and a storage medium for positioning a voice quality inspection area, wherein the method comprises the following steps: acquiring a voice text to be processed; inputting a voice text segment of a voice text to be processed into a preset cnn model to obtain a quality inspection link corresponding to the voice text segment; and determining quality testing links of the same type, and taking the areas of the voice text segments corresponding to the quality testing links of the same type as quality testing areas. Therefore, the method and the device have the advantages that the characteristics of good robustness and the like of the preset cnn model are utilized, and compared with the method and the device for determining the quality inspection area of the voice text through the keyword recognition technology in the prior art, the effects of high robustness and good recognition effect can be achieved.

Description

Method and device for positioning voice quality inspection area, positioning equipment and storage medium
Technical Field
The present application belongs to the field of speech processing technologies, and in particular, to a method and an apparatus for positioning a speech quality inspection, a positioning device, and a storage medium.
Background
For the purposes of improving customer satisfaction, improving customer service, evaluating the work of customer service staff, and the like, quality inspection of voice calls between the customer service staff and the customer is generally required, for example, the voice calls between the customer service staff and the customer in the insurance industry are inspected to find out violation points.
The traditional quality inspection mode is manual quality inspection, but because the voice call quantity of customer service personnel and customers is too large, the traditional manual quality inspection efficiency is low, and manpower and material resources are greatly consumed.
The current quality inspection mode is intelligent quality inspection, and the principle is as follows: after the voice call between the customer service personnel and the customer is converted into the voice text, the quality inspection area of the voice text is determined through a keyword technology, and finally the quality inspection area is inspected to find out violation points in the quality inspection area. However, in the existing intelligent quality inspection process, the quality inspection area for identifying the voice text by the keyword technology has the defects of low robustness, low identification effect and the like.
Disclosure of Invention
The embodiment of the application provides a method and a device for positioning a voice quality inspection area, and can solve the problems of low robustness, low recognition effect and the like existing in the existing intelligent quality inspection process when the quality inspection area of a voice text is recognized through a keyword technology.
In a first aspect, an embodiment of the present application provides a method for positioning a voice quality inspection area, including:
acquiring a voice text to be processed; wherein the voice text to be processed comprises at least one voice text segment;
inputting the voice text segment of the voice text to be processed to a preset cnn model to obtain a quality inspection link corresponding to the voice text segment;
and determining the quality inspection links of the same type, and taking the areas of the voice text segments corresponding to the quality inspection links of the same type as quality inspection areas.
In a possible implementation manner of the first aspect, before the obtaining the to-be-processed speech text, the method further includes:
acquiring voice audio to be processed;
and converting the voice audio to be processed into a voice text to be processed.
In a possible implementation manner of the first aspect, converting the to-be-processed speech audio into to-be-processed speech text includes:
separating out a target voice audio frequency from the voice audio frequency to be processed;
performing role object segmentation on the target voice audio to obtain a voice audio segment; wherein each voice audio clip corresponds to a character object;
and converting the voice audio fragment into a voice text fragment, and forming the voice text to be processed according to the voice text fragment.
In a possible implementation manner of the first aspect, before the step of inputting the speech text segment of the speech text to be processed to a preset cnn model and obtaining a quality inspection link corresponding to the speech text segment, the method further includes:
acquiring a voice text sample; wherein the speech text sample comprises at least one speech text sample segment;
marking a quality inspection link corresponding to the voice text sample fragment;
and inputting the voice text sample fragment and a quality testing link corresponding to the voice text sample fragment into a cnn model for training to obtain the preset cnn model.
In a possible implementation manner of the first aspect, determining a quality inspection link of the same category, and before taking an area where the speech text segment corresponding to the quality inspection link of the same category is located as a quality inspection area, the method further includes:
acquiring a voice audio clip corresponding to the voice text clip;
and verifying a quality inspection link corresponding to the voice text fragment according to the voice audio fragment.
In a possible implementation manner of the first aspect, verifying, according to the voice audio segment, a quality inspection link corresponding to the voice text segment includes:
determining the sequence number and the total number of the voice audio clips;
calculating the relative position of the quality inspection link corresponding to the voice audio clip according to the following formula:
Percent=A/B,
wherein, Percent represents the relative position of the quality inspection link corresponding to the voice audio clip, A represents the serial number of the voice audio clip, B represents the total number of the voice audio clip;
judging whether the relative position of the quality inspection link is within a relative position interval corresponding to the quality inspection link in a preset box diagram;
if not, the quality inspection link is changed into an abnormal link.
In a possible implementation manner of the first aspect, determining quality inspection links of the same category, and taking areas where the voice text segments respectively corresponding to the quality inspection links of the same category are located as quality inspection areas includes:
performing cluster analysis on the quality inspection links of the same category to obtain a quality inspection link set;
determining a starting quality inspection link and an ending quality inspection link in the quality inspection link set;
and taking the areas of the voice text segments respectively corresponding to the initial quality inspection link, the intermediate quality inspection link between the initial quality inspection link and the ending quality inspection link as quality inspection areas.
In a second aspect, an embodiment of the present application provides a device for locating a voice quality inspection area, including:
the acquisition module is used for acquiring a voice text to be processed; wherein the voice text to be processed comprises at least one voice text segment;
the link division module is used for inputting the voice text segment of the voice text to be processed into a preset cnn model to obtain a quality inspection link corresponding to the voice text segment;
and the positioning module is used for determining quality inspection links of the same type and taking the area of the voice text segment corresponding to the quality inspection links of the same type as a quality inspection area.
In a possible implementation manner of the second aspect, the positioning apparatus further includes:
the acquisition submodule is used for acquiring the voice audio to be processed;
and the conversion submodule is used for converting the voice audio to be processed into a voice text to be processed.
In a possible implementation manner of the second aspect, the conversion sub-module includes:
the separation unit is used for separating target voice audio in the voice audio to be processed;
the cutting unit is used for carrying out role object segmentation on the target voice audio to obtain a voice audio segment; wherein each voice audio clip corresponds to a character object;
and the conversion unit is used for converting the voice audio fragment into a voice text fragment and forming the voice text to be processed according to the voice text fragment.
In a possible implementation manner of the second aspect, the positioning apparatus further includes:
the sample acquisition module is used for acquiring a voice text sample; wherein the speech text sample comprises at least one speech text sample segment;
the marking module is used for marking a quality inspection link corresponding to the voice text sample fragment;
and the training module is used for inputting the voice text sample fragment and the quality testing link corresponding to the voice text sample fragment into a cnn model for training to obtain the preset cnn model.
In a possible implementation manner of the second aspect, the positioning apparatus further includes:
the audio acquisition module is used for acquiring a voice audio fragment corresponding to the voice text fragment;
and the verification module is used for verifying the quality inspection link corresponding to the voice text fragment according to the voice audio fragment.
In a possible implementation manner of the second aspect, the checking module includes:
the determining submodule is used for determining the serial number and the total number of the voice audio clips;
the calculating submodule is used for calculating the relative position of the quality inspection link corresponding to the voice audio clip according to the following formula:
Percent=A/B,
wherein, Percent represents the relative position of the quality inspection link corresponding to the voice audio clip, A represents the serial number of the voice audio clip, B represents the total number of the voice audio clip;
the judging submodule is used for judging whether the relative position of the quality inspection link is within a relative position interval corresponding to the quality inspection link in a preset box diagram;
and the change submodule is used for changing the quality inspection link into an abnormal link if the quality inspection link is not normal.
In one possible implementation, the positioning module includes:
the clustering submodule is used for clustering and analyzing the quality inspection links of the same category to obtain a quality inspection link set;
the determining submodule is used for determining a starting quality inspection link and an ending quality inspection link in the quality inspection link set;
and the positioning sub-module is used for taking the areas where the voice text segments respectively correspond to the initial quality inspection link, the intermediate quality inspection link between the initial quality inspection link and the ending quality inspection link as quality inspection areas.
In a third aspect, an embodiment of the present application provides a positioning apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the method according to the first aspect when executing the computer program.
In a fourth aspect, the present application provides a storage medium, and the computer program, when executed by a processor, implements the method according to the first aspect.
Compared with the prior art, the embodiment of the application has the advantages that:
the quality inspection link of the voice text segment of the voice text to be processed is determined through the preset cnn model, so that the areas where the voice text segments respectively corresponding to the quality inspection links of the same category are located are used as quality inspection areas, and subsequent quality inspection is conveniently carried out according to the quality inspection areas. Therefore, the embodiment of the application utilizes the characteristics of good robustness and the like of the preset cnn model, and can achieve the effects of high robustness and good recognition effect compared with the method for determining the quality inspection area of the voice text through the keyword recognition technology in the prior art.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for locating a voice quality inspection area according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for positioning a voice quality inspection area according to an embodiment of the present application, before step S101 in fig. 1;
fig. 3 is a schematic flowchart illustrating a specific process of step S202 in fig. 2 of a method for positioning a voice quality inspection area according to an embodiment of the present application;
fig. 4 is a flowchart illustrating a method for locating a voice quality inspection area according to an embodiment of the present application, before step S102 in fig. 1;
fig. 5 is a schematic flowchart of a method for positioning a voice quality inspection area according to an embodiment of the present application, before step S103 in fig. 1;
fig. 6 is a schematic flowchart illustrating a specific process of step S103 in fig. 1 of a method for positioning a voice quality inspection area according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a positioning device for a voice quality inspection area according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of a positioning apparatus provided in an embodiment of the present application;
fig. 9 is a diagram illustrating a first comparison relationship between a voice segment and a quality inspection link in a method for locating a voice quality inspection area according to an embodiment of the present application;
fig. 10 is a preset box diagram representing a relationship between relative positions corresponding to a normal quality inspection link and a quality inspection link in the method for positioning a voice quality inspection area according to the embodiment of the present application;
fig. 11 is an exemplary diagram of a second corresponding relationship between a quality inspection link and a speech text segment in the method for positioning a speech quality inspection region according to the embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The application scenario of the present application may be any scenario that requires quality inspection of voice, for example, a typical application scenario may be a scenario that performs quality inspection of voice calls between customer service personnel and customers in the insurance industry.
The technical solutions provided in the embodiments of the present application will be described below with specific embodiments.
Example one
Referring to fig. 1, a flowchart of a method for positioning a voice quality inspection area provided in an embodiment of the present application is schematically illustrated, by way of example and not limitation, the method may be applied to a positioning device, where the positioning device includes a terminal device or a server, and the method may include the following steps:
and S101, acquiring a voice text to be processed.
The voice text to be processed comprises at least one voice text segment.
In a specific application, the to-be-processed voice text may be a voice text obtained by converting a voice call between a customer service person and a client in an insurance industry.
In a possible manner, referring to fig. 2, a schematic flow chart of the method for positioning a voice quality inspection area provided in the embodiment of the present application before step S101 in fig. 1 is shown, where before obtaining a to-be-processed voice text, the method further includes:
step S201, obtaining the voice audio to be processed.
In a specific application, the embodiment of the present application may directly obtain the to-be-processed voice audio from the call center, and may also indirectly obtain the to-be-processed voice audio from the relay server, that is, the obtaining source of the to-be-processed voice audio in the embodiment of the present application is not limited. In addition, the number of the voice audio to be processed is not limited in the embodiment of the present application, for example, the number of the voice audio to be processed may be 500 voice calls between the customer service personnel and the customer.
Step S202, converting the voice audio to be processed into a voice text to be processed.
As an example and not by way of limitation, as shown in fig. 3, a specific flowchart of step S202 in fig. 2 of a method for positioning a voice quality inspection area provided in an embodiment of the present application is a schematic diagram, where converting a to-be-processed voice audio into a to-be-processed voice text includes:
and S301, separating target voice audio in the voice audio to be processed.
It can be understood that the background noise and the target voice audio exist in the voice audio to be processed, and the background noise and the target voice audio need to be separated.
By way of example and not limitation, the specific process of separating out the target speech audio from the to-be-processed speech audio may be:
firstly, framing a voice audio to be processed to obtain an audio frame.
Secondly, calculating the energy of the audio frame according to the following formula:
Figure 470913DEST_PATH_IMAGE001
wherein E isnIs the energy of an audio frame, N is the time instant, x is a frame sample value, m is the average sound amplitude, and N is the window length.
And thirdly, screening out the audio frames with the energy larger than the energy threshold.
And fourthly, forming target voice audio according to the audio frames with the energy larger than the energy threshold value.
It can be understood that, in the embodiment of the present application, the target speech audio is separated by using the difference between the energy of the background noise and the target speech audio.
And step S302, performing role object segmentation on the target voice audio to obtain voice audio segments.
Wherein each voice audio clip corresponds to a character object.
In a specific application, in an application scene of quality inspection by voice call between a customer service person and a client in the insurance industry, the role object can comprise the customer service person and the client.
By way of example and not limitation, the character object segmentation is performed on the target voice audio, and a specific process of obtaining the voice audio segment may be:
firstly, all role objects corresponding to the target voice audio are determined.
For example, the role object of the embodiment of the present application may be a customer service person and a client in the insurance industry.
And secondly, searching a preset voice characteristic model corresponding to each role object.
The preset voice characteristic model is set in advance according to the voice characteristics of the role object.
For example, the preset speech feature model of the customer service personnel can be obtained by extracting speech feature values of the customer service personnel according to a Mel frequency cepstrum system (MFCC) and inputting the speech feature values into a speech feature model, such as a Gaussian mixture model for training. Correspondingly, the preset speech feature model of the customer can be obtained by extracting speech feature values of customer personnel according to Mel Frequency Cepstrum Coefficient (MFCC) and inputting the speech feature values into a speech feature model, such as a Gaussian mixture model for training
And thirdly, substituting the preset voice characteristic model corresponding to each role object into a preset function to calculate a jump prediction value.
For example, the predetermined function may be a likelihood function plus a penalty term.
And fourthly, taking the moment of the jump prediction value larger than the jump prediction threshold value as a jump point, and segmenting the target voice audio according to the jump point to obtain a voice audio segment.
It can be understood that, in the embodiment of the present application, by predicting the transition point of the speech audio segment, the speech audio segments corresponding to different character objects in the speech audio are segmented.
Step S303, converting the voice audio segment into a voice text segment, and forming a voice text to be processed according to the voice text segment.
And the voice audio segments correspond to the voice text segments one by one.
By way of example and not limitation, converting a speech audio segment to a speech text segment and forming a to-be-processed speech text from the speech text segment may be:
firstly, extracting a characteristic value of a voice audio segment.
And secondly, inputting the characteristic value into a preset acoustic model to obtain a voice characteristic vector sequence.
The preset acoustic model is obtained by training according to acoustic data and a voice feature vector sequence in advance.
And thirdly, inputting the voice feature vector sequence into a preset voice model to obtain a character sequence.
The preset voice model is obtained by training in advance according to the character sequence and the voice feature vector sequence.
And fourthly, forming a voice text to be processed according to the character sequence.
And S102, inputting the voice text segment of the voice text to be processed into a preset cnn model to obtain a quality inspection link corresponding to the voice text segment.
The preset cnn model is a neural network model obtained in advance according to the training of the voice text fragments and the corresponding quality testing links. It should be noted that the quality inspection link refers to a conversation progress between role objects, and for example, in an application scenario of performing quality inspection on a voice call between a customer service person and a client in the insurance industry, the quality inspection link represents the conversation progress between the customer service person and the client, and the quality inspection area is determined based on the quality inspection link in the embodiment of the present application, so that the voice call between the customer service person and the client can be subsequently performed quality inspection according to the quality inspection area.
For example, the following steps are carried out: as shown in fig. 9, which is an exemplary diagram of a first comparison relationship between a voice text segment and a quality inspection link in a positioning method of a voice quality inspection area according to an embodiment of the present application, wherein a first column is a serial number of the voice text segment, and since the voice text segment corresponds to a voice audio segment one to one, the first column also represents a serial number of the voice audio segment; the second column is a quality inspection link; the third column is the start-stop time of the speech text segment, and can also represent the start-stop time of the speech audio segment, which is the same as the first column; the fourth column is a speech text segment. Therefore, quality inspection link classification can be carried out on the voice text fragments through the preset cnn model, and quality inspection links corresponding to the voice text fragments are obtained.
In a possible implementation manner, referring to fig. 4, which is a schematic flow chart of the method for positioning a voice quality inspection region provided in the embodiment of the present application before step S102 in fig. 1, before inputting a voice text segment of a voice text to be processed to a preset cnn model and obtaining a quality inspection link corresponding to the voice text segment, the method further includes:
and step S401, obtaining a voice text sample.
Wherein the speech text sample comprises at least one speech text sample segment.
It should be noted that, here, the same as step S101, that is, before the voice text sample is obtained, the voice audio sample is obtained first, and then the voice audio sample is converted into the voice text sample, and of course, the specific process of converting the voice audio sample into the voice text sample is also the same as step S201 to step S202, and is not described herein again.
And step S402, marking a quality inspection link corresponding to the voice text sample fragment.
Illustratively, in an application scenario of quality inspection by voice call between a customer service staff and a client in the insurance industry, the quality inspection links may include, but are not limited to, a product introduction link, an information checking link, a health notification link, a duty-free declaration link, an opening white link, a guarantee confirmation link, a hesitation period link, a connection one link, a connection two link, a connection three link, and the like.
Generally, the embodiment of the present application adopts a quality inspection link corresponding to a manually labeled voice sample fragment, and specifically, when the embodiment of the present application is applied to a terminal device, a user directly performs manual labeling through the terminal device, wherein a specific process of human-computer interaction is a conventional means well known to those skilled in the art, and a description thereof is not further provided herein; when the embodiment of the application is applied to the server, the user indirectly performs manual labeling through the user terminal, wherein a specific process of interaction between the user terminal and the server is not further described as it is also a conventional means well known to those skilled in the art.
And S403, inputting the voice text sample fragment and a quality testing link corresponding to the voice text sample fragment into the cnn model for training to obtain a preset cnn model.
Generally, 70% of voice text sample fragments and quality testing links corresponding to the voice text sample fragments are selected as a training set, and 30% of the voice text sample fragments and quality testing links corresponding to the voice text sample fragments are selected as a testing set.
And S103, determining quality inspection links of the same type, and taking the areas of the voice text segments corresponding to the quality inspection links of the same type as quality inspection areas.
In step S402, the categories of the quality inspection link include, but are not limited to, product introduction, information verification, health notification, disclaimer declaration, opening white, insurance confirmation, hesitation, association of one, association of two, association of three, and the like.
It can be understood that the same link can appear in one-time quality inspection process, namely, in one-time call between customer service personnel and a customer, the region where the voice text segment respectively corresponding to the quality inspection links of the same category is located is used as a quality inspection region in the embodiment of the application, the effect of accurately and quickly positioning the quality inspection region is achieved, and convenience is provided for subsequent quality inspection according to the quality inspection region.
In a possible manner, referring to fig. 5, a flowchart of the method for positioning a voice quality inspection area provided in the embodiment of the present application before step S103 in fig. 1 is shown, where quality inspection links of the same category are determined, and before taking areas where voice text segments respectively corresponding to the quality inspection links of the same category as quality inspection areas, the method further includes:
and step S501, acquiring a voice audio clip corresponding to the voice text clip.
And S502, verifying a quality inspection link corresponding to the voice text fragment according to the voice audio fragment.
It can be understood that there is an error in the quality inspection process corresponding to the speech text segment obtained in step S102, and the error needs to be further removed. According to the embodiment of the application, the quality inspection link corresponding to the voice text fragment can be verified according to the corresponding voice audio fragment corresponding to the voice text fragment.
Specifically, the verifying the quality inspection link corresponding to the voice text segment according to the voice audio segment includes:
the first step is to determine the sequence number and the total number of the voice audio segments.
And secondly, calculating the relative position of a quality inspection link corresponding to the voice audio clip according to the following formula:
Percent=A/B,
wherein, Percent represents the relative position of the quality testing link corresponding to the voice audio segment, A represents the serial number of the voice audio segment, and B represents the total number of the voice audio segment.
For example, if the quality inspection link is a link, the number of the audio segments corresponding to the link is 118, and the total number of the audio segments corresponding to the link is 200, the relative position of the link is 0.59.
And thirdly, judging whether the relative position of the quality inspection link is within the relative position interval corresponding to the quality inspection link in the preset boxed graph.
The preset box chart is obtained in advance according to statistics of the relative positions of the normal quality inspection link and the normal quality inspection link.
Referring to fig. 10, a preset box chart representing a relationship between corresponding relative positions of a normal quality inspection link and a quality inspection link is provided for the method for positioning a voice quality inspection area according to the embodiment of the present application. Illustrated in connection with fig. 10: when the quality inspection link is an open field white link, the sequence number of the voice audio segments corresponding to the open field white link is 87, and the total number of the voice audio segments is 200, the calculated relative position of the open field white link is 0.435, the open field white link is substituted into the graph 10, the interval of the normal open field white at the relative position corresponding to the preset box-shaped graph is 0.82-0.92, and the calculated relative position of the open field white link is 0.435 out of the interval of 0.82-0.92, so the open field white link is not the normal open field white link, and the quality inspection link is in wrong prediction.
And fourthly, if not, changing the quality inspection link into an abnormal link.
The abnormal link includes an unknown link, that is, the quality inspection link of the embodiment of the present application further includes an unknown link.
It can be understood that, in the embodiment of the present application, the voice audio segment is converted into the voice text segment, and after the quality inspection link corresponding to the voice text segment is predicted according to the preset cnn model, the quality inspection link that does not meet the requirement can be further modified into an abnormal link, for example, an unknown link, according to the quality inspection loop that is performed on the voice audio segment, so as to achieve the purpose of filtering the interference data.
Exemplarily, referring to fig. 6, a detailed flowchart of step S103 in fig. 1 of the method for positioning a voice quality inspection area provided in the embodiment of the present application is shown, where quality inspection links of the same category are determined, and areas where voice text segments corresponding to the quality inspection links of the same category are located are taken as quality inspection areas, including:
step S601, performing cluster analysis on quality testing links of the same category to obtain a quality testing link set.
The cluster analysis refers to an analysis process of grouping a set of physical or abstract objects into a plurality of classes composed of similar objects, that is, collecting data for classification on the basis of similarity of information.
Specifically, a quality inspection link with the relevance greater than a preset relevance threshold is selected, and a quality inspection link set is constructed according to the quality inspection link with the relevance greater than the preset relevance threshold.
It can be understood that the relevance refers to the similarity between quality inspection links, and the selected relevance is greater than a preset relevance threshold value, which is equivalent to finding out quality inspection links of the same quality inspection link type.
And step S602, determining a starting quality inspection link and an ending quality inspection link in the quality inspection link set.
It should be noted that, in the prior art, the quality inspection link with the previous time sequence number is generally used as the initial quality inspection link in a quality inspection set, that is, the first-appearing quality inspection link is used as the initial quality inspection link in a quality inspection set, but since an error occurs by presetting a cnn model, the first-appearing quality inspection link is used as the initial quality inspection link and a deviation occurs,
exemplarily, the specific process of determining the initial quality inspection link and the ending quality inspection link in the quality inspection link set may be:
the method comprises the following steps of firstly, calculating the sequence number difference value of the sequence numbers of the voice text segments corresponding to the quality inspection links in the quality inspection link set.
Referring to fig. 11, in the method for positioning a voice quality inspection area according to the embodiment of the present application, an example diagram of a second corresponding relationship between a quality inspection link and a voice text segment is selected, and a second association quality inspection link set is selected, where the second association quality inspection link set includes a second association link with a serial number 131, a second association link with a serial number 136, a second association link with a serial number 137, and a second association link with a serial number 138.
And secondly, selecting a quality inspection link with a sequence number in the front sequence in two quality inspection links with a sequence number difference value smaller than a preset sequence number difference value threshold as an initial quality inspection link in a quality inspection link set.
Taking fig. 11 as an example, the preset sequence number difference threshold is 2, the sequence number difference between the association two link with sequence number 131 and the association two link with sequence number 136 is 5, which is greater than the preset sequence number difference threshold, so that the association two link with sequence number 131 cannot be used as the initial quality inspection link of the quality inspection link set, the sequence number difference between the association two link with sequence number 136 and the association two link with sequence number 137 is 1, which is less than the preset sequence number difference threshold, so that the association two link with sequence number 136 can be used as the initial quality inspection link of the quality inspection link set.
And thirdly, determining a next initial quality inspection link of the next quality inspection link set according to the first step and the second step, and taking a quality inspection link before the next initial quality inspection link as an ending quality inspection link in the quality inspection link set in the second step.
Taking fig. 11 as an example, the association link with reference number 138 is the quality inspection end link.
It is understood that the ending quality inspection link of each quality inspection link set can be determined by the starting quality inspection link of the next quality inspection link set.
Step S603, taking areas where the voice text segments respectively corresponding to the initial quality inspection link, the intermediate quality inspection link between the initial quality inspection link and the ending quality inspection link, and the ending quality inspection link are located as quality inspection areas.
It can be understood that the voice text segments covered by the initial quality inspection link, the intermediate quality inspection link between the initial quality inspection link and the ending quality inspection link are used as the voice text segments covered by the quality inspection area.
Taking fig. 11 as an example, if the two links with serial numbers 136 are the initial quality inspection link, the two links with serial numbers 137 are the intermediate quality inspection link, and the two links with serial numbers 138 are the end quality inspection link, then the voice text fragments corresponding to the initial quality inspection link, the intermediate quality inspection link between the initial quality inspection link and the end quality inspection link, and the end quality inspection link are respectively "this indicates that you are going to worship for time and need to sign … …", and "the courier will sign the exclusive insurance … …" and "let you sign the insurance policy on three to four working days, and will take effect for liu yuan one number … … two-one-nine years.
The quality inspection method and the quality inspection device have the advantages that the quality inspection link of the voice text segment of the voice text to be processed is determined through the preset cnn model, and therefore the areas where the voice text segments respectively corresponding to the quality inspection links of the same category are located serve as quality inspection areas. Therefore, the embodiment of the application utilizes the characteristics of good robustness and the like of the preset cnn model, and achieves the effects of high robustness and good recognition effect compared with the method for determining the quality inspection area of the voice text through the keyword recognition technology in the prior art.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Fig. 7 is a block diagram of a positioning apparatus for a voice quality inspection area according to an embodiment of the present application, which corresponds to the method for positioning a voice quality inspection area according to the foregoing embodiment.
Referring to fig. 7, the positioning apparatus includes:
an obtaining module 71, configured to obtain a to-be-processed speech text; wherein the voice text to be processed comprises at least one voice text segment;
a link division module 72, configured to input the speech text segment of the speech text to be processed to a preset cnn model, so as to obtain a quality inspection link corresponding to the speech text segment;
and the positioning module 73 is configured to determine quality inspection links of the same category, and use an area where the voice text segment corresponding to the quality inspection link of the same category is located as a quality inspection area.
In one possible implementation manner, the positioning apparatus further includes:
the acquisition submodule is used for acquiring the voice audio to be processed;
and the conversion submodule is used for converting the voice audio to be processed into a voice text to be processed.
In one possible implementation, the conversion sub-module includes:
the separation unit is used for separating target voice audio in the voice audio to be processed;
the cutting unit is used for carrying out role object segmentation on the target voice audio to obtain a voice audio segment; wherein each voice audio clip corresponds to a character object;
and the conversion unit is used for converting the voice audio fragment into a voice text fragment and forming the voice text to be processed according to the voice text fragment.
In one possible implementation manner, the positioning apparatus further includes:
the sample acquisition module is used for acquiring a voice text sample; wherein the speech text sample comprises at least one speech text sample segment;
the marking module is used for marking a quality inspection link corresponding to the voice text sample fragment;
and the training module is used for inputting the voice text sample fragment and the quality testing link corresponding to the voice text sample fragment into a cnn model for training to obtain the preset cnn model.
In one possible implementation manner, the positioning apparatus further includes:
the audio acquisition module is used for acquiring a voice audio fragment corresponding to the voice text fragment;
and the verification module is used for verifying the quality inspection link corresponding to the voice text fragment according to the voice audio fragment.
In one possible implementation, the verification module includes:
the determining submodule is used for determining the serial number and the total number of the voice audio clips;
the calculating submodule is used for calculating the relative position of the quality inspection link corresponding to the voice audio clip according to the following formula:
Percent=A/B,
wherein, Percent represents the relative position of the quality inspection link corresponding to the voice audio clip, A represents the serial number of the voice audio clip, B represents the total number of the voice audio clip;
the judging submodule is used for judging whether the relative position of the quality inspection link is within a relative position interval corresponding to the quality inspection link in a preset box diagram;
and the change submodule is used for changing the quality inspection link into an abnormal link if the quality inspection link is not normal.
In one possible implementation, the positioning module includes:
the clustering submodule is used for clustering and analyzing the quality inspection links of the same category to obtain a quality inspection link set;
the determining submodule is used for determining a starting quality inspection link and an ending quality inspection link in the quality inspection link set;
and the positioning sub-module is used for taking the areas where the voice text segments respectively correspond to the initial quality inspection link, the intermediate quality inspection link between the initial quality inspection link and the ending quality inspection link as quality inspection areas.
It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.
Fig. 8 is a schematic structural diagram of a positioning apparatus according to an embodiment of the present application. As shown in fig. 8, the positioning apparatus 8 of this embodiment includes: at least one processor 80, a memory 81 and a computer program 82 stored in the memory 81 and operable on the at least one processor 80, the processor 80 executing the computer program 82 to implement the steps in the above-mentioned embodiment of the method for locating a voice quality inspection area.
The positioning device 8 may be a terminal device or a computing device such as a server.
The Processor 80 may be a Central Processing Unit (CPU), and the Processor 80 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 81 may in some embodiments be an internal storage unit of the positioning device 8, such as a hard disk or a memory of the positioning device 8.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. The method for positioning the voice quality inspection area is characterized by comprising the following steps:
acquiring a voice text to be processed; wherein the voice text to be processed comprises at least one voice text segment;
inputting the voice text segment of the voice text to be processed to a preset cnn model to obtain a quality inspection link corresponding to the voice text segment, wherein the quality inspection link refers to the conversation progress between role objects;
and determining the quality inspection links of the same type, and taking the areas of the voice text segments corresponding to the quality inspection links of the same type as the quality inspection areas of the voice text to be processed.
2. The method for locating the voice quality inspection area according to claim 1, wherein before the obtaining the voice text to be processed, the method further comprises:
acquiring voice audio to be processed;
and converting the voice audio to be processed into a voice text to be processed.
3. The method for locating the voice quality inspection area according to claim 2, wherein converting the voice audio to be processed into a voice text to be processed comprises:
separating out a target voice audio frequency from the voice audio frequency to be processed;
performing role object segmentation on the target voice audio to obtain a voice audio segment; wherein each voice audio clip corresponds to a character object;
and converting the voice audio fragment into a voice text fragment, and forming the voice text to be processed according to the voice text fragment.
4. The method according to claim 1, wherein before the step of inputting the speech text segment of the speech text to be processed into a preset cnn model and obtaining the quality testing link corresponding to the speech text segment, the method further comprises:
acquiring a voice text sample; wherein the speech text sample comprises at least one speech text sample segment;
marking a quality inspection link corresponding to the voice text sample fragment;
and inputting the voice text sample fragment and a quality testing link corresponding to the voice text sample fragment into a cnn model for training to obtain the preset cnn model.
5. The method according to any one of claims 1 to 4, wherein the method for locating the voice quality inspection area determines a quality inspection link of the same category, and further comprises, before the area where the voice text segment corresponding to the quality inspection link of the same category is located is used as a quality inspection area:
acquiring a voice audio clip corresponding to the voice text clip;
and verifying a quality inspection link corresponding to the voice text fragment according to the voice audio fragment.
6. The method for locating the voice quality inspection area according to claim 5, wherein the verifying the quality inspection link corresponding to the voice text segment according to the voice audio segment comprises:
determining the sequence number and the total number of the voice audio clips;
calculating the relative position of the quality inspection link corresponding to the voice audio clip according to the following formula:
Percent=A/B,
wherein, Percent represents the relative position of the quality inspection link corresponding to the voice audio clip, A represents the serial number of the voice audio clip, B represents the total number of the voice audio clip;
judging whether the relative position of the quality inspection link is within a relative position interval corresponding to the quality inspection link in a preset box diagram;
if not, the quality inspection link is changed into an abnormal link.
7. The method according to claim 1, wherein determining quality inspection links of the same category, and using the areas where the voice text segments respectively corresponding to the quality inspection links of the same category are located as quality inspection areas comprises:
performing cluster analysis on the quality inspection links of the same category to obtain a quality inspection link set;
determining a starting quality inspection link and an ending quality inspection link in the quality inspection link set;
and taking the areas of the voice text segments respectively corresponding to the initial quality inspection link, the intermediate quality inspection link between the initial quality inspection link and the ending quality inspection link as quality inspection areas.
8. A device for locating a voice quality inspection area, comprising:
the acquisition module is used for acquiring a voice text to be processed; wherein the voice text to be processed comprises at least one voice text segment;
a link division module, configured to input a speech text segment of the speech text to be processed to a preset cnn model, so as to obtain a quality inspection link corresponding to the speech text segment, where the quality inspection link is a session progress between role objects;
and the positioning module is used for determining quality inspection links of the same type and taking the area of the voice text segment corresponding to the quality inspection links of the same type as a quality inspection area.
9. A positioning device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.
10. A storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any one of claims 1 to 7.
CN202010544756.8A 2020-06-15 2020-06-15 Method and device for positioning voice quality inspection area, positioning equipment and storage medium Active CN111696527B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010544756.8A CN111696527B (en) 2020-06-15 2020-06-15 Method and device for positioning voice quality inspection area, positioning equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010544756.8A CN111696527B (en) 2020-06-15 2020-06-15 Method and device for positioning voice quality inspection area, positioning equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111696527A CN111696527A (en) 2020-09-22
CN111696527B true CN111696527B (en) 2020-12-22

Family

ID=72481278

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010544756.8A Active CN111696527B (en) 2020-06-15 2020-06-15 Method and device for positioning voice quality inspection area, positioning equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111696527B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390946A (en) * 2019-07-26 2019-10-29 龙马智芯(珠海横琴)科技有限公司 A kind of audio signal processing method, device, electronic equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107093431B (en) * 2016-02-18 2020-07-07 中国移动通信集团辽宁有限公司 Method and device for quality inspection of service quality
US10923141B2 (en) * 2018-08-06 2021-02-16 Spotify Ab Singing voice separation with deep u-net convolutional networks
CN111179935B (en) * 2018-11-12 2022-06-28 中移(杭州)信息技术有限公司 Voice quality inspection method and device
CN110364183A (en) * 2019-07-09 2019-10-22 深圳壹账通智能科技有限公司 Method, apparatus, computer equipment and the storage medium of voice quality inspection
CN110634471B (en) * 2019-09-21 2020-10-02 龙马智芯(珠海横琴)科技有限公司 Voice quality inspection method and device, electronic equipment and storage medium
CN110750229A (en) * 2019-09-30 2020-02-04 北京淇瑀信息科技有限公司 Voice quality inspection display method and device and electronic equipment
CN110909162B (en) * 2019-11-15 2020-10-27 龙马智芯(珠海横琴)科技有限公司 Text quality inspection method, storage medium and electronic equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390946A (en) * 2019-07-26 2019-10-29 龙马智芯(珠海横琴)科技有限公司 A kind of audio signal processing method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111696527A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
CN110147726B (en) Service quality inspection method and device, storage medium and electronic device
CN112804400B (en) Customer service call voice quality inspection method and device, electronic equipment and storage medium
US11276407B2 (en) Metadata-based diarization of teleconferences
CN111128223B (en) Text information-based auxiliary speaker separation method and related device
US8750489B2 (en) System and method for automatic call segmentation at call center
WO2020155750A1 (en) Artificial intelligence-based corpus collecting method, apparatus, device, and storage medium
CN112492343B (en) Video live broadcast monitoring method and related device
CN107578770B (en) Voice recognition method and device for network telephone, computer equipment and storage medium
JP2005530214A (en) Mega speaker identification (ID) system and method corresponding to its purpose
CN111696528B (en) Voice quality inspection method and device, quality inspection equipment and readable storage medium
JP4132589B2 (en) Method and apparatus for tracking speakers in an audio stream
CN113223532B (en) Quality inspection method and device for customer service call, computer equipment and storage medium
Lu et al. Real-time unsupervised speaker change detection
CN111429943B (en) Joint detection method for music and relative loudness of music in audio
CN111639529A (en) Speech technology detection method and device based on multi-level logic and computer equipment
CN112671985A (en) Agent quality inspection method, device, equipment and storage medium based on deep learning
CN111341333B (en) Noise detection method, noise detection device, medium, and electronic apparatus
CN111083469A (en) Video quality determination method and device, electronic equipment and readable storage medium
US20220157322A1 (en) Metadata-based diarization of teleconferences
CN111723204B (en) Method and device for correcting voice quality inspection area, correction equipment and storage medium
CN116996337B (en) Conference data management system and method based on Internet of things and microphone switching technology
CN111696527B (en) Method and device for positioning voice quality inspection area, positioning equipment and storage medium
CN113194332A (en) Multi-policy-based new advertisement discovery method, electronic device and readable storage medium
CN113076932B (en) Method for training audio language identification model, video detection method and device thereof
CN114155845A (en) Service determination method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 519031 office 1316, No. 1, lianao Road, Hengqin new area, Zhuhai, Guangdong

Patentee after: LONGMA ZHIXIN (ZHUHAI HENGQIN) TECHNOLOGY Co.,Ltd.

Address before: Room 417.418.419, building 20, creative Valley, 1889 Huandao East Road, Hengqin New District, Zhuhai City, Guangdong Province

Patentee before: LONGMA ZHIXIN (ZHUHAI HENGQIN) TECHNOLOGY Co.,Ltd.

CP02 Change in the address of a patent holder