CN111681672A

CN111681672A - Voice data detection method and device, computer equipment and storage medium

Info

Publication number: CN111681672A
Application number: CN202010456652.1A
Authority: CN
Inventors: 张山
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Smart Technology Co Ltd; OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2020-09-18

Abstract

The invention discloses a voice data detection method, a voice data detection device, computer equipment and a storage medium, wherein a voice detection trigger instruction is received, and the voice detection trigger instruction comprises detection type information; if the detection type information is first type information, a first monitoring strategy is adopted to carry out real-time detection on the target voice data of the client; when the target voice data triggers a preset early warning condition in a risk monitoring item, sending prompt information to a monitoring end of a client; after the real-time detection of the target voice data of the client is finished, outputting the detection result information of the quality detection item; if the detection type information is second type information, performing off-line detection on the target voice data of the client by adopting a second monitoring strategy; after the off-line detection of the target voice data of the client is finished, outputting the detection result information of the second monitoring strategy; therefore, the quality inspection efficiency of the voice data is improved.

Description

Voice data detection method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of voice semantics, and in particular, to a method and an apparatus for detecting voice data, a computer device, and a storage medium.

Background

The voice detection plays an important role in the quality management and detection process of the voice content. At present, most quality inspection systems on the market are completed in a manual mode, namely, voice data are inspected according to preset quality inspection specifications by special quality inspection personnel. However, when the manual voice detection method faces a large amount of voice data, not only a large amount of manpower is required to be consumed, which results in low detection efficiency, but also the manual quality inspection is greatly interfered by human, and the detection result is often inaccurate.

Disclosure of Invention

The embodiment of the invention provides a voice data detection method, a voice data detection device, computer equipment and a storage medium, and aims to solve the problem that a voice detection result is inaccurate.

A voice data detection method, comprising:

receiving a voice detection trigger instruction, wherein the voice detection trigger instruction comprises detection type information;

if the detection type information is first type information, a first monitoring strategy is adopted to carry out real-time detection on target voice data of the client, wherein the first type information indicates that the detection type is real-time monitoring, and the first monitoring strategy comprises a risk monitoring item and a quality monitoring item;

when the target voice data triggers an early warning condition in the risk monitoring item, sending prompt information to a monitoring end of the client;

after the real-time detection of the target voice data of the client is finished, outputting the detection result information of the quality detection item;

if the detection type information is second type information, performing off-line detection on the target voice data of the client by adopting a second monitoring strategy, wherein the second type information indicates that the detection type is off-line detection;

and outputting the detection result information of the second monitoring strategy after the off-line detection of the target voice data of the client is finished.

A voice data detecting apparatus comprising:

the voice detection triggering instruction receiving module is used for receiving a voice detection triggering instruction, and the voice detection triggering instruction comprises detection type information;

the real-time detection module is used for detecting the target voice data of the client in real time by adopting a first monitoring strategy when the detection type information is first type information, wherein the first type information indicates that the detection type is real-time monitoring, and the first monitoring strategy comprises a risk monitoring item and a quality monitoring item;

the first sending module is used for sending prompt information to a monitoring end of the client when the target voice data triggers the early warning condition in the risk monitoring item;

the first output module is used for outputting the detection result information of the quality detection item after the real-time detection of the target voice data of the client is finished;

the off-line detection module is used for carrying out off-line detection on the target voice data of the client by adopting a second monitoring strategy when the detection type information is second type information, wherein the second type information indicates that the detection type is off-line detection;

and the second output module is used for outputting the detection result information of the second monitoring strategy after the off-line detection of the target voice data of the client is finished.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above-mentioned voice data detection method when executing the computer program.

A computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the above-described voice data detection method.

According to the voice data detection method, the voice data detection device, the computer equipment and the storage medium, the voice detection trigger instruction is received, and the voice detection trigger instruction comprises the detection type information; if the detection type information is first type information, a first monitoring strategy is adopted to carry out real-time detection on the target voice data of the client, wherein the first type information indicates that the detection type is real-time monitoring, and the first monitoring strategy comprises a risk monitoring item and a quality monitoring item; when the target voice data triggers a preset early warning condition in a risk monitoring item, sending prompt information to a monitoring end of a client; after the real-time detection of the target voice data of the client is finished, outputting the detection result information of the quality detection item; if the detection type information is second type information, performing off-line detection on the target voice data of the client by adopting a second monitoring strategy, wherein the second type information indicates that the detection type is off-line detection; after the off-line detection of the target voice data of the client is finished, outputting the detection result information of the second monitoring strategy; according to the scheme, the target voice data are detected by combining real-time detection and offline detection, so that the time investment of manual quality inspection and the time investment of manual reinspection are obviously reduced, and the quality inspection efficiency of the voice data is improved. In addition, in the process of detecting the target voice data in real time, when the target voice data triggers a preset early warning condition in the risk monitoring item, prompt information is sent to the monitoring end of the client in time, and therefore the effectiveness of quality inspection on the voice data is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a diagram illustrating an application environment of a voice data detection method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a voice data detection method according to an embodiment of the present invention;

FIG. 3 is another flow chart of a method for detecting voice data according to an embodiment of the present invention;

FIG. 4 is another flow chart of a method for detecting voice data according to an embodiment of the present invention;

FIG. 5 is another flow chart of a method for detecting voice data according to an embodiment of the present invention;

FIG. 6 is another flow chart of a method for detecting voice data according to an embodiment of the present invention;

FIG. 7 is another flow chart of a method for voice data detection according to an embodiment of the present invention;

FIG. 8 is another flow chart of a method for voice data detection according to an embodiment of the present invention;

FIG. 9 is a diagram of a voice data detection apparatus according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The voice data detection method provided by the embodiment of the invention can be applied to the application environment shown in fig. 1. Specifically, the voice data detection method is applied to a voice data detection system, the voice data detection system includes a client and a server as shown in fig. 1, and the client and the server communicate through a network to solve the problem that a voice detection result is inaccurate. The client is also called a client, and refers to a program corresponding to the server and providing local services to the client. The client may be installed on, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented by an independent server or a server cluster composed of a plurality of servers.

In an embodiment, as shown in fig. 2, a method for detecting voice data is provided, which is described by taking the method applied to the server in fig. 1 as an example, and includes the following steps:

s10: and receiving a voice detection trigger instruction, wherein the voice detection trigger instruction comprises detection type information.

The voice detection triggering instruction is an instruction for triggering voice data detection. Specifically, the voice detection trigger instruction can be generated by triggering after a user executes a page operation behavior of voice detection on a client page, and after the client generates the voice detection trigger instruction, the voice detection trigger instruction is sent to the server, so that the server can receive the voice detection trigger instruction. In a specific application scenario, after the voice data to be detected is uploaded to a voice quality inspection system of a client, a voice detection trigger button is arranged on a display interface of the voice quality inspection system of the client, after a user clicks the voice detection trigger button, the client responds to the clicking operation behavior to generate a voice detection trigger instruction and sends the voice detection trigger instruction to a server, and the server receives the voice detection trigger instruction. The voice detection trigger instruction includes detection type information. The detection type information is a detection type for detecting voice data. For example: the detection type information can be real-time detection, timing detection, offline detection, or the like.

S20: and if the detection type information is first type information, performing real-time detection on the target voice data of the client by adopting a first monitoring strategy, wherein the first type information indicates that the detection type is real-time monitoring, and the first monitoring strategy comprises a risk monitoring item and a quality monitoring item.

The target voice data refers to voice data to be detected. The target voice data includes customer service voice data and customer voice data. After receiving the detection type information carried by the voice detection triggering instruction, identifying the detection type information, and if the detection type information is the first type information, adopting a first monitoring strategy to detect the target voice data of the client in real time. In this embodiment, the real-time detection of the target voice data mainly includes risk detection and quality detection of the target voice data. The first type information indicates that the detection type is real-time monitoring, and the first monitoring strategy comprises a risk monitoring item and a quality monitoring item. Optionally, the risk monitoring item mainly comprises an emotion analysis item and a keyword and sensitive word detection item. The quality monitoring item mainly comprises an intention identification item, a speech speed and silence analysis item, a voice cross analysis item and a quality control rule matching item. Optionally, in the process of detecting the target voice data of the client in real time by using the first monitoring strategy, the target voice data may be detected in real time by using the risk monitoring item, and then the target voice data may be detected in real time by using the quality monitoring item; the target voice data can be detected in real time by adopting the quality monitoring item, and then the target voice data can be detected in real time by adopting the risk monitoring item, so that the risk monitoring result corresponding to the risk monitoring item and the quality detection result corresponding to the quality monitoring item can be respectively obtained.

Preferably, in this embodiment, in the process of performing real-time detection on the target voice data of the client by using the first monitoring strategy, the risk monitoring item is first used to perform real-time detection on the target voice data to obtain a risk monitoring result corresponding to the risk monitoring item, and then the risk monitoring item is used to perform real-time detection on the target voice data to obtain a quality detection result corresponding to the quality monitoring item.

Specifically, the real-time detection of the target voice data by using the risk monitoring item mainly comprises the following steps: and performing emotion analysis and detection of key words and sensitive words on the target voice data. The emotion analysis of the target voice data can be performed by converting the target voice data into corresponding target character data, then identifying the target character data by using a clustering algorithm (e.g., a mean clustering algorithm), and judging whether the target character data is biased to be negative (e.g., anger) or positive (e.g., happiness) so as to obtain a detection result corresponding to the emotion analysis item. The target voice data is subjected to keyword and sensitive word detection, word segmentation processing can be carried out on target character data converted from the target voice data to obtain target keywords, then the target keywords are matched with preset keywords and sensitive words, whether the keywords identical to the preset keywords or the sensitive words exist or not is judged, and therefore detection results corresponding to the keywords and the sensitive words are obtained.

Specifically, the detection of the target voice data by using the quality monitoring item mainly includes intention recognition, speech speed and silence analysis, voice cross analysis, quality inspection rule matching and the like of the target voice data. The target voice data is subjected to intention recognition, and word segmentation processing can be performed on target character data converted from the target voice data to obtain target keywords; then, identifying the target keywords by adopting a similarity algorithm and a clustering algorithm to obtain intention information; and finally classifying the intention information so as to judge the intention of the target voice data. For example: the target voice data is intended to "open a member".

Further, the target voice data is subjected to speech speed and silence analysis, that is, the voice speed and silence time of the target voice data are detected, and since the target voice data includes customer service voice data and customer voice data, the speech speed and silence analysis of the target voice data is mainly to detect the speech speed and silence time of the customer service voice data. Preferably, the voice detection module can be used for detecting the voice speed and the silence time of the customer service voice data, and judging whether the speed of the customer service voice data is too fast or too slow, and the phenomenon that the silence time is too long will not occur.

And further, performing voice cross analysis on the target voice data, and detecting whether the customer service voice data and the customer voice data have too much voice cross at the same time. And finally, performing quality inspection rule matching on the voice time dimension of the target voice data, and judging whether the target voice data dimension meets the preset requirement.

S30: and when the target voice data triggers the early warning condition in the risk monitoring item, sending prompt information to a monitoring end of the client.

The early warning condition refers to a preset condition for triggering the alarm of the target voice data. Specifically, the early warning condition may be that a sensitive word appears in the target voice data, that is, the target voice data includes a negative keyword such as a dirty word; it may be the case that the detection result corresponding to the emotion analysis item in the target voice data is a negative (angry) emotion. Specifically, after the target voice data is detected in real time by using the risk monitoring item according to step S20, a corresponding risk detection result is obtained, and then it is determined whether the risk detection result triggers an early warning condition, and if the risk detection result triggers the early warning condition in the risk monitoring item, that is, a sensitive word is detected to appear in the target voice data, or if the emotion of the guest speaking in the target voice data is detected to be a negative (angry) emotion, a prompt message is sent to the monitoring end of the client. The prompt information is used for prompting that violation phenomena occur in target voice data of the client of the monitoring personnel. Specifically, specific voice data corresponding to the trigger early warning condition in the target voice data and corresponding trigger time are recorded in the prompt message. For example: the voice data in the target voice data has the early warning keywords, and the time when the early warning keywords appear. The monitoring terminal is a terminal which has the authority to monitor the target voice data of the client.

In one embodiment, each client establishes an association with a corresponding monitoring terminal in advance. When the target voice data triggers the early warning condition in the risk monitoring item, the server side can directly send prompt information to the monitoring side which is pre-associated with the client side, so that the target voice data of the client side can be monitored and intervened.

S40: and after the real-time detection of the target voice data of the client is finished, outputting the detection result information of the quality detection item.

The detection result information of the quality detection item is a result generated after the target voice data is detected by adopting the quality monitoring item. The detection result information includes detection results corresponding to detection items of each dimension in the quality detection items. Preferably, in this embodiment, the detection result corresponding to the detection item of each dimension in the quality detection items may be represented by a specific score value. Specifically, after the real-time detection of the target voice data of the client is finished, the detection result corresponding to the detection item of each dimension in the quality detection items is output.

S50: and if the detection type information is second type information, performing off-line detection on the target voice data of the client by adopting a second monitoring strategy, wherein the second type information indicates that the detection type is off-line detection.

The target voice data may be a complete call recording file or a part of call recording files screened out according to a certain rule requirement. For example, only a call record for a certain time period and a certain customer service is detected. Optionally, a part of the recording files can be screened out as target voice data according to the corresponding rule requirements for offline detection. Specifically, if the detection type information is the second type information, the target voice data of the client is detected offline by adopting a second monitoring strategy. And the second type information indicates that the detection type is off-line detection monitoring.

In a specific embodiment, before performing offline detection on the target voice data of the client by using the second monitoring policy, a time frequency for acquiring the target voice data to perform voice detection may be preset, and then the corresponding target voice data is acquired from the client according to the time frequency to perform offline detection. Specifically, in the process of performing offline detection on the target voice data by using the second monitoring strategy, risk detection and quality detection on the obtained target voice data of the client are also included, that is, emotion analysis, keyword and sensitive word detection, intention recognition, speech speed and silence analysis, speech cross analysis, quality inspection rule matching and the like are also performed on the target voice data, so that detection result information corresponding to the second monitoring strategy is obtained. The specific method and process for performing risk detection and quality detection on the obtained target voice data of the client in this step are the same as the specific method and process for performing real-time detection on the target voice data of the client by using the first monitoring strategy in step S20, and redundant description is not repeated here.

S60: and outputting the detection result information of the second monitoring strategy after the off-line detection of the target voice data of the client is finished.

And the detection result information of the second monitoring strategy is a result generated after the target voice data is detected by adopting the second monitoring strategy. Specifically, the detection result information of the second monitoring strategy includes: the detection results corresponding to the detection items of each dimension in the risk detection items (emotion analysis items, keyword and sensitive word detection), and the detection results corresponding to the detection items of each dimension in the quality detection (intention identification items, speech rate and silence analysis items, voice cross analysis items and quality control rule matching items). Preferably, in this embodiment, the detection result information of the second monitoring strategy may also be embodied by a specific fractional value. Specifically, after the off-line detection of the target voice data of the client is finished, the detection result information of the second monitoring strategy is output.

In the embodiment, a voice detection trigger instruction is received, wherein the voice detection trigger instruction comprises detection type information; if the detection type information is first type information, a first monitoring strategy is adopted to carry out real-time detection on the target voice data of the client, wherein the first type information indicates that the detection type is real-time monitoring, and the first monitoring strategy comprises a risk monitoring item and a quality monitoring item; when the target voice data triggers a preset early warning condition in a risk monitoring item, sending prompt information to a monitoring end of a client; after the real-time detection of the target voice data of the client is finished, outputting the detection result information of the quality detection item; if the detection type information is second type information, performing off-line detection on the target voice data of the client by adopting a second monitoring strategy, wherein the second type information indicates that the detection type is off-line detection; after the off-line detection of the target voice data of the client is finished, outputting the detection result information of the second monitoring strategy; according to the scheme, the target voice data are detected by combining real-time detection and offline detection, so that the time investment of manual quality inspection and the time investment of manual reinspection are obviously reduced, and the quality inspection efficiency of the voice data is improved. In addition, in the process of detecting the target voice data in real time, when the target voice data triggers a preset early warning condition in the risk monitoring item, prompt information is sent to the monitoring end of the client in time, and therefore the effectiveness of quality inspection on the voice data is further improved.

In an embodiment, as shown in fig. 3, the detection result information of the second monitoring policy includes detection items and a detection score corresponding to each detection item; after outputting the detection result information of the second monitoring strategy, the voice data detection method further specifically comprises the following steps:

s61: and adding the target voice data and the detection result information of the second monitoring strategy into a preset detection strategy database, wherein the detection strategy database is used for storing the detection result information after the detection is finished.

The detection strategy database refers to a preset database for storing detection result information after detection is completed. Specifically, after the detection result information of the second monitoring strategy is output, the target voice data and the detection result information of the corresponding second monitoring strategy are associated and added to and stored in a preset detection strategy database, so that the detection result information is subjected to data analysis subsequently, and the second monitoring strategy is adjusted.

S62: and counting the sample detection data stored in the detection strategy database, and determining the average score ratio of each detection item in the second monitoring strategy, wherein the sample detection data is the data which is stored in the detection strategy database after the detection is finished.

And the sample detection data is the data which is stored in the detection strategy database after the detection is finished. The sample detection data comprises voice data and detection result information of the corresponding second monitoring strategy. As can be seen from step S50, in the process of detecting the target voice data by using the second monitoring strategy, the target voice data is detected according to each detection item in the second monitoring strategy, and therefore the obtained detection result information includes the detection result corresponding to each detection item. In this embodiment, the detection items in the second monitoring strategy include: emotion analysis item, keyword and sensitive word detection, intention recognition item, speech speed and silence analysis item, voice cross analysis item and quality control rule matching item. In this embodiment, the detection result corresponding to each detection item is preferably a score generated after quality inspection scoring is performed on the target voice data according to a preset scoring strategy. The preset scoring strategy is a rule strategy of deduction or adding points when the quality inspection items are configured according to the quality inspection scheme. Preferably, in this embodiment, a deduction system is adopted for quality inspection scoring of the target voice data, each detection item corresponds to one initial score, and the proportion of the initial score corresponding to each detection item is set by a quality inspection demander according to different service types and corresponding quality inspection requirements, so that different institutions and different quality inspection projects are different; for example, the call collection item of a small credit company has strict detection on the emotional state, the keywords and the sensitive words of the collection person, so a higher score ratio can be set for the 2 dimensions, assuming that the total quality inspection score is 100, the initial score corresponding to the speech speed and silence analysis score is 5, the initial score corresponding to the emotion analysis score is 30, the initial score corresponding to the keyword and the sensitive word detection score is 50, and the initial score corresponding to the quality inspection rule matching score is 15.

Specifically, the counting of the stored sample detection data in the detection policy database includes: counting all detection score results corresponding to each detection item in the detection result information corresponding to the target voice data, and then summing all the detection score results corresponding to each detection item to obtain an average value, so as to obtain an average score of each detection item; and finally, determining the average score ratio of each detection item in the second monitoring strategy according to the average score of each detection item.

Exemplarily, if the second monitoring strategy includes 3 detection items, which are respectively a detection item a, a detection item B, and a detection item C; the initial score corresponding to the detection item A is 15, the initial score corresponding to the detection item B is 30, the initial score corresponding to the detection item C is 55, and the sample detection data comprises target voice data a, target voice data B and target voice data C; and according to the deduction system, carrying out deduction on the target voice data corresponding to each detection item. For example: if the detection score result of the target voice data a in the detection item A is 10, the detection score result of the target voice data b is 6, and the detection score result of the target voice data c is 8; in the detection item B, the detection score result of the target voice data a is 20, the detection score result of the target voice data B is 10, and the detection score result of the target voice data c is 18; in the detection item C, the detection score result of the target voice data a is 40, the detection score result of the target voice data b is 31, and the detection score result of the target voice data C is 28; the total mean score was 57; summing and averaging all detection score results corresponding to the detection item A in the sample detection data to obtain an average score of 8 for the detection item A, and summing and averaging all detection score results corresponding to the detection item B to obtain an average score of 16 for the detection item B; summing all the detection score results corresponding to the detection item C and averaging to obtain the average score of the detection item C of 33; then the average score for test item a is 8/57; the average score of test item B is 16/57; the average score of the test items is 33/57.

S63: and sending the average score ratio of each detection item to the client, and receiving the adjustment factor of each detection item returned by the client.

The adjustment factor of the detection item refers to a factor for adjusting the weight value of each detection item in the second monitoring strategy. Specifically, after the average score of each detection item is determined, the average score of each detection item is sent to the client, and after the average score of each detection item is received by the client, the user readjusts the weight value of each detection item according to a preset strategy, so that an adjustment factor of each detection item is obtained; and finally, the adjustment factor of each detection item is sent to the server side through the client side, and the server side can receive the adjustment factor of each detection item returned by the client side. It is understood that the adjustment factor is a ratio of the original weight value to the current weight value of each detection item. For example: the original weight value of the detection item a is 3, and the adjusted current weight value is 5, then the adjustment factor of the detection item a is 5/3.

S64: and adjusting the second monitoring strategy according to the adjusting factor.

Specifically, after the adjustment factor of each detection item is determined, the weight value of each detection item in the second monitoring strategy is readjusted according to the adjustment factor of each detection item, that is, the proportion of the initial score of each detection in the total quality control score is readjusted according to the adjustment factor of each detection item.

In this embodiment, the target voice data and the detection result information of the second monitoring strategy are added to a preset detection strategy database, and the detection strategy database is used for storing the detection result information after the detection is completed; counting sample detection data stored in the detection strategy database, and determining the average score ratio of each detection item in the second monitoring strategy, wherein the sample detection data are data which are stored in the detection strategy database after detection is finished; sending the average score ratio of each detection item to the client, and receiving the adjustment factor of each detection item returned by the client; adjusting the second monitoring strategy according to the adjustment factor; through the automatic adjustment of the quality inspection item, the quality inspection item is simplified and adjusted manually, the hit rate of the quality inspection item and the optimization of the quality inspection item condition item is improved, and useless and repeated adjustment is reduced.

In an embodiment, as shown in fig. 4, the first monitoring policy includes a risk monitoring item and a quality monitoring item, and the real-time detection of the target voice data of the client by using the first monitoring policy specifically includes the following steps:

s201: and processing the target voice data, and converting the target voice data into target text data.

Specifically, since the target voice data includes customer service voice data and customer voice data, before the target voice data is converted into the target text data, the target voice data is separated, that is, the target voice data is separated and screened from the customer according to the customer service, so as to obtain the customer service voice data. Further, the customer service voice data is converted into data in a text format, and the generated text format file is subjected to processing operations such as Chinese word segmentation, part of speech tagging and the like, so that unstructured target voice data is converted into structured target text data. Specifically, the customer service voice data may be converted into data in a text format by using a preset voice recognition model, that is, the customer service voice data is input into the preset voice recognition model, and the voice recognition model performs voice recognition on the voice data and outputs text content corresponding to the customer service voice data. The preset speech recognition Model may specifically adopt a speech recognition algorithm based on a Hidden Markov Model (HMM), and may also adopt a speech recognition algorithm based on a GMM-HMM Model formed by combining a Gaussian Mixed Model (GMM) and a Hidden markov Model, but is not limited thereto, and the specific implementation algorithm of the speech recognition Model is not limited in this embodiment.

S202: and detecting the target voice data and the target text data in real time according to the risk monitoring items to acquire risk result information corresponding to each risk monitoring item, wherein the risk monitoring items comprise emotion analysis items, keywords and sensitive word detection items.

Wherein, the risk result information corresponding to each risk monitoring item comprises: risk result information corresponding to the emotion analysis item and risk result information corresponding to the sensitive word detection item. Specifically, 2-dimensional risk monitoring items of emotion analysis and keyword and sensitive word detection are preset, a corresponding detection strategy is set for each risk monitoring item, after target text data are obtained, the target speech data and the target text data are combined, real-time detection is carried out according to each risk monitoring item and the corresponding risk detection strategy thereof, and risk result information corresponding to each risk monitoring item is correspondingly obtained.

The detection strategy of the emotion analysis item can be to identify the target text data by using a clustering algorithm (such as a mean clustering algorithm), and judge whether the target text data is biased to be negative (such as angry) or positive (such as happy). The detection strategy of the keyword and sensitive word detection item can be that word segmentation processing is carried out on target text data to obtain target keywords, then the target keywords are matched with preset keywords and sensitive words, and whether the keywords same as the preset keywords or the sensitive words exist or not is judged.

S203: and detecting the target voice data and the target text data in real time according to the quality monitoring items to obtain quality result information corresponding to each quality monitoring item, wherein the quality monitoring items comprise intention identification items, speech speed and silence analysis items, speech cross analysis items and quality inspection rule matching items.

Wherein, the quality result information corresponding to each quality monitoring item comprises: quality result information corresponding to the intention identification item, quality result information corresponding to the speech speed and silence analysis item, quality result information corresponding to the voice cross analysis item and quality result information corresponding to the quality control rule matching item. Specifically, 4-dimensional quality monitoring items of intention recognition, speech speed and silence analysis, voice cross analysis and quality inspection rule matching are preset, a corresponding detection strategy is set for each quality monitoring item, after target text data are obtained, the target voice data and the target text data are combined, real-time detection is carried out according to each risk monitoring item and the corresponding quality detection strategy thereof, and quality result information corresponding to each quality monitoring item is correspondingly obtained.

The detection strategy of the intention identification item can be that a target keyword is obtained by performing word segmentation processing on target text data; then, identifying the target keywords by adopting a similarity algorithm and a clustering algorithm to obtain intention information; and finally classifying the intention information so as to judge the intention of the target voice data. The detection strategy of the speech speed and silence analysis item can be to adopt a speech detection module to detect the speech speed and silence time of customer service speech data, judge whether the speed of the customer service speech data is too fast or too slow, and not to have the phenomenon of too long silence time. The detection strategy of the voice cross analysis can be to detect whether the phenomenon that the customer service voice data and the customer voice data have too much voice cross at the same time by adopting a voice recognition method. The detection strategy of the quality inspection rule matching item can be to perform quality inspection rule matching on the voice time dimension of the target voice data so as to judge whether the target voice data meets the requirement or not.

In this embodiment, the target voice data is processed, and the target voice data is converted into target text data; the method comprises the steps of detecting target voice data and target text data in real time according to risk monitoring items to obtain risk result information corresponding to each risk monitoring item, wherein the risk monitoring items comprise emotion analysis items, keywords and sensitive word detection items; detecting target voice data and target text data in real time according to the quality monitoring items to obtain quality result information corresponding to each quality monitoring item, wherein the quality monitoring items comprise intention identification items, speech speed and silence analysis items, voice cross analysis items and quality inspection rule matching items; therefore, the quality inspection task is not influenced by human factors of quality inspectors, the effect of quality inspection influenced by factors such as mood fluctuation of the quality inspectors is reduced, and the score of the quality inspection can more objectively reflect the true level of the seat.

In an embodiment, as shown in fig. 5, the real-time detection of the target voice data and the target text data according to the risk monitoring items to obtain the risk result information corresponding to each risk monitoring item specifically includes the following steps:

s2021: and performing emotion analysis scoring on the target voice data and the target text data according to a preset first scoring strategy to obtain scores corresponding to emotion analysis items.

The first scoring strategy is a strategy for performing emotion analysis scoring on the target voice data and the target text data. Specifically, the first scoring strategy may be to first obtain the tone information and intonation information of the customer service staff from the target voice data; then, detecting the mood words and the sensitive words of the target text data; judging the emotional state of the customer service staff according to the tone information and tone information of the customer service staff and the detection result of detecting tone words and sensitive words of the target text data; finally, obtaining scores corresponding to the emotion analysis scores according to the obtained emotion states and a preset scoring rule; the preset scoring rule is a rule in which different scores are set in advance according to different emotional states. In a specific embodiment, detection of the tone words and the sensitive words is performed by combining target text data through tone information, tone information and other information of the customer service staff in the target voice data, so that the emotional state of the customer service staff is identified, and different scores are set according to different degrees of the emotional state. The tone is the attitude of speaking, such as statement, question, exclamation, etc., the tone is the form of the external speed, height, length, strength, etc., the tone is expressed by the tone, different tones represent different tones, when analyzing and scoring the emotion, the emotion state of the customer service staff is judged by identifying tone and combining with related tone words and sensitive words.

S2022: and performing keyword and sensitive word detection scoring on the target text data according to a preset second scoring strategy to obtain scores corresponding to the keyword and the sensitive word detection items.

The second scoring strategy is a strategy for performing keyword and sensitive word detection scoring on the target text data. Specifically, the second scoring policy may be to automatically match the target text data with characters in a preset keyword lexicon and a preset sensitive word lexicon at first; the keyword word stock and the sensitive word stock are established in advance according to actual scenes and conversational requirements. Specifically, when the keyword is not matched, performing score deduction processing according to a preset first score sub-rule to obtain a corresponding keyword score; and when the sensitive words are matched, performing deduction processing according to the matching result and a preset second scoring sub-rule to obtain corresponding sensitive word scores. The first scoring sub-rule is a scoring rule which is set correspondingly according to different keyword types. The second scoring sub-rule is that different deduction rules are correspondingly set according to different sensitive word types; and finally, obtaining the scores corresponding to the detection scores of the keywords and the sensitive words according to the scores of the keywords and the sensitive words. In a specific embodiment, a keyword lexicon and a sensitive word lexicon are pre-established according to actual scenes and requirements of words, then target text data are automatically matched with characters in the established keyword lexicon and the established sensitive word lexicon respectively, whether set characters are included is checked, and if the set characters are not included, the keywords are deducted according to preset scoring rules to obtain corresponding keyword scores. For the sensitive words, performing deduction processing according to different sensitive word types and preset grading rules in the matching result to correspondingly obtain sensitive word scores; and adding the keyword scores and the sensitive word scores to obtain scores corresponding to the detection scores of the keywords and the sensitive words. Preferably, when matching the keywords or the sensitive words, pinyin transcription is performed on the data in the target text data, and the data is matched with the synonym library and the homonym library, so that the robustness of matching and identifying the keywords or the sensitive words can be improved.

In this embodiment, performing emotion analysis scoring on the target voice data and the target text data according to a preset first scoring strategy to obtain a score corresponding to an emotion analysis item; detecting and grading the keywords and the sensitive words of the target text data according to a preset second grading strategy to obtain grades corresponding to the keywords and the sensitive word detection items; therefore, the time consumed by quality inspection is greatly reduced, the quality inspection efficiency is improved, the quality inspection result is not influenced by human factors, the service quality of customer service personnel can be objectively reflected, and the quality inspection accuracy is improved.

In an embodiment, as shown in fig. 6, the real-time detection of the target voice data and the target text data according to the risk monitoring items to obtain the risk result information corresponding to each risk monitoring item specifically includes the following steps:

s2031: and performing intention identification scoring on the target text data according to a preset third scoring strategy, and acquiring a score corresponding to an intention identification item.

The third scoring strategy is a preset strategy for performing intention identification scoring on the target text data. Specifically, the third scoring strategy firstly carries out word segmentation processing on the target character data to obtain target keywords, then adopts a similarity algorithm and a clustering algorithm to identify the target keywords to obtain intention information, and finally classifies the intention information so as to judge the intention of the target voice data. For example: the intention state of the target voice data is "member can be turned on". Finally, obtaining a score corresponding to the intention identification item according to the obtained intention state and a preset intention scoring rule; the preset scoring rule is a rule in which different scores are set in advance according to different intention states.

S2032: and carrying out speech speed and mute analysis scoring on the target voice data according to a preset fourth scoring strategy, and acquiring scores corresponding to the speech speed and mute analysis items.

The fourth scoring strategy is a preset strategy for performing speech speed and silence analysis scoring on the target voice data. Specifically, the fourth scoring strategy may be to first calculate the speech rate of the sentence in the target speech data; according to a preset first scoring sub-rule, carrying out deduction processing on sentences of which the speech speed is not within a preset speech speed threshold range to obtain corresponding speech speed scores; wherein, the speed threshold range is set according to the number of words contained in a preset time period; then, the mute duration in the target voice data is counted; when the mute time is within a preset time threshold range, carrying out deduction processing according to a preset second scoring sub-rule to obtain a corresponding mute score; and finally, obtaining scores corresponding to the speech speed and mute analysis scores according to the speech speed scores and the mute scores.

In a specific embodiment, the target voice data is subjected to voice speed analysis to obtain voice speed scores, the target voice data is subjected to mute analysis to obtain mute scores, and the voice speed scores and the mute scores are added to obtain scores corresponding to the voice speed and the mute analysis scores; the speech rate analysis is to preset a speech rate threshold range of customer service personnel response according to service requirements so as to achieve the purpose of customer comfort, the response speech rate of the customer service personnel is not deducted within the speech rate threshold range, and deduction is carried out according to a preset grading rule if the response speech rate of the customer service personnel exceeds the threshold; the silence analysis is to analyze the service proficiency, the timely response degree, the service attitude and the like of the customer service personnel by counting the information of the silence duration, the effective call duration, the call starting and stopping time and the like of the customer service personnel and grade according to the information; for example, assuming that the initial score corresponding to the speech rate and the silence analysis score is 5, the threshold range of the speech rate is set to 110 to 120 words/minute, the speech rate is measured according to sentences, the speech rate of a certain sentence in the voice file to be subjected to quality inspection is not in the range of 110 to 120 words/minute, 1 score is deducted when the silence duration is set to be within 10 to 30s, 2 scores are deducted within 30 to 60s, and the scores are accumulated until the deduction is completed.

S2033: and performing voice cross analysis scoring on the target voice data according to a preset fifth scoring strategy, and acquiring a score corresponding to the voice cross analysis item.

The fifth scoring strategy is a preset strategy for performing quality inspection rule matching scoring on the target voice data. Specifically, the fifth scoring policy may be to perform recognition analysis on the target voice data, determine the number of times that the customer service and the customer speak at the same time and the time duration of each time from the target voice data, and then perform a deduction processing on the voice data that the number of times that the customer service and the customer speak at the same time exceeds the threshold number and the time duration of the speaking at the same time exceeds the threshold time according to a preset fifth scoring sub-rule, so as to obtain a score corresponding to the voice cross analysis item.

S2034: and performing quality inspection rule matching scoring on the target text data according to a preset sixth scoring strategy, and acquiring scores corresponding to quality inspection rule matching items, wherein the quality inspection rule comprises a text matching rule set which is preset according to quality inspection contents, and the text matching rule set comprises word rules, phrase rules and script rules.

The sixth scoring strategy is a preset strategy for performing quality inspection rule matching scoring on the target text data. A sixth scoring policy may be to first combine target words in the target text data into regular expressions according to a text matching rule set. Wherein the target words comprise keywords; then automatically matching the rule expression with the quality inspection rule; and when the rule expression is not matched with the quality control rule, performing deduction processing according to a fourth grading rule so as to obtain a score corresponding to the quality control rule matching grade. In a specific embodiment, the rules of the manual quality inspection are converted into languages which can be identified by a computer in advance, namely quality inspection rules, the quality inspection rules can fully cover and reflect aspects of service process quality inspection, service avoiding word quality inspection, service standard term quality inspection and the like, the quality inspection rules comprise word rules, phrase rules and text matching rule sets defined by 3 levels of script rules, target words (such as keywords, sensitive words, contraindication words and the like) in a structured text file are combined into regular expressions according to certain text matching rule sets, and the contents related to the word expression logic, the service process or the word operation can be detected according to the regular expressions; and automatically matching the rule expression with word rules, phrase rules and script rules in the quality inspection rules, if the rule expression at least matches a previous word, at least matches a previous phrase and at least matches a previous script, indicating that the rule expression hits the quality inspection rules and is successfully matched, otherwise indicating that the rule expression is not hit and is not successfully matched, and if the rule expression fails to be matched, performing deduction processing according to a preset scoring rule to obtain scores corresponding to the matching scores of the quality inspection rules.

In this embodiment, performing intention identification scoring on the target text data according to a preset third scoring strategy, and acquiring a score corresponding to an intention identification item; performing speech speed and mute analysis scoring on the target voice data according to a preset fourth scoring strategy to obtain scores corresponding to the speech speed and mute analysis items; performing voice cross analysis scoring on the target voice data according to a preset fifth scoring strategy, and acquiring a score corresponding to a voice cross analysis item; performing quality inspection rule matching scoring on the target text data according to a preset sixth scoring strategy to obtain a score corresponding to a quality inspection rule matching item; the quality inspection rule comprises a text matching rule set which is preset according to quality inspection contents; the text matching rule set comprises word rules, phrase rules and script rules; therefore, the time consumed by quality inspection is greatly reduced, the quality inspection efficiency is improved, the quality inspection result is not influenced by human factors, the service quality of customer service personnel can be objectively reflected, and the quality inspection accuracy is improved.

In a specific embodiment, as shown in fig. 7, after the target voice data triggers the early warning condition in the risk monitoring item and before the target voice data sends the prompt message to the monitoring end of the client, the voice data detection method further includes the following steps:

s31, according to the pre-established neural network model, performing attention-based bad voice recognition on the target voice data triggering the early warning condition in the risk monitoring item to obtain the character sequence number distribution and the voice classification result of the target voice data; the neural network model comprises a preset character library, each character in the character library corresponds to a unique character serial number, and the character serial number distribution of the target voice data is composed of a plurality of character serial numbers.

The neural network model is a model capable of performing bad voice recognition on target voice data. The neural network model comprises a backbone network, a voice recognition network and a bad voice classification network established based on an attention mechanism.

Specifically, target voice data of an early warning condition in a trigger risk monitoring item is obtained, and voice features of the target voice data are extracted; according to the voice characteristics of the target voice data, identifying the target voice data to obtain the character sequence number distribution of the target voice data, wherein the character sequence number distribution is obtained by sequentially arranging the character sequence numbers corresponding to all characters in the target voice data in a character library according to a time sequence, for example, the target voice data sequentially comprises characters which are 'open members', wherein the character sequence number of 'a' word in the character library is 10 ', the character sequence number of the' word is 11 ', the character sequence number of the' open 'word is 12', the character sequence number of the 'open' word is 13 ', the character sequence number of a' meeting 'word is 14', the character sequence number of the 'member' word is 15, the character sequence number distribution of the target voice data is '101112131415', or when the characters in the character library are more, interval symbols can be added among each character to distinguish the character sequence numbers of each character, for example, the interval symbol is "", the character number of the target speech data is distributed to "10 × 11 × 12 × 13 × 14 × 15".

Further, according to the neural network model, performing attention-based bad voice recognition on the target voice data to obtain a voice classification result of the target voice data, wherein the voice classification result is obtained according to the voice pronunciation condition of the target voice data, and the voice pronunciation condition includes the pronunciation of characters and the pronunciation of characters without specific correspondence such as various moods. Specifically, feature extraction is carried out on target voice data triggering an early warning condition in the risk monitoring item, the frequency spectrum feature of the target voice data is determined, and then the sequence feature of the target voice data is extracted according to the main network of a neural network model and the frequency spectrum feature of the target voice data; inputting the sequence characteristics of the target voice data into a voice recognition network to obtain the character sequence number distribution of the target voice data; and finally, inputting the sequence characteristics into a bad voice classification network to obtain a pronunciation classification result of the target voice data.

S32, determining the starting position and the ending position of the bad keywords in the target voice data according to the character sequence number distribution and a preset bad keyword dictionary; the bad keyword dictionary stores a plurality of bad keyword samples collected in advance.

Specifically, whether the target voice data has the bad keywords is detected according to the character sequence number distribution and a preset bad keyword dictionary, and then the starting position and the ending position of the bad keywords in the target voice data are determined according to the character sequence number distribution. And the starting position and the ending position of the bad keyword are position information of the bad keyword in the target voice data. Specifically, after the character sequence number distribution is converted into the voice characters, the voice characters are matched with the bad keyword samples stored in the bad keyword dictionary, if the matching of the bad keyword samples is successful, the target voice data has the bad keywords, and after the target voice data has the bad keywords, the starting position and the ending position of the bad keywords in the target voice data are determined according to the character sequence number distribution. For example: if the character serial number of the bad keyword is distributed to 10 × 11 × 12 × 13 × 14 × 15 ", the initial position and the end position of the bad keyword in the target voice data are 10 and 15 respectively.

In the poor speech recognition based on the attention mechanism, the attention scores of the speech regions in different time periods in the target speech data can be obtained through adjustment by the attention mechanism, and the speech segment which needs to be focused most can be obtained according to the attention scores. The attention mechanism is proved to have good effect in the classification of the time sequence data, the accuracy of the classification result can be improved, the attention score value is large, the region concerned by the model can be visualized in the time period, and the effectiveness of the model can be simply judged before the model is formally used. For example, if the region concerned by the model coincides with the bad speech segment of the training speech of the bad speech, the model is good in effect and accurate.

In the embodiment, according to target voice data and a pre-established neural network model, poor voice recognition based on an attention mechanism is performed on the target voice data triggering the early warning condition in the risk monitoring item, and a character sequence number distribution and a voice classification result of the target voice data are obtained; the neural network model comprises a preset character library, each character in the character library corresponds to a unique character serial number, and the starting position and the ending position of a bad keyword existing in the target voice data are detected according to the character serial number distribution and a preset bad keyword dictionary; the bad keyword dictionary stores a plurality of bad keyword samples collected in advance, a voice classification result is obtained according to pronunciation characteristics of target voice data, and a starting position and an ending position of the voice to be detected are detected by combining text information and the voice classification result of the target voice data. The method grasps the character information and the pronunciation characteristics of the target voice data, identifies the character information and various tone information of the target voice data, can detect out bad voice with bad voice semantics, can detect out bad voice without specific voice characters or bad voice with bad tone carried by the characters corresponding to the voice, and improves the accuracy of bad voice detection. Meanwhile, the poor pronunciation recognition based on the attention mechanism is combined through the constraint of the voice recognition, the recognition precision of the poor voice can be improved, and the accuracy of poor voice detection is further improved.

In a specific embodiment, as shown in fig. 8, the neural network model includes a backbone network, a speech recognition network, and an attention mechanism-based bad speech classification network, and performs attention mechanism-based bad speech recognition on target speech data triggering an early warning condition in the risk monitoring item according to a pre-established neural network model, which specifically includes the following steps:

s311, performing feature extraction on the target voice data triggering the early warning condition in the risk monitoring item, and determining the frequency spectrum feature of the target voice data.

And S312, extracting the sequence characteristics of the target voice data according to the frequency spectrum characteristics of the backbone network and the target voice data.

And S313, inputting the sequence characteristics of the target voice data into the voice recognition network to obtain the character sequence number distribution of the target voice data.

And S314, inputting the sequence characteristics of the target voice data into a bad voice classification network to obtain a pronunciation classification result of the target voice data.

In the embodiment, the frequency spectrum characteristics of target voice data are determined by performing characteristic extraction on the target voice data triggering the early warning condition in the risk monitoring item; extracting sequence characteristics of the target voice data according to the frequency spectrum characteristics of the main network and the target voice data; inputting the sequence characteristics of the target voice data into a voice recognition network to obtain the character sequence number distribution of the target voice data; inputting the sequence characteristics of the target voice data into a bad voice classification network to obtain a pronunciation classification result of the target voice data; therefore, the character sequence number distribution of the obtained target voice data and the accuracy of the voice classification result are further improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, a voice data detection apparatus is provided, and the voice data detection apparatus corresponds to the voice data detection method in the above embodiments one to one. As shown in fig. 9, the voice data detecting apparatus includes a voice detection trigger instruction receiving module 10, a real-time detecting module 20, a first sending module 30, a first output module 40, an offline detecting module 50, and a second output module 60. The functional modules are explained in detail as follows:

a voice detection trigger instruction receiving module 10, configured to receive a voice detection trigger instruction, where the voice detection trigger instruction includes detection type information;

the real-time detection module 20 is configured to perform real-time detection on the target voice data of the client by using a first monitoring strategy when the detection type information is first type information, where the first type information indicates that the detection type is real-time monitoring, and the first monitoring strategy includes a risk monitoring item and a quality monitoring item;

the first sending module 30 is configured to send prompt information to a monitoring end of the client when the target voice data triggers an early warning condition in the risk monitoring item;

the first output module 40 is configured to output detection result information of the quality detection item after the real-time detection of the target voice data of the client is finished;

the offline detection module 50 is configured to perform offline detection on the target voice data of the client by using a second monitoring policy when the detection type information is second type information, where the second type information indicates that the detection type is offline detection;

and a second output module 60, configured to output detection result information of the second monitoring policy after the offline detection of the target voice data of the client is finished.

Preferably, the voice data detecting apparatus further includes:

the adding module 61 is configured to add the target voice data and the detection result information of the second monitoring strategy to a preset detection strategy database, where the detection strategy database is used to store the detection result information after the detection is completed;

the statistical module 62 is configured to perform statistics on sample detection data stored in the detection policy database, and determine an average score ratio of each detection item in the second monitoring policy, where the sample detection data is data that is stored in the detection policy database after detection is completed;

a second sending module 63, configured to send the average score of each detection item to the client, and receive an adjustment factor of each detection item returned by the client;

and an adjusting module 64, configured to adjust the second monitoring strategy according to the adjustment factor.

Preferably, the real-time detection module 20 comprises:

a conversion unit 201, configured to process target voice data and convert the target voice data into target text data;

the first real-time detection unit 202 is configured to perform real-time detection on target voice data and target text data according to risk monitoring items, and acquire risk result information corresponding to each risk monitoring item, where the risk monitoring items include emotion analysis items, keywords, and sensitive word detection items;

and the second real-time detection unit 203 is configured to perform real-time detection on the target voice data and the target text data according to the quality monitoring items, and acquire quality result information corresponding to each quality monitoring item, where the quality monitoring items include an intention identification item, a speech rate and silence analysis item, a speech cross analysis item, and a quality inspection rule matching item.

Preferably, the first real-time detection unit 202 includes:

the first scoring subunit is used for performing emotion analysis scoring on the target voice data and the target text data according to a preset first scoring strategy to obtain scores corresponding to emotion analysis items;

and the second scoring subunit is used for performing keyword and sensitive word detection scoring on the target text data according to a preset second scoring strategy to obtain scores corresponding to the keyword and the sensitive word detection items.

Preferably, the second real-time detecting unit 203 includes:

the third scoring subunit is used for performing intention identification scoring on the target text data according to a preset third scoring strategy and acquiring a score corresponding to an intention identification item;

the fourth scoring subunit is used for performing speech speed and mute analysis scoring on the target voice data according to a preset fourth scoring strategy to obtain scores corresponding to the speech speed and mute analysis items;

the fifth scoring subunit is used for performing voice cross analysis scoring on the target voice data according to a preset fifth scoring strategy and acquiring a score corresponding to a voice cross analysis item;

and the sixth scoring subunit is used for performing quality inspection rule matching scoring on the target text data according to a preset sixth scoring strategy to obtain a score corresponding to a quality inspection rule matching item, wherein the quality inspection rule comprises a text matching rule set which is preset according to quality inspection content, and the text matching rule set comprises a word rule, a phrase rule and a script rule.

For the specific limitation of the voice data detection device, reference may be made to the above limitation of the voice data detection method, and details are not described herein again. The modules in the voice data detection device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the data used in the voice data detection method in the above embodiment. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of voice data detection.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the voice data detection method in the above embodiments is implemented.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, implements the voice data detection method in the above-described embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A method for detecting voice data, comprising:

2. The voice data detecting method according to claim 1, wherein the detection result information of the second monitoring policy includes detection items and a detection score corresponding to each detection item;

after the outputting of the detection result information of the second monitoring policy, the voice data detection method further includes:

adding the target voice data and the detection result information of the second monitoring strategy into a preset detection strategy database, wherein the detection strategy database is used for storing the detection result information after detection is finished;

counting sample detection data stored in the detection strategy database, and determining the average score ratio of each detection item in the second monitoring strategy, wherein the sample detection data is data which is stored in the detection strategy database after detection is completed;

sending the average score ratio of each detection item to the client, and receiving the adjustment factor of each detection item returned by the client;

and adjusting the second monitoring strategy according to the adjusting factor.

3. The voice data detecting method according to claim 1, wherein the first monitoring strategy includes a risk monitoring item and a quality monitoring item, and the detecting the target voice data of the client in real time by using the first monitoring strategy includes:

processing the target voice data, and converting the target voice data into target text data;

detecting the target voice data and the target text data in real time according to risk monitoring items to obtain risk result information corresponding to each risk monitoring item, wherein the risk monitoring items comprise emotion analysis items, key words and sensitive word detection items;

and detecting the target voice data and the target text data in real time according to the quality monitoring items to acquire quality result information corresponding to each quality monitoring item, wherein the quality monitoring items comprise intention identification items, speech speed and mute analysis items, voice cross analysis items and quality inspection rule matching items.

4. The method according to claim 3, wherein the detecting the target voice data and the target text data in real time according to risk monitoring items to obtain risk result information corresponding to each of the risk monitoring items comprises:

performing emotion analysis scoring on the target voice data and the target text data according to a preset first scoring strategy to obtain scores corresponding to the emotion analysis items;

and performing keyword and sensitive word detection scoring on the target text data according to a preset second scoring strategy to obtain scores corresponding to the keyword and sensitive word detection items.

5. The method of claim 3, wherein the real-time detection of the target voice data and the target text data according to the quality monitoring items to obtain the quality result information corresponding to each quality monitoring item comprises:

performing intention identification scoring on the target text data according to a preset third scoring strategy, and acquiring a score corresponding to the intention identification item;

performing speech speed and mute analysis scoring on the target voice data according to a preset fourth scoring strategy to obtain scores corresponding to the speech speed and mute analysis items;

performing voice cross analysis scoring on the target voice data according to a preset fifth scoring strategy, and acquiring a score corresponding to the voice cross analysis item;

and performing quality inspection rule matching scoring on the target text data according to a preset sixth scoring strategy to obtain a score corresponding to a quality inspection rule matching item, wherein the quality inspection rule comprises a text matching rule set which is preset according to quality inspection content, and the text matching rule set comprises a word rule, a phrase rule and a script rule.

6. The voice data detection method of claim 1, wherein after the target voice data triggers an early warning condition in the risk monitoring item and before sending a prompt to the monitoring end of the client, the voice data detection method further comprises:

according to a pre-established neural network model, performing attention mechanism-based bad voice recognition on target voice data triggering early warning conditions in the risk monitoring items to obtain character sequence number distribution and voice classification results of the target voice data; the neural network model comprises a preset character library, and each character in the character library corresponds to a unique character serial number; the character sequence number distribution of the target voice data is composed of a plurality of character sequence numbers;

determining the starting position and the ending position of bad keywords in the target voice data according to the character sequence number distribution and a preset bad keyword dictionary; the bad keyword dictionary stores a plurality of bad keyword samples collected in advance.

7. The voice data detection method of claim 6, wherein the neural network model comprises a backbone network, a voice recognition network, and a bad voice classification network established based on an attention mechanism;

the poor voice recognition based on an attention mechanism is carried out on target voice data triggering early warning conditions in the risk monitoring items according to a pre-established neural network model, and the poor voice recognition comprises the following steps: performing feature extraction on target voice data triggering early warning conditions in the risk monitoring items, and determining the frequency spectrum features of the target voice data;

extracting sequence characteristics of the target voice data according to the frequency spectrum characteristics of the main network and the target voice data;

inputting the sequence characteristics of the target voice data into the voice recognition network to obtain the character sequence number distribution of the target voice data;

and inputting the sequence characteristics of the target voice data into the bad voice classification network to obtain a pronunciation classification result of the target voice data.

8. A voice data detecting apparatus, comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the voice data detection method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the voice data detection method according to any one of claims 1 to 7.