WO2019227580A1

WO2019227580A1 - Voice recognition method, apparatus, computer device, and storage medium

Info

Publication number: WO2019227580A1
Application number: PCT/CN2018/094371
Authority: WO
Inventors: 黄锦伦
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-05-29
Filing date: 2018-07-03
Publication date: 2019-12-05
Also published as: CN108833722A; CN108833722B

Abstract

Disclosed in the present application are a voice recognition method, an apparatus, a computer device, and a storage medium. The method comprises: if an outbound call operation from an agent is detected, acquiring a device identifier and voice data concerning the agent, and determining a service department to which the agent belongs, so as to acquire a service text template corresponding to the service department, and perform voice recognition on the voice data, to obtain a real-time voice text. By performing text matching on the service text template and the real-time voice text in real time to obtain a matching result, and taking corresponding warning measures according to the matching result, the present invention performs real-time monitoring on the voice of the agent during outbound calling, and can discover non-standard wordings in a timely manner and issue a warning, thereby ensuring the timeliness of monitoring, and as outbound calling is monitored without the need for artificial hearing and the analysis of the recording, thereby saving time, and improving monitoring efficiency.

Description

Speech recognition method, device, computer equipment and storage medium

This application is based on a Chinese invention patent application filed on May 29, 2018 with the application number 201810529536.0, entitled "A Voice Recognition Method, Device, Terminal Device and Storage Medium", and claims its priority.

Technical field

The present application relates to the field of computer technology, and in particular, to a speech recognition method, device, computer device, and storage medium.

Background technique

The call center consists of an interactive voice response system and an artificial traffic system. The artificial traffic system consists of a check-in system, a traffic platform, and an interface machine. In order to perform customer service, customer representatives, agents, need to perform a check-in operation in the check-in system. After successfully signing in to the traffic platform, The assigned manual service request establishes a call with the customer, that is, the agent calls out to perform customer service. Usually, according to business requirements, different business terms are set for different services to provide better service to customers.

Although each agent has been informed of the corresponding business terms before the outbound call, in real life, due to the transfer of the business or the unfamiliarity with the business, the term of the outbound caller's term is often inappropriate.

In response to the inappropriate use of outbound language by agents, the current practice is to listen to the recording afterwards and analyze the recording to obtain outbound information that does not meet the specifications and deal with it accordingly. On the one hand, this can only be heard after the event. Recording does not provide timely early warning, resulting in unscheduled monitoring of agent voice calls. On the other hand, due to the need to manually listen to all recordings and analyze them, it takes a lot of time, resulting in low monitoring efficiency.

Summary of the Invention

The embodiments of the present application provide a method, a device, a computer device, and a storage medium for speech recognition, so as to solve the problems that the current outbound monitoring of agents is not timely and the monitoring efficiency is low.

An embodiment of the present application provides a voice recognition method, including:

If an agent's outbound call operation is monitored, obtaining voice data during the agent's outbound call and the device identification of the used outbound device;

Determining a business department to which the agent belongs based on the device identification;

Obtaining a business text template corresponding to the business department, wherein the business text template includes a term necessary for outbound calls and a term prohibited for outbound calls;

Performing voice recognition on the voice data to obtain real-time voice text, and adding the real-time voice text to the current outgoing text;

Text-matching the real-time voice text with the outgoing call prohibition term to obtain a first matching result;

If the first matching result is that the real-time voice text includes the outbound call prohibition term, a first warning measure is performed.

An embodiment of the present application provides a voice recognition device, including:

A data acquisition module, configured to acquire voice data and an equipment identifier of an outbound device used by the agent if the outbound operation of the agent is monitored;

A department determination module, configured to determine a business department to which the agent belongs based on the device identifier;

A template selection module, configured to obtain a business text template corresponding to the business department, wherein the business text template includes a term necessary for outbound calls and a term prohibited for outbound calls;

A voice recognition module, configured to perform voice recognition on the voice data to obtain real-time voice text, and add the real-time voice text to the current outgoing text;

A first matching module, configured to perform text matching between the real-time voice text and the outgoing call prohibition term to obtain a first matching result;

A first warning module is configured to execute a first warning measure if the first matching result is that the real-time voice text includes the outbound call prohibition term.

An embodiment of the present application provides a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. The processor implements the computer-readable instructions to implement Steps of the above speech recognition method.

This embodiment of the present application provides one or more non-volatile readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors are Perform the steps of the speech recognition method described above.

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.

FIG. 1 is a schematic diagram of an application environment of a speech recognition method according to an embodiment of the present application; FIG.

FIG. 2 is an implementation flowchart of a speech recognition method provided by an embodiment of the present application; FIG.

FIG. 3 is a flowchart of implementing step S4 in the speech recognition method according to an embodiment of the present application; FIG.

FIG. 4 is a flowchart of implementing step S41 in the voice recognition method provided by an embodiment of the present application;

FIG. 5 is an exemplary diagram of overlapping frames of speech signals in a speech recognition method according to an embodiment of the present application; FIG.

FIG. 6 is a flowchart of implementing a monitoring and early-warning phrase necessary for outgoing calls in the voice recognition method provided by an embodiment of the present application; FIG.

FIG. 7 is a schematic diagram of a voice recognition device provided by an embodiment of the present application; FIG.

FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed ways

In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.

Please refer to FIG. 1, which illustrates an application environment of a speech recognition method provided by an embodiment of the present application. The speech recognition method is applied in an outbound agent scenario of a call center. The call center includes a server, a client, and a monitoring terminal. The server and the client are connected, and the server and the monitoring terminal are connected through a network. Agents make outbound calls through the client. The client can specifically, but not limited to, various direct-line telephones, telephone network telephones connected with program-controlled switches, mobile phones, walkie-talkies, or other smart devices used for communication. The server and monitoring terminals are specific. It can be implemented by an independent server or a server cluster composed of multiple servers. The speech recognition method provided in the embodiment of the present application is applied to a server.

Please refer to FIG. 2, which illustrates an implementation process of a speech recognition method provided by an embodiment of the present application. This method is applied to the server in FIG. 1 as an example, and includes the following steps:

S1: If an agent's outbound call operation is monitored, the voice data during the outbound call of the agent and the device identification of the outbound device used by the agent are obtained.

Specifically, the server and the client are connected through a network, and the server can monitor the client in real time. When an outbound operation of an agent at the client is detected, the device identifier and the outbound device used by the agent are obtained. Voice data generated during outgoing calls.

The client includes at least two or more outbound call devices, and each of the outbound call devices is used by an agent for outbound calls.

It should be noted that the monitoring of the client by the server can be implemented by using the listening mode of the socket process communication, or it can be controlled by the Transmission Control Protocol (TCP) to control data transmission. It can also be implemented by a third-party tool with a monitoring function. The preferred method used in the embodiment of the present application is to implement the monitoring mode of the socket process communication. Actually, a suitable monitoring method can be selected according to the specific situation. There are no restrictions here.

S2: Determine the business department to which the agent belongs based on the device identification.

Specifically, the device identification records the main information of the device, including but not limited to: the employee number of the agent, the department to which the agent belongs, the device type, or the device number. After obtaining the device identifier, the service to which the agent belongs can be determined according to the device identifier department.

For example, in a specific embodiment, the obtained device identifier is: 89757-KD-EN170-962346, and the device identifier contains information: the agent employee number is 89757, the agent's department is KD, and the device type is EN170. The equipment number is 962346.

It is worth noting that the agent needs to verify the identity before using the outbound device. The verification methods include, but are not limited to: account verification, voiceprint recognition, or fingerprint identification. After passing the verification, the outbound device obtains the corresponding information and records it in the device. Logo.

S3: Obtain the business text template corresponding to the business department, where the business text template includes the required words for outbound calls and the prohibited words for outbound calls.

Specifically, each business department presets its own business text template. According to the business department determined in step S2, a business text template corresponding to the business department is obtained, and each business text template contains the necessary language and foreign language for outbound calls. Call forbidden terms.

Take the business department obtained in step S2 as an example, the business department number is KD, and the business text template KDYY corresponding to the business department number KD is found in the database. The business text template KDYY is used as the current outbound agent this time. The standard business template for outbound calls, that is, after converting the voice data of the current agent's outbound call into text, the business text template KDYY is used to check the text to monitor whether the agent's outbound call terms are standardized.

S4: Perform speech recognition on the voice data to obtain real-time voice text, and add the real-time voice text to the current outgoing text.

Specifically, voice recognition is performed on the voice data of the agent's outbound call obtained in step S1 to obtain real-time voice text during the outbound process, so as to monitor whether the agent's outgoing term is checked by checking the real-time voice text. Standardize and add the real-time voice text to the current outgoing text.

Among them, real-time voice text refers to the segmentation of the outbound voice data according to the pause and silence during each outbound call, and the segmented segmented voice data is obtained, and each segmented segmented voice data is subjected to speech recognition to obtain the corresponding The recognized text is the speech recognition text.

For example, in a specific embodiment, a piece of voice data is acquired from 0 to 1.8 seconds, and is recorded as voice data E. The voice data acquired from 1.8 to 3 seconds is empty, and 3 to 8 seconds are obtained. Get the voice data at the other end, record it as voice data F, perform voice recognition on voice data E, and get a real-time voice text: “Hello”, perform voice recognition on voice data F, and get a real-time voice text: “here Is it China XX, may I help you? "

The voice data may be recognized by a voice recognition algorithm or a third-party tool with a voice recognition function, which is not limited in particular. Speech recognition algorithms include, but are not limited to, speech recognition algorithms based on channel models, speech template matching recognition algorithms, or artificial neural network speech recognition algorithms.

Preferably, the speech recognition algorithm used in the embodiment of the present application is a speech recognition algorithm based on a channel model.

S5: Match the real-time voice text with the outgoing call prohibition text to obtain the first matching result.

Specifically, the real-time voice text obtained in step S4 is matched with the outbound prohibited words in the business text template obtained in step S3, and whether the outbound prohibited words are included in the real-time voice text is effectively ensured through this real-time monitoring method. The timeliness of monitoring.

The first matching result includes: the real-time voice text includes an outgoing call prohibition term and the real-time voice text does not include an outgoing call prohibition term.

It is easy to understand that the outbound prohibition term can be set according to business requirements, and the outbound prohibition term can be one, or two or more.

It is worth noting that the real-time voice text is one or more, and if there is at least one real-time voice text including an outgoing call prohibition term, it is determined that the first matching result is that the real-time voice text includes an outgoing call prohibition term.

S6: If the first matching result is that the real-time voice text contains outbound call prohibition words, a first warning measure is executed.

Specifically, if the first matching result obtained in step S6 is that the real-time voice text contains outbound call prohibition terms, it means that the agent used at least one outbound call prohibition term in this outbound call, at this time, the first warning measure will be executed .

Among them, the first warning measures include, but are not limited to: sending a warning alert to the monitoring end that the outbound call is irregular, reminding the agents of the outbound call about irregularities in the outbound call, and / or disconnecting the current outbound call The network connection of the call device can be set according to the actual situation, and is not specifically limited here.

Further, different first warning measures may be set according to the severity of the prohibited words for outgoing calls. For example, if outbound calling is prohibited to include Word A, Word B, and Word C, where the severity of Word A and Word B is Level 1 and the severity of Word C is Level 2 and Level 1 is lower than Level 2, then The first warning measure corresponding to the first level can be set to "send a non-standard warning alert for this outbound call to the monitoring end", and the first warning measure corresponding to the second level can be set to "disconnect the network connection of the current outgoing call device". When the real-time voice text contains the word C, the first warning measure is executed to directly disconnect the network connection of the current outgoing call device and terminate the agent's outgoing call process.

In the embodiment corresponding to FIG. 2, if an outbound operation of an agent is detected, a device identifier and voice data of the agent are obtained, and the business department to which the agent belongs is determined through the device identifier, and then the business department is obtained. Corresponding business text template, and perform voice recognition on the voice data to obtain real-time voice text, store the real-time voice text into the current outgoing call text, and perform text matching through real-time outgoing call prohibited words and real-time voice text to obtain the first matching result. If the first matching result is that the real-time voice text contains outbound prohibited words, the first warning measure is implemented to realize real-time monitoring of the voice of the agent during the outbound call. When the agent uses outbound prohibited words in the outbound process It can detect and warn in time, thus ensuring the timeliness of monitoring, and because there is no need to manually monitor and analyze the recording for external calls, which saves time and improves monitoring efficiency.

Next, on the basis of the embodiment corresponding to FIG. 2, a specific embodiment is used to perform a detailed description on the specific implementation method of performing voice recognition on the voice data mentioned in step S4 to obtain real-time voice text.

Please refer to FIG. 3, which illustrates a specific implementation process of step S4 provided by an embodiment of the present application, which is detailed as follows:

S41: Perform speech analysis on the speech data to obtain a frame set including basic speech frames.

Specifically, speech analysis is performed on the acquired speech data to obtain a frame set including basic speech frames. Speech analysis includes, but is not limited to, speech encoding and pre-processing of speech signals.

Among them, speech coding is to encode analog speech signals and convert the analog signals into digital signals, thereby reducing the transmission code rate and digital transmission. The basic methods of speech encoding can be divided into waveform encoding, parametric encoding (sound source encoding) and mixed coding.

Preferably, the voice coding method used in this proposal is waveform coding. Wave coding is a digital voice signal formed by sampling, quantizing, and encoding the waveform signal of the analog voice in the time domain. The waveform coding can provide high voice quality.

Among them, the preprocessing of a voice signal refers to pre-emphasis, framing, windowing and other preprocessing operations on the voice signal before analysis and processing. The purpose of voice signal pre-processing is to eliminate the effects of aliasing, higher harmonic distortion, high frequency and other factors on the quality of the voice signal caused by the human vocal organ itself and the equipment that collects the voice signal. As far as possible, ensure that the signals obtained by subsequent speech processing are more uniform and smooth, provide high-quality parameters for signal parameter extraction, and improve the quality of speech processing.

S42: Perform mute detection on the basic voice frame to obtain K consecutive mute frames in the basic voice frame, where K is a natural number.

Specifically, during the duration of an outbound call, the voice signal in the voice data can be divided into two states: an active period and a silent period. No silent signal is transmitted during the silent period, and the active and silent periods of the uplink and downlink are independent of each other. During the outgoing call, the agent will have a pause state before and after each utterance. This state will cause a pause in the voice signal, that is, a silent period. When performing speech recognition and converting text, the silent period state needs to be detected. Then, the silent period and the activation period are separated to obtain a continuous activation period, and the voice signal of the remaining continuous activation period is used as a target voice frame.

The methods for detecting the state of the silence include, but are not limited to, voice endpoint detection, detection of audio mute algorithms, and voice activity detection (VAD) algorithms.

Preferably, the mute detection on the basic voice frame used in the embodiment of the present application to obtain K consecutive mute frames in the basic voice frame includes steps A to E, which are detailed as follows:

Step A: Calculate the frame energy of each basic speech frame.

Specifically, the frame energy is the short-term energy of the voice signal, and reflects the data amount of the voice information of the voice frame. The frame energy can be used to determine whether the voice frame is a sentence frame or a mute frame.

Step B: For each basic speech frame, if the frame energy of the basic speech frame is less than a preset frame energy threshold, mark the basic speech frame as a silent frame.

Specifically, the frame energy threshold is a preset parameter. If the calculated frame energy of the basic voice frame is less than a preset frame energy threshold, the corresponding basic voice frame is marked as a mute frame, and the frame energy threshold may be specifically determined according to Set it according to actual requirements. For example, if the frame energy threshold is set to 0.5, you can also perform specific analysis settings based on the calculated frame energy of each basic voice frame, which is not limited here.

For example, in a specific implementation, the frame energy threshold is set to 0.5, and the frame energy calculation is calculated for 6 basic speech frames: J ₁ , J ₂ , J ₃ , J ₄ , J _5, and J ₆ , and the results are: 1.6, 0.2, 0.4, 1.7, 1.1 and 0.8. From this result it is easy to understand that the basic speech frame J ₂ and the basic speech frame J ₃ are silent frames.

Step C: If H consecutive mute frames are detected, and the cut H is greater than a preset continuous threshold I, the frame set composed of the H consecutive mute frames is regarded as a continuous mute frame.

Specifically, the continuous threshold I can be preset according to actual needs. If the number of continuous silent frames is H, and the cut H is greater than the preset continuous threshold I, all of the intervals composed of the H continuous silent frames are all Mute frames are merged to get a continuous mute frame.

For example, in a specific implementation, the preset continuous threshold I is 5, and at a certain moment, the status of the acquired mute frames is shown in Table 1. Table 1 shows a frame set composed of 50 basic voice frames. As can be seen from Table 1, the interval of 5 or more consecutive mute frames is: interval P composed of basic speech frames corresponding to frame number 7 to frame number 13 and basic speech corresponding to frame number 21 to frame number 29 The interval Q composed of frames, therefore, the 7 basic voice frames corresponding to the frame number 7 to frame number 13 included in the interval P are combined to obtain a continuous mute frame P, and the duration of the continuous mute frame P is the frame number 7 to The sum of the lengths of the 7 basic voice frames corresponding to the frame number 13, according to this method, the basic voice frames corresponding to the frame number 21 to the frame number 29 included in the interval Q are combined as another continuous mute frame Q, which is continuously mute. The duration of the frame Q is the sum of the durations of the 9 basic speech frames corresponding to the frame number 21 to the frame number 29.

Table I

帧序号Frame number	11	22	33	44	55	66	77	88	99	1010
是否静音帧Whether to mute the frame	否no	否no	是Yes	否no	否no	否no	是Yes	是Yes	是Yes	是Yes
帧序号Frame number	1111	1212	1313	1414	1515	1616	1717	1818	1919	2020
是否静音帧Whether to mute the frame	是Yes	是Yes	是Yes	否no	否no	否no	否no	否no	否no	否no
帧序号Frame number	21twenty one	22twenty two	23twenty three	24twenty four	2525	2626	2727	2828	2929	3030
是否静音帧Whether to mute the frame	是Yes	是Yes	是Yes	是Yes	是Yes	是Yes	是Yes	是Yes	是Yes	否no
帧序号Frame number	3131	3232	3333	3434	3535	3636	3737	3838	3939	4040
是否静音帧Whether to mute the frame	是Yes	是Yes	否no	否no	否no	否no	否no	否no	是Yes	是Yes
帧序号Frame number	3131	3232	3333	3434	3535	3636	3737	3838	3939	4040
是否静音帧Whether to mute the frame	否no	是Yes	是Yes	否no	否no	是Yes	否no	否no	否no	否no

Step D: According to the method of steps A to C, obtain a total of K consecutive silent frames.

Taking Table 1 listed in step C as an example, the continuous mute frames obtained are continuous mute frame P and continuous mute frame Q, because in the example corresponding to step C, the value of K is 2.

S43: Divide the basic voice frames contained in the frame set into K + 1 target voice frames according to K consecutive mute frames.

Specifically, the K consecutive silent frames obtained in step S42 are used as dividing points, and the basic speech frames included in the frame set are divided to obtain a set interval of K + 1 basic speech frames, and each set interval includes All the basic speech frames as a target speech frame.

For example, in a specific implementation manner, the status of the acquired mute frame is shown in Table 1 in step C in S42, which shows two consecutive mute frames, which are 7 corresponding to frame number 7 to frame number 13, respectively. The basic voice frames are combined to obtain a continuous mute frame P, and the nine basic voice frames corresponding to the frame number 21 to frame number 29 are combined to obtain a continuous mute frame Q. The two consecutive mute frames are used as the demarcation point. The frame set of 50 basic speech frames is divided into three intervals, which are: the interval M ₁ composed of basic speech frames corresponding to frame number 1 to frame number 6 and the basic speech frames corresponding to frame number 14 to frame number 20 The interval M ₂ and the interval M ₃ composed of the basic speech frames corresponding to the frame number 30 to the frame number 50 are combined with all the basic speech frames in the interval M ₁ to obtain a combined speech frame as the target speech frame M ₁ .

S44: Convert each target voice frame into real-time voice text.

Specifically, text conversion is performed on each target voice frame to obtain a real-time voice text corresponding to the target voice frame.

Among them, the text conversion may use a tool that supports speech conversion text, or a text conversion algorithm, which is not specifically limited here.

In the embodiment corresponding to FIG. 3, the speech data is parsed to obtain a frame set including basic speech frames, and then the basic speech frames are silenced to detect K consecutive silent frames in the basic speech frames. According to the K Continuous mute frames, divide the basic voice frames contained in the frame set into K + 1 target voice frames, convert each target voice frame into a real-time voice text, so that the received voice signals are converted into independent ones in real time Real-time voice text, in order to use the real-time voice text to prevent users from matching outbound calls, ensuring the timeliness of monitoring during outbound calls.

Next, on the basis of the embodiment corresponding to FIG. 3, a specific embodiment is used to perform speech analysis on the voice data mentioned in step S41 to obtain a specific implementation method of a frame set including a basic voice frame. Explain in detail.

Please refer to FIG. 4, which illustrates a specific implementation process of step S41 provided by an embodiment of the present application, which is detailed as follows:

S411: Perform amplitude normalization processing on the voice data to obtain a basic voice signal.

Specifically, the voice data obtained by the device are all analog signals. After the voice data is obtained, the voice data must be encoded using Pulse Code Modulation (PCM), so that these analog signals are converted into digital signals, and The analog signal in the voice data is sampled at a sampling point every predetermined time to discretize it, and then the sampled signal is quantized, and the quantized digital signal is output in the form of a binary code group. According to the speech spectrum range 200 -3400Hz, the sampling rate can be set to 8KHz, and the quantization accuracy is 16bit.

It should be understood that the numerical ranges of the sampling rate and quantization accuracy herein are the preferred ranges of the present application, but can be set according to the needs of practical applications, and are not limited here.

Further, the amplitude normalization processing is performed on the discretized and quantized speech data. The specific amplitude normalization processing method may be dividing the sampling value of each sampling point by the maximum value among the sampling values of the speech data. It is also possible to divide the sampling value of each sampling point by the average value of the corresponding sampling value of the speech data, and converge the data to a specific interval, which is convenient for data processing.

It is worth noting that after the amplitude normalization process, the sample value of each sampling point in the audio data is converted into a corresponding standard value, thereby obtaining a basic voice signal corresponding to the voice data.

S412: Perform pre-emphasis processing on the basic voice signal to generate a target voice signal with a flat frequency spectrum.

Specifically, since the glottic excitation and the nose-to-nose radiation will affect the average power spectrum of the basic speech signal, causing high frequencies to fall by 6dB / octave when it exceeds 800Hz, the higher the frequency when calculating the frequency spectrum of the basic speech signal The smaller the corresponding component is, pre-emphasis is performed in the pre-processing for this purpose. The purpose of pre-emphasis is to improve the high-frequency part, make the signal spectrum flat, and maintain the entire frequency band from low to high frequencies. In the same way, the spectrum can be obtained with the same signal-to-noise ratio, which is convenient for spectrum analysis or channel parameter analysis. Pre-emphasis can be performed before the anti-aliasing filter when the voice signal is digitized. This not only can perform pre-emphasis, but also can compress the dynamic range of the signal and effectively improve the signal-to-noise ratio. Pre-emphasis can be implemented using a first-order digital filter, such as a Finite Impulse Response (FIR) filter.

S413: Perform frame processing on the target voice signal according to a preset frame length and a preset frame shift to obtain a frame set including a basic voice frame.

Specifically, the speech signal has the property of short-term stability. After pre-emphasis processing, the speech signal needs to be framed and windowed to maintain the short-term stability of the signal. Generally, the The number of frames is between 33 and 100 frames. In order to maintain the continuity between frames, so that adjacent two frames can smoothly transition, the overlapping framing method is adopted, as shown in FIG. 5, FIG. 5 shows an example of overlapping framing, and FIG. 5 The overlap between the kth frame and the k + 1th frame is the frame shift.

Preferably, the range of the ratio of the frame shift to the frame length is (0, 0.5).

For example, in a specific implementation, the pre-emphasized voice signal is s' (n), the frame length is N sampling points, and the frame shift is M sampling points. When the sampling point corresponding to the l frame is the nth, the corresponding relationship between the original speech signal x _l (n) and each parameter is:

x _l (n) = x [(l-1) M + n]

Among them, n = 0,1, ..., N-1, N = 256.

Further, after the target speech signal is framed, the corresponding window function w (n) is multiplied with the framed speech signal s' (n) to obtain a windowed speech signal S _w , and the speech signal is The set of frames that are the basic speech frames.

The window functions include, but are not limited to, rectangular windows, Hamming windows, and Hanning windows.

The rectangular window expression is:

Among them, w (n) is a window function, N is the number of sampling points, and n is the nth sampling point.

The Hamming window expression is:

Wherein, pi is the perimeter, and preferably, the value of pi in the embodiment of the present application is 3.1416.

The Hanning window expression is:

Framed and windowed the pre-emphasized speech signal, so that the speech signal maintains the continuity between frames, and removes some abnormal signal points to get the frame set of the basic speech frame, which improves the speech signal. Robustness.

In the embodiment corresponding to FIG. 4, the amplitude normalized processing is performed on the speech data to obtain a basic speech signal, and then the base speech signal is pre-emphasized to generate a target speech signal with a flat frequency spectrum. Long and preset frame shift, and frame the target voice signal to get the frame set of basic voice frames, which improves the robustness of each basic voice frame in the frame set, which is beneficial to the subsequent use of the frames of basic voice frames When a set is used to convert text into speech, the accuracy of the conversion is improved, which is conducive to improving the accuracy of speech recognition.

Based on the embodiments corresponding to FIG. 2 to FIG. 4, the following uses a specific embodiment to perform text matching on the real-time voice text and the outgoing call prohibition term mentioned in step S5 to obtain a first matching result. The specific implementation method will be described in detail.

The detailed implementation process of step S5 provided in the embodiment of the present application is detailed as follows:

For each outgoing call banned word, a text similarity algorithm is used to calculate the similarity between the outgoing call banned word and the real-time voice text. If the similarity is greater than or equal to a preset similarity threshold, the real-time voice text is included in the Outgoing banned words are used as the first matching result.

Specifically, after performing voice recognition in step S4 to obtain a real-time voice text, calculate the similarity between the real-time voice text and each of the outgoing call prohibited words, and compare the similarity with a preset similarity threshold. If the similarity is greater than or equal to a preset similarity threshold, it is determined that the real-time voice text includes the outbound prohibition term. The preset similarity threshold may be set to 0.8 or may be set according to actual needs, which is not specifically limited here.

Among them, the text similarity algorithm is an algorithm that determines the similarity of two texts by calculating the ratio of the intersection and union sizes between two texts. The larger the calculated ratio, the more similar the two texts are.

Text similarity algorithms include, but are not limited to: cosine similarity, k-NearestNeighbor (kNN) classification algorithm, Manhattan Distance, Hamming distance based on SimHash algorithm, and the like.

It is worth noting that during the matching process, if the similarity between an outgoing call banned term and the real-time voice text is greater than or equal to a preset similarity threshold, it can be determined that the matching result is that the real-time voice text contains the outgoing call banned term, and End this match without continuing to match with the remaining outbound banned words.

For example, in a specific implementation manner, the outbound call prohibited words obtained in step S3 include 15 phrases, which are V ₁ , V ₂ , V ₃ , ..., V ₁₄ , and V ₁₅ . After the speech text G, the real-time speech text G is matched with V _1. The matching process is: the real-time speech text G and V ₁ calculate the similarity. If the similarity is greater than or equal to a preset similarity threshold, the real-time speech text is determined. Contains the banned words, end this match. If the similarity is less than the preset similarity threshold, then continue to match the voice text G with the outgoing call prohibition V ₂ following V ₁ and follow the real-time voice text G and V _{1 above} . The matching method is used to match the real-time voice text G with the remaining outgoing call prohibited words. If the similarity is greater than or equal to a preset threshold during the matching process, it is determined that the real-time voice text contains the outgoing call prohibited words, End this match.

In this embodiment, the similarity is calculated by comparing the real-time voice text with each of the outgoing call prohibitions, and comparing the similarity with a preset similarity threshold value to determine whether the real-time voice text contains an outgoing call prohibition, thereby The accuracy of the matching is improved, and the accuracy of the first matching result is ensured.

On the basis of the embodiments corresponding to FIG. 2 to FIG. 4, in step S5, the real-time voice text is matched with the outgoing call prohibition text to obtain the first matching result and the first warning measure is executed Before the step, after the agent's outbound call is over, whether all necessary terms for outbound call are used for monitoring and early warning, as shown in FIG. 6, the voice recognition method further includes:

S7: When it is detected that the agent's outbound call operation is terminated, the current outbound text is matched with the necessary terms of the outbound text to obtain a second matching result.

Specifically, if it is detected that no voice data is generated within a preset time threshold, it is determined that the outbound call operation is terminated, and then the current outbound call text is matched with the necessary outbound call terms, and a second match is obtained. As a result, in the invented embodiment, the preset time threshold range is 10 seconds, which can be specifically set according to actual needs, and is not limited here.

The specific process of matching the current outbound text with the necessary terms of the outbound call is as follows:

By obtaining the Y real-time voice texts stored in the current outgoing call text, and then for each of the essential words for the outgoing call, the similarities of the essential words for the outgoing call and the Y real-time voice texts are matched to obtain Y similarities, if Y If the similarity is less than the preset similarity threshold, it is confirmed that the current outbound text does not include the necessary language for the outbound call.

It is worth noting that if there is at least one required language for outbound calls that is not included in the current outbound text, it is confirmed that the second matching result is that the current outbound text does not contain necessary terms for outbound calls.

For example, in a specific implementation, the necessary terms for outbound calls include: "Hello", "Can I help you?" "Please wait a moment", "Thank you for your support" and "Goodbye". The outbound text was matched with the necessary terms for outbound calls, and it was found that the current outbound words contained: "Can I help you?" "Please wait", "Thank you for your support", and "Goodbye", but did not include " "Hello", then confirm that the second matching result is that the current outgoing text does not contain the necessary words for outgoing calls.

Optionally, when the obtained current outgoing call text is matched with the necessary outgoing call terms, it is also possible to query each required outgoing call term in the current outgoing call text. If each required outgoing call term can be queried , It is confirmed that the second matching result is that the current outbound text contains the necessary language for the outbound call, otherwise, it is confirmed that the current matching text is that the current outbound text does not contain the necessary language for the outbound call.

S8: If the second matching result is that the current outbound text does not contain the necessary words for the outbound call, a second warning measure is performed.

Specifically, if the second matching result is that the current outgoing call text does not contain the necessary outgoing call terms, it means that at least one required outgoing call term has not been used in this outgoing call, at this time, a second warning measure will be executed.

Among them, the second warning measures include, but are not limited to: sending a warning alert to the monitoring end that the outbound call is irregular, reminding the agents of the outbound call about irregularities in the outbound call, and generating the outbound call record Wait.

Further, different second warning measures may be set according to the importance of the terms necessary for outbound calls. For example, if an outbound call must be used to include Word G, Word H, and Word I, where the weight of Word G and Word H is one level, the importance of word I is second level, and the first level is lower than second level, then You can set the corresponding second-level early-warning measure to “remind the agents of this outbound call to generate irregularities and generate this outbound call record”, and set the corresponding second-level early-warning measure to "Send an out-of-standard warning alert to the monitoring end and generate this outbound call record". When the real-time voice text contains the word I, a second warning measure is executed to send a non-standard warning alert for the outbound call to the monitoring end and generate an outbound call record.

In the embodiment corresponding to FIG. 6, when it is detected that the agent's outbound call operation is terminated, the current outbound text is matched with the necessary terms of the outbound text to obtain a second matching result, and if the second matching result is the current outbound call The text does not contain the necessary words for outbound calls, and the second warning measure is implemented to automatically warn when the necessary words for outbound calls are not used, avoiding monitoring by manually listening to the recording and analysis, thereby improving the monitoring efficiency. It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

Corresponding to the speech recognition method in the above method embodiment, FIG. 7 shows a speech recognition device that corresponds to the speech recognition method provided by the above method embodiment in a one-to-one manner. For convenience of explanation, only the related to the embodiment of the present application is shown. section.

As shown in FIG. 7, the voice recognition device includes a data acquisition module 10, a department determination module 20, a template selection module 30, a voice recognition module 40, a first matching module 50, and a first warning module 60. The detailed description of each function module is as follows:

A data acquisition module 10 is configured to acquire voice data and an equipment identifier of an outbound device used by the agent when the outbound operation of the agent is monitored;

The department determination module 20 is configured to determine a business department to which the agent belongs based on the equipment identification;

The template selection module 30 is configured to obtain a business text template corresponding to a business department, where the business text template includes a required language for outbound calls and a prohibited language for outbound calls;

A voice recognition module 40, configured to perform voice recognition on voice data to obtain real-time voice text, and add the real-time voice text to the current outgoing text;

A first matching module 50, configured to perform text matching between the real-time voice text and the outgoing call prohibited words to obtain a first matching result;

The first early warning module 60 is configured to execute a first early warning measure if the first matching result is that the real-time voice text includes an outbound call prohibition term.

Further, the real-time speech recognition module 40 includes:

A speech parsing unit 41, configured to perform speech parsing on speech data to obtain a frame set including basic speech frames;

The silence detection unit 42 is configured to perform silence detection on the basic voice frame to obtain K consecutive silence frames in the basic voice frame, where K is a natural number;

A frame set dividing unit 43 configured to divide a basic voice frame included in the frame set into K + 1 target voice frames according to K consecutive mute frames;

The text conversion unit 44 is configured to convert each target speech frame into real-time speech text.

Further, the speech parsing unit 41 includes:

A normalization subunit 411, configured to perform amplitude normalization processing on the voice data to obtain a basic voice signal;

A pre-emphasis subunit 412, configured to perform pre-emphasis processing on a basic voice signal to generate a target voice signal having a flat frequency spectrum;

The frame sub-unit 413 is configured to perform frame processing on a target voice signal according to a preset frame length and a preset frame shift to obtain a frame set of a basic voice frame.

Further, the first matching module 50 includes:

The first matching unit 51 is configured to calculate a similarity between the outgoing prohibited words and the real-time voice text for each outgoing prohibited word using a text similarity algorithm. , The real-time voice text includes the outbound call prohibition term as the first matching result.

Further, the voice recognition device further includes:

A second matching module 70, configured to perform text matching between the current outgoing call text and the required words of the outgoing call when detecting that the outgoing call operation of the agent is terminated, to obtain a second matching result;

The second early warning module 80 is configured to execute a second early warning measure if the second matching result is that the current outgoing text does not contain the necessary words for the outgoing call.

For a process in which each module in the voice recognition device provided by this embodiment implements their functions, reference may be made to the description of the foregoing method embodiments, and details are not described herein again.

This embodiment provides one or more nonvolatile readable storage media storing computer readable instructions. The nonvolatile readable storage medium stores computer readable instructions, and the computer readable instructions are When executed by one processor, the one or more processors are caused to execute the speech recognition method in the foregoing method embodiment, or when the computer-readable instructions are executed by one or more processors, each module in the foregoing device embodiment is implemented / Unit function. To avoid repetition, we will not repeat them here.

Understandably, the non-volatile readable storage medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, Read-Only Memory (ROM), Random Access Memory (RAM), electric carrier signals and telecommunication signals.

FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application. As shown in FIG. 8, the computer device 90 of this embodiment includes a processor 91, a memory 92, and computer-readable instructions 93 stored in the memory 92 and executable on the processor 91, such as a voice recognition program. When the processor 91 executes the computer-readable instructions 93, the steps in the foregoing embodiment of the speech recognition method are implemented, for example, steps S1 to S6 shown in FIG. 2. Alternatively, when the processor 91 executes the computer-readable instructions 93, the functions of the modules / units in the foregoing device embodiments are implemented, for example, the functions of the modules 10 to 60 shown in FIG. 7.

The computer device 90 may be a desktop computer, a notebook, a palmtop computer, or a cloud server. FIG. 8 is only an example of the computer device in this embodiment, and may include more or fewer components as shown in FIG. 8. Or combine some parts or different parts. The memory 92 may be an internal storage unit of a computer device, such as a hard disk or a memory, or an external storage unit of a computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), and a Secure Digital (SD ) Cards, flash cards, etc. The computer-readable instructions 93 include program code, which may be in a source code form, an object code form, an executable file, or some intermediate form.

Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims

A speech recognition method, characterized in that the speech recognition method includes:

If an outbound operation of the agent is monitored, obtaining voice data during the outbound process of the agent and a device identifier of an outbound device used by the agent;

Determining a business department to which the agent belongs based on the device identification;

Obtaining a business text template corresponding to the business department, wherein the business text template includes a term necessary for outbound calls and a term prohibited for outbound calls;

Performing voice recognition on the voice data to obtain real-time voice text, and adding the real-time voice text to the current outgoing text;

Text-matching the real-time voice text with the outgoing call prohibition term to obtain a first matching result;

If the first matching result is that the real-time voice text includes the outbound call prohibition term, a first warning measure is performed.
The voice recognition method according to claim 1, wherein the performing voice recognition on the voice data to obtain real-time voice text comprises:

Performing speech analysis on the speech data to obtain a frame set including basic speech frames;

Performing silence detection on the basic voice frame to obtain K consecutive silence frames in the basic voice frame, where K is a natural number;

Dividing the basic voice frames included in the frame set into K + 1 target voice frames according to the K silence frames;

Converting each of the target speech frames into the real-time speech text.
The speech recognition method according to claim 2, wherein the performing a speech analysis on the speech data to obtain a frame set including a basic speech frame comprises:

Performing amplitude normalization processing on the voice data to obtain a basic voice signal;

Performing pre-emphasis processing on the basic voice signal to generate a target voice signal having a flat frequency spectrum;

Frame processing the target voice signal according to a preset frame length and a preset frame shift to obtain a frame set including a basic voice frame.
The speech recognition method according to any one of claims 1 to 3, wherein performing the text matching between the real-time speech text and the outbound call prohibited term to obtain a first matching result comprises:

For each of the outgoing prohibited words, a text similarity algorithm is used to calculate the similarity between the outgoing prohibited words and the real-time speech text. If the similarity is greater than or equal to a preset similarity threshold, The real-time voice text includes the outgoing call prohibition term as a first matching result.
The speech recognition method according to any one of claims 1 to 3, wherein after the step of text matching the real-time voice text with the outbound call prohibition term to obtain a first matching result and after Before performing the steps of the first warning measure, the voice recognition method further includes:

When it is detected that the agent's outbound call operation is terminated, text matching is performed between the current outbound call text and the necessary term for the outbound call to obtain a second matching result;

If the second matching result is that the current outgoing call text does not include the necessary words for the outgoing call, a second warning measure is performed.
A voice recognition device, characterized in that the voice recognition device includes:

A data acquisition module, configured to acquire voice data and an equipment identifier of an outbound device used by the agent if the outbound operation of the agent is monitored;

A department determination module, configured to determine a business department to which the agent belongs based on the device identifier;

A template selection module, configured to obtain a business text template corresponding to the business department, wherein the business text template includes a term necessary for outbound calls and a term prohibited for outbound calls;

A voice recognition module, configured to perform voice recognition on the voice data to obtain real-time voice text, and add the real-time voice text to the current outgoing text;

A first matching module, configured to perform text matching between the real-time voice text and the outgoing call prohibition term to obtain a first matching result;

A first warning module is configured to execute a first warning measure if the first matching result is that the real-time voice text includes the outbound call prohibition term.
The voice recognition device according to claim 6, wherein the voice recognition module comprises:

A speech parsing unit, configured to perform speech parsing on the speech data to obtain a frame set including basic speech frames;

A silence detection unit, configured to perform silence detection on the basic voice frame to obtain K consecutive silence frames in the basic voice frame, where K is a natural number;

A frame set dividing unit, configured to divide the basic voice frame included in the frame set into K + 1 target voice frames according to the K silence frames;

A text conversion unit is configured to convert each of the target speech frames into the real-time speech text.
The speech recognition device according to claim 7, wherein the speech analysis unit comprises:

A normalization subunit, configured to perform amplitude normalization processing on the voice data to obtain a basic voice signal;

A pre-emphasis subunit, configured to perform pre-emphasis processing on the basic voice signal to generate a target voice signal having a flat frequency spectrum;

The frame sub-unit is configured to perform frame processing on the target voice signal according to a preset frame length and a preset frame shift to obtain a frame set of a basic voice frame.
The speech recognition device according to any one of claims 6 to 8, wherein the first matching module comprises:

A first matching unit, configured to calculate a similarity between the outgoing prohibited words and the real-time voice text for each of the outgoing prohibited words using a text similarity algorithm, and if the similarity is greater than or equal to If the similarity threshold is set, the real-time voice text includes the outbound call prohibition term as a first matching result.
The voice recognition device according to any one of claims 6 to 8, wherein the voice recognition device further comprises:

A second matching module, configured to perform text matching between the current outgoing call text and the necessary outgoing call terms when detecting that the outgoing call operation of the agent is terminated, to obtain a second matching result;

A second warning module is configured to execute a second warning measure if the second matching result is that the current outgoing call text does not include the necessary words for the outgoing call.
A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:

If an outbound operation of the agent is monitored, obtaining voice data during the outbound process of the agent and a device identifier of an outbound device used by the agent;

Determining a business department to which the agent belongs based on the device identification;

Obtaining a business text template corresponding to the business department, wherein the business text template includes a term necessary for outbound calls and a term prohibited for outbound calls;

Performing voice recognition on the voice data to obtain real-time voice text, and adding the real-time voice text to the current outgoing text;

Text-matching the real-time voice text with the outgoing call prohibition term to obtain a first matching result;

If the first matching result is that the real-time voice text includes the outbound call prohibition term, a first warning measure is performed.
The terminal device according to claim 11, wherein the performing voice recognition on the voice data to obtain real-time voice text comprises:

Performing speech analysis on the speech data to obtain a frame set including basic speech frames;

Performing silence detection on the basic voice frame to obtain K consecutive silence frames in the basic voice frame, where K is a natural number;

Dividing the basic voice frames included in the frame set into K + 1 target voice frames according to the K silence frames;

Converting each of the target speech frames into the real-time speech text.
The terminal device according to claim 12, wherein the performing voice analysis on the voice data to obtain a frame set including a basic voice frame comprises:

The performing voice analysis on the voice data to obtain a frame set including a basic voice frame includes:

Performing amplitude normalization processing on the voice data to obtain a basic voice signal;

Performing pre-emphasis processing on the basic voice signal to generate a target voice signal having a flat frequency spectrum;

Frame processing the target voice signal according to a preset frame length and a preset frame shift to obtain a frame set including a basic voice frame.
The terminal device according to any one of claims 11 to 13, wherein the text matching between the real-time voice text and the outgoing call prohibition term to obtain a first matching result comprises:

For each of the outgoing prohibited words, a text similarity algorithm is used to calculate the similarity between the outgoing prohibited words and the real-time speech text. If the similarity is greater than or equal to a preset similarity threshold, The real-time voice text includes the outgoing call prohibition term as a first matching result.
The terminal device according to any one of claims 11 to 13, wherein after the step of performing text matching between the real-time voice text and the outgoing call prohibition term to obtain a first matching result, and after executing Prior to the steps of the first warning measure, when the processor executes the computer-readable instructions, the method further includes implementing the following steps:

When detecting the termination of the outbound call operation of the agent, text matching the current outbound call text with the necessary term for the outbound call to obtain a second matching result;

If the second matching result is that the current outgoing call text does not include the necessary words for the outgoing call, a second warning measure is performed.
One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:

If an outbound operation of the agent is monitored, obtaining voice data during the outbound process of the agent and a device identifier of an outbound device used by the agent;

Determining a business department to which the agent belongs based on the device identification;

Obtaining a business text template corresponding to the business department, wherein the business text template includes a term necessary for outbound calls and a term prohibited for outbound calls;

Performing voice recognition on the voice data to obtain real-time voice text, and adding the real-time voice text to the current outgoing text;

Text-matching the real-time voice text with the outgoing call prohibition term to obtain a first matching result;

If the first matching result is that the real-time voice text includes the outbound call prohibition term, a first warning measure is performed.
The non-volatile readable storage medium according to claim 16, wherein the performing voice recognition on the voice data to obtain real-time voice text comprises:

Performing speech analysis on the speech data to obtain a frame set including basic speech frames;

Performing silence detection on the basic voice frame to obtain K consecutive silence frames in the basic voice frame, where K is a natural number;

Dividing the basic voice frames included in the frame set into K + 1 target voice frames according to the K silence frames;

Converting each of the target speech frames into the real-time speech text.
The non-volatile readable storage medium according to claim 17, wherein the performing voice analysis on the voice data to obtain a frame set including a basic voice frame comprises:

Performing amplitude normalization processing on the voice data to obtain a basic voice signal;

Performing pre-emphasis processing on the basic voice signal to generate a target voice signal having a flat frequency spectrum;

Frame processing the target voice signal according to a preset frame length and a preset frame shift to obtain a frame set including a basic voice frame.
The non-volatile readable storage medium according to any one of claims 16 to 18, wherein the text matching between the real-time voice text and the outgoing call prohibition term to obtain a first matching result includes :

For each of the outgoing prohibited words, a text similarity algorithm is used to calculate the similarity between the outgoing prohibited words and the real-time speech text. If the similarity is greater than or equal to a preset similarity threshold, then The real-time voice text includes the outgoing call prohibition term as a first matching result.
The non-volatile readable storage medium according to any one of claims 16 to 18, wherein a text matching is performed between the real-time voice text and the outgoing call prohibition term to obtain a first matching result When the computer-readable instructions are executed by one or more processors after the steps of and before the steps of the first warning measure are executed, the one or more processors further perform the following steps:

When detecting the termination of the outbound call operation of the agent, text matching the current outbound call text with the necessary term for the outbound call to obtain a second matching result;

If the second matching result is that the current outgoing call text does not include the necessary words for the outgoing call, a second warning measure is performed.