WO2019227580A1 - Voice recognition method, apparatus, computer device, and storage medium - Google Patents

Voice recognition method, apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2019227580A1
WO2019227580A1 PCT/CN2018/094371 CN2018094371W WO2019227580A1 WO 2019227580 A1 WO2019227580 A1 WO 2019227580A1 CN 2018094371 W CN2018094371 W CN 2018094371W WO 2019227580 A1 WO2019227580 A1 WO 2019227580A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
voice
real
frame
speech
Prior art date
Application number
PCT/CN2018/094371
Other languages
French (fr)
Chinese (zh)
Inventor
黄锦伦
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019227580A1 publication Critical patent/WO2019227580A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2281Call monitoring, e.g. for law enforcement purposes; Call tracing; Detection or prevention of malicious calls
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/51Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing
    • H04M3/5175Call or contact centers supervision arrangements

Definitions

  • the present application relates to the field of computer technology, and in particular, to a speech recognition method, device, computer device, and storage medium.
  • the call center consists of an interactive voice response system and an artificial traffic system.
  • the artificial traffic system consists of a check-in system, a traffic platform, and an interface machine.
  • customer representatives, agents In order to perform customer service, customer representatives, agents, need to perform a check-in operation in the check-in system. After successfully signing in to the traffic platform, The assigned manual service request establishes a call with the customer, that is, the agent calls out to perform customer service.
  • different business terms are set for different services to provide better service to customers.
  • the current practice is to listen to the recording afterwards and analyze the recording to obtain outbound information that does not meet the specifications and deal with it accordingly. On the one hand, this can only be heard after the event. Recording does not provide timely early warning, resulting in unscheduled monitoring of agent voice calls. On the other hand, due to the need to manually listen to all recordings and analyze them, it takes a lot of time, resulting in low monitoring efficiency.
  • the embodiments of the present application provide a method, a device, a computer device, and a storage medium for speech recognition, so as to solve the problems that the current outbound monitoring of agents is not timely and the monitoring efficiency is low.
  • An embodiment of the present application provides a voice recognition method, including:
  • a first warning measure is performed.
  • An embodiment of the present application provides a voice recognition device, including:
  • a data acquisition module configured to acquire voice data and an equipment identifier of an outbound device used by the agent if the outbound operation of the agent is monitored;
  • a department determination module configured to determine a business department to which the agent belongs based on the device identifier
  • a template selection module configured to obtain a business text template corresponding to the business department, wherein the business text template includes a term necessary for outbound calls and a term prohibited for outbound calls;
  • a voice recognition module configured to perform voice recognition on the voice data to obtain real-time voice text, and add the real-time voice text to the current outgoing text;
  • a first matching module configured to perform text matching between the real-time voice text and the outgoing call prohibition term to obtain a first matching result
  • a first warning module is configured to execute a first warning measure if the first matching result is that the real-time voice text includes the outbound call prohibition term.
  • An embodiment of the present application provides a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor implements the computer-readable instructions to implement Steps of the above speech recognition method.
  • This embodiment of the present application provides one or more non-volatile readable storage media storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors are Perform the steps of the speech recognition method described above.
  • FIG. 1 is a schematic diagram of an application environment of a speech recognition method according to an embodiment of the present application
  • FIG. 2 is an implementation flowchart of a speech recognition method provided by an embodiment of the present application
  • FIG. 3 is a flowchart of implementing step S4 in the speech recognition method according to an embodiment of the present application.
  • FIG. 4 is a flowchart of implementing step S41 in the voice recognition method provided by an embodiment of the present application.
  • FIG. 5 is an exemplary diagram of overlapping frames of speech signals in a speech recognition method according to an embodiment of the present application.
  • FIG. 6 is a flowchart of implementing a monitoring and early-warning phrase necessary for outgoing calls in the voice recognition method provided by an embodiment of the present application;
  • FIG. 7 is a schematic diagram of a voice recognition device provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
  • FIG. 1 illustrates an application environment of a speech recognition method provided by an embodiment of the present application.
  • the speech recognition method is applied in an outbound agent scenario of a call center.
  • the call center includes a server, a client, and a monitoring terminal.
  • the server and the client are connected, and the server and the monitoring terminal are connected through a network.
  • Agents make outbound calls through the client.
  • the client can specifically, but not limited to, various direct-line telephones, telephone network telephones connected with program-controlled switches, mobile phones, walkie-talkies, or other smart devices used for communication.
  • the server and monitoring terminals are specific. It can be implemented by an independent server or a server cluster composed of multiple servers.
  • the speech recognition method provided in the embodiment of the present application is applied to a server.
  • FIG. 2 illustrates an implementation process of a speech recognition method provided by an embodiment of the present application.
  • This method is applied to the server in FIG. 1 as an example, and includes the following steps:
  • the server and the client are connected through a network, and the server can monitor the client in real time.
  • the server can monitor the client in real time.
  • the device identifier and the outbound device used by the agent are obtained. Voice data generated during outgoing calls.
  • the client includes at least two or more outbound call devices, and each of the outbound call devices is used by an agent for outbound calls.
  • the monitoring of the client by the server can be implemented by using the listening mode of the socket process communication, or it can be controlled by the Transmission Control Protocol (TCP) to control data transmission. It can also be implemented by a third-party tool with a monitoring function.
  • TCP Transmission Control Protocol
  • the preferred method used in the embodiment of the present application is to implement the monitoring mode of the socket process communication. Actually, a suitable monitoring method can be selected according to the specific situation. There are no restrictions here.
  • S2 Determine the business department to which the agent belongs based on the device identification.
  • the device identification records the main information of the device, including but not limited to: the employee number of the agent, the department to which the agent belongs, the device type, or the device number. After obtaining the device identifier, the service to which the agent belongs can be determined according to the device identifier department.
  • the obtained device identifier is: 89757-KD-EN170-962346, and the device identifier contains information: the agent employee number is 89757, the agent's department is KD, and the device type is EN170. The equipment number is 962346.
  • the agent needs to verify the identity before using the outbound device.
  • the verification methods include, but are not limited to: account verification, voiceprint recognition, or fingerprint identification. After passing the verification, the outbound device obtains the corresponding information and records it in the device. logo.
  • each business department presets its own business text template. According to the business department determined in step S2, a business text template corresponding to the business department is obtained, and each business text template contains the necessary language and foreign language for outbound calls. Call forbidden terms.
  • the business department number is KD
  • the business text template KDYY corresponding to the business department number KD is found in the database.
  • the business text template KDYY is used as the current outbound agent this time.
  • the standard business template for outbound calls that is, after converting the voice data of the current agent's outbound call into text, the business text template KDYY is used to check the text to monitor whether the agent's outbound call terms are standardized.
  • S4 Perform speech recognition on the voice data to obtain real-time voice text, and add the real-time voice text to the current outgoing text.
  • voice recognition is performed on the voice data of the agent's outbound call obtained in step S1 to obtain real-time voice text during the outbound process, so as to monitor whether the agent's outgoing term is checked by checking the real-time voice text. Standardize and add the real-time voice text to the current outgoing text.
  • real-time voice text refers to the segmentation of the outbound voice data according to the pause and silence during each outbound call, and the segmented segmented voice data is obtained, and each segmented segmented voice data is subjected to speech recognition to obtain the corresponding
  • the recognized text is the speech recognition text.
  • a piece of voice data is acquired from 0 to 1.8 seconds, and is recorded as voice data E.
  • the voice data acquired from 1.8 to 3 seconds is empty, and 3 to 8 seconds are obtained.
  • Get the voice data at the other end record it as voice data F, perform voice recognition on voice data E, and get a real-time voice text: “Hello”, perform voice recognition on voice data F, and get a real-time voice text: “here Is it China XX, may I help you? "
  • the voice data may be recognized by a voice recognition algorithm or a third-party tool with a voice recognition function, which is not limited in particular.
  • Speech recognition algorithms include, but are not limited to, speech recognition algorithms based on channel models, speech template matching recognition algorithms, or artificial neural network speech recognition algorithms.
  • the speech recognition algorithm used in the embodiment of the present application is a speech recognition algorithm based on a channel model.
  • the real-time voice text obtained in step S4 is matched with the outbound prohibited words in the business text template obtained in step S3, and whether the outbound prohibited words are included in the real-time voice text is effectively ensured through this real-time monitoring method.
  • the first matching result includes: the real-time voice text includes an outgoing call prohibition term and the real-time voice text does not include an outgoing call prohibition term.
  • outbound prohibition term can be set according to business requirements, and the outbound prohibition term can be one, or two or more.
  • the real-time voice text is one or more, and if there is at least one real-time voice text including an outgoing call prohibition term, it is determined that the first matching result is that the real-time voice text includes an outgoing call prohibition term.
  • step S6 if the first matching result obtained in step S6 is that the real-time voice text contains outbound call prohibition terms, it means that the agent used at least one outbound call prohibition term in this outbound call, at this time, the first warning measure will be executed .
  • the first warning measures include, but are not limited to: sending a warning alert to the monitoring end that the outbound call is irregular, reminding the agents of the outbound call about irregularities in the outbound call, and / or disconnecting the current outbound call
  • the network connection of the call device can be set according to the actual situation, and is not specifically limited here.
  • first warning measures may be set according to the severity of the prohibited words for outgoing calls. For example, if outbound calling is prohibited to include Word A, Word B, and Word C, where the severity of Word A and Word B is Level 1 and the severity of Word C is Level 2 and Level 1 is lower than Level 2, then The first warning measure corresponding to the first level can be set to "send a non-standard warning alert for this outbound call to the monitoring end", and the first warning measure corresponding to the second level can be set to "disconnect the network connection of the current outgoing call device". When the real-time voice text contains the word C, the first warning measure is executed to directly disconnect the network connection of the current outgoing call device and terminate the agent's outgoing call process.
  • a device identifier and voice data of the agent are obtained, and the business department to which the agent belongs is determined through the device identifier, and then the business department is obtained.
  • Corresponding business text template, and perform voice recognition on the voice data to obtain real-time voice text store the real-time voice text into the current outgoing call text, and perform text matching through real-time outgoing call prohibited words and real-time voice text to obtain the first matching result. If the first matching result is that the real-time voice text contains outbound prohibited words, the first warning measure is implemented to realize real-time monitoring of the voice of the agent during the outbound call.
  • the agent uses outbound prohibited words in the outbound process It can detect and warn in time, thus ensuring the timeliness of monitoring, and because there is no need to manually monitor and analyze the recording for external calls, which saves time and improves monitoring efficiency.
  • a specific embodiment is used to perform a detailed description on the specific implementation method of performing voice recognition on the voice data mentioned in step S4 to obtain real-time voice text.
  • FIG. 3 illustrates a specific implementation process of step S4 provided by an embodiment of the present application, which is detailed as follows:
  • S41 Perform speech analysis on the speech data to obtain a frame set including basic speech frames.
  • Speech analysis is performed on the acquired speech data to obtain a frame set including basic speech frames.
  • Speech analysis includes, but is not limited to, speech encoding and pre-processing of speech signals.
  • speech coding is to encode analog speech signals and convert the analog signals into digital signals, thereby reducing the transmission code rate and digital transmission.
  • the basic methods of speech encoding can be divided into waveform encoding, parametric encoding (sound source encoding) and mixed coding.
  • the voice coding method used in this proposal is waveform coding.
  • Wave coding is a digital voice signal formed by sampling, quantizing, and encoding the waveform signal of the analog voice in the time domain.
  • the waveform coding can provide high voice quality.
  • the preprocessing of a voice signal refers to pre-emphasis, framing, windowing and other preprocessing operations on the voice signal before analysis and processing.
  • the purpose of voice signal pre-processing is to eliminate the effects of aliasing, higher harmonic distortion, high frequency and other factors on the quality of the voice signal caused by the human vocal organ itself and the equipment that collects the voice signal. As far as possible, ensure that the signals obtained by subsequent speech processing are more uniform and smooth, provide high-quality parameters for signal parameter extraction, and improve the quality of speech processing.
  • S42 Perform mute detection on the basic voice frame to obtain K consecutive mute frames in the basic voice frame, where K is a natural number.
  • the voice signal in the voice data can be divided into two states: an active period and a silent period. No silent signal is transmitted during the silent period, and the active and silent periods of the uplink and downlink are independent of each other.
  • the agent will have a pause state before and after each utterance. This state will cause a pause in the voice signal, that is, a silent period.
  • the silent period state needs to be detected. Then, the silent period and the activation period are separated to obtain a continuous activation period, and the voice signal of the remaining continuous activation period is used as a target voice frame.
  • the methods for detecting the state of the silence include, but are not limited to, voice endpoint detection, detection of audio mute algorithms, and voice activity detection (VAD) algorithms.
  • the mute detection on the basic voice frame used in the embodiment of the present application to obtain K consecutive mute frames in the basic voice frame includes steps A to E, which are detailed as follows:
  • Step A Calculate the frame energy of each basic speech frame.
  • the frame energy is the short-term energy of the voice signal, and reflects the data amount of the voice information of the voice frame.
  • the frame energy can be used to determine whether the voice frame is a sentence frame or a mute frame.
  • Step B For each basic speech frame, if the frame energy of the basic speech frame is less than a preset frame energy threshold, mark the basic speech frame as a silent frame.
  • the frame energy threshold is a preset parameter. If the calculated frame energy of the basic voice frame is less than a preset frame energy threshold, the corresponding basic voice frame is marked as a mute frame, and the frame energy threshold may be specifically determined according to Set it according to actual requirements. For example, if the frame energy threshold is set to 0.5, you can also perform specific analysis settings based on the calculated frame energy of each basic voice frame, which is not limited here.
  • the frame energy threshold is set to 0.5
  • the frame energy calculation is calculated for 6 basic speech frames: J 1 , J 2 , J 3 , J 4 , J 5, and J 6 , and the results are: 1.6, 0.2, 0.4, 1.7, 1.1 and 0.8. From this result it is easy to understand that the basic speech frame J 2 and the basic speech frame J 3 are silent frames.
  • Step C If H consecutive mute frames are detected, and the cut H is greater than a preset continuous threshold I, the frame set composed of the H consecutive mute frames is regarded as a continuous mute frame.
  • the continuous threshold I can be preset according to actual needs. If the number of continuous silent frames is H, and the cut H is greater than the preset continuous threshold I, all of the intervals composed of the H continuous silent frames are all Mute frames are merged to get a continuous mute frame.
  • the preset continuous threshold I is 5, and at a certain moment, the status of the acquired mute frames is shown in Table 1.
  • Table 1 shows a frame set composed of 50 basic voice frames.
  • the interval of 5 or more consecutive mute frames is: interval P composed of basic speech frames corresponding to frame number 7 to frame number 13 and basic speech corresponding to frame number 21 to frame number 29
  • the interval Q composed of frames, therefore, the 7 basic voice frames corresponding to the frame number 7 to frame number 13 included in the interval P are combined to obtain a continuous mute frame P, and the duration of the continuous mute frame P is the frame number 7 to
  • the sum of the lengths of the 7 basic voice frames corresponding to the frame number 13, according to this method, the basic voice frames corresponding to the frame number 21 to the frame number 29 included in the interval Q are combined as another continuous mute frame Q, which is continuously mute.
  • the duration of the frame Q is the sum of the durations of the 9 basic speech frames corresponding to the frame number 21 to the frame number
  • Step D According to the method of steps A to C, obtain a total of K consecutive silent frames.
  • the continuous mute frames obtained are continuous mute frame P and continuous mute frame Q, because in the example corresponding to step C, the value of K is 2.
  • the K consecutive silent frames obtained in step S42 are used as dividing points, and the basic speech frames included in the frame set are divided to obtain a set interval of K + 1 basic speech frames, and each set interval includes All the basic speech frames as a target speech frame.
  • the status of the acquired mute frame is shown in Table 1 in step C in S42, which shows two consecutive mute frames, which are 7 corresponding to frame number 7 to frame number 13, respectively.
  • the basic voice frames are combined to obtain a continuous mute frame P, and the nine basic voice frames corresponding to the frame number 21 to frame number 29 are combined to obtain a continuous mute frame Q.
  • the two consecutive mute frames are used as the demarcation point.
  • the frame set of 50 basic speech frames is divided into three intervals, which are: the interval M 1 composed of basic speech frames corresponding to frame number 1 to frame number 6 and the basic speech frames corresponding to frame number 14 to frame number 20
  • the interval M 2 and the interval M 3 composed of the basic speech frames corresponding to the frame number 30 to the frame number 50 are combined with all the basic speech frames in the interval M 1 to obtain a combined speech frame as the target speech frame M 1 .
  • text conversion is performed on each target voice frame to obtain a real-time voice text corresponding to the target voice frame.
  • the text conversion may use a tool that supports speech conversion text, or a text conversion algorithm, which is not specifically limited here.
  • the speech data is parsed to obtain a frame set including basic speech frames, and then the basic speech frames are silenced to detect K consecutive silent frames in the basic speech frames.
  • K Continuous mute frames divide the basic voice frames contained in the frame set into K + 1 target voice frames, convert each target voice frame into a real-time voice text, so that the received voice signals are converted into independent ones in real time Real-time voice text, in order to use the real-time voice text to prevent users from matching outbound calls, ensuring the timeliness of monitoring during outbound calls.
  • a specific embodiment is used to perform speech analysis on the voice data mentioned in step S41 to obtain a specific implementation method of a frame set including a basic voice frame.
  • a specific embodiment is used to perform speech analysis on the voice data mentioned in step S41 to obtain a specific implementation method of a frame set including a basic voice frame.
  • FIG. 4 illustrates a specific implementation process of step S41 provided by an embodiment of the present application, which is detailed as follows:
  • S411 Perform amplitude normalization processing on the voice data to obtain a basic voice signal.
  • the voice data obtained by the device are all analog signals.
  • the voice data must be encoded using Pulse Code Modulation (PCM), so that these analog signals are converted into digital signals, and
  • PCM Pulse Code Modulation
  • the analog signal in the voice data is sampled at a sampling point every predetermined time to discretize it, and then the sampled signal is quantized, and the quantized digital signal is output in the form of a binary code group.
  • the sampling rate can be set to 8KHz
  • the quantization accuracy is 16bit.
  • the amplitude normalization processing is performed on the discretized and quantized speech data.
  • the specific amplitude normalization processing method may be dividing the sampling value of each sampling point by the maximum value among the sampling values of the speech data. It is also possible to divide the sampling value of each sampling point by the average value of the corresponding sampling value of the speech data, and converge the data to a specific interval, which is convenient for data processing.
  • the sample value of each sampling point in the audio data is converted into a corresponding standard value, thereby obtaining a basic voice signal corresponding to the voice data.
  • S412 Perform pre-emphasis processing on the basic voice signal to generate a target voice signal with a flat frequency spectrum.
  • pre-emphasis is performed in the pre-processing for this purpose.
  • the purpose of pre-emphasis is to improve the high-frequency part, make the signal spectrum flat, and maintain the entire frequency band from low to high frequencies. In the same way, the spectrum can be obtained with the same signal-to-noise ratio, which is convenient for spectrum analysis or channel parameter analysis.
  • Pre-emphasis can be performed before the anti-aliasing filter when the voice signal is digitized. This not only can perform pre-emphasis, but also can compress the dynamic range of the signal and effectively improve the signal-to-noise ratio. Pre-emphasis can be implemented using a first-order digital filter, such as a Finite Impulse Response (FIR) filter.
  • FIR Finite Impulse Response
  • S413 Perform frame processing on the target voice signal according to a preset frame length and a preset frame shift to obtain a frame set including a basic voice frame.
  • the speech signal has the property of short-term stability. After pre-emphasis processing, the speech signal needs to be framed and windowed to maintain the short-term stability of the signal. Generally, the The number of frames is between 33 and 100 frames. In order to maintain the continuity between frames, so that adjacent two frames can smoothly transition, the overlapping framing method is adopted, as shown in FIG. 5, FIG. 5 shows an example of overlapping framing, and FIG. 5 The overlap between the kth frame and the k + 1th frame is the frame shift.
  • the range of the ratio of the frame shift to the frame length is (0, 0.5).
  • the pre-emphasized voice signal is s' (n)
  • the frame length is N sampling points
  • the frame shift is M sampling points.
  • x l (n) x [(l-1) M + n]
  • the corresponding window function w (n) is multiplied with the framed speech signal s' (n) to obtain a windowed speech signal S w , and the speech signal is The set of frames that are the basic speech frames.
  • the window functions include, but are not limited to, rectangular windows, Hamming windows, and Hanning windows.
  • the rectangular window expression is:
  • w (n) is a window function
  • N is the number of sampling points
  • n is the nth sampling point.
  • the Hamming window expression is:
  • pi is the perimeter, and preferably, the value of pi in the embodiment of the present application is 3.1416.
  • the Hanning window expression is:
  • the amplitude normalized processing is performed on the speech data to obtain a basic speech signal, and then the base speech signal is pre-emphasized to generate a target speech signal with a flat frequency spectrum.
  • Long and preset frame shift, and frame the target voice signal to get the frame set of basic voice frames, which improves the robustness of each basic voice frame in the frame set, which is beneficial to the subsequent use of the frames of basic voice frames
  • the accuracy of the conversion is improved, which is conducive to improving the accuracy of speech recognition.
  • the following uses a specific embodiment to perform text matching on the real-time voice text and the outgoing call prohibition term mentioned in step S5 to obtain a first matching result.
  • the specific implementation method will be described in detail.
  • step S5 The detailed implementation process of step S5 provided in the embodiment of the present application is detailed as follows:
  • a text similarity algorithm is used to calculate the similarity between the outgoing call banned word and the real-time voice text. If the similarity is greater than or equal to a preset similarity threshold, the real-time voice text is included in the Outgoing banned words are used as the first matching result.
  • step S4 After performing voice recognition in step S4 to obtain a real-time voice text, calculate the similarity between the real-time voice text and each of the outgoing call prohibited words, and compare the similarity with a preset similarity threshold. If the similarity is greater than or equal to a preset similarity threshold, it is determined that the real-time voice text includes the outbound prohibition term.
  • the preset similarity threshold may be set to 0.8 or may be set according to actual needs, which is not specifically limited here.
  • the text similarity algorithm is an algorithm that determines the similarity of two texts by calculating the ratio of the intersection and union sizes between two texts. The larger the calculated ratio, the more similar the two texts are.
  • Text similarity algorithms include, but are not limited to: cosine similarity, k-NearestNeighbor (kNN) classification algorithm, Manhattan Distance, Hamming distance based on SimHash algorithm, and the like.
  • the outbound call prohibited words obtained in step S3 include 15 phrases, which are V 1 , V 2 , V 3 , ..., V 14 , and V 15 .
  • the real-time speech text G is matched with V 1.
  • the matching process is: the real-time speech text G and V 1 calculate the similarity. If the similarity is greater than or equal to a preset similarity threshold, the real-time speech text is determined. Contains the banned words, end this match. If the similarity is less than the preset similarity threshold, then continue to match the voice text G with the outgoing call prohibition V 2 following V 1 and follow the real-time voice text G and V 1 above .
  • the matching method is used to match the real-time voice text G with the remaining outgoing call prohibited words. If the similarity is greater than or equal to a preset threshold during the matching process, it is determined that the real-time voice text contains the outgoing call prohibited words, End this match.
  • the similarity is calculated by comparing the real-time voice text with each of the outgoing call prohibitions, and comparing the similarity with a preset similarity threshold value to determine whether the real-time voice text contains an outgoing call prohibition, thereby The accuracy of the matching is improved, and the accuracy of the first matching result is ensured.
  • step S5 the real-time voice text is matched with the outgoing call prohibition text to obtain the first matching result and the first warning measure is executed Before the step, after the agent's outbound call is over, whether all necessary terms for outbound call are used for monitoring and early warning, as shown in FIG. 6, the voice recognition method further includes:
  • the preset time threshold range is 10 seconds, which can be specifically set according to actual needs, and is not limited here.
  • the similarities of the essential words for the outgoing call and the Y real-time voice texts are matched to obtain Y similarities, if Y If the similarity is less than the preset similarity threshold, it is confirmed that the current outbound text does not include the necessary language for the outbound call.
  • the necessary terms for outbound calls include: “Hello”, “Can I help you?" “Please wait a moment”, “Thank you for your support” and “Goodbye”.
  • the outbound text was matched with the necessary terms for outbound calls, and it was found that the current outbound words contained: “Can I help you?" “Please wait”, “Thank you for your support”, and “Goodbye", but did not include " "Hello”, then confirm that the second matching result is that the current outgoing text does not contain the necessary words for outgoing calls.
  • the obtained current outgoing call text is matched with the necessary outgoing call terms, it is also possible to query each required outgoing call term in the current outgoing call text. If each required outgoing call term can be queried , It is confirmed that the second matching result is that the current outbound text contains the necessary language for the outbound call, otherwise, it is confirmed that the current matching text is that the current outbound text does not contain the necessary language for the outbound call.
  • the second matching result is that the current outgoing call text does not contain the necessary outgoing call terms, it means that at least one required outgoing call term has not been used in this outgoing call, at this time, a second warning measure will be executed.
  • the second warning measures include, but are not limited to: sending a warning alert to the monitoring end that the outbound call is irregular, reminding the agents of the outbound call about irregularities in the outbound call, and generating the outbound call record Wait.
  • different second warning measures may be set according to the importance of the terms necessary for outbound calls. For example, if an outbound call must be used to include Word G, Word H, and Word I, where the weight of Word G and Word H is one level, the importance of word I is second level, and the first level is lower than second level, then You can set the corresponding second-level early-warning measure to “remind the agents of this outbound call to generate irregularities and generate this outbound call record”, and set the corresponding second-level early-warning measure to "Send an out-of-standard warning alert to the monitoring end and generate this outbound call record". When the real-time voice text contains the word I, a second warning measure is executed to send a non-standard warning alert for the outbound call to the monitoring end and generate an outbound call record.
  • the current outbound text is matched with the necessary terms of the outbound text to obtain a second matching result, and if the second matching result is the current outbound call
  • the text does not contain the necessary words for outbound calls, and the second warning measure is implemented to automatically warn when the necessary words for outbound calls are not used, avoiding monitoring by manually listening to the recording and analysis, thereby improving the monitoring efficiency.
  • the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
  • FIG. 7 shows a speech recognition device that corresponds to the speech recognition method provided by the above method embodiment in a one-to-one manner. For convenience of explanation, only the related to the embodiment of the present application is shown. section.
  • the voice recognition device includes a data acquisition module 10, a department determination module 20, a template selection module 30, a voice recognition module 40, a first matching module 50, and a first warning module 60.
  • the detailed description of each function module is as follows:
  • a data acquisition module 10 is configured to acquire voice data and an equipment identifier of an outbound device used by the agent when the outbound operation of the agent is monitored;
  • the department determination module 20 is configured to determine a business department to which the agent belongs based on the equipment identification;
  • the template selection module 30 is configured to obtain a business text template corresponding to a business department, where the business text template includes a required language for outbound calls and a prohibited language for outbound calls;
  • a voice recognition module 40 configured to perform voice recognition on voice data to obtain real-time voice text, and add the real-time voice text to the current outgoing text;
  • a first matching module 50 configured to perform text matching between the real-time voice text and the outgoing call prohibited words to obtain a first matching result
  • the first early warning module 60 is configured to execute a first early warning measure if the first matching result is that the real-time voice text includes an outbound call prohibition term.
  • the real-time speech recognition module 40 includes:
  • a speech parsing unit 41 configured to perform speech parsing on speech data to obtain a frame set including basic speech frames
  • the silence detection unit 42 is configured to perform silence detection on the basic voice frame to obtain K consecutive silence frames in the basic voice frame, where K is a natural number;
  • a frame set dividing unit 43 configured to divide a basic voice frame included in the frame set into K + 1 target voice frames according to K consecutive mute frames;
  • the text conversion unit 44 is configured to convert each target speech frame into real-time speech text.
  • the speech parsing unit 41 includes:
  • a normalization subunit 411 configured to perform amplitude normalization processing on the voice data to obtain a basic voice signal
  • a pre-emphasis subunit 412 configured to perform pre-emphasis processing on a basic voice signal to generate a target voice signal having a flat frequency spectrum
  • the frame sub-unit 413 is configured to perform frame processing on a target voice signal according to a preset frame length and a preset frame shift to obtain a frame set of a basic voice frame.
  • the first matching module 50 includes:
  • the first matching unit 51 is configured to calculate a similarity between the outgoing prohibited words and the real-time voice text for each outgoing prohibited word using a text similarity algorithm.
  • the real-time voice text includes the outbound call prohibition term as the first matching result.
  • the voice recognition device further includes:
  • a second matching module 70 configured to perform text matching between the current outgoing call text and the required words of the outgoing call when detecting that the outgoing call operation of the agent is terminated, to obtain a second matching result
  • the second early warning module 80 is configured to execute a second early warning measure if the second matching result is that the current outgoing text does not contain the necessary words for the outgoing call.
  • This embodiment provides one or more nonvolatile readable storage media storing computer readable instructions.
  • the nonvolatile readable storage medium stores computer readable instructions, and the computer readable instructions are When executed by one processor, the one or more processors are caused to execute the speech recognition method in the foregoing method embodiment, or when the computer-readable instructions are executed by one or more processors, each module in the foregoing device embodiment is implemented / Unit function. To avoid repetition, we will not repeat them here.
  • the non-volatile readable storage medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, Read-Only Memory (ROM), Random Access Memory (RAM), electric carrier signals and telecommunication signals.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
  • the computer device 90 of this embodiment includes a processor 91, a memory 92, and computer-readable instructions 93 stored in the memory 92 and executable on the processor 91, such as a voice recognition program.
  • the processor 91 executes the computer-readable instructions 93
  • the steps in the foregoing embodiment of the speech recognition method are implemented, for example, steps S1 to S6 shown in FIG. 2.
  • the processor 91 executes the computer-readable instructions 93
  • the functions of the modules / units in the foregoing device embodiments are implemented, for example, the functions of the modules 10 to 60 shown in FIG. 7.
  • the computer device 90 may be a desktop computer, a notebook, a palmtop computer, or a cloud server.
  • FIG. 8 is only an example of the computer device in this embodiment, and may include more or fewer components as shown in FIG. 8. Or combine some parts or different parts.
  • the memory 92 may be an internal storage unit of a computer device, such as a hard disk or a memory, or an external storage unit of a computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), and a Secure Digital (SD ) Cards, flash cards, etc.
  • the computer-readable instructions 93 include program code, which may be in a source code form, an object code form, an executable file, or some intermediate form.

Abstract

Disclosed in the present application are a voice recognition method, an apparatus, a computer device, and a storage medium. The method comprises: if an outbound call operation from an agent is detected, acquiring a device identifier and voice data concerning the agent, and determining a service department to which the agent belongs, so as to acquire a service text template corresponding to the service department, and perform voice recognition on the voice data, to obtain a real-time voice text. By performing text matching on the service text template and the real-time voice text in real time to obtain a matching result, and taking corresponding warning measures according to the matching result, the present invention performs real-time monitoring on the voice of the agent during outbound calling, and can discover non-standard wordings in a timely manner and issue a warning, thereby ensuring the timeliness of monitoring, and as outbound calling is monitored without the need for artificial hearing and the analysis of the recording, thereby saving time, and improving monitoring efficiency.

Description

语音识别方法、装置、计算机设备及存储介质Speech recognition method, device, computer equipment and storage medium
本申请以2018年5月29日提交的申请号为201810529536.0,名称为“一种语音识别方法、装置、终端设备及存储介质”的中国发明专利申请为基础,并要求其优先权。This application is based on a Chinese invention patent application filed on May 29, 2018 with the application number 201810529536.0, entitled "A Voice Recognition Method, Device, Terminal Device and Storage Medium", and claims its priority.
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种语音识别方法、装置、计算机设备及存储介质。The present application relates to the field of computer technology, and in particular, to a speech recognition method, device, computer device, and storage medium.
背景技术Background technique
呼叫中心由交互式语音应答系统和人工话务系统构成。人工话务系统由签入系统,话务平台,接口机组成,客户代表也就是坐席员为了进行客户服务,需在签入系统进行签入操作,成功签入话务平台后,根据话务平台分配的人工服务请求和客户建立通话,也就是坐席员外呼,来进行客户服务,通常会根据业务需求,针对不同业务设置不同的业务用语,来对客户进行更优质的服务。The call center consists of an interactive voice response system and an artificial traffic system. The artificial traffic system consists of a check-in system, a traffic platform, and an interface machine. In order to perform customer service, customer representatives, agents, need to perform a check-in operation in the check-in system. After successfully signing in to the traffic platform, The assigned manual service request establishes a call with the customer, that is, the agent calls out to perform customer service. Usually, according to business requirements, different business terms are set for different services to provide better service to customers.
虽然每个坐席员在外呼前已被告知相应的业务术语,但现实生活中,由于业务的调动或者对业务的不熟悉,经常会出现坐席员外呼用语不恰当的现象。Although each agent has been informed of the corresponding business terms before the outbound call, in real life, due to the transfer of the business or the unfamiliarity with the business, the term of the outbound caller's term is often inappropriate.
针对坐席员外呼用语不恰当的情况,目前的做法是通过事后听取录音并对录音进行分析,进而获取不符合规范的外呼信息并进行相应处理,这种作法一方面,只能在事后去听取录音,做不到及时的预警,导致坐席员语音外呼的监控不及时,另一方面,由于需要人工去听取所有录音并进行分析,需要花费大量时间,导致监控效率低。In response to the inappropriate use of outbound language by agents, the current practice is to listen to the recording afterwards and analyze the recording to obtain outbound information that does not meet the specifications and deal with it accordingly. On the one hand, this can only be heard after the event. Recording does not provide timely early warning, resulting in unscheduled monitoring of agent voice calls. On the other hand, due to the need to manually listen to all recordings and analyze them, it takes a lot of time, resulting in low monitoring efficiency.
发明内容Summary of the Invention
本申请实施例提供一种语音识别方法、装置、计算机设备及存储介质,以解决当前对坐席员语音外呼的监控不及时和监控效率低的问题。The embodiments of the present application provide a method, a device, a computer device, and a storage medium for speech recognition, so as to solve the problems that the current outbound monitoring of agents is not timely and the monitoring efficiency is low.
本申请实施例提供一种语音识别方法,包括:An embodiment of the present application provides a voice recognition method, including:
若监测到坐席员的外呼操作,则获取所述坐席员外呼过程中的语音数据和使用的外呼设备的设备标识;If an agent's outbound call operation is monitored, obtaining voice data during the agent's outbound call and the device identification of the used outbound device;
基于所述设备标识,确定所述坐席员所属的业务部门;Determining a business department to which the agent belongs based on the device identification;
获取所述业务部门对应的业务文本模板,其中,所述业务文本模板包括外呼必需用语和外呼禁止用语;Obtaining a business text template corresponding to the business department, wherein the business text template includes a term necessary for outbound calls and a term prohibited for outbound calls;
对所述语音数据进行语音识别,得到实时语音文本,并将所述实时语音文本添加到当前外呼文本;Performing voice recognition on the voice data to obtain real-time voice text, and adding the real-time voice text to the current outgoing text;
将所述实时语音文本与所述外呼禁止用语进行文本匹配,得到第一匹配结果;Text-matching the real-time voice text with the outgoing call prohibition term to obtain a first matching result;
若所述第一匹配结果为所述实时语音文本包含所述外呼禁止用语,则执行第一预警措施。If the first matching result is that the real-time voice text includes the outbound call prohibition term, a first warning measure is performed.
本申请实施例提供一种语音识别装置,包括:An embodiment of the present application provides a voice recognition device, including:
数据获取模块,用于若监测到坐席员的外呼操作,则获取所述坐席员外呼过程中的语音数据和使用的外呼设备的设备标识;A data acquisition module, configured to acquire voice data and an equipment identifier of an outbound device used by the agent if the outbound operation of the agent is monitored;
部门确定模块,用于基于所述设备标识,确定所述坐席员所属的业务部门;A department determination module, configured to determine a business department to which the agent belongs based on the device identifier;
模板选取模块,用于获取所述业务部门对应的业务文本模板,其中,所述业务文本模板包括外呼必需用语和外呼禁止用语;A template selection module, configured to obtain a business text template corresponding to the business department, wherein the business text template includes a term necessary for outbound calls and a term prohibited for outbound calls;
语音识别模块,用于对所述语音数据进行语音识别,得到实时语音文本,并将所述实时语音文本添加到当前外呼文本;A voice recognition module, configured to perform voice recognition on the voice data to obtain real-time voice text, and add the real-time voice text to the current outgoing text;
第一匹配模块,用于将所述实时语音文本与所述外呼禁止用语进行文本匹配,得到第一匹配结果;A first matching module, configured to perform text matching between the real-time voice text and the outgoing call prohibition term to obtain a first matching result;
第一预警模块,用于若所述第一匹配结果为所述实时语音文本包含所述外呼禁止用语,则执行第一预警措施。A first warning module is configured to execute a first warning measure if the first matching result is that the real-time voice text includes the outbound call prohibition term.
本申请实施例提供一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述语音识别方法的步骤。An embodiment of the present application provides a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. The processor implements the computer-readable instructions to implement Steps of the above speech recognition method.
本申请实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行上述语音识别方法的步骤。This embodiment of the present application provides one or more non-volatile readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors are Perform the steps of the speech recognition method described above.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.
图1是本申请实施例提供的语音识别方法的应用环境示意图;FIG. 1 is a schematic diagram of an application environment of a speech recognition method according to an embodiment of the present application; FIG.
图2是本申请实施例提供的语音识别方法的实现流程图;FIG. 2 is an implementation flowchart of a speech recognition method provided by an embodiment of the present application; FIG.
图3是本申请实施例提供的语音识别方法中步骤S4的实现流程图;FIG. 3 is a flowchart of implementing step S4 in the speech recognition method according to an embodiment of the present application; FIG.
图4是本申请实施例提供的语音识别方法中步骤S41的实现流程图;FIG. 4 is a flowchart of implementing step S41 in the voice recognition method provided by an embodiment of the present application;
图5是本申请实施例提供的语音识别方法中语音信号交叠分帧的示例图;FIG. 5 is an exemplary diagram of overlapping frames of speech signals in a speech recognition method according to an embodiment of the present application; FIG.
图6是本申请实施例提供的语音识别方法中对外呼必需用语进行监控预警的实现流程图;FIG. 6 is a flowchart of implementing a monitoring and early-warning phrase necessary for outgoing calls in the voice recognition method provided by an embodiment of the present application; FIG.
图7是本申请实施例提供的语音识别装置的示意图;FIG. 7 is a schematic diagram of a voice recognition device provided by an embodiment of the present application; FIG.
图8是本申请实施例提供的计算机设备的示意图。FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.
请参阅图1,图1示出了本申请实施例提供的语音识别方法的应用环境。该语音识别方法应用在呼叫中心的坐席员外呼场景中,该呼叫中心包括服务端、客户端和监控端,其中,服务端和客户端之间、服务端和监控端之间通过网络进行连接,坐席员通过客户端进行外呼呼叫,客户端具体可以但不限于是各种直线电话、程控交换机联系的电话网电话、手机、步话机或其他用于通讯的智能设备,服务端和监控端具体可以用独立的服务器或者多个服务器组成的服务器集群实现。本申请实施例提供的语音识别方法应用于服务端。Please refer to FIG. 1, which illustrates an application environment of a speech recognition method provided by an embodiment of the present application. The speech recognition method is applied in an outbound agent scenario of a call center. The call center includes a server, a client, and a monitoring terminal. The server and the client are connected, and the server and the monitoring terminal are connected through a network. Agents make outbound calls through the client. The client can specifically, but not limited to, various direct-line telephones, telephone network telephones connected with program-controlled switches, mobile phones, walkie-talkies, or other smart devices used for communication. The server and monitoring terminals are specific. It can be implemented by an independent server or a server cluster composed of multiple servers. The speech recognition method provided in the embodiment of the present application is applied to a server.
请参阅图2,图2示出本申请实施例提供的语音识别方法的实现流程。以该方法应用在图1中的服务端为例进行说明,包括如下步骤:Please refer to FIG. 2, which illustrates an implementation process of a speech recognition method provided by an embodiment of the present application. This method is applied to the server in FIG. 1 as an example, and includes the following steps:
S1:若监测到坐席员的外呼操作,则获取该坐席员外呼过程中的语音数据和该坐席员使用的外呼设备的设备标识。S1: If an agent's outbound call operation is monitored, the voice data during the outbound call of the agent and the device identification of the outbound device used by the agent are obtained.
具体地,服务端与客户端通过网络连接,服务端可以实时对客户端进行监测,当监测到在客户端有坐席员的外呼操作时,获取坐席员所使用的外呼设备的设备标识与外呼过程中产生的语音数据。Specifically, the server and the client are connected through a network, and the server can monitor the client in real time. When an outbound operation of an agent at the client is detected, the device identifier and the outbound device used by the agent are obtained. Voice data generated during outgoing calls.
其中,客户端包含至少两个以上的外呼设备,每个外呼设备用于一个坐席员进行外呼。The client includes at least two or more outbound call devices, and each of the outbound call devices is used by an agent for outbound calls.
需要说明的是,服务端对客户端的监控,可以是使用套接字(Socket)进程通信的监听模式来实现,也可以通过传输控制协议(Transmission Control Protocol,TCP)对数据传输进行控制来实现,还可以是通过具有监控功能的第三方工具来实现,本申请实施例采用的优选方式为通过套接字(Socket)进程通信的监听模式来实现,实际可以根据具体情况来选取合适的监控方式,此处不作限制。It should be noted that the monitoring of the client by the server can be implemented by using the listening mode of the socket process communication, or it can be controlled by the Transmission Control Protocol (TCP) to control data transmission. It can also be implemented by a third-party tool with a monitoring function. The preferred method used in the embodiment of the present application is to implement the monitoring mode of the socket process communication. Actually, a suitable monitoring method can be selected according to the specific situation. There are no restrictions here.
S2:基于设备标识,确定坐席员所属的业务部门。S2: Determine the business department to which the agent belongs based on the device identification.
具体地,设备标识记录该设备的主要信息,包括但不限于:坐席员工号、坐席员所属部门、设备类型或设备编号等,在获取到设备标识后,可以根据设备标识确定坐席员所属的业务部门。Specifically, the device identification records the main information of the device, including but not limited to: the employee number of the agent, the department to which the agent belongs, the device type, or the device number. After obtaining the device identifier, the service to which the agent belongs can be determined according to the device identifier department.
例如,在一具体实施方式中,获取到的设备标识为:89757-KD-EN170-962346,该设备标识包含的信息为:坐席员工号为89757,坐席员所属部门为KD,设备类型为EN170,设备编号为962346。For example, in a specific embodiment, the obtained device identifier is: 89757-KD-EN170-962346, and the device identifier contains information: the agent employee number is 89757, the agent's department is KD, and the device type is EN170. The equipment number is 962346.
值得说明的是,坐席员在使用外呼设备之前,需要验证身份,验证方式包括但不限于:帐号验证、声纹识别或指纹识别等,在通过验证后,外呼设备获取相应信息记入设备标识。It is worth noting that the agent needs to verify the identity before using the outbound device. The verification methods include, but are not limited to: account verification, voiceprint recognition, or fingerprint identification. After passing the verification, the outbound device obtains the corresponding information and records it in the device. Logo.
S3:获取业务部门对应的业务文本模板,其中,业务文本模板包括外呼必需用语和外 呼禁止用语。S3: Obtain the business text template corresponding to the business department, where the business text template includes the required words for outbound calls and the prohibited words for outbound calls.
具体地,每个业务部门均预设有各自的业务文本模板,根据步骤S2中确定的业务部门,获取该业务部门对应的业务文本模板,每个业务文本模板中包含外呼的必需用语和外呼禁止用语。Specifically, each business department presets its own business text template. According to the business department determined in step S2, a business text template corresponding to the business department is obtained, and each business text template contains the necessary language and foreign language for outbound calls. Call forbidden terms.
以步骤S2中获取到的业务部门为例,该业务部门编号为KD,在数据库中找到编号为KD的业务部门对应的业务文本模板KDYY,将业务文本模板KDYY作为当前外呼的坐席员本次外呼的规范性业务模板,即将当前坐席员外呼的语音数据转化为文本后,使用业务文本模板KDYY对该文本进行检验,从而监控该坐席员外呼用语是否规范。Take the business department obtained in step S2 as an example, the business department number is KD, and the business text template KDYY corresponding to the business department number KD is found in the database. The business text template KDYY is used as the current outbound agent this time. The standard business template for outbound calls, that is, after converting the voice data of the current agent's outbound call into text, the business text template KDYY is used to check the text to monitor whether the agent's outbound call terms are standardized.
S4:对语音数据进行语音识别,得到实时语音文本,并将该实时语音文本添加到当前外呼文本。S4: Perform speech recognition on the voice data to obtain real-time voice text, and add the real-time voice text to the current outgoing text.
具体地,对步骤S1中获取到的坐席员进行外呼的语音数据进行语音识别,得到外呼过程中的实时语音文本,以便通过对实时语音文本的检验,监控该坐席员的外呼用语是否规范,同时,将该实时语音文本添加到当前外呼文本之中。Specifically, voice recognition is performed on the voice data of the agent's outbound call obtained in step S1 to obtain real-time voice text during the outbound process, so as to monitor whether the agent's outgoing term is checked by checking the real-time voice text. Standardize and add the real-time voice text to the current outgoing text.
其中,实时语音文本是指根据每次外呼过程中的停顿静音,对外呼的语音数据进行切分,得到的一段段切分语音数据,对每段切分语音数据经过语音识别,从而得到对应的识别文本,即为语音识别文本。Among them, real-time voice text refers to the segmentation of the outbound voice data according to the pause and silence during each outbound call, and the segmented segmented voice data is obtained, and each segmented segmented voice data is subjected to speech recognition to obtain the corresponding The recognized text is the speech recognition text.
例如,在一具体实施方式中,第0秒至第1.8秒获取到一段语音数据,记为语音数据E,第1.8秒到第3秒获取到的语音数据为空,第3秒至第8秒获取到另一端语音数据,记为语音数据F,对语音数据E进行语音识别,得到一个实时语音文本为:“您好”,对语音数据F进行语音识别,得到一个实时语音文本为:“这里是中国XX,请问有什么可以帮助您的吗”。For example, in a specific embodiment, a piece of voice data is acquired from 0 to 1.8 seconds, and is recorded as voice data E. The voice data acquired from 1.8 to 3 seconds is empty, and 3 to 8 seconds are obtained. Get the voice data at the other end, record it as voice data F, perform voice recognition on voice data E, and get a real-time voice text: “Hello”, perform voice recognition on voice data F, and get a real-time voice text: “here Is it China XX, may I help you? "
其中,对语音数据进行语音识别,可采用语音识别算法,也可以使用具有语音识别功能的第三方工具,具体不作限制。语音识别算法包括但不限于:基于声道模型的语音识别算法、语音模板匹配识别算法和或人工神经网络的语音识别算法等。The voice data may be recognized by a voice recognition algorithm or a third-party tool with a voice recognition function, which is not limited in particular. Speech recognition algorithms include, but are not limited to, speech recognition algorithms based on channel models, speech template matching recognition algorithms, or artificial neural network speech recognition algorithms.
优选地,本申请实施例采用的语音识别算法为基于声道模型的语音识别算法。Preferably, the speech recognition algorithm used in the embodiment of the present application is a speech recognition algorithm based on a channel model.
S5:将实时语音文本与外呼禁止用语进行文本匹配,得到第一匹配结果。S5: Match the real-time voice text with the outgoing call prohibition text to obtain the first matching result.
具体地,将步骤S4得到的实时语音文本与步骤S3得到的业务文本模板中的外呼禁止用语进行文本匹配,检验实时语音文本中是否包含外呼禁止用语,通过这种实时监控方式,有效确保了监控的及时性。Specifically, the real-time voice text obtained in step S4 is matched with the outbound prohibited words in the business text template obtained in step S3, and whether the outbound prohibited words are included in the real-time voice text is effectively ensured through this real-time monitoring method. The timeliness of monitoring.
其中,第一匹配结果包括:实时语音文本包含外呼禁止用语和实时语音文本不包含外呼禁止用语。The first matching result includes: the real-time voice text includes an outgoing call prohibition term and the real-time voice text does not include an outgoing call prohibition term.
容易理解地,外呼禁止用语可根据业务需求进行设置,外呼禁止用语可以是一个,也可以是两个或两个以上。It is easy to understand that the outbound prohibition term can be set according to business requirements, and the outbound prohibition term can be one, or two or more.
值得说明的是,实时语音文本为一个或一个以上,若存在至少一个实时语音文本中包含了外呼禁止用语,则确定第一匹配结果为实时语音文本包含外呼禁止用语。It is worth noting that the real-time voice text is one or more, and if there is at least one real-time voice text including an outgoing call prohibition term, it is determined that the first matching result is that the real-time voice text includes an outgoing call prohibition term.
S6:若第一匹配结果为实时语音文本包含外呼禁止用语,则执行第一预警措施。S6: If the first matching result is that the real-time voice text contains outbound call prohibition words, a first warning measure is executed.
具体地,若步骤S6得到的第一匹配结果为实时语音文本包含外呼禁止用语,则说明坐席员在本次外呼中使用了至少一个外呼禁止用语,此时,将执行第一预警措施。Specifically, if the first matching result obtained in step S6 is that the real-time voice text contains outbound call prohibition terms, it means that the agent used at least one outbound call prohibition term in this outbound call, at this time, the first warning measure will be executed .
其中,第一预警措施包括但不限于:向监控端发送本次外呼不规范的预警提示、提醒本次外呼的坐席员本次外呼中出现的不规范事项和/或断开当前外呼设备的网络连接等,其具体可根据实际情况设定,此处不作具体限制。Among them, the first warning measures include, but are not limited to: sending a warning alert to the monitoring end that the outbound call is irregular, reminding the agents of the outbound call about irregularities in the outbound call, and / or disconnecting the current outbound call The network connection of the call device can be set according to the actual situation, and is not specifically limited here.
进一步地,可以根据外呼禁止用语的严重程度,设置不同的第一预警措施。例如,若外呼禁止用于包括词语A、词语B和词语C,其中,词语A和词语B的严重程度为一级,词语C的严重程度为二级,并且一级低于二级,则可以设置一级对应的第一预警措施为“向监控端发送本次外呼不规范的预警提示”,同时设置二级对应的第一预警措施为“断开当前外呼设备的网络连接”。当实时语音文本包含词语C时,执行第一预警措施,直接断开当前外呼设备的网络连接,终止坐席员的外呼过程。Further, different first warning measures may be set according to the severity of the prohibited words for outgoing calls. For example, if outbound calling is prohibited to include Word A, Word B, and Word C, where the severity of Word A and Word B is Level 1 and the severity of Word C is Level 2 and Level 1 is lower than Level 2, then The first warning measure corresponding to the first level can be set to "send a non-standard warning alert for this outbound call to the monitoring end", and the first warning measure corresponding to the second level can be set to "disconnect the network connection of the current outgoing call device". When the real-time voice text contains the word C, the first warning measure is executed to directly disconnect the network connection of the current outgoing call device and terminate the agent's outgoing call process.
在图2对应的实施例中,若监测到有坐席员的外呼操作,则获取该坐席员的设备标识和语音数据,通过该设备标识,确定坐席员所属的业务部门,进而获取该业务部门对应的业务文本模板,并对语音数据进行语音识别,得到实时语音文本,将实时语音文本存入当前外呼文本,通过实时对外呼禁止用语和实时语音文本进行文本匹配,得到第一匹配结果,若第一匹配结果为实时语音文本包含外呼禁止用语,则执行第一预警措施,实现了对坐席员外呼过程中的语音进行实时监控,当坐席员在外呼过程中使用了外呼禁止用语时,能够及时发现并预警,从而确保了监控的及时性,并且,由于无需通过人工听取并分析录音来对外呼进行监控,从而节约了时间,提高了监控效率。In the embodiment corresponding to FIG. 2, if an outbound operation of an agent is detected, a device identifier and voice data of the agent are obtained, and the business department to which the agent belongs is determined through the device identifier, and then the business department is obtained. Corresponding business text template, and perform voice recognition on the voice data to obtain real-time voice text, store the real-time voice text into the current outgoing call text, and perform text matching through real-time outgoing call prohibited words and real-time voice text to obtain the first matching result. If the first matching result is that the real-time voice text contains outbound prohibited words, the first warning measure is implemented to realize real-time monitoring of the voice of the agent during the outbound call. When the agent uses outbound prohibited words in the outbound process It can detect and warn in time, thus ensuring the timeliness of monitoring, and because there is no need to manually monitor and analyze the recording for external calls, which saves time and improves monitoring efficiency.
接下来,在图2对应的实施例的基础之上,下面通过一个具体的实施例来对步骤S4中所提及的对语音数据进行语音识别,得到实时语音文本的具体实现方法进行详细说明。Next, on the basis of the embodiment corresponding to FIG. 2, a specific embodiment is used to perform a detailed description on the specific implementation method of performing voice recognition on the voice data mentioned in step S4 to obtain real-time voice text.
请参阅图3,图3示出了本申请实施例提供的步骤S4的具体实现流程,详述如下:Please refer to FIG. 3, which illustrates a specific implementation process of step S4 provided by an embodiment of the present application, which is detailed as follows:
S41:对语音数据进行语音解析,得到包含基础语音帧的帧集合。S41: Perform speech analysis on the speech data to obtain a frame set including basic speech frames.
具体地,对获取到的语音数据进行语音解析,得到包含基础语音帧的帧集合,语音解析包括但不限于:语音编码和语音信号的预处理等。Specifically, speech analysis is performed on the acquired speech data to obtain a frame set including basic speech frames. Speech analysis includes, but is not limited to, speech encoding and pre-processing of speech signals.
其中,语音编码就是对模拟的语音信号进行编码,将模拟信号转化成数字信号,从而降低传输码率并进行数字传输,语音编码的基本方法可分为波形编码、参量编码(音源编码)和混合编码。Among them, speech coding is to encode analog speech signals and convert the analog signals into digital signals, thereby reducing the transmission code rate and digital transmission. The basic methods of speech encoding can be divided into waveform encoding, parametric encoding (sound source encoding) and mixed coding.
优选地,本提案使用的语音编码方式为波形编码,波形编码是将时域的模拟话音的波形信号经过取样、量化、编码而形成的数字话音信号,波形编码可提供高话音的质量。Preferably, the voice coding method used in this proposal is waveform coding. Wave coding is a digital voice signal formed by sampling, quantizing, and encoding the waveform signal of the analog voice in the time domain. The waveform coding can provide high voice quality.
其中,语音信号的预处理是指在对语音信号进行分析和处理之前,对其进行预加重、分帧、加窗等预处理操作。语音信号的预处理的目的是消除因为人类发声器官本身和由于采集语音信号的设备所带来的混叠、高次谐波失真、高频等等因素,对语音信号质量的影响。尽可能保证后续语音处理得到的信号更均匀、平滑,为信号参数提取提供优质的参数,提高语音处理质量。Among them, the preprocessing of a voice signal refers to pre-emphasis, framing, windowing and other preprocessing operations on the voice signal before analysis and processing. The purpose of voice signal pre-processing is to eliminate the effects of aliasing, higher harmonic distortion, high frequency and other factors on the quality of the voice signal caused by the human vocal organ itself and the equipment that collects the voice signal. As far as possible, ensure that the signals obtained by subsequent speech processing are more uniform and smooth, provide high-quality parameters for signal parameter extraction, and improve the quality of speech processing.
S42:对基础语音帧进行静音检测,得到基础语音帧中的K个连续静音帧,其中,K 为自然数。S42: Perform mute detection on the basic voice frame to obtain K consecutive mute frames in the basic voice frame, where K is a natural number.
具体地,在外呼通话持续期间,语音数据中的语音信号可分为激活期和静默期两个状态,静默期不传送任何语音信号,上、下行链路的激活期和静默期相互独立。坐席员在外呼过程中,在每次发音前后,均会有停顿的状态,这个状态会带来语音信号的停顿,即静默期,在进行语音识别并转换文本的时候,需要检测出静默期状态,进而将静默期与激活期进行分离,以得到持续的激活期,将保留下来的持续的激活期的语音信号作为目标语音帧。Specifically, during the duration of an outbound call, the voice signal in the voice data can be divided into two states: an active period and a silent period. No silent signal is transmitted during the silent period, and the active and silent periods of the uplink and downlink are independent of each other. During the outgoing call, the agent will have a pause state before and after each utterance. This state will cause a pause in the voice signal, that is, a silent period. When performing speech recognition and converting text, the silent period state needs to be detected. Then, the silent period and the activation period are separated to obtain a continuous activation period, and the voice signal of the remaining continuous activation period is used as a target voice frame.
其中,检测静默音状态的方式包括但不限于:语音端点检测、探测音频静音算法和语音活动检测(Voice Activity Detection,VAD)算法等。The methods for detecting the state of the silence include, but are not limited to, voice endpoint detection, detection of audio mute algorithms, and voice activity detection (VAD) algorithms.
优选地,本申请实施例使用的对基础语音帧进行静音检测,得到基础语音帧中的K个连续静音帧的具体实现流程包括步骤A至步骤E,详述如下:Preferably, the mute detection on the basic voice frame used in the embodiment of the present application to obtain K consecutive mute frames in the basic voice frame includes steps A to E, which are detailed as follows:
步骤A:计算每帧基础语音帧的帧能量。Step A: Calculate the frame energy of each basic speech frame.
具体地,帧能量是语音信号的短时能量,反映了语音帧的语音信息的数据量,通过帧能量能够进行判断该语音帧是否为语句帧还是静音帧。Specifically, the frame energy is the short-term energy of the voice signal, and reflects the data amount of the voice information of the voice frame. The frame energy can be used to determine whether the voice frame is a sentence frame or a mute frame.
步骤B:针对每帧基础语音帧,若该基础语音帧的帧能量小于预设的帧能量阈值,则标记该基础语音帧为静音帧。Step B: For each basic speech frame, if the frame energy of the basic speech frame is less than a preset frame energy threshold, mark the basic speech frame as a silent frame.
具体地,帧能量阈值为预先设定的参数,若计算得到的基础语音帧的帧能量小于预设的帧能量阈值,则将对应的基础语音帧标记为静音帧,该帧能量阈值具体可以根据实际需求进行设置,如帧能量阈值设置为0.5,也可以根据计算得到各个基础语音帧的帧能量进行具体分析设置,此处不做限制。Specifically, the frame energy threshold is a preset parameter. If the calculated frame energy of the basic voice frame is less than a preset frame energy threshold, the corresponding basic voice frame is marked as a mute frame, and the frame energy threshold may be specifically determined according to Set it according to actual requirements. For example, if the frame energy threshold is set to 0.5, you can also perform specific analysis settings based on the calculated frame energy of each basic voice frame, which is not limited here.
例如,在一具体实施方式中,帧能量阈值设置为0.5,对6个基础语音帧:J 1、J 2、J 3、J 4、J 5和J 6计算帧能量计算,得到结果分别为:1.6、0.2、0.4、1.7、1.1和0.8,由此结果容易理解,基础语音帧J 2和基础语音帧J 3为静音帧。 For example, in a specific implementation, the frame energy threshold is set to 0.5, and the frame energy calculation is calculated for 6 basic speech frames: J 1 , J 2 , J 3 , J 4 , J 5, and J 6 , and the results are: 1.6, 0.2, 0.4, 1.7, 1.1 and 0.8. From this result it is easy to understand that the basic speech frame J 2 and the basic speech frame J 3 are silent frames.
步骤C:若检测到H个连续的静音帧,切H大于预设的连续阈值I,则将该H个连续的静音帧组成的帧集合作为为连续静音帧。Step C: If H consecutive mute frames are detected, and the cut H is greater than a preset continuous threshold I, the frame set composed of the H consecutive mute frames is regarded as a continuous mute frame.
具体地,连续阈值I可以根据实际需要进行预先设置,若存在连续的静音帧的数量为H,切H大于预设的连续阈值I,则将该H个连续的静音帧组成的区间中的所有静音帧进行合并,得到一个连续静音帧。Specifically, the continuous threshold I can be preset according to actual needs. If the number of continuous silent frames is H, and the cut H is greater than the preset continuous threshold I, all of the intervals composed of the H continuous silent frames are all Mute frames are merged to get a continuous mute frame.
例如,在一具体实施方式中,预设的连续阈值I为5,在某一时刻,获取到的静音帧状态如表一所示,表一示出了50个基础语音帧组成的帧集合,由表一可知,包含连续5个或5个以上的连续的静音帧区间为:帧序号7至帧序号13对应的基础语音帧组成的区间P,以及帧序号21至帧序号29对应的基础语音帧组成的区间Q,因而,将区间P中包含的帧序号7至帧序号13对应的7个基础语音帧进行组合,得到一个连续静音帧P,该连续静音帧P的时长为帧序号7至帧序号13对应的7个基础语音帧的时长之和,按此方法, 将区间Q中包含的帧序号21至帧序号29对应的基础语音帧进行组合,作为另一个连续静音帧Q,连续静音帧Q的时长为帧序号21至帧序号29对应的9个基础语音帧的时长之和。For example, in a specific implementation, the preset continuous threshold I is 5, and at a certain moment, the status of the acquired mute frames is shown in Table 1. Table 1 shows a frame set composed of 50 basic voice frames. As can be seen from Table 1, the interval of 5 or more consecutive mute frames is: interval P composed of basic speech frames corresponding to frame number 7 to frame number 13 and basic speech corresponding to frame number 21 to frame number 29 The interval Q composed of frames, therefore, the 7 basic voice frames corresponding to the frame number 7 to frame number 13 included in the interval P are combined to obtain a continuous mute frame P, and the duration of the continuous mute frame P is the frame number 7 to The sum of the lengths of the 7 basic voice frames corresponding to the frame number 13, according to this method, the basic voice frames corresponding to the frame number 21 to the frame number 29 included in the interval Q are combined as another continuous mute frame Q, which is continuously mute. The duration of the frame Q is the sum of the durations of the 9 basic speech frames corresponding to the frame number 21 to the frame number 29.
表一Table I
帧序号Frame number 11 22 33 44 55 66 77 88 99 1010
是否静音帧Whether to mute the frame no no Yes no no no Yes Yes Yes Yes
帧序号Frame number 1111 1212 1313 1414 1515 1616 1717 1818 1919 2020
是否静音帧Whether to mute the frame Yes Yes Yes no no no no no no no
帧序号Frame number 21twenty one 22twenty two 23twenty three 24twenty four 2525 2626 2727 2828 2929 3030
是否静音帧Whether to mute the frame Yes Yes Yes Yes Yes Yes Yes Yes Yes no
帧序号Frame number 3131 3232 3333 3434 3535 3636 3737 3838 3939 4040
是否静音帧Whether to mute the frame Yes Yes no no no no no no Yes Yes
帧序号Frame number 3131 3232 3333 3434 3535 3636 3737 3838 3939 4040
是否静音帧Whether to mute the frame no Yes Yes no no Yes no no no no
步骤D:按照步骤A至步骤C的方法,获取连续静音帧的总数K个。Step D: According to the method of steps A to C, obtain a total of K consecutive silent frames.
以步骤C中列举的表一为例,获取的连续静音帧为连续静音帧P和连续静音帧Q,因为在步骤C对应的举例中,K的值为2。Taking Table 1 listed in step C as an example, the continuous mute frames obtained are continuous mute frame P and continuous mute frame Q, because in the example corresponding to step C, the value of K is 2.
S43:根据K个连续静音帧,将帧集合中包含的基础语音帧划分成K+1个目标语音帧。S43: Divide the basic voice frames contained in the frame set into K + 1 target voice frames according to K consecutive mute frames.
具体地,将步骤S42中得到的K个连续静音帧作为分界点,将帧集合中包含的基础语音帧划分开来,得到K+1个基础语音帧的集合区间,将每个集合区间中包含的所有基础语音帧,作为一个目标语音帧。Specifically, the K consecutive silent frames obtained in step S42 are used as dividing points, and the basic speech frames included in the frame set are divided to obtain a set interval of K + 1 basic speech frames, and each set interval includes All the basic speech frames as a target speech frame.
例如,在一具体实施方式中,获取到的静音帧的状态如S42中步骤C的表一所示,该表示出了两个连续静音帧,分别为帧序号7至帧序号13对应的7个基础语音帧进行组合得到一个连续静音帧P,以及帧序号21至帧序号29对应的9个基础语音帧进行组合得到一个连续静音帧Q,将这两个连续静音帧作为分界点,将这个包含50个基础语音帧的帧集合划分成了三个区间,分别为:帧序号1至帧序号6对应的基础语音帧组成的区间M 1,帧序号14至帧序号20对应的基础语音帧组成的区间M 2,以及帧序号30至帧序号50对应的基础语音帧组成的区间M 3,将区间M 1中所有的基础语音帧进行组合,得到一个组合后的语音帧,作为目标语音帧M 1For example, in a specific implementation manner, the status of the acquired mute frame is shown in Table 1 in step C in S42, which shows two consecutive mute frames, which are 7 corresponding to frame number 7 to frame number 13, respectively. The basic voice frames are combined to obtain a continuous mute frame P, and the nine basic voice frames corresponding to the frame number 21 to frame number 29 are combined to obtain a continuous mute frame Q. The two consecutive mute frames are used as the demarcation point. The frame set of 50 basic speech frames is divided into three intervals, which are: the interval M 1 composed of basic speech frames corresponding to frame number 1 to frame number 6 and the basic speech frames corresponding to frame number 14 to frame number 20 The interval M 2 and the interval M 3 composed of the basic speech frames corresponding to the frame number 30 to the frame number 50 are combined with all the basic speech frames in the interval M 1 to obtain a combined speech frame as the target speech frame M 1 .
S44:将每个目标语音帧转换为实时语音文本。S44: Convert each target voice frame into real-time voice text.
具体地,对每个目标语音帧进行文本转换,得到该目标语音帧对应的实时语音文本。Specifically, text conversion is performed on each target voice frame to obtain a real-time voice text corresponding to the target voice frame.
其中,文本转换可使用支持语音转换文本的工具,也可以使用用于文本转换算法,此处不作具体限制。Among them, the text conversion may use a tool that supports speech conversion text, or a text conversion algorithm, which is not specifically limited here.
在图3对应的实施例中,对语音数据进行语音解析,得到包含基础语音帧的帧集合,进而对基础语音帧进行静音检测,得到基础语音帧中的K个连续静音帧,根据这K个连 续静音帧,将帧集合中包含的基础语音帧划分成K+1个目标语音帧,将每个目标语音帧均转换为一个实时语音文本,使得将接收到的语音信号实时转换成一个个独立的实时语音文本,以便于使用该实时语音文本来对外呼禁止用户进行匹配,保证了外呼过程中监控的的及时性。In the embodiment corresponding to FIG. 3, the speech data is parsed to obtain a frame set including basic speech frames, and then the basic speech frames are silenced to detect K consecutive silent frames in the basic speech frames. According to the K Continuous mute frames, divide the basic voice frames contained in the frame set into K + 1 target voice frames, convert each target voice frame into a real-time voice text, so that the received voice signals are converted into independent ones in real time Real-time voice text, in order to use the real-time voice text to prevent users from matching outbound calls, ensuring the timeliness of monitoring during outbound calls.
接下来,在图3对应的实施例的基础之上,下面通过一个具体的实施例来对步骤S41中所提及的对语音数据进行语音解析,得到包含基础语音帧的帧集合的具体实现方法进行详细说明。Next, on the basis of the embodiment corresponding to FIG. 3, a specific embodiment is used to perform speech analysis on the voice data mentioned in step S41 to obtain a specific implementation method of a frame set including a basic voice frame. Explain in detail.
请参阅图4,图4示出了本申请实施例提供的步骤S41的具体实现流程,详述如下:Please refer to FIG. 4, which illustrates a specific implementation process of step S41 provided by an embodiment of the present application, which is detailed as follows:
S411:对语音数据进行幅值归一化处理,得到基础语音信号。S411: Perform amplitude normalization processing on the voice data to obtain a basic voice signal.
具体地,利用设备获取的语音数据都是模拟信号,在获取到语音数据后,要对语音数据采用脉冲编码调制技术(Pulse Code Modulation,PCM)进行编码,使这些模拟信号转化为数字信号,并将语音数据中的模拟信号每隔预设的时间对一个采样点进行采样,使其离散化,进而对采样信号量化,以二进制码组的方式输出量化后的数字信号,根据语音的频谱范围200-3400Hz,采样率可设置为8KHz,量化精度为16bit。Specifically, the voice data obtained by the device are all analog signals. After the voice data is obtained, the voice data must be encoded using Pulse Code Modulation (PCM), so that these analog signals are converted into digital signals, and The analog signal in the voice data is sampled at a sampling point every predetermined time to discretize it, and then the sampled signal is quantized, and the quantized digital signal is output in the form of a binary code group. According to the speech spectrum range 200 -3400Hz, the sampling rate can be set to 8KHz, and the quantization accuracy is 16bit.
应理解,此处采样率和量化精度的数值范围,为本申请优选范围,但可以根据实际应用的需要进行设置,此处不做限制。It should be understood that the numerical ranges of the sampling rate and quantization accuracy herein are the preferred ranges of the present application, but can be set according to the needs of practical applications, and are not limited here.
进一步地,对经过离散化和量化的语音数据进行幅值归一化处理,具体幅值归一化处理方式可以是将每个采样点的采样值除以语音数据的采样值中的最大值,也可以将每个采样点的采样值除以对应语音数据的采样值的均值,将数据收敛到特定区间,方便进行数据处理。Further, the amplitude normalization processing is performed on the discretized and quantized speech data. The specific amplitude normalization processing method may be dividing the sampling value of each sampling point by the maximum value among the sampling values of the speech data. It is also possible to divide the sampling value of each sampling point by the average value of the corresponding sampling value of the speech data, and converge the data to a specific interval, which is convenient for data processing.
值得说明的是,在幅值归一化处理之后,将音频数据中每个采样点的采样值转换为对应的标准值,从而得到与语音数据对应的基础语音信号。It is worth noting that after the amplitude normalization process, the sample value of each sampling point in the audio data is converted into a corresponding standard value, thereby obtaining a basic voice signal corresponding to the voice data.
S412:对基础语音信号进行预加重处理,生成具有平坦频谱的目标语音信号。S412: Perform pre-emphasis processing on the basic voice signal to generate a target voice signal with a flat frequency spectrum.
具体地,由于声门激励和口鼻辐射会对基础语音信号的平均功率谱产生影响,导致高频在超过800Hz时会按6dB/倍频跌落,所以在计算基础语音信号频谱时,频率越高相应的成分越小,为此要在预处理中进行预加重(Pre-emphasis)处理,预加重的目的是提高高频部分,使信号的频谱变得平坦,保持在低频到高频的整个频带中,能用同样的信噪比求频谱,以便于频谱分析或者声道参数分析。预加重可在语音信号数字化时在反混叠滤波器之前进行,这样不仅可以进行预加重,而且可以压缩信号的动态范围,有效地提高信噪比。预加重可使用一阶的数字滤波器来实现,例如:有限脉冲响应(Finite Impulse Response,FIR)滤波器。Specifically, since the glottic excitation and the nose-to-nose radiation will affect the average power spectrum of the basic speech signal, causing high frequencies to fall by 6dB / octave when it exceeds 800Hz, the higher the frequency when calculating the frequency spectrum of the basic speech signal The smaller the corresponding component is, pre-emphasis is performed in the pre-processing for this purpose. The purpose of pre-emphasis is to improve the high-frequency part, make the signal spectrum flat, and maintain the entire frequency band from low to high frequencies. In the same way, the spectrum can be obtained with the same signal-to-noise ratio, which is convenient for spectrum analysis or channel parameter analysis. Pre-emphasis can be performed before the anti-aliasing filter when the voice signal is digitized. This not only can perform pre-emphasis, but also can compress the dynamic range of the signal and effectively improve the signal-to-noise ratio. Pre-emphasis can be implemented using a first-order digital filter, such as a Finite Impulse Response (FIR) filter.
S413:按照预设的帧长和预设的帧移,对目标语音信号进行分帧处理,得到包含基础语音帧的帧集合。S413: Perform frame processing on the target voice signal according to a preset frame length and a preset frame shift to obtain a frame set including a basic voice frame.
具体地,语音信号具有短时平稳的性质,语音信号在经过预加重处理后,需要对其进行分帧和加窗处理,来保持信号的短时平稳性,通常情况下,每秒钟包含的帧数在33~100帧之间。为了保持帧与帧之间的连续性,使得相邻两帧都能平滑过渡,采用交叠分帧的方 式,如图5所示,图5示出了交叠分帧的样例,图5中第k帧和第k+1帧之间的交叠部分即为帧移。Specifically, the speech signal has the property of short-term stability. After pre-emphasis processing, the speech signal needs to be framed and windowed to maintain the short-term stability of the signal. Generally, the The number of frames is between 33 and 100 frames. In order to maintain the continuity between frames, so that adjacent two frames can smoothly transition, the overlapping framing method is adopted, as shown in FIG. 5, FIG. 5 shows an example of overlapping framing, and FIG. 5 The overlap between the kth frame and the k + 1th frame is the frame shift.
优选地,帧移与帧长的比值的取值范围为(0,0.5)。Preferably, the range of the ratio of the frame shift to the frame length is (0, 0.5).
例如,在一具体实施方式中,预加重后的语音信号为s'(n),帧长为N个采样点,帧移为M个采样点。当第l帧对应的采样点为第n个时,原始语音信号x l(n)与各参数之间的对应关系为: For example, in a specific implementation, the pre-emphasized voice signal is s' (n), the frame length is N sampling points, and the frame shift is M sampling points. When the sampling point corresponding to the l frame is the nth, the corresponding relationship between the original speech signal x l (n) and each parameter is:
x l(n)=x[(l-1)M+n] x l (n) = x [(l-1) M + n]
其中,n=0,1,...,N-1,N=256。Among them, n = 0,1, ..., N-1, N = 256.
进一步地,目标语音信号经过分帧之后,使用相应的窗函数w(n)与分帧后的语音信号s'(n)相乘,即得到加窗后的语音信号S w,将该语音信号作为基础语音帧的帧集合。 Further, after the target speech signal is framed, the corresponding window function w (n) is multiplied with the framed speech signal s' (n) to obtain a windowed speech signal S w , and the speech signal is The set of frames that are the basic speech frames.
其中,窗函数包括但不限于:矩形窗(Rectangular)、汉明窗(Hamming)和汉宁窗(Hanning)等。The window functions include, but are not limited to, rectangular windows, Hamming windows, and Hanning windows.
矩形窗表达式为:The rectangular window expression is:
Figure PCTCN2018094371-appb-000001
Figure PCTCN2018094371-appb-000001
其中,w(n)为窗函数,N为采样点的个数,n为第n个采样点。Among them, w (n) is a window function, N is the number of sampling points, and n is the nth sampling point.
汉明窗表达式为:The Hamming window expression is:
Figure PCTCN2018094371-appb-000002
Figure PCTCN2018094371-appb-000002
其中,pi为圆周率,优选地,本申请实施例中pi的取值为3.1416。Wherein, pi is the perimeter, and preferably, the value of pi in the embodiment of the present application is 3.1416.
汉宁窗表达式为:The Hanning window expression is:
Figure PCTCN2018094371-appb-000003
Figure PCTCN2018094371-appb-000003
对经过预加重处理的语音信号进行分帧和加窗处理,使得语音信号保持帧与帧之间的连续性,并剔除掉一些异常的信号点,得到基础语音帧的帧集合,提高了语音信号的鲁棒性。Framed and windowed the pre-emphasized speech signal, so that the speech signal maintains the continuity between frames, and removes some abnormal signal points to get the frame set of the basic speech frame, which improves the speech signal. Robustness.
在图4对应的实施例中,通过对语音数据进行幅值归一化处理,得到基础语音信号,进而对基础语音信号进行预加重处理,生成具有平坦频谱的目标语音信号,按照预设的帧长和预设的帧移,对目标语音信号进行分帧处理,得到基础语音帧的帧集合,提升了帧集合中每个基础语音帧的鲁棒性,有利于在后续利用基础语音帧的帧集合来进行语音转化文本时,提升了转换的准确性,从而有利于提高语音识别的准确率。In the embodiment corresponding to FIG. 4, the amplitude normalized processing is performed on the speech data to obtain a basic speech signal, and then the base speech signal is pre-emphasized to generate a target speech signal with a flat frequency spectrum. Long and preset frame shift, and frame the target voice signal to get the frame set of basic voice frames, which improves the robustness of each basic voice frame in the frame set, which is beneficial to the subsequent use of the frames of basic voice frames When a set is used to convert text into speech, the accuracy of the conversion is improved, which is conducive to improving the accuracy of speech recognition.
在图2至图4对应的实施例的基础之上,下面通过一个具体的实施例来对步骤S5中所提及的将实时语音文本与外呼禁止用语进行文本匹配,得到第一匹配结果的具体实现方法进行详细说明。Based on the embodiments corresponding to FIG. 2 to FIG. 4, the following uses a specific embodiment to perform text matching on the real-time voice text and the outgoing call prohibition term mentioned in step S5 to obtain a first matching result. The specific implementation method will be described in detail.
本申请实施例提供的步骤S5的具体实现流程,详述如下:The detailed implementation process of step S5 provided in the embodiment of the present application is detailed as follows:
针对每个外呼禁止用语,采用文本相似度算法,计算该外呼禁止用语与实时语音文本之间的相似度,若相似度大于或等于预设的相似度阈值,则将实时语音文本包含该外呼禁止用语作为第一匹配结果。For each outgoing call banned word, a text similarity algorithm is used to calculate the similarity between the outgoing call banned word and the real-time voice text. If the similarity is greater than or equal to a preset similarity threshold, the real-time voice text is included in the Outgoing banned words are used as the first matching result.
具体地,经过步骤S4进行语音识别,得到实时语音文本之后,计算该实时语音文本与每个外呼禁止用语之间的相似度,并将该相似度与预设的相似度阈值进行比较,若该相似度大于或等于预设的相似度阈值,则确定实时语音文本包含该外呼禁止用语,预设的相似度阈值可以设置为0.8,也可以根据实际需要进行设置,此处不作具体限制。Specifically, after performing voice recognition in step S4 to obtain a real-time voice text, calculate the similarity between the real-time voice text and each of the outgoing call prohibited words, and compare the similarity with a preset similarity threshold. If the similarity is greater than or equal to a preset similarity threshold, it is determined that the real-time voice text includes the outbound prohibition term. The preset similarity threshold may be set to 0.8 or may be set according to actual needs, which is not specifically limited here.
其中,文本相似度算法是通过计算两个文本之间的交集和并集大小的比例来判断这两个文本的相似度的算法,计算出的比例越大,表示两个文本越相似。Among them, the text similarity algorithm is an algorithm that determines the similarity of two texts by calculating the ratio of the intersection and union sizes between two texts. The larger the calculated ratio, the more similar the two texts are.
文本相似度算法包括但不限于:余弦相似性、最近邻(k-NearestNeighbor,kNN)分类算法、曼哈顿距离(Manhattan Distance)、基于SimHash算法的汉明距离等。Text similarity algorithms include, but are not limited to: cosine similarity, k-NearestNeighbor (kNN) classification algorithm, Manhattan Distance, Hamming distance based on SimHash algorithm, and the like.
值得说明的是,在匹配过程中,若一外呼禁止用语与实时语音文本的相似度大于或等于预设的相似度阈值,则可确定匹配结果为实时语音文本包含该外呼禁止用语,并结束本次匹配,而无需继续与剩余的外呼禁止用语进行匹配。It is worth noting that during the matching process, if the similarity between an outgoing call banned term and the real-time voice text is greater than or equal to a preset similarity threshold, it can be determined that the matching result is that the real-time voice text contains the outgoing call banned term, and End this match without continuing to match with the remaining outbound banned words.
例如,在一具体实施方式中,在步骤S3中获取到的外呼禁止用语包括15个短语,分别为V 1,V 2,V 3,...,V 14,V 15,在获取到实时语音文本G后,将实时语音文本G与V 1进行匹配,其匹配过程为:实时语音文本G与V 1计算相似度,若相似度大于或等于预设的相似度阈值,则确定实时语音文本包含禁用词汇,结束本次匹配,若相似度小于预设的相似度阈值,则继续将语音文本G与V 1后面一个外呼禁止用语V 2进行匹配,按照上述实时语音文本G与V 1进行匹配的的方法,来对实时语音文本G与剩余外呼禁止用语进行匹配,若匹配过程中出现相似度大于或者等于预设的阈值的时,则确定该实时语音文本包含外呼禁止用语,并结束本次匹配。 For example, in a specific implementation manner, the outbound call prohibited words obtained in step S3 include 15 phrases, which are V 1 , V 2 , V 3 , ..., V 14 , and V 15 . After the speech text G, the real-time speech text G is matched with V 1. The matching process is: the real-time speech text G and V 1 calculate the similarity. If the similarity is greater than or equal to a preset similarity threshold, the real-time speech text is determined. Contains the banned words, end this match. If the similarity is less than the preset similarity threshold, then continue to match the voice text G with the outgoing call prohibition V 2 following V 1 and follow the real-time voice text G and V 1 above . The matching method is used to match the real-time voice text G with the remaining outgoing call prohibited words. If the similarity is greater than or equal to a preset threshold during the matching process, it is determined that the real-time voice text contains the outgoing call prohibited words, End this match.
在本实施例中,通过将实时语音文本与每个外呼禁止用语计算相似度,并通过比较相似度与预设的相似度阈值的大小来判断该实时语音文本是否包含外呼禁止用语,从而提高了匹配的准确度,确保第一匹配结果的正确率。In this embodiment, the similarity is calculated by comparing the real-time voice text with each of the outgoing call prohibitions, and comparing the similarity with a preset similarity threshold value to determine whether the real-time voice text contains an outgoing call prohibition, thereby The accuracy of the matching is improved, and the accuracy of the first matching result is ensured.
在图2至图4对应的实施例的基础之上,在步骤S5提及的将实时语音文本与外呼禁止用语进行文本匹配,得到第一匹配结果的步骤之后且在执行第一预警措施的步骤之前,还可以在该坐席员外呼结束后,对外呼过程中是否使用了所有外呼必需用语进行监控预警,如图6所示,该语音识别方法还包括:On the basis of the embodiments corresponding to FIG. 2 to FIG. 4, in step S5, the real-time voice text is matched with the outgoing call prohibition text to obtain the first matching result and the first warning measure is executed Before the step, after the agent's outbound call is over, whether all necessary terms for outbound call are used for monitoring and early warning, as shown in FIG. 6, the voice recognition method further includes:
S7:在检测到坐席员的外呼操作终止时,将当前外呼文本与外呼必需用语进行文本匹配,得到第二匹配结果。S7: When it is detected that the agent's outbound call operation is terminated, the current outbound text is matched with the necessary terms of the outbound text to obtain a second matching result.
具体地,若监测到在预设的时间阈值范围内未产生语音数据,则确定该本次外呼操作终止,进而将得到的当前外呼文本与外呼必需用语进行匹配,并得到第二匹配结果,在被发明实施例中,预设的的时间阈值范围为10秒钟,具体可以根据实际需求进行设置,此处不作限制。Specifically, if it is detected that no voice data is generated within a preset time threshold, it is determined that the outbound call operation is terminated, and then the current outbound call text is matched with the necessary outbound call terms, and a second match is obtained. As a result, in the invented embodiment, the preset time threshold range is 10 seconds, which can be specifically set according to actual needs, and is not limited here.
其中,将得到的当前外呼文本与外呼必需用语进行匹配的具体过程如下:The specific process of matching the current outbound text with the necessary terms of the outbound call is as follows:
通过获取当前外呼文本中存储的Y个实时语音文本,进而针对每个外呼必需用语,将该外呼必需用语与Y个实时语音文本进行相似度匹配,得到Y个相似度,若Y个相似度均小于预设的相似度阈值,则确认当前外呼文本中不包含该外呼必需用语。By obtaining the Y real-time voice texts stored in the current outgoing call text, and then for each of the essential words for the outgoing call, the similarities of the essential words for the outgoing call and the Y real-time voice texts are matched to obtain Y similarities, if Y If the similarity is less than the preset similarity threshold, it is confirmed that the current outbound text does not include the necessary language for the outbound call.
值得说明的是,若存在至少一个外呼必需用语不被当前外呼文本所包含,则确认第二匹配结果为当前外呼文本不包含外呼必需用语。It is worth noting that if there is at least one required language for outbound calls that is not included in the current outbound text, it is confirmed that the second matching result is that the current outbound text does not contain necessary terms for outbound calls.
例如,在一具体实施方式中,外呼必需用语包括:“您好”、“请问有什么可以帮助的吗”、“请稍等”、“感谢您的支持”和“再见”,经过对当前外呼文本与外呼必需用语进行匹配,发现当前外呼用语中包含:“请问有什么可以帮助的吗”、“请稍等”、“感谢您的支持”和“再见”,但不包含“您好”,则确认第二匹配结果为当前外呼文本不包含外呼必需用语。For example, in a specific implementation, the necessary terms for outbound calls include: "Hello", "Can I help you?" "Please wait a moment", "Thank you for your support" and "Goodbye". The outbound text was matched with the necessary terms for outbound calls, and it was found that the current outbound words contained: "Can I help you?" "Please wait", "Thank you for your support", and "Goodbye", but did not include " "Hello", then confirm that the second matching result is that the current outgoing text does not contain the necessary words for outgoing calls.
可选地,将得到的当前外呼文本与外呼必需用语进行匹配时,还可以通过在当前外呼文本中对每个外呼必需用语进行查询,若每个外呼必需用语均能查询到,则确认第二匹配结果为当前外呼文本包含该外呼必需用语,反之,则确认第二匹配结果为当前外呼文本不包含外呼必需用语。Optionally, when the obtained current outgoing call text is matched with the necessary outgoing call terms, it is also possible to query each required outgoing call term in the current outgoing call text. If each required outgoing call term can be queried , It is confirmed that the second matching result is that the current outbound text contains the necessary language for the outbound call, otherwise, it is confirmed that the current matching text is that the current outbound text does not contain the necessary language for the outbound call.
S8:若第二匹配结果为当前外呼文本不包含外呼必需用语,则执行第二预警措施。S8: If the second matching result is that the current outbound text does not contain the necessary words for the outbound call, a second warning measure is performed.
具体地,若第二匹配结果为当前外呼文本中不包含外呼必需用语,则说明本次外呼中存在至少一个外呼必需用语没有被使用,此时,将执行第二预警措施。Specifically, if the second matching result is that the current outgoing call text does not contain the necessary outgoing call terms, it means that at least one required outgoing call term has not been used in this outgoing call, at this time, a second warning measure will be executed.
其中,第二预警措施包括但不限于:向监控端发送本次外呼不规范的预警提示、提醒本次外呼的坐席员本次外呼中出现的不规范事项和生成本次外呼记录等。Among them, the second warning measures include, but are not limited to: sending a warning alert to the monitoring end that the outbound call is irregular, reminding the agents of the outbound call about irregularities in the outbound call, and generating the outbound call record Wait.
进一步地,可以根据外呼必需用语的重要程度,设置不同的第二预警措施。例如,若外呼必需用于包括词语G、词语H和词语I,其中,词语G和词语H的重程度为一级,词语I的重要程度为二级,并且一级低于二级,则可以设置一级对应的第二预警措施为“提醒本次外呼的坐席员本次外呼中出现的不规范事项和生成本次外呼记录”,同时设置二级对应的第二预警措施为“向监控端发送本次外呼不规范的预警提示和生成本次外呼记录”。当实时语音文本包含词语I时,执行第二预警措施,向监控端发送本次外呼不规范的预警提示和生成本次外呼记录。Further, different second warning measures may be set according to the importance of the terms necessary for outbound calls. For example, if an outbound call must be used to include Word G, Word H, and Word I, where the weight of Word G and Word H is one level, the importance of word I is second level, and the first level is lower than second level, then You can set the corresponding second-level early-warning measure to “remind the agents of this outbound call to generate irregularities and generate this outbound call record”, and set the corresponding second-level early-warning measure to "Send an out-of-standard warning alert to the monitoring end and generate this outbound call record". When the real-time voice text contains the word I, a second warning measure is executed to send a non-standard warning alert for the outbound call to the monitoring end and generate an outbound call record.
在图6对应的实施例中,在检测到坐席员的外呼操作终止时,将当前外呼文本与外呼必需用语进行文本匹配,得到第二匹配结果,若第二匹配结果为当前外呼文本不包含外呼必需用语,执行第二预警措施,实现对外呼必需用语未被使用的情况进行自动预警,避免通过人工去听取录音并分析来进行监控,从而提升了监控的效率。应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻 辑确定,而不应对本申请实施例的实施过程构成任何限定。In the embodiment corresponding to FIG. 6, when it is detected that the agent's outbound call operation is terminated, the current outbound text is matched with the necessary terms of the outbound text to obtain a second matching result, and if the second matching result is the current outbound call The text does not contain the necessary words for outbound calls, and the second warning measure is implemented to automatically warn when the necessary words for outbound calls are not used, avoiding monitoring by manually listening to the recording and analysis, thereby improving the monitoring efficiency. It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
对应于上述方法实施例中的语音识别方法,图7示出了与上述方法实施例提供的语音识别方法一一对应的语音识别装置,为了便于说明,仅示出了与本申请实施例相关的部分。Corresponding to the speech recognition method in the above method embodiment, FIG. 7 shows a speech recognition device that corresponds to the speech recognition method provided by the above method embodiment in a one-to-one manner. For convenience of explanation, only the related to the embodiment of the present application is shown. section.
如图7所示,该语音识别装置包括:数据获取模块10、部门确定模块20、模板选取模块30、语音识别模块40、第一匹配模块50和第一预警模块60。各功能模块详细说明如下:As shown in FIG. 7, the voice recognition device includes a data acquisition module 10, a department determination module 20, a template selection module 30, a voice recognition module 40, a first matching module 50, and a first warning module 60. The detailed description of each function module is as follows:
数据获取模块10,用于若监测到坐席员的外呼操作,则获取该坐席员外呼过程中的语音数据和使用的外呼设备的设备标识;A data acquisition module 10 is configured to acquire voice data and an equipment identifier of an outbound device used by the agent when the outbound operation of the agent is monitored;
部门确定模块20,用于基于设备标识,确定坐席员所属的业务部门;The department determination module 20 is configured to determine a business department to which the agent belongs based on the equipment identification;
模板选取模块30,用于获取业务部门对应的业务文本模板,其中,业务文本模板包括外呼必需用语和外呼禁止用语;The template selection module 30 is configured to obtain a business text template corresponding to a business department, where the business text template includes a required language for outbound calls and a prohibited language for outbound calls;
语音识别模块40,用于对语音数据进行语音识别,得到实时语音文本,并将该实时语音文本添加到当前外呼文本;A voice recognition module 40, configured to perform voice recognition on voice data to obtain real-time voice text, and add the real-time voice text to the current outgoing text;
第一匹配模块50,用于将实时语音文本与外呼禁止用语进行文本匹配,得到第一匹配结果;A first matching module 50, configured to perform text matching between the real-time voice text and the outgoing call prohibited words to obtain a first matching result;
第一预警模块60,用于若第一匹配结果为实时语音文本包含外呼禁止用语,则执行第一预警措施。The first early warning module 60 is configured to execute a first early warning measure if the first matching result is that the real-time voice text includes an outbound call prohibition term.
进一步地,实时语音识别模块40包括:Further, the real-time speech recognition module 40 includes:
语音解析单元41,用于对语音数据进行语音解析,得到包含基础语音帧的帧集合;A speech parsing unit 41, configured to perform speech parsing on speech data to obtain a frame set including basic speech frames;
静音检测单元42,用于对基础语音帧进行静音检测,得到基础语音帧中的K个连续静音帧,其中,K为自然数;The silence detection unit 42 is configured to perform silence detection on the basic voice frame to obtain K consecutive silence frames in the basic voice frame, where K is a natural number;
帧集划分单元43,用于根据K个连续静音帧,将帧集合中包含的基础语音帧划分成K+1个目标语音帧;A frame set dividing unit 43 configured to divide a basic voice frame included in the frame set into K + 1 target voice frames according to K consecutive mute frames;
文本转换单元44,用于将每个目标语音帧转换为实时语音文本。The text conversion unit 44 is configured to convert each target speech frame into real-time speech text.
进一步地,语音解析单元41包括:Further, the speech parsing unit 41 includes:
归一化子单元411,用于对语音数据进行幅值归一化处理,得到基础语音信号;A normalization subunit 411, configured to perform amplitude normalization processing on the voice data to obtain a basic voice signal;
预加重子单元412,用于对基础语音信号进行预加重处理,生成具有平坦频谱的目标语音信号;A pre-emphasis subunit 412, configured to perform pre-emphasis processing on a basic voice signal to generate a target voice signal having a flat frequency spectrum;
分帧子单元413,用于按照预设的帧长和预设的帧移,对目标语音信号进行分帧处理,得到基础语音帧的帧集合。The frame sub-unit 413 is configured to perform frame processing on a target voice signal according to a preset frame length and a preset frame shift to obtain a frame set of a basic voice frame.
进一步地,第一匹配模块50包括:Further, the first matching module 50 includes:
第一匹配单元51,用于针对每个外呼禁止用语,采用文本相似度算法,计算该外呼禁止用语与实时语音文本之间的相似度,若相似度大于或等于预设的相似度阈值,则将实时语音文本包含该外呼禁止用语作为第一匹配结果。The first matching unit 51 is configured to calculate a similarity between the outgoing prohibited words and the real-time voice text for each outgoing prohibited word using a text similarity algorithm. , The real-time voice text includes the outbound call prohibition term as the first matching result.
进一步地,该语音识别装置还包括:Further, the voice recognition device further includes:
第二匹配模块70,用于在检测到坐席员的外呼操作终止时,将当前外呼文本与外呼必 需用语进行文本匹配,得到第二匹配结果;A second matching module 70, configured to perform text matching between the current outgoing call text and the required words of the outgoing call when detecting that the outgoing call operation of the agent is terminated, to obtain a second matching result;
第二预警模块80,用于若第二匹配结果为当前外呼文本不包含外呼必需用语,执行第二预警措施。The second early warning module 80 is configured to execute a second early warning measure if the second matching result is that the current outgoing text does not contain the necessary words for the outgoing call.
本实施例提供的一种语音识别装置中各模块实现各自功能的过程,具体可参考前述方法实施例的描述,此处不再赘述。For a process in which each module in the voice recognition device provided by this embodiment implements their functions, reference may be made to the description of the foregoing method embodiments, and details are not described herein again.
本实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,该非易失性可读存储介质上存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行上述方法实施例中语音识别方法,或者,该计算机可读指令被被一个或多个处理器执行时实现上述装置实施例中各模块/单元的功能。为避免重复,这里不再赘述。This embodiment provides one or more nonvolatile readable storage media storing computer readable instructions. The nonvolatile readable storage medium stores computer readable instructions, and the computer readable instructions are When executed by one processor, the one or more processors are caused to execute the speech recognition method in the foregoing method embodiment, or when the computer-readable instructions are executed by one or more processors, each module in the foregoing device embodiment is implemented / Unit function. To avoid repetition, we will not repeat them here.
可以理解地,所述非易失性可读存储介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、电载波信号和电信信号等。Understandably, the non-volatile readable storage medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, Read-Only Memory (ROM), Random Access Memory (RAM), electric carrier signals and telecommunication signals.
图8是本申请一实施例提供的计算机设备的示意图。如图8所示,该实施例的计算机设备90包括:处理器91、存储器92以及存储在存储器92中并可在处理器91上运行的计算机可读指令93,例如语音识别程序。处理器91执行计算机可读指令93时实现上述语音识别方法实施例中的步骤,例如图2所示的步骤S1至步骤S6。或者,处理器91执行计算机可读指令93时实现上述各装置实施例中各模块/单元的功能,例如图7所示模块10至模块60的功能。FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application. As shown in FIG. 8, the computer device 90 of this embodiment includes a processor 91, a memory 92, and computer-readable instructions 93 stored in the memory 92 and executable on the processor 91, such as a voice recognition program. When the processor 91 executes the computer-readable instructions 93, the steps in the foregoing embodiment of the speech recognition method are implemented, for example, steps S1 to S6 shown in FIG. 2. Alternatively, when the processor 91 executes the computer-readable instructions 93, the functions of the modules / units in the foregoing device embodiments are implemented, for example, the functions of the modules 10 to 60 shown in FIG. 7.
其中,计算机设备90可以是桌上型计算机、笔记本、掌上电脑及云端服务器等设备,图8仅为本实施例中计算机设备的示例,可以包括如图8所示更多或更少的部件,或者组合某些部件或者不同的部件。存储器92可以是计算机设备的内部存储单元,如硬盘或内存,也可以是计算机设备的外部存储单元,如插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。计算机可读指令93包括程序代码,该程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。The computer device 90 may be a desktop computer, a notebook, a palmtop computer, or a cloud server. FIG. 8 is only an example of the computer device in this embodiment, and may include more or fewer components as shown in FIG. 8. Or combine some parts or different parts. The memory 92 may be an internal storage unit of a computer device, such as a hard disk or a memory, or an external storage unit of a computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), and a Secure Digital (SD ) Cards, flash cards, etc. The computer-readable instructions 93 include program code, which may be in a source code form, an object code form, an executable file, or some intermediate form.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims (20)

  1. 一种语音识别方法,其特征在于,所述语音识别方法包括:A speech recognition method, characterized in that the speech recognition method includes:
    若监测到坐席员的外呼操作,则获取所述坐席员外呼过程中的语音数据和所述坐席员使用的外呼设备的设备标识;If an outbound operation of the agent is monitored, obtaining voice data during the outbound process of the agent and a device identifier of an outbound device used by the agent;
    基于所述设备标识,确定所述坐席员所属的业务部门;Determining a business department to which the agent belongs based on the device identification;
    获取所述业务部门对应的业务文本模板,其中,所述业务文本模板包括外呼必需用语和外呼禁止用语;Obtaining a business text template corresponding to the business department, wherein the business text template includes a term necessary for outbound calls and a term prohibited for outbound calls;
    对所述语音数据进行语音识别,得到实时语音文本,并将所述实时语音文本添加到当前外呼文本;Performing voice recognition on the voice data to obtain real-time voice text, and adding the real-time voice text to the current outgoing text;
    将所述实时语音文本与所述外呼禁止用语进行文本匹配,得到第一匹配结果;Text-matching the real-time voice text with the outgoing call prohibition term to obtain a first matching result;
    若所述第一匹配结果为所述实时语音文本包含所述外呼禁止用语,则执行第一预警措施。If the first matching result is that the real-time voice text includes the outbound call prohibition term, a first warning measure is performed.
  2. 如权利要求1所述的语音识别方法,其特征在于,所述对所述语音数据进行语音识别,得到实时语音文本包括:The voice recognition method according to claim 1, wherein the performing voice recognition on the voice data to obtain real-time voice text comprises:
    对所述语音数据进行语音解析,得到包含基础语音帧的帧集合;Performing speech analysis on the speech data to obtain a frame set including basic speech frames;
    对所述基础语音帧进行静音检测,得到所述基础语音帧中的K个连续静音帧,其中,K为自然数;Performing silence detection on the basic voice frame to obtain K consecutive silence frames in the basic voice frame, where K is a natural number;
    根据K个所述静音帧,将所述帧集合中包含的所述基础语音帧划分成K+1个目标语音帧;Dividing the basic voice frames included in the frame set into K + 1 target voice frames according to the K silence frames;
    将每个所述目标语音帧转换为所述实时语音文本。Converting each of the target speech frames into the real-time speech text.
  3. 如权利要求2所述的语音识别方法,其特征在于,所述对所述语音数据进行语音解析,得到包含基础语音帧的帧集合包括:The speech recognition method according to claim 2, wherein the performing a speech analysis on the speech data to obtain a frame set including a basic speech frame comprises:
    对所述语音数据进行幅值归一化处理,得到基础语音信号;Performing amplitude normalization processing on the voice data to obtain a basic voice signal;
    对所述基础语音信号进行预加重处理,生成具有平坦频谱的目标语音信号;Performing pre-emphasis processing on the basic voice signal to generate a target voice signal having a flat frequency spectrum;
    按照预设的帧长和预设的帧移,对所述目标语音信号进行分帧处理,得到包含基础语音帧的帧集合。Frame processing the target voice signal according to a preset frame length and a preset frame shift to obtain a frame set including a basic voice frame.
  4. 如权利要求1至3的任一项所述的语音识别方法,其特征在于,所述将所述实时语音文本与所述外呼禁止用语进行文本匹配,得到第一匹配结果包括:The speech recognition method according to any one of claims 1 to 3, wherein performing the text matching between the real-time speech text and the outbound call prohibited term to obtain a first matching result comprises:
    针对每个所述外呼禁止用语,采用文本相似度算法,计算该外呼禁止用语与所述实时语音文本之间的相似度,若所述相似度大于或等于预设的相似度阈值,则将所述实时语音文本包含该外呼禁止用语作为第一匹配结果。For each of the outgoing prohibited words, a text similarity algorithm is used to calculate the similarity between the outgoing prohibited words and the real-time speech text. If the similarity is greater than or equal to a preset similarity threshold, The real-time voice text includes the outgoing call prohibition term as a first matching result.
  5. 如权利要求1至3任一项所述的语音识别方法,其特征在于,在所述将所述实时语音文本与所述外呼禁止用语进行文本匹配,得到第一匹配结果的步骤之后且在执行第一预警措施的步骤之前,所述语音识别方法还包括:The speech recognition method according to any one of claims 1 to 3, wherein after the step of text matching the real-time voice text with the outbound call prohibition term to obtain a first matching result and after Before performing the steps of the first warning measure, the voice recognition method further includes:
    在检测到所述坐席员的外呼操作终止时,将所述当前外呼文本与所述外呼必需用语进 行文本匹配,得到第二匹配结果;When it is detected that the agent's outbound call operation is terminated, text matching is performed between the current outbound call text and the necessary term for the outbound call to obtain a second matching result;
    若所述第二匹配结果为所述当前外呼文本不包含所述外呼必需用语,则执行第二预警措施。If the second matching result is that the current outgoing call text does not include the necessary words for the outgoing call, a second warning measure is performed.
  6. 一种语音识别装置,其特征在于,所述语音识别装置包括:A voice recognition device, characterized in that the voice recognition device includes:
    数据获取模块,用于若监测到坐席员的外呼操作,则获取所述坐席员外呼过程中的语音数据和使用的外呼设备的设备标识;A data acquisition module, configured to acquire voice data and an equipment identifier of an outbound device used by the agent if the outbound operation of the agent is monitored;
    部门确定模块,用于基于所述设备标识,确定所述坐席员所属的业务部门;A department determination module, configured to determine a business department to which the agent belongs based on the device identifier;
    模板选取模块,用于获取所述业务部门对应的业务文本模板,其中,所述业务文本模板包括外呼必需用语和外呼禁止用语;A template selection module, configured to obtain a business text template corresponding to the business department, wherein the business text template includes a term necessary for outbound calls and a term prohibited for outbound calls;
    语音识别模块,用于对所述语音数据进行语音识别,得到实时语音文本,并将所述实时语音文本添加到当前外呼文本;A voice recognition module, configured to perform voice recognition on the voice data to obtain real-time voice text, and add the real-time voice text to the current outgoing text;
    第一匹配模块,用于将所述实时语音文本与所述外呼禁止用语进行文本匹配,得到第一匹配结果;A first matching module, configured to perform text matching between the real-time voice text and the outgoing call prohibition term to obtain a first matching result;
    第一预警模块,用于若所述第一匹配结果为所述实时语音文本包含所述外呼禁止用语,则执行第一预警措施。A first warning module is configured to execute a first warning measure if the first matching result is that the real-time voice text includes the outbound call prohibition term.
  7. 如权利要求6所述的语音识别装置,其特征在于,所述语音识别模块包括:The voice recognition device according to claim 6, wherein the voice recognition module comprises:
    语音解析单元,用于对所述语音数据进行语音解析,得到包含基础语音帧的帧集合;A speech parsing unit, configured to perform speech parsing on the speech data to obtain a frame set including basic speech frames;
    静音检测单元,用于对所述基础语音帧进行静音检测,得到所述基础语音帧中的K个连续静音帧,其中,K为自然数;A silence detection unit, configured to perform silence detection on the basic voice frame to obtain K consecutive silence frames in the basic voice frame, where K is a natural number;
    帧集划分单元,用于根据K个所述静音帧,将所述帧集合中包含的所述基础语音帧划分成K+1个目标语音帧;A frame set dividing unit, configured to divide the basic voice frame included in the frame set into K + 1 target voice frames according to the K silence frames;
    文本转换单元,用于将每个所述目标语音帧转换为所述实时语音文本。A text conversion unit is configured to convert each of the target speech frames into the real-time speech text.
  8. 如权利要求7所述的语音识别装置,其特征在于,所述语音解析单元包括:The speech recognition device according to claim 7, wherein the speech analysis unit comprises:
    归一化子单元,用于对所述语音数据进行幅值归一化处理,得到基础语音信号;A normalization subunit, configured to perform amplitude normalization processing on the voice data to obtain a basic voice signal;
    预加重子单元,用于对所述基础语音信号进行预加重处理,生成具有平坦频谱的目标语音信号;A pre-emphasis subunit, configured to perform pre-emphasis processing on the basic voice signal to generate a target voice signal having a flat frequency spectrum;
    分帧子单元,用于按照预设的帧长和预设的帧移,对所述目标语音信号进行分帧处理,得到基础语音帧的帧集合。The frame sub-unit is configured to perform frame processing on the target voice signal according to a preset frame length and a preset frame shift to obtain a frame set of a basic voice frame.
  9. 如权利要求6至8任一项所述的语音识别装置,其特征在于,所述第一匹配模块包括:The speech recognition device according to any one of claims 6 to 8, wherein the first matching module comprises:
    第一匹配单元,用于针对每个所述外呼禁止用语,采用文本相似度算法,计算该外呼禁止用语与所述实时语音文本之间的相似度,若所述相似度大于或等于预设的相似度阈值,则将所述实时语音文本包含该外呼禁止用语作为第一匹配结果。A first matching unit, configured to calculate a similarity between the outgoing prohibited words and the real-time voice text for each of the outgoing prohibited words using a text similarity algorithm, and if the similarity is greater than or equal to If the similarity threshold is set, the real-time voice text includes the outbound call prohibition term as a first matching result.
  10. 如权利要求6至8任一项所述的语音识别装置,其特征在于,所述语音识别装置还包括:The voice recognition device according to any one of claims 6 to 8, wherein the voice recognition device further comprises:
    第二匹配模块,用于在检测到所述坐席员的外呼操作终止时,将所述当前外呼文本与 所述外呼必需用语进行文本匹配,得到第二匹配结果;A second matching module, configured to perform text matching between the current outgoing call text and the necessary outgoing call terms when detecting that the outgoing call operation of the agent is terminated, to obtain a second matching result;
    第二预警模块,用于若所述第二匹配结果为所述当前外呼文本不包含所述外呼必需用语,则执行第二预警措施。A second warning module is configured to execute a second warning measure if the second matching result is that the current outgoing call text does not include the necessary words for the outgoing call.
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:
    若监测到坐席员的外呼操作,则获取所述坐席员外呼过程中的语音数据和所述坐席员使用的外呼设备的设备标识;If an outbound operation of the agent is monitored, obtaining voice data during the outbound process of the agent and a device identifier of an outbound device used by the agent;
    基于所述设备标识,确定所述坐席员所属的业务部门;Determining a business department to which the agent belongs based on the device identification;
    获取所述业务部门对应的业务文本模板,其中,所述业务文本模板包括外呼必需用语和外呼禁止用语;Obtaining a business text template corresponding to the business department, wherein the business text template includes a term necessary for outbound calls and a term prohibited for outbound calls;
    对所述语音数据进行语音识别,得到实时语音文本,并将所述实时语音文本添加到当前外呼文本;Performing voice recognition on the voice data to obtain real-time voice text, and adding the real-time voice text to the current outgoing text;
    将所述实时语音文本与所述外呼禁止用语进行文本匹配,得到第一匹配结果;Text-matching the real-time voice text with the outgoing call prohibition term to obtain a first matching result;
    若所述第一匹配结果为所述实时语音文本包含所述外呼禁止用语,则执行第一预警措施。If the first matching result is that the real-time voice text includes the outbound call prohibition term, a first warning measure is performed.
  12. 如权利要求11所述的终端设备,其特征在于,所述对所述语音数据进行语音识别,得到实时语音文本包括:The terminal device according to claim 11, wherein the performing voice recognition on the voice data to obtain real-time voice text comprises:
    对所述语音数据进行语音解析,得到包含基础语音帧的帧集合;Performing speech analysis on the speech data to obtain a frame set including basic speech frames;
    对所述基础语音帧进行静音检测,得到所述基础语音帧中的K个连续静音帧,其中,K为自然数;Performing silence detection on the basic voice frame to obtain K consecutive silence frames in the basic voice frame, where K is a natural number;
    根据K个所述静音帧,将所述帧集合中包含的所述基础语音帧划分成K+1个目标语音帧;Dividing the basic voice frames included in the frame set into K + 1 target voice frames according to the K silence frames;
    将每个所述目标语音帧转换为所述实时语音文本。Converting each of the target speech frames into the real-time speech text.
  13. 如权利要求12所述的终端设备,其特征在于,所述对所述语音数据进行语音解析,得到包含基础语音帧的帧集合包括:The terminal device according to claim 12, wherein the performing voice analysis on the voice data to obtain a frame set including a basic voice frame comprises:
    所述对所述语音数据进行语音解析,得到包含基础语音帧的帧集合包括:The performing voice analysis on the voice data to obtain a frame set including a basic voice frame includes:
    对所述语音数据进行幅值归一化处理,得到基础语音信号;Performing amplitude normalization processing on the voice data to obtain a basic voice signal;
    对所述基础语音信号进行预加重处理,生成具有平坦频谱的目标语音信号;Performing pre-emphasis processing on the basic voice signal to generate a target voice signal having a flat frequency spectrum;
    按照预设的帧长和预设的帧移,对所述目标语音信号进行分帧处理,得到包含基础语音帧的帧集合。Frame processing the target voice signal according to a preset frame length and a preset frame shift to obtain a frame set including a basic voice frame.
  14. 如权利要求11至13任一项所述的终端设备,其特征在于,所述将所述实时语音文本与所述外呼禁止用语进行文本匹配,得到第一匹配结果包括:The terminal device according to any one of claims 11 to 13, wherein the text matching between the real-time voice text and the outgoing call prohibition term to obtain a first matching result comprises:
    针对每个所述外呼禁止用语,采用文本相似度算法,计算该外呼禁止用语与所述实时语音文本之间的相似度,若所述相似度大于或等于预设的相似度阈值,则将所述实时语音文本包含该外呼禁止用语作为第一匹配结果。For each of the outgoing prohibited words, a text similarity algorithm is used to calculate the similarity between the outgoing prohibited words and the real-time speech text. If the similarity is greater than or equal to a preset similarity threshold, The real-time voice text includes the outgoing call prohibition term as a first matching result.
  15. 如权利要求11至13任一项所述的终端设备,其特征在于,在所述将所述实时语音文本与所述外呼禁止用语进行文本匹配,得到第一匹配结果的步骤之后且在执行第一预警措施的步骤之前,所述处理器执行所述计算机可读指令时还包括实现如下步骤:The terminal device according to any one of claims 11 to 13, wherein after the step of performing text matching between the real-time voice text and the outgoing call prohibition term to obtain a first matching result, and after executing Prior to the steps of the first warning measure, when the processor executes the computer-readable instructions, the method further includes implementing the following steps:
    在检测到所述坐席员的外呼操作终止时,将所述当前外呼文本与所述外呼必需用语进行文本匹配,得到第二匹配结果;When detecting the termination of the outbound call operation of the agent, text matching the current outbound call text with the necessary term for the outbound call to obtain a second matching result;
    若所述第二匹配结果为所述当前外呼文本不包含所述外呼必需用语,则执行第二预警措施。If the second matching result is that the current outgoing call text does not include the necessary words for the outgoing call, a second warning measure is performed.
  16. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:
    若监测到坐席员的外呼操作,则获取所述坐席员外呼过程中的语音数据和所述坐席员使用的外呼设备的设备标识;If an outbound operation of the agent is monitored, obtaining voice data during the outbound process of the agent and a device identifier of an outbound device used by the agent;
    基于所述设备标识,确定所述坐席员所属的业务部门;Determining a business department to which the agent belongs based on the device identification;
    获取所述业务部门对应的业务文本模板,其中,所述业务文本模板包括外呼必需用语和外呼禁止用语;Obtaining a business text template corresponding to the business department, wherein the business text template includes a term necessary for outbound calls and a term prohibited for outbound calls;
    对所述语音数据进行语音识别,得到实时语音文本,并将所述实时语音文本添加到当前外呼文本;Performing voice recognition on the voice data to obtain real-time voice text, and adding the real-time voice text to the current outgoing text;
    将所述实时语音文本与所述外呼禁止用语进行文本匹配,得到第一匹配结果;Text-matching the real-time voice text with the outgoing call prohibition term to obtain a first matching result;
    若所述第一匹配结果为所述实时语音文本包含所述外呼禁止用语,则执行第一预警措施。If the first matching result is that the real-time voice text includes the outbound call prohibition term, a first warning measure is performed.
  17. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述对所述语音数据进行语音识别,得到实时语音文本包括:The non-volatile readable storage medium according to claim 16, wherein the performing voice recognition on the voice data to obtain real-time voice text comprises:
    对所述语音数据进行语音解析,得到包含基础语音帧的帧集合;Performing speech analysis on the speech data to obtain a frame set including basic speech frames;
    对所述基础语音帧进行静音检测,得到所述基础语音帧中的K个连续静音帧,其中,K为自然数;Performing silence detection on the basic voice frame to obtain K consecutive silence frames in the basic voice frame, where K is a natural number;
    根据K个所述静音帧,将所述帧集合中包含的所述基础语音帧划分成K+1个目标语音帧;Dividing the basic voice frames included in the frame set into K + 1 target voice frames according to the K silence frames;
    将每个所述目标语音帧转换为所述实时语音文本。Converting each of the target speech frames into the real-time speech text.
  18. 如权利要求17所述的非易失性可读存储介质,其特征在于,所述对所述语音数据进行语音解析,得到包含基础语音帧的帧集合包括:The non-volatile readable storage medium according to claim 17, wherein the performing voice analysis on the voice data to obtain a frame set including a basic voice frame comprises:
    对所述语音数据进行幅值归一化处理,得到基础语音信号;Performing amplitude normalization processing on the voice data to obtain a basic voice signal;
    对所述基础语音信号进行预加重处理,生成具有平坦频谱的目标语音信号;Performing pre-emphasis processing on the basic voice signal to generate a target voice signal having a flat frequency spectrum;
    按照预设的帧长和预设的帧移,对所述目标语音信号进行分帧处理,得到包含基础语音帧的帧集合。Frame processing the target voice signal according to a preset frame length and a preset frame shift to obtain a frame set including a basic voice frame.
  19. 如权利要求16至18任一项所述的非易失性可读存储介质,其特征在于,所述将所述实时语音文本与所述外呼禁止用语进行文本匹配,得到第一匹配结果包括:The non-volatile readable storage medium according to any one of claims 16 to 18, wherein the text matching between the real-time voice text and the outgoing call prohibition term to obtain a first matching result includes :
    针对每个所述外呼禁止用语,采用文本相似度算法,计算该外呼禁止用语与所述实时 语音文本之间的相似度,若所述相似度大于或等于预设的相似度阈值,则将所述实时语音文本包含该外呼禁止用语作为第一匹配结果。For each of the outgoing prohibited words, a text similarity algorithm is used to calculate the similarity between the outgoing prohibited words and the real-time speech text. If the similarity is greater than or equal to a preset similarity threshold, then The real-time voice text includes the outgoing call prohibition term as a first matching result.
  20. 如权利要求16至18任一项所述的非易失性可读存储介质,其特征在于,在所述将所述实时语音文本与所述外呼禁止用语进行文本匹配,得到第一匹配结果的步骤之后且在执行第一预警措施的步骤之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The non-volatile readable storage medium according to any one of claims 16 to 18, wherein a text matching is performed between the real-time voice text and the outgoing call prohibition term to obtain a first matching result When the computer-readable instructions are executed by one or more processors after the steps of and before the steps of the first warning measure are executed, the one or more processors further perform the following steps:
    在检测到所述坐席员的外呼操作终止时,将所述当前外呼文本与所述外呼必需用语进行文本匹配,得到第二匹配结果;When detecting the termination of the outbound call operation of the agent, text matching the current outbound call text with the necessary term for the outbound call to obtain a second matching result;
    若所述第二匹配结果为所述当前外呼文本不包含所述外呼必需用语,则执行第二预警措施。If the second matching result is that the current outgoing call text does not include the necessary words for the outgoing call, a second warning measure is performed.
PCT/CN2018/094371 2018-05-29 2018-07-03 Voice recognition method, apparatus, computer device, and storage medium WO2019227580A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810529536.0 2018-05-29
CN201810529536.0A CN108833722B (en) 2018-05-29 2018-05-29 Speech recognition method, speech recognition device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2019227580A1 true WO2019227580A1 (en) 2019-12-05

Family

ID=64146099

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/094371 WO2019227580A1 (en) 2018-05-29 2018-07-03 Voice recognition method, apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN108833722B (en)
WO (1) WO2019227580A1 (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047473B (en) * 2019-04-19 2022-02-22 交通银行股份有限公司太平洋信用卡中心 Man-machine cooperative interaction method and system
CN110265008A (en) * 2019-05-23 2019-09-20 中国平安人寿保险股份有限公司 Intelligence pays a return visit method, apparatus, computer equipment and storage medium
CN110472097A (en) * 2019-07-03 2019-11-19 平安科技(深圳)有限公司 Melody automatic classification method, device, computer equipment and storage medium
CN110633912A (en) * 2019-09-20 2019-12-31 苏州思必驰信息科技有限公司 Method and system for monitoring service quality of service personnel
CN110782318A (en) * 2019-10-21 2020-02-11 五竹科技(天津)有限公司 Marketing method and device based on audio interaction and storage medium
CN112735421A (en) * 2019-10-28 2021-04-30 北京京东尚科信息技术有限公司 Real-time quality inspection method and device for voice call
CN110807090A (en) * 2019-10-30 2020-02-18 福建工程学院 Unmanned invigilating method for online examination
CN111064849B (en) * 2019-12-25 2021-02-26 北京合力亿捷科技股份有限公司 Call center system based line resource utilization and management and control analysis method
CN111698374B (en) * 2020-06-28 2022-02-11 中国银行股份有限公司 Customer service voice processing method and device
CN112069796B (en) * 2020-09-03 2023-08-04 阳光保险集团股份有限公司 Voice quality inspection method and device, electronic equipment and storage medium
CN112530424A (en) * 2020-11-23 2021-03-19 北京小米移动软件有限公司 Voice processing method and device, electronic equipment and storage medium
CN114006986A (en) * 2021-10-29 2022-02-01 平安普惠企业管理有限公司 Outbound call compliance early warning method, device, equipment and storage medium
CN114220432A (en) * 2021-11-15 2022-03-22 交通运输部南海航海保障中心广州通信中心 Maritime single-side-band-based voice automatic monitoring method and system and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102625005A (en) * 2012-03-05 2012-08-01 广东天波信息技术股份有限公司 Call center system with function of real-timely monitoring service quality and implement method of call center system
CN105261362A (en) * 2015-09-07 2016-01-20 科大讯飞股份有限公司 Conversation voice monitoring method and system
CN105975514A (en) * 2016-04-28 2016-09-28 朱宇光 Automatic quality testing method and system
US20170161378A1 (en) * 2015-12-02 2017-06-08 International Business Machines Corporation Expansion of a question and answer database
CN107093431A (en) * 2016-02-18 2017-08-25 中国移动通信集团辽宁有限公司 A kind of method and device that quality inspection is carried out to service quality

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001100781A (en) * 1999-09-30 2001-04-13 Sony Corp Method and device for voice processing and recording medium
CN100566360C (en) * 2006-01-19 2009-12-02 北京讯鸟软件有限公司 Realize the call center services method of sitting service level evaluation
CN101662550B (en) * 2009-09-11 2012-10-03 中兴通讯股份有限公司 Method and system for service quality detection for call center
EP2622832B1 (en) * 2010-09-30 2019-03-13 British Telecommunications public limited company Speech comparison
CN102456344B (en) * 2010-10-22 2014-12-10 中国电信股份有限公司 System and method for analyzing customer behavior characteristic based on speech recognition technique
JP6438674B2 (en) * 2014-04-28 2018-12-19 エヌ・ティ・ティ・コミュニケーションズ株式会社 Response system, response method, and computer program
US9300801B1 (en) * 2015-01-30 2016-03-29 Mattersight Corporation Personality analysis of mono-recording system and methods
CN206332732U (en) * 2016-08-30 2017-07-14 国家电网公司客户服务中心南方分中心 A kind of real-time interfering system of operator's mood
CN107871497A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
CN108010513B (en) * 2016-10-28 2021-05-14 北京回龙观医院 Voice processing method and device
CN106790004B (en) * 2016-12-12 2021-02-02 北京易掌云峰科技有限公司 Customer service auxiliary real-time prompt system based on artificial intelligence
CN106851032B (en) * 2016-12-31 2019-10-29 国家电网公司客户服务中心 A method of improving the abnormal fix-rate of seat application system
CN106981291A (en) * 2017-03-30 2017-07-25 上海航动科技有限公司 A kind of intelligent vouching quality inspection system based on speech recognition
CN107317942A (en) * 2017-07-18 2017-11-03 国家电网公司客户服务中心南方分中心 A kind of call center's customer service system is recognized and monitoring system with online voice mood
CN107945790B (en) * 2018-01-03 2021-01-26 京东方科技集团股份有限公司 Emotion recognition method and emotion recognition system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102625005A (en) * 2012-03-05 2012-08-01 广东天波信息技术股份有限公司 Call center system with function of real-timely monitoring service quality and implement method of call center system
CN105261362A (en) * 2015-09-07 2016-01-20 科大讯飞股份有限公司 Conversation voice monitoring method and system
US20170161378A1 (en) * 2015-12-02 2017-06-08 International Business Machines Corporation Expansion of a question and answer database
CN107093431A (en) * 2016-02-18 2017-08-25 中国移动通信集团辽宁有限公司 A kind of method and device that quality inspection is carried out to service quality
CN105975514A (en) * 2016-04-28 2016-09-28 朱宇光 Automatic quality testing method and system

Also Published As

Publication number Publication date
CN108833722A (en) 2018-11-16
CN108833722B (en) 2021-05-11

Similar Documents

Publication Publication Date Title
WO2019227580A1 (en) Voice recognition method, apparatus, computer device, and storage medium
CN108922538B (en) Conference information recording method, conference information recording device, computer equipment and storage medium
US10249304B2 (en) Method and system for using conversational biometrics and speaker identification/verification to filter voice streams
CN109065075A (en) A kind of method of speech processing, device, system and computer readable storage medium
US8731936B2 (en) Energy-efficient unobtrusive identification of a speaker
WO2019227583A1 (en) Voiceprint recognition method and device, terminal device and storage medium
CN108766441B (en) Voice control method and device based on offline voiceprint recognition and voice recognition
WO2019227547A1 (en) Voice segmenting method and apparatus, and computer device and storage medium
WO2021051506A1 (en) Voice interaction method and apparatus, computer device and storage medium
US9171547B2 (en) Multi-pass speech analytics
US8005676B2 (en) Speech analysis using statistical learning
US8417524B2 (en) Analysis of the temporal evolution of emotions in an audio interaction in a service delivery environment
CN107818798A (en) Customer service quality evaluating method, device, equipment and storage medium
CN111312219B (en) Telephone recording labeling method, system, storage medium and electronic equipment
CN105489221A (en) Voice recognition method and device
WO2019210556A1 (en) Call reservation method, agent leaving processing method and apparatus, device, and medium
JP2004502985A (en) Recording device for recording voice information for subsequent offline voice recognition
US20060100866A1 (en) Influencing automatic speech recognition signal-to-noise levels
CN111128241A (en) Intelligent quality inspection method and system for voice call
CN110798578A (en) Incoming call transaction management method and device and related equipment
CN111683317A (en) Prompting method and device applied to earphone, terminal and storage medium
CN111508527B (en) Telephone answering state detection method, device and server
CN109994129A (en) Speech processing system, method and apparatus
CN111326159B (en) Voice recognition method, device and system
CN111179936B (en) Call recording monitoring method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18920982

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18920982

Country of ref document: EP

Kind code of ref document: A1