WO2020140607A1 - 一种语音信号处理方法、设备及计算机可读存储介质 - Google Patents

一种语音信号处理方法、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2020140607A1
WO2020140607A1 PCT/CN2019/116962 CN2019116962W WO2020140607A1 WO 2020140607 A1 WO2020140607 A1 WO 2020140607A1 CN 2019116962 W CN2019116962 W CN 2019116962W WO 2020140607 A1 WO2020140607 A1 WO 2020140607A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speech
sample
information
target
Prior art date
Application number
PCT/CN2019/116962
Other languages
English (en)
French (fr)
Inventor
王健宗
程宁
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020140607A1 publication Critical patent/WO2020140607A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present application relates to the field of voice recognition technology, and in particular, to a voice signal processing method, device, and computer-readable storage medium.
  • Embodiments of the present application provide a signal processing method, device, and computer-readable storage medium, which can improve voice recognition efficiency, thereby improving interview efficiency.
  • an embodiment of the present application provides a signal processing method.
  • the method includes:
  • the target speech level determines whether the interviewer has a successful interview.
  • an embodiment of the present application provides a signal processing device including a unit for performing the signal processing method of the first aspect described above.
  • an embodiment of the present application provides another signal processing device, including a processor, an input device, an output device, and a memory, where the processor, input device, output device, and memory are connected to each other, wherein the memory is used
  • the computer program includes program instructions
  • the processor is configured to call the program instructions to perform the method of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium that stores a computer program, where the computer program includes program instructions, which when executed by a processor causes the processing Implements the method of the first aspect described above.
  • the target voice type of the voice signal sequence is determined to determine the target score and target voice level corresponding to the target voice category, thereby improving the efficiency of voice recognition and accuracy.
  • FIG. 1 is a schematic flowchart of a voice signal processing method provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of another voice signal processing method provided by an embodiment of the present application.
  • FIG. 3 is a schematic block diagram of a voice signal processing device provided by an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of another voice signal processing device provided by an embodiment of the present application.
  • the voice signal processing method provided in the embodiments of the present application may be performed by a voice signal processing device, wherein, in some embodiments, the voice signal processing device may be provided on smart terminals such as mobile phones, computers, tablets, and smart watches. .
  • the voice signal processing device may acquire the voice signal issued by the interviewer during the interview, and perform windowing and framing processing on the voice signal according to the first preset duration, and split the voice signal into multiple second pre-segments For a speech frame with a set duration, the second preset duration is less than or equal to the first preset duration.
  • the speech signal processing device may perform denoising on each segment of the speech frame of the second preset duration, and convert all the speech frames of the second preset duration after the denoising process into a speech signal sequence, and
  • the speech signal sequence is input into a speech recognition model for classification processing, and a target speech category corresponding to the speech signal sequence is determined.
  • the voice signal processing device may determine the target score corresponding to the target voice category according to the preset correspondence between the voice category and the score, and The target score determines whether the interviewer has a successful interview.
  • FIG. 1 is a schematic flowchart of a voice signal processing method provided by an embodiment of the present application. As shown in FIG. 1, the method may be executed by a voice signal processing device, and a specific explanation of the voice signal processing device As mentioned before, no more details will be given here. Specifically, the method in the embodiment of the present application includes the following steps.
  • S101 Obtain the voice signal issued by the interviewer during the interview.
  • the voice signal processing device may acquire the voice signal issued by the interviewer during the interview.
  • the voice signal processing device may obtain voice signals through a sensor in a quiet environment in advance when acquiring voice signals of each interviewer, and configure a user identification for each voice signal, the The user identification is used to distinguish the voice signals of different interviewers.
  • the voice signal processing device may store the correspondence between the collected voice signals and the user identification in a database, where the sensor may be a wearable device or other intelligent terminals.
  • the embodiments of the present application may use a wearable device to obtain voice signals from multiple interviewers during the interview process, and may transmit the voice signals to the cloud server for processing in real time.
  • the voice signal processing device collects voice information of 50 people in an environment without external voice interference. First, the voice signal is collected through the sensor, and the voices of 50 people are recorded, and the voice of each interviewer is recorded for 30 minutes. For all records, the sampling rate of the piezoelectric sensor is 44.1kHz, and then sampled to 16kHz. Among them, it should be noted that the selected data is different for different sampling rates.
  • S102 Perform windowing and framing processing on the voice signal according to a first preset duration, and split the voice signal into multiple voice frames of a second preset duration.
  • the voice signal processing device may perform windowing and framing processing on the voice signal according to the first preset duration, and split the voice signal into multiple segments of voice frames of the second preset duration; in some In an embodiment, the second preset duration is less than or equal to the first preset duration.
  • the voice signal is not stable macroscopically, microscopically, and has short-term stability (for example, it can be considered that the voice signal is approximately unchanged within 10-30 ms), so you can The voice signal is divided into some short segments for processing, and each short segment is called a frame, so as to realize the frame processing of the voice signal.
  • windowing refers to multiplying by a window function. After windowing is to perform Fourier expansion to make the global more continuous and avoid the Gibbs effect. There is no periodicity after windowing Speech signals exhibit some characteristics of periodic functions.
  • the voice signal processing device may perform windowing and framing processing on the voice signal to split the voice signal into 3 segments
  • the second preset time frame is 10ms speech frame.
  • S103 De-noise each segment of the speech frame of the second preset duration, and convert all the speech frames of the second preset duration after the denoising process into a speech signal sequence.
  • the voice signal processing device may perform denoising processing on each segment of the speech frames of the second preset duration, and convert all the speech frames of the second preset duration after the denoising process into speech Signal sequence.
  • the speech signal processing device when it performs denoising on each segment of speech frames with a second preset duration, it may perform on each segment of speech frames with a second preset duration according to a preset denoising algorithm Denoising.
  • the denoising algorithm may be any one of adaptive filters, spectral subtraction, Wiener filtering, etc.; in other embodiments, the denoising algorithm may also use other algorithms, this application The embodiment is not specifically limited.
  • the denoising can be performed by discrete cosine transform All the processed speech frames of the second preset duration are converted into speech signal sequences.
  • the discrete cosine transform is a transform related to the Fourier transform, which is similar to the discrete Fourier transform, but uses only real numbers.
  • the discrete cosine transform is equivalent to a discrete Fourier transform that is about twice its length. This discrete Fourier transform is performed on a real even function (because the Fourier transform of a real even function is still a real even function ).
  • the voice signal processing device may also use other methods to convert each segment of the voice frame of the second preset duration.
  • S104 Input the speech signal sequence into a speech recognition model for classification processing, and determine a target speech category corresponding to the speech signal sequence.
  • the voice signal processing device may input the voice signal sequence into a voice recognition model for classification processing, and determine a target voice category corresponding to the voice signal sequence.
  • the speech category may include n categories, where n is a positive integer greater than 0, and the language classification may be based on the sweetness, softness, thickness, huskyness, huskyness, hyperactivity, magnetism, impetuousness, etc.
  • the embodiments of the present application are not specifically limited.
  • the speech signal processing device may also obtain a sample data set, and generate an initial recognition model according to a preset recognition algorithm and based on the The sample speech signal sequence and the type of the sample speech signal train the initial recognition model to obtain the speech recognition model.
  • the sample data set includes sample speech signal sequences and categories of sample speech signals.
  • the speech recognition model is a Recurrent Neural Network (RNN) implemented by a 6-layer encoding-decoding structure.
  • RNN Recurrent Neural Network
  • the 6-layer encoding-decoding structure includes: an encoder, a fixed encoding layer, a decoder, and a classification layer, wherein the encoder is composed of 3 layers, including: 128 neurons and 64 neurons The two bidirectional circulating layers of the element and the unidirectional layer of 32 circulating neurons. The details are as follows:
  • Encoder It consists of 3 layers, including 2 bidirectional circulating layers of 128 neurons and 64 neurons respectively, and a unidirectional layer of 32 circulating neurons. Our encoder is set to handle any sequence with the maximum length set by us. The encoder is a process of coding and modeling using a neural network. There will be several layers of structure, and the original voice data is mapped and compressed.
  • Decoder It consists of a single cyclic layer, which has 64 long-term short-term memory (LSTM) units, and incorporates an attention mechanism.
  • the attention mechanism makes the network mainly focus on the significant part of the input characteristics, and ultimately improve the classification performance.
  • our decoder is set to output a single label for each input sequence, which is one of 1-5 grades. Among them, the decoder: converts and decodes the previously compressed data, and finally outputs the classification.
  • Classification The final classification layer uses the softmax function to output a classification label.
  • the Softmax function can refer to the value of the input map as (0,1), and interpret this value as a probability. Classification is the process of dividing human voice into multiple grades.
  • the Softmax function will output a probability for each category. For example, (Class 1, 0.2) (Class 2, 0.1), (Class 3, 0.01), (Class 4, 0.01), (Class 5, 0.68). By comparing the probabilities, we selected class 5 as the final class, which is the classification function of the softmax function.
  • the voice signal processing device may also acquire a sample voice signal, and determine the content of the sample voice signal carried according to the preset correspondence between score information and voice category The voice category corresponding to the score information.
  • the sample speech signal carries score information.
  • the preset voice category may include 5 categories, which are a first category, a second category, a third category, a fourth category, and a fifth category; wherein, the preset score information and voice
  • the corresponding relationship of the categories is: the first category corresponds to the first score range, the second category corresponds to the second score range, the third category corresponds to the third score range, and the fourth category corresponds to the fourth score range, the The fifth category corresponds to the fifth score range.
  • the speech categories from the first category to the fifth category may be in order: unpleasant, unpleasant, average, nice, and nice.
  • the voice signal processing device may perform windowing and framing processing on the sample voice signal according to a third preset duration to convert the sample voice
  • the signal is split into multiple segments of sample speech frames of the fourth preset duration, and each segment of the sample speech frames of the fourth preset duration is denoised, and all the second preset durations after the denoising are processed
  • the sample speech frame of is converted into a sample speech signal sequence, thereby determining that the speech category corresponding to the sample speech signal sequence and the sample speech signal is the sample data set.
  • the fourth preset duration is less than or equal to the third preset duration.
  • the voice signal processing device may determine the target voice category corresponding to the voice signal sequence according to the similarity between the voice signal sequence and each sample voice signal sequence in the voice recognition model. In some embodiments, the voice signal processing device may also determine the target voice category corresponding to the voice signal sequence according to the probability that the voice signal sequence belongs to each voice category.
  • S105 Determine a target score corresponding to the target voice category according to the preset correspondence between the voice category and the score, and determine a target voice level corresponding to the target score according to the preset correspondence between the score and the voice grade.
  • the target score corresponding to the target voice category is determined according to the preset correspondence between the voice category and the score, and the target score corresponding to the target score is determined according to the preset correspondence between the score and the voice grade A target voice level, so as to determine whether the interviewer succeeds in the interview according to the target voice level.
  • the voice signal processing device may determine the target voice corresponding to the target score according to a preset correspondence between the score and the voice level Level; determine whether the target voice level is greater than a preset level threshold; if the result of the determination indicates that the target voice level is greater than the preset level threshold, store the correspondence between the target voice level and the interviewer's user ID in the database; When the interview ends, a preset number of target user IDs are selected from the database in order of the target voice level from high to bottom, and it is determined that the interviewer corresponding to the target user ID has a successful interview.
  • the voice signal processing device if the voice signal processing device does not receive the voice signal from the interview within a preset time interval, the voice signal processing device is triggered to determine the end of the interview to perform a score on the database filter. In some embodiments, the voice signal processing device may also determine the end of the interview through the obtained interview end instruction. In some embodiments, the interview end instruction may be triggered by the user through an interview end operation on the voice signal processing device, and the interview end operation may be implemented by operations such as an end button and an end switch; of course in other embodiments In this embodiment of the present application, other methods may also be used to trigger the interview end operation, which is not specifically limited in the embodiment of the present application.
  • FIG. 2 is a schematic flowchart of another voice signal processing method provided by an embodiment of the present application. As shown in FIG. 2, the method may be executed by a voice signal processing device, and a specific explanation of the voice signal processing device As mentioned before, no more details will be given here.
  • the difference between the embodiment of the present application and the embodiment described in FIG. 1 above is that the embodiment of the present application is a schematic illustration of an implementation process of determining a target speech category corresponding to the speech signal sequence according to a speech signal sequence.
  • the method in the embodiment of the present application includes the following steps.
  • the voice signal processing device may acquire the voice signal issued by the interviewer during the interview.
  • S202 Perform windowing and framing processing on the voice signal according to a first preset duration, and split the voice signal into multiple segments of voice frames with a second preset duration.
  • the voice signal processing device may perform windowing and framing processing on the voice signal according to the first preset duration, and split the voice signal into multiple segments of voice frames of the second preset duration. Specific embodiments and examples are as described above, and are not repeated here.
  • S203 Perform denoising processing on each segment of the speech frames of the second preset duration, and convert all the speech frames of the second preset duration after the denoising process into a speech signal sequence.
  • the voice signal processing device may perform denoising processing on each segment of the speech frames of the second preset duration, and convert all the speech frames of the second preset duration after the denoising process into speech Signal sequence.
  • denoising processing on each segment of the speech frames of the second preset duration, and convert all the speech frames of the second preset duration after the denoising process into speech Signal sequence.
  • S204 Calculate the similarity between the speech signal sequence and each sample speech signal sequence in the speech recognition model.
  • the voice signal processing device may calculate the similarity between the voice signal sequence and each sample voice signal sequence in the voice recognition model.
  • the speech signal processing device may calculate the similarity between the speech signal sequence and each sample speech signal sequence in the speech recognition model through a cosine similarity algorithm.
  • the speech signal processing device may also calculate the similarity between the speech signal sequence and each sample speech signal sequence in the speech recognition model through other algorithms for calculating similarity.
  • the embodiment of the present application does not make specific limited.
  • S205 Acquire at least one sample speech signal sequence whose similarity is greater than a preset threshold.
  • the voice signal processing device may obtain at least one sample voice signal sequence whose similarity is greater than a preset threshold.
  • the speech signal processing device may detect whether each similarity is greater than a preset threshold, and from the At least one sample speech signal sequence whose similarity is greater than a preset threshold is acquired from each sample speech signal sequence of the speech recognition model.
  • S206 Determine, from the at least one sample speech signal sequence, a target speech category corresponding to the sample speech signal sequence with the largest similarity.
  • the voice signal processing device may determine, from the at least one sample voice signal sequence, the target voice category corresponding to the sample voice signal sequence with the largest similarity.
  • the voice signal processing device may determine the The target speech category corresponding to the sample speech signal sequence with the largest similarity.
  • the speech signal processing device may also calculate the speech signal sequence according to a preset normalized exponential function (ie softmax function) The probability of belonging to each voice category, and determining the maximum probability value of the voice signal sequence belonging to each voice category, thereby determining the voice category corresponding to the maximum probability value as the target voice category corresponding to the voice signal sequence.
  • a preset normalized exponential function ie softmax function
  • the form of the softmax function is generally given by the following formula:
  • the softmax function is to “compress” a K-dimensional vector z k containing any real number into another K-dimensional real vector ⁇ (z) j such that the range of ⁇ (z) j is (0,1) Time, and the sum of all elements is 1.
  • the Softmax function is actually a gradient normalization of the discrete probability distribution of finite terms. Therefore, the Softmax function is widely used in multiple probability-based multiple classification problem methods including multiple logistic regression, multiple linear discriminant analysis, naive Bayes classifier and artificial neural network.
  • S207 Determine a target score corresponding to the target voice category according to the preset correspondence between the voice category and the score, and determine a target voice level corresponding to the target score according to the preset correspondence between the score and the voice grade.
  • the voice signal processing device may determine the target score corresponding to the target voice category according to the preset correspondence between the voice category and the score, and determine the target score according to the preset correspondence between the score and the voice level A target voice level corresponding to the target score, so as to determine whether the interviewer succeeds in the interview according to the target voice level. Specific embodiments are as described above, and will not be repeated here.
  • FIG. 3 is a schematic block diagram of a voice signal processing device according to an embodiment of the present application.
  • the voice signal processing device of this embodiment includes: an obtaining unit 301, a splitting unit 302, a denoising unit 303, a classification unit 304, and a determining unit 305.
  • the obtaining unit 301 is used to obtain the voice signal issued by the interviewer during the interview process
  • a splitting unit 302 configured to perform windowing and framing processing on the voice signal according to a first preset duration, and split the voice signal into multiple voice frames of a second preset duration, the second preset duration Less than or equal to the first preset duration;
  • the denoising unit 303 is configured to denoise each segment of the speech frames of the second preset duration, and convert all the speech frames of the second preset duration after the denoising process into a speech signal sequence;
  • a classification unit 304 configured to input the speech signal sequence into a speech recognition model for classification processing, and determine a target speech category corresponding to the speech signal sequence;
  • the determining unit 305 is configured to determine the target score corresponding to the target voice category according to the preset correspondence between the voice category and the score, and determine the target score corresponding to the target score according to the preset correspondence between the score and the voice grade A target voice level, so as to determine whether the interviewer succeeds in the interview according to the target voice level.
  • classification unit 304 inputs the speech signal sequence into the speech recognition model for classification processing, it is also used to:
  • the sample data set includes a sample speech signal sequence and a category of the sample speech signal
  • the initial recognition model is trained based on the sample speech signal sequence and the category of the sample speech signal to obtain the speech recognition model.
  • the classification unit 304 acquires the sample data set, it is also used to:
  • the voice category corresponding to the score information carried by the sample voice signal is determined.
  • the classification unit 304 acquires the sample data set, it is specifically used to:
  • the fourth preset duration is less than or equal to Describe the third preset duration
  • sample speech signal sequence and the speech category corresponding to the sample speech signal are the sample data set.
  • the classification unit 304 inputs the speech signal sequence into a speech recognition model for classification processing, and determines a target speech category corresponding to the speech signal sequence, it is specifically used for:
  • a target speech category corresponding to the sample speech signal sequence with the largest similarity is determined.
  • the classification unit 304 inputs the speech signal sequence into a speech recognition model for classification processing, and determines a target speech category corresponding to the speech signal sequence, it is specifically used for:
  • the voice category corresponding to the maximum probability value is determined as the target voice category corresponding to the voice signal sequence.
  • the determining unit 305 determines the target voice level corresponding to the target score according to the correspondence between the preset score and the voice level, so as to determine whether the interviewer succeeds in the interview according to the target voice level in:
  • the target voice level is greater than a preset level threshold
  • the correspondence between the target voice level and the interviewer's user ID is stored in the database
  • a preset number of target user IDs are selected from the database in order of the target voice level from high to bottom, and it is determined that the interviewer corresponding to the target user ID has a successful interview.
  • the obtaining unit 301 obtains the voice signal issued by the interviewer during the interview, it is specifically used to:
  • a user identification is added to the acquired voice signal, where the user identification is used to distinguish voice signals of different interviewers.
  • the classification unit 304 converts all the sample speech frames of the second preset duration after the denoising process into a sample speech signal sequence, it is specifically used for:
  • the discrete cosine transform is a Fourier transform performed on a real even function.
  • the classification unit 304 inputs the speech signal sequence into a speech recognition model for classification processing, and determines a target speech category corresponding to the speech signal sequence, it is specifically used for:
  • the speech category with the highest probability is determined as the target speech category corresponding to the speech signal sequence.
  • FIG. 4 is a schematic block diagram of another voice signal processing device provided by an embodiment of the present application.
  • the voice signal processing device in this embodiment may include: one or more processors 401; one or more input devices 402, one or more output devices 403, and a memory 404.
  • the processor 401, the input device 402, the output device 403, and the memory 404 are connected via a bus 405.
  • the memory 404 is used to store a computer program, and the computer program includes program instructions, and the processor 401 is used to execute the program instructions stored in the memory 404.
  • the processor 401 is configured to call the program instructions to execute:
  • the target speech level determines whether the interviewer has a successful interview.
  • the processor 401 inputs the speech signal sequence into a speech recognition model for classification processing, it is also used to:
  • the sample data set includes a sample speech signal sequence and a category of the sample speech signal
  • the initial recognition model is trained based on the sample speech signal sequence and the category of the sample speech signal to obtain the speech recognition model.
  • the processor 401 obtains the sample data set, it is also used to:
  • the voice category corresponding to the score information carried by the sample voice signal is determined.
  • the processor 401 acquires the sample data set, it is specifically used to:
  • the fourth preset duration is less than or equal to Describe the third preset duration
  • sample speech signal sequence and the speech category corresponding to the sample speech signal are the sample data set.
  • the processor 401 inputs the speech signal sequence into a speech recognition model for classification processing, and determines a target speech category corresponding to the speech signal sequence, it is specifically used to:
  • a target speech category corresponding to the sample speech signal sequence with the largest similarity is determined.
  • the processor 401 inputs the speech signal sequence into a speech recognition model for classification processing, and determines a target speech category corresponding to the speech signal sequence, it is specifically used for:
  • the voice category corresponding to the maximum probability value is determined as the target voice category corresponding to the voice signal sequence.
  • the processor 401 determines the target voice level corresponding to the target score according to the correspondence between the preset score and the voice level, so as to determine whether the interviewer succeeds in the interview according to the target voice level in:
  • the target voice level is greater than a preset level threshold
  • the correspondence between the target voice level and the interviewer's user ID is stored in the database
  • a preset number of target user IDs are selected from the database in order of the target voice level from high to bottom, and it is determined that the interviewer corresponding to the target user ID has a successful interview.
  • the processor 401 obtains the voice signal issued by the interviewer during the interview, it is specifically used to:
  • a user identification is added to the acquired voice signal, where the user identification is used to distinguish voice signals of different interviewers.
  • the processor 401 converts all the sample speech frames of the second preset duration after the denoising process into a sample speech signal sequence, it is specifically used for:
  • the discrete cosine transform is a Fourier transform performed on a real even function.
  • the processor 401 inputs the speech signal sequence into a speech recognition model for classification processing, and determines a target speech category corresponding to the speech signal sequence, it is specifically used to:
  • the voice category with the highest probability is determined as the target voice category corresponding to the voice signal sequence.
  • the so-called processor 401 may be a central processing unit (CenSral Processing UniS, CPU), and the processor may also be other general-purpose processors, digital voice signal processors (DigiSal Signal Processor, DSP) ), application specific integrated circuits (ApplicaSion Specific InSegraSed Circuits, ASIC), ready-made programmable gate array (Field-Programmable GaSe Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the input device 402 may include a touch panel, a microphone, and the like
  • the output device 403 may include a display (LCD, etc.), a speaker, and the like.
  • the memory 404 may include a read-only memory and a random access memory, and provide instructions and data to the processor 401. A portion of the memory 404 may also include non-volatile random access memory. For example, the memory 404 may also store device type information.
  • the processor 401, the input device 402, and the output device 403 described in the embodiments of the present application may perform the implementation described in the embodiment of the voice signal processing method described in FIG. 1 or FIG. In this manner, the implementation of the voice signal processing device described in FIG. 3 or FIG. 4 of the embodiment of the present application may also be performed, and details are not described herein again.
  • An embodiment of the present application also provides a computer-readable storage medium that stores a computer program, which when executed by a processor implements the description in the embodiment corresponding to FIG. 1 or FIG. 2
  • the voice signal processing method may also implement the voice signal processing device of the embodiment corresponding to FIG. 3 or FIG. 4 of the present application, and details are not described herein again.
  • the computer-readable storage medium may also be a non-volatile computer-readable storage medium, which is not specifically limited herein in this embodiment of the present invention.
  • the computer-readable storage medium may be an internal storage unit of the voice signal processing device according to any of the foregoing embodiments, such as a hard disk or a memory of the voice signal processing device.
  • the computer-readable storage medium may also be an external storage device of the voice signal processing device, such as a plug-in hard disk equipped on the voice signal processing device, an intelligent memory card (SmarS Media, Card, SMC), and a secure digital ( Secure, DigiSal, SD) card, flash card (Flash Card), etc.
  • the computer-readable storage medium may also include both an internal storage unit of the voice signal processing device and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by the voice signal processing device.
  • the computer-readable storage medium can also be used to temporarily store data that has been or will be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种语音信号处理方法、设备及计算机可读存储介质,其中方法包括:获取面试过程中面试者发出的语音信号(S101);根据第一预设时长对语音信号进行加窗分帧处理,将语音信号拆分为多段第二预设时长的语音帧,第二预设时长小于或等于第一预设时长(S102);对每一段第二预设时长的语音帧进行去噪处理,并将去噪处理后的所有第二预设时长的语音帧转换为语音信号序列(S103);将语音信号序列输入语音识别模型进行分类处理,确定出与语音信号序列对应的目标语音类别(S104);根据预设的语音类别与分数的对应关系,确定与目标语音类别对应的目标分数,并确定与所述目标分数对应的目标语音等级(S105)。通过这种方式,可提高语音识别的效率和准确性,从而提高面试效率。

Description

一种语音信号处理方法、设备及计算机可读存储介质
本申请要求于2019年01月04日提交中国专利局、申请号为201910014077.7、申请名称为“一种语音信号处理方法、设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音识别技术领域,尤其涉及一种语音信号处理方法、设备及计算机可读存储介质。
背景技术
招聘是每个企业必不可少的一个环节,招聘效率不管是对企业的下一步发展战略还是企业成本都是至关重要的。客服作为企业基础岗位之一,在客服招聘过程中,面试官主要通过与面试者的面对面的交流来判断其语音是否满足客服要求。然而由于应聘量较大,需要处理的简历多,带来了较大的工作量。因此如何更有效地提高客服招聘效率成为研究的重点。
发明内容
本申请实施例提供一种信号处理方法、设备及计算机可读存储介质,可提高语音识别效率,从而提高面试效率。
第一方面,本申请实施例提供了一种信号处理方法,该方法包括:
获取面试过程中面试者发出的语音信号;
根据第一预设时长对所述语音信号进行加窗分帧处理,将所述语音信号拆分为多段第二预设时长的语音帧,所述第二预设时长小于或等于所述第一预设时长;
对每一段所述第二预设时长的语音帧进行去噪处理,并将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列;
将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别;
根据预设的语音类别与分数的对应关系,确定与所述目标语音类别对应的目标分数,并根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级,以便根据所述目标语音等级确定所述面试者是否面试成功。
第二方面,本申请实施例提供了一种信号处理设备,该信号处理设备包括用于执行上述第一方面的信号处理方法的单元。
第三方面,本申请实施例提供了另一种信号处理设备,包括处理器、输入设备、输出设备和存储器,所述处理器、输入设备、输出设备和存储器相互连接,其中,所述存储器用于存储支持信号处理设备执行上述方法的计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令, 执行上述第一方面的方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面的方法。
本申请实施例通过将获取到的语音信号转换为语音信号序列,确定该语音信号序列的目标语音类型,以确定与目标语音类别对应的目标分数和目标语音等级,从而提高了语音识别的效率和准确性。
附图说明
图1是本申请实施例提供的一种语音信号处理方法的示意流程图;
图2是本申请实施例提供的另一种语音信号处理方法的示意流程图;
图3是本申请实施例提供的一种语音信号处理设备的示意框图;
图4是本申请实施例提供的另一种语音信号处理设备的示意框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的语音信号处理方法可以由一种语音信号处理设备执行,其中,在某些实施例中,所述语音信号处理设备可以设置在手机、电脑、平板、智能手表等智能终端上。所述语音信号处理设备可以获取面试过程中面试者发出的语音信号,并根据第一预设时长对所述语音信号进行加窗分帧处理,以及将所述语音信号拆分为多段第二预设时长的语音帧,所述第二预设时长小于或等于所述第一预设时长。所述语音信号处理设备可以对每一段所述第二预设时长的语音帧进行去噪处理,并将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列,以及将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别。所述语音信号处理设备在确定出与所述语音信号序列对应的目标语音类别之后,可以根据预设的语音类别与分数的对应关系,确定与所述目标语音类别对应的目标分数,并根据所述目标分数确定所述面试者是否面试成功。下面结合附图对本申请实施例的语音信号处理方法进行示意性说明。
请参见图1,图1是本申请实施例提供的一种语音信号处理方法的示意流程图,如图1所示,该方法可以由语音信号处理设备执行,所述语音信号处理设备的具体解释如前所述,此处不再赘述。具体地,本申请实施例的所述方法包括如下步骤。
S101:获取面试过程中面试者发出的语音信号。
本申请实施例中,语音信号处理设备可以获取面试过程中面试者发出的语音信号。
在一个实施例中,所述语音信号处理设备在获取每个面试者的语音信号时,可以预先在安静的环境 下,通过传感器获取语音信号,并对每个语音信号配置一个用户标识,所述用户标识用于区分不同的面试者的语音信号。所述语音信号处理设备可以将收集到的各语音信号与用户标识的对应关系存储至数据库,其中,所述传感器可以是一种可穿戴设备,也可以是其他智能终端。在某些实施例中,本申请实施例可以通过一种可穿戴设备来全程获取面试过程中多个面试者发出的语音信号,并且可以将所述语音信号实时传输到云端服务器进行处理。
具体可举例说明,假设预设数量为50,则语音信号处理设备收集50人在无外界语音干扰的环境下的语音信息。首先通过传感器来采集语音信号,记录了50位人员的语音,每位面试者的语音记录30分钟。对于所有的记录,压电传感器的采样率为44.1kHz,然后再采样到16kHz。其中,需要说明的是,不同的采样率,选取的数据是不一样的。
S102:根据第一预设时长对所述语音信号进行加窗分帧处理,将所述语音信号拆分为多段第二预设时长的语音帧。
本申请实施例中,语音信号处理设备可以根据第一预设时长对所述语音信号进行加窗分帧处理,将所述语音信号拆分为多段第二预设时长的语音帧;在某些实施例中,所述第二预设时长小于或等于所述第一预设时长。在某些实施例中,语音信号在宏观上是不平稳的,在微观上是平稳的,具有短时平稳性(如10---30ms内可以认为语音信号近似不变),因此就可以将语音信号分为一些短段来进行处理,且每一个短段称为一帧,从而实现对语音信号的分帧处理。在某些是实施例中,加窗是指与一个窗函数相乘,加窗之后是为了进行傅里叶展开,使全局更加连续,避免出现吉布斯效应,加窗之后原本没有周期性的语音信号呈现出周期函数的部分特征。
例如,假设第一预设时长为30ms,如果第二预设时长为10ms,则所述语音信号处理设备可以对所述语音信号进行加窗分帧处理,将所述语音信号拆分为3段第二预设时长为10ms的语音帧。
S103:对每一段所述第二预设时长的语音帧进行去噪处理,并将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列。
本申请实施例中,语音信号处理设备可以对每一段所述第二预设时长的语音帧进行去噪处理,并将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列。在一些实施例中,所述语音信号处理设备在对每一段第二预设时长的语音帧进行去噪处理时,可以根据预设的去噪算法对每一段第二预设时长的语音帧进行去噪处理。在某些实施例中,所述去噪算法可以是自适应滤波器、谱减法、维纳滤波法等任意一种;在其他实施例中,所述去噪算法还可以采用其他算法,本申请实施例不做具体限定。
在一个实施例中,所述语音信号处理设备在将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列时,可以采用离散余弦变换的方式,将所述去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列。在某些实施例中,所述离散余弦变换是与傅里叶变换相关的一种变换,它类似于 离散傅里叶变换,但是只使用实数。离散余弦变换相当于一个长度大概是它两倍的离散傅里叶变换,这个离散傅里叶变换是对一个实偶函数进行的(因为一个实偶函数的傅里叶变换仍然是一个实偶函数)。在其他实施例中,所述语音信号处理设备还可以采用其他方式对每一段所述第二预设时长的语音帧进行转换。
S104:将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别。
本申请实施例中,语音信号处理设备可以将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别。在某些实施例中,所述语音类别可以包括n个类别,所述n为大于0的正整数,所述语言的分类可以根据声音的甜美、柔和、浑厚、沙哑、高亢、磁性、浮躁等进行分类,本申请实施例不做具体限定。
在一个实施例中,所述语音信号处理设备在将所述语音信号序列输入语音识别模型进行分类处理之前,还可以获取样本数据集,并根据预设的识别算法生成初始识别模型以及基于所述样本语音信号序列和所述样本语音信号的类别对所述初始识别模型进行训练,得到所述语音识别模型。在某些实施例中,所述样本数据集中包括样本语音信号序列和样本语音信号的类别。
在一些实施例中,所述语音识别模型是通过用一个6层的编码-解码结构实现的循环神经网络(Recurrent Neural Network,RNN),这个结构可以使RNN处理和分类任意长度的语音信号序列。在某些实施例中,所述6层编码-解码结构包括:编码器、固定的编码层、解码器和分类层,其中,所述编码器由3层组成,包括:128神经元和64神经元的2个双向循环层、32个循环神经元的单向层。具体介绍如下:
1)编码器:由3层组成的,包括分别为128神经元和64神经元的2个双向循环层,有32个循环神经元的单向层。我们的编码器被设置为可以处理最大长度为我们设定的值的任意序列。所述编码器是利用神经网络进行编码建模的过程,会有几层结构,将原来的语音数据进行了映射做了压缩。
2)固定的编码层:编码器输出的最后一层是一个固定参数的有32神经元的激活层,被用来初始化解码器。
3)解码器:由一个单独的循环层构成,它具有64个长短时记忆(LSTM)单元,且结合了注意力机制。注意力机制使该网络主要关注输入特性的显著部分,并最终提高分类性能。目前,我们的解码器设置为对每个输入序列输出一个单一的标签,即1-5档次中的一种。其中,所述解码器:是把之前压缩的数据,进行转换解码,最后输出分类的工作。
4)分类:最后的分类层使用softmax函数输出一个分类标签。Softmax函数可以将输入映射称为(0,1)的值,将这个值理解为概率。分类就是把人的语音分成多个档次的过程。
Softmax函数会对每个分类输出一个概率出来。例如,(类1,0.2)(类2,0.1),(类3,0.01),(类4,0.01),(类5,0.68)。我们通过比较概率,选出类5作为最后的类,这个就是softmax函数的分类作用。
在一个实施例中,所述语音信号处理设备在获取样本数据集之前,还可以获取样本语音信号,并根据预设的分数信息与语音类别的对应关系,确定与所述样本语音信号所携带的分数信息对应的语音类别。在某些实施例中,所述样本语音信号携带了分数信息。例如,假设所述预设的语音类别可以包括5个类别,且分别为第一类别,第二类别,第三类别,第四类别,第五类别;其中,所述预设的分数信息与语音类别的对应关系为:第一类别对应第一分数范围,所述第二类别对应第二分数范围,所述第三类别对应第三分数范围,所述第四类别对应第四分数范围,所述第五类别对应第五分数范围。在某些实施例中,从所述第一类别至第五类别的语音类别可以依次为:很难听,难听,一般,好听,很好听。
在一个实施例中,所述语音信号处理设备在获取样本数据集时,所述语音信号处理设备可以根据第三预设时长对所述样本语音信号进行加窗分帧处理,将所述样本语音信号拆分为多段第四预设时长的样本语音帧,并对每一段所述第四预设时长的样本语音帧进行去噪处理,以及将去噪处理后的所有所述第二预设时长的样本语音帧转换为样本语音信号序列,从而确定所述样本语音信号序列和所述样本语音信号对应的语音类别为所述样本数据集。在某些实施例中,所述第四预设时长小于或等于所述第三预设时长。
在一个实施例中,所述语音信号处理设备可以根据所述语音信号序列与所述语音识别模型中各样本语音信号序列的相似度,来确定与所述语音信号序列对应的目标语音类别。在某些实施例中,所述语音信号处理设备还可以根据所述语音信号序列属于各语音类别的概率,来确定与所述语音信号序列对应的目标语音类别。
S105:根据预设的语音类别与分数的对应关系,确定与所述目标语音类别对应的目标分数,并根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级。
本申请实施例中,根据预设的语音类别与分数的对应关系,确定与所述目标语音类别对应的目标分数,并根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级,以便根据所述目标语音等级确定所述面试者是否面试成功。
在一个实施例中,所述语音信号处理设备在根据所述目标分数确定所述面试者是否面试成功时,可以根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级;判断所述目标语音等级是否大于预设等级阈值;如果判断结果出所述目标语音等级大于预设等级阈值,则将所述目标语音等级与面试者的用户标识的对应关系存储至数据库;当面试结束时,按照目标语音等级从高到底的顺序从所述数据库筛选出预设数量的目标用户标识,并确定所述目标用户标识对应的面试者面试成功。
在一个实施例中,如果所述语音信号处理设备在预设时间间隔内未接收到面试这发出的语音信号,则触发所述语音信号处理设备确定面试结束,以对所述数据库中的分数进行筛选。在某些实施例中,所述语音信号处理设备还可以通过获取到的面试结束指令,来确定面试结束。在某些实施例中,所述面试结束指令可以是用户通过语音信号处理设备上的面试结束操作触发的,所述面试结束操作可以通过对结束按钮、结束开关等操作实现;当然在其他实施例中,本申请实施例还可以采用其他方式来触发面试结束操作,本申请实施例不做具体限定。
本申请实施例通过将语音信号转换为语音信号序列,并确定出与语音信号序列对应的目标语音类别的目标分数,以根据目标分数确定面试者是否面试成功,从而提高语音识别的效率和准确性。
请参见图2,图2是本申请实施例提供的另一种语音信号处理方法的示意流程图,如图2所示,该方法可以由语音信号处理设备执行,该语音信号处理设备的具体解释如前所述,此处不再赘述。本申请实施例与上述图1所述实施例的区别在于,本申请实施例是对根据语音信号序列确定出与所述语音信号序列对应的目标语音类别的实施过程进行示意性说明。具体地,本申请实施例的所述方法包括如下步骤。
S201:获取面试过程中面试者发出的语音信号。
本申请实施例中,语音信号处理设备可以获取面试过程中面试者发出的语音信号。
S202:根据第一预设时长对所述语音信号进行加窗分帧处理,将所述语音信号拆分为多段第二预设时长的语音帧。
本申请实施例中,语音信号处理设备可以根据第一预设时长对所述语音信号进行加窗分帧处理,将所述语音信号拆分为多段第二预设时长的语音帧。具体实施例及举例如前所述,此处不再赘述。
S203:对每一段所述第二预设时长的语音帧进行去噪处理,并将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列。
本申请实施例中,语音信号处理设备可以对每一段所述第二预设时长的语音帧进行去噪处理,并将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列。具体实施例如前所述,此处不在赘述。
S204:计算所述语音信号序列与语音识别模型中各样本语音信号序列的相似度。
本申请实施例中,语音信号处理设备可以计算所述语音信号序列与语音识别模型中各样本语音信号序列的相似度。在某些实施例中,所述语音信号处理设备可以通过余弦相似度算法,来计算所述语音信号序列与语音识别模型中各样本语音信号序列的相似度。在其他实施例中,所述语音信号处理设备也可以通过其他计算相似度的算法,来计算所述语音信号序列与语音识别模型中各样本语音信号序列的相似度,本申请实施例不做具体限定。
S205:获取所述相似度大于预设阈值的至少一个样本语音信号序列。
本申请实施例中,语音信号处理设备可以获取所述相似度大于预设阈值的至少一个样本语音信号序列。
在一个实施例中,所述语音信号处理设备在计算出所述语音信号序列与语音识别模型中各样本语音信号序列的相似度之后,可以检测各相似度是否大于预设阈值,并从所述语音识别模型的各样本语音信号序列中获取所述相似度大于预设阈值的至少一个样本语音信号序列。
S206:从所述至少一个样本语音信号序列中,确定出所述相似度最大的样本语音信号序列所对应的目标语音类别。
本申请实施例中,语音信号处理设备可以从所述至少一个样本语音信号序列中,确定出所述相似度最大的样本语音信号序列所对应的目标语音类别。
例如,假设所述语音信号处理设备获取到所述相似度大于预设阈值的n个样本语音信号序列,则所述语音信号处理设备可以从所述n个样本语音信号序列中,确定出所述相似度最大的样本语音信号序列所对应的目标语音类别。
在一个实施例中,所述语音信号处理设备在确定与所述语音信号序列对应的目标语音类别时,还可以根据预设的归一化指数函数(即softmax函数),计算所述语音信号序列属于各语音类别的概率,并确定所述语音信号序列属于各语音类别的最大概率值,从而将所述最大概率值所对应的语音类别确定为与所述语音信号序列对应的目标语音类别。
在一些实施例中,所述softmax函数的形式通常按下面的式子给出:
Figure PCTCN2019116962-appb-000001
其中,所述softmax函数是将一个含任意实数的K维向量z k“压缩”到另一个K维实向量σ(z) j中,使得σ(z) j的范围在(0,1)之间,并且所有元素的和为1。其中,该j=1,…,K,k=1,…,K。Softmax函数实际上是有限项离散概率分布的梯度对数归一化。因此,Softmax函数在包括多项逻辑回归,多项线性判别分析,朴素贝叶斯分类器和人工神经网络等的多种基于概率的多分类问题方法中都有着广泛应用。
S207:根据预设的语音类别与分数的对应关系,确定与所述目标语音类别对应的目标分数,并根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级。
本申请实施例中,语音信号处理设备可以根据预设的语音类别与分数的对应关系,确定与所述目标语音类别对应的目标分数,并根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级,以便根据所述目标语音等级确定所述面试者是否面试成功。具体实施例如前所述,此处不再赘述。
本申请实施例通过将语音信号转换为语音信号序列,并确定出与语音信号序列对应的目标语音类别 的目标分数,以根据目标分数确定面试者是否面试成功,从而提高语音识别的效率和准确性。
本申请实施例还提供了一种语音信号处理设备,该语音信号处理设备用于执行前述任一项所述的方法的单元。具体地,参见图3,图3是本申请实施例提供的一种语音信号处理设备的示意框图。本实施例的语音信号处理设备包括:获取单元301、拆分单元302、去噪单元303、分类单元304以及确定单元305。
获取单元301,用于获取面试过程中面试者发出的语音信号;
拆分单元302,用于根据第一预设时长对所述语音信号进行加窗分帧处理,将所述语音信号拆分为多段第二预设时长的语音帧,所述第二预设时长小于或等于所述第一预设时长;
去噪单元303,用于对每一段所述第二预设时长的语音帧进行去噪处理,并将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列;
分类单元304,用于将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别;
确定单元305,用于根据预设的语音类别与分数的对应关系,确定与所述目标语音类别对应的目标分数,并根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级,以便根据所述目标语音等级确定所述面试者是否面试成功。
进一步地,所述分类单元304将所述语音信号序列输入语音识别模型进行分类处理之前,还用于:
获取样本数据集,所述样本数据集中包括样本语音信号序列和样本语音信号的类别;
根据预设的识别算法生成初始识别模型;
基于所述样本语音信号序列和所述样本语音信号的类别对所述初始识别模型进行训练,得到所述语音识别模型。
进一步地,所述分类单元304获取样本数据集之前,还用于:
获取样本语音信号,其中,所述样本语音信号携带了分数信息;
根据预设的分数信息与语音类别的对应关系,确定与所述样本语音信号所携带的分数信息对应的语音类别。
进一步地,所述分类单元304获取样本数据集时,具体用于:
根据第三预设时长对所述样本语音信号进行加窗分帧处理,将所述样本语音信号拆分为多段第四预设时长的样本语音帧,所述第四预设时长小于或等于所述第三预设时长;
对每一段所述第四预设时长的样本语音帧进行去噪处理,并将去噪处理后的所有所述第二预设时长的样本语音帧转换为样本语音信号序列;
确定所述样本语音信号序列和所述样本语音信号对应的语音类别为所述样本数据集。
进一步地,所述分类单元304将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别时,具体用于:
计算所述语音信号序列与所述语音识别模型中各样本语音信号序列的相似度;
获取所述相似度大于预设阈值的至少一个样本语音信号序列;
从所述至少一个样本语音信号序列中,确定出所述相似度最大的样本语音信号序列所对应的目标语音类别。
进一步地,所述分类单元304将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别时,具体用于:
根据预设的归一化指数函数,计算所述语音信号序列属于各语音类别的概率,并确定所述语音信号序列属于各语音类别的最大概率值;
将所述最大概率值所对应的语音类别确定为与所述语音信号序列对应的目标语音类别。
进一步地,所述确定单元305根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级,以便根据所述目标语音等级确定所述面试者是否面试成功时,具体用于:
根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级;
判断所述目标语音等级是否大于预设等级阈值;
如果判断结果出所述目标语音等级大于预设等级阈值,则将所述目标语音等级与面试者的用户标识的对应关系存储至数据库;
当面试结束时,按照目标语音等级从高到底的顺序从所述数据库筛选出预设数量的目标用户标识,并确定所述目标用户标识对应的面试者面试成功。
进一步地,所述获取单元301获取面试过程中面试者发出的语音信号时,具体用于:
通过传感器获取语音信号;
对获取到的语音信号添加用户标识,其中,所述用户标识用于区分不同的面试者的语音信号。
进一步地,所述分类单元304将去噪处理后的所有所述第二预设时长的样本语音帧转换为样本语音信号序列时,具体用于:
采用离散余弦变换的方式,将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列;
其中,所述离散余弦变换是对一个实偶函数进行的傅里叶变换。
进一步地,所述分类单元304将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别时,具体用于:
计算所述语音信号序列属于各语音类别的概率;
根据所述语音信号序列属于各语音类别的概率,确定概率最大的语音类别为与所述语音信号序列对 应的目标语音类别。
本申请实施例通过将语音信号转换为语音信号序列,并确定出与语音信号序列对应的目标语音类别的目标分数,以根据目标分数确定面试者是否面试成功,从而提高语音识别的效率和准确性。
参见图4,图4是本申请实施例提供的另一种语音信号处理设备示意框图。如图所示的本实施例中的语音信号处理设备可以包括:一个或多个处理器401;一个或多个输入设备402,一个或多个输出设备403和存储器404。上述处理器401、输入设备402、输出设备403和存储器404通过总线405连接。存储器404用于存储计算机程序,所述计算机程序包括程序指令,处理器401用于执行存储器404存储的程序指令。其中,处理器401被配置用于调用所述程序指令执行:
获取面试过程中面试者发出的语音信号;
根据第一预设时长对所述语音信号进行加窗分帧处理,将所述语音信号拆分为多段第二预设时长的语音帧,所述第二预设时长小于或等于所述第一预设时长;
对每一段所述第二预设时长的语音帧进行去噪处理,并将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列;
将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别;
根据预设的语音类别与分数的对应关系,确定与所述目标语音类别对应的目标分数,并根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级,以便根据所述目标语音等级确定所述面试者是否面试成功。
进一步地,所述处理器401将所述语音信号序列输入语音识别模型进行分类处理之前,还用于:
获取样本数据集,所述样本数据集中包括样本语音信号序列和样本语音信号的类别;
根据预设的识别算法生成初始识别模型;
基于所述样本语音信号序列和所述样本语音信号的类别对所述初始识别模型进行训练,得到所述语音识别模型。
进一步地,所述处理器401获取样本数据集之前,还用于:
获取样本语音信号,其中,所述样本语音信号携带了分数信息;
根据预设的分数信息与语音类别的对应关系,确定与所述样本语音信号所携带的分数信息对应的语音类别。
进一步地,所述处理器401获取样本数据集时,具体用于:
根据第三预设时长对所述样本语音信号进行加窗分帧处理,将所述样本语音信号拆分为多段第四预设时长的样本语音帧,所述第四预设时长小于或等于所述第三预设时长;
对每一段所述第四预设时长的样本语音帧进行去噪处理,并将去噪处理后的所有所述第二预设时长的样本语音帧转换为样本语音信号序列;
确定所述样本语音信号序列和所述样本语音信号对应的语音类别为所述样本数据集。
进一步地,所述处理器401将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别时,具体用于:
计算所述语音信号序列与所述语音识别模型中各样本语音信号序列的相似度;
获取所述相似度大于预设阈值的至少一个样本语音信号序列;
从所述至少一个样本语音信号序列中,确定出所述相似度最大的样本语音信号序列所对应的目标语音类别。
进一步地,所述处理器401将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别时,具体用于:
根据预设的归一化指数函数,计算所述语音信号序列属于各语音类别的概率,并确定所述语音信号序列属于各语音类别的最大概率值;
将所述最大概率值所对应的语音类别确定为与所述语音信号序列对应的目标语音类别。
进一步地,所述处理器401根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级,以便根据所述目标语音等级确定所述面试者是否面试成功时,具体用于:
根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级;
判断所述目标语音等级是否大于预设等级阈值;
如果判断结果出所述目标语音等级大于预设等级阈值,则将所述目标语音等级与面试者的用户标识的对应关系存储至数据库;
当面试结束时,按照目标语音等级从高到底的顺序从所述数据库筛选出预设数量的目标用户标识,并确定所述目标用户标识对应的面试者面试成功。
进一步地,所述处理器401获取面试过程中面试者发出的语音信号时,具体用于:
通过传感器获取语音信号;
对获取到的语音信号添加用户标识,其中,所述用户标识用于区分不同的面试者的语音信号。
进一步地,所述处理器401将去噪处理后的所有所述第二预设时长的样本语音帧转换为样本语音信号序列时,具体用于:
采用离散余弦变换的方式,将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列;
其中,所述离散余弦变换是对一个实偶函数进行的傅里叶变换。
进一步地,所述处理器401将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语 音信号序列对应的目标语音类别时,具体用于:
计算所述语音信号序列属于各语音类别的概率;
根据所述语音信号序列属于各语音类别的概率,确定概率最大的语音类别为与所述语音信号序列对应的目标语音类别。
本申请实施例通过将语音信号转换为语音信号序列,并确定出与语音信号序列对应的目标语音类别的目标分数,以根据目标分数确定面试者是否面试成功,从而提高语音识别的效率和准确性。
应当理解,在本申请实施例中,所称处理器401可以是中央处理单元(CenSral Processing UniS,CPU),该处理器还可以是其他通用处理器、数字语音信号处理器(DigiSal Signal Processor,DSP)、专用集成电路(ApplicaSion Specific InSegraSed CircuiS,ASIC)、现成可编程门阵列(Field-Programmable GaSe Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
输入设备402可以包括触控板、麦克风等,输出设备403可以包括显示器(LCD等)、扬声器等。
该存储器404可以包括只读存储器和随机存取存储器,并向处理器401提供指令和数据。存储器404的一部分还可以包括非易失性随机存取存储器。例如,存储器404还可以存储设备类型的信息。
具体实现中,本申请实施例中所描述的处理器401、输入设备402、输出设备403可执行本申请实施例提供的图1或图2所述的语音信号处理方法实施例中所描述的实现方式,也可执行本申请实施例图3或图4所描述的语音信号处理设备的实现方式,在此不再赘述。
本申请实施例中还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现图1或图2所对应实施例中描述的语音信号处理方法,也可实现本申请图3或图4所对应实施例的语音信号处理设备,在此不再赘述。在某些实施例中,所述计算机可读存储介质还可以为计算机非易失性可读存储介质,本发明实施例在此处不做具体限定。
所述计算机可读存储介质可以是前述任一实施例所述的语音信号处理设备的内部存储单元,例如语音信号处理设备的硬盘或内存。所述计算机可读存储介质也可以是所述语音信号处理设备的外部存储设备,例如所述语音信号处理设备上配备的插接式硬盘,智能存储卡(SmarS Media Card,SMC),安全数字(Secure DigiSal,SD)卡,闪存卡(Flash Card)等。进一步地,所述计算机可读存储介质还可以既包括所述语音信号处理设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述语音信号处理设备所需的其他程序和数据。所述计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。
以上所述,仅为本申请的部分实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖 在本申请的保护范围之内。

Claims (20)

  1. 一种图像数据处理方法,其特征在于,包括:
    接收业务终端发送的待检测的图像数据,所述待检测的图像数据包括字段信息;
    对所述待检测的图像数据中的字段信息进行标注,以得到字段标注信息;
    根据所述字段标注信息确定所述待检测的图像数据中字段信息的位置信息,并根据所述位置信息对所述待检测的图像数据进行裁剪,得到与所述位置信息对应的字段图像数据;
    获取所述字段图像数据中的文本信息,并根据所述文本信息对所述字段图像数据中文本的位置信息进行标注,以得到文本位置标注信息;
    基于识别模型对所述文本位置标注信息和所述字段图像数据进行处理,以识别出所述字段图像数据中的文本信息。
  2. 根据权利要求1所述的方法,其特征在于,所述字段信息包括载体数据和载体数据中的字段数据;所述对所述待检测的图像数据中的字段信息进行标注,以得到字段标注信息,包括:
    对所述待检测的图像数据中的载体数据进行标注,得到载体的标注数据;以及,
    对所述载体数据中的字段数据进行标注,得到字段的标注数据;
    将所述载体的标注数据和所述字段的标注数据确定为所述字段标注信息。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述字段标注信息确定所述待检测的图像数据中字段信息的位置信息,包括:
    根据所述字段标注信息中载体的标注数据,确定出所述待检测的图像数据中载体的位置信息;
    根据所述载体的位置信息和所述字段标注信息中字段的标注数据,确定出所述字段在所述载体中的相对位置信息;
    所述根据所述位置信息对所述待检测的图像数据进行裁剪,得到与所述位置信息对应的字段图像数据,包括:
    根据所述字段在所述载体中的相对位置信息对所述载体中的字段进行裁剪,得到与所述相对位置信息对应的字段图像数据。
  4. 根据权利要求1所述的方法,其特征在于,所述根据所述文本信息对所述字段图像数据中文本的位置信息进行标注,以得到文本位置标注信息,包括:
    根据所述文本信息对所述字段图像数据中的文本信息进行拆分,得到与所述文本信息对应的每个文字;
    对每一个文字的位置信息进行标注,以得到所述字段图像数据中与所述文本信息对应的每一个文字 的文本位置标注信息。
  5. 根据权利要求4所述的方法,其特征在于,所述基于识别模型对所述文本位置标注信息和所述字段图像数据进行处理,以识别出所述字段图像数据中的文本信息,包括:
    基于所述识别模型对所述字段图像数据中与所述文本信息对应的每一个文字的文本位置标注信息进行识别,确定出与所述每一个文字的文本位置标注信息对应的位置信息;
    按照与所述每一个文字的文本位置标注信息对应的位置信息,对所述文本信息中的文字进行排列组合,以得到所述字段图像数据中的文本信息。
  6. 根据权利要求1所述的方法,其特征在于,所述基于识别模型对所述文本位置标注信息和所述字段图像数据进行处理之前,还包括:
    获取样本字段图像数据,所述样本字段图像数据中包括文本位置标注信息;
    根据预设的识别算法生成初始识别模型;
    基于所述包括文本位置标注信息的样本字段图像数据对所述初始识别模型进行训练,得到所述识别模型。
  7. 根据权利要求6所述的方法,其特征在于,所述获取样本数据集之前,还包括:
    获取样本图像数据,所述样本图像数据包括样本字段信息;
    对所述样本图像数据的样本字段信息进行标注,以得到样本字段标注信息;
    根据所述样本字段标注信息,确定所述样本图像数据中样本字段信息的位置信息;
    根据所述样本字段信息的位置信息对所述样本图像数据进行裁剪,得到与所述样本字段信息的位置信息对应的样本字段图像数据。
  8. 根据权利要求1所述的方法,其特征在于,所述获取面试过程中面试者发出的语音信号,包括:
    通过传感器获取语音信号;
    对获取到的语音信号添加用户标识,其中,所述用户标识用于区分不同的面试者的语音信号。
  9. 根据权利要求4所述的方法,其特征在于,所述将去噪处理后的所有所述第二预设时长的样本语音帧转换为样本语音信号序列,包括:
    采用离散余弦变换的方式,将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列;
    其中,所述离散余弦变换是对一个实偶函数进行的傅里叶变换。
  10. 根据权利要求5所述的方法,其特征在于,所述将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别,包括:
    计算所述语音信号序列属于各语音类别的概率;
    根据所述语音信号序列属于各语音类别的概率,确定概率最大的语音类别为与所述语音信号序列对 应的目标语音类别。
  11. 一种信号处理设备,其特征在于,包括:
    获取单元,用于获取面试过程中面试者发出的语音信号;
    拆分单元,用于根据第一预设时长对所述语音信号进行加窗分帧处理,将所述语音信号拆分为多段第二预设时长的语音帧,所述第二预设时长小于或等于所述第一预设时长;
    去噪单元,用于对每一段所述第二预设时长的语音帧进行去噪处理,并将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列;
    分类单元,用于将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别;
    确定单元,用于根据预设的语音类别与分数的对应关系,确定与所述目标语音类别对应的目标分数,并根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级,以便根据所述目标语音等级确定所述面试者是否面试成功。
  12. 根据权利要求11所述的设备,其特征在于,所述分类单元将所述语音信号序列输入语音识别模型进行分类处理之前,还用于:
    获取样本数据集,所述样本数据集中包括样本语音信号序列和样本语音信号的类别;
    根据预设的识别算法生成初始识别模型;
    基于所述样本语音信号序列和所述样本语音信号的类别对所述初始识别模型进行训练,得到所述语音识别模型。
  13. 根据权利要求12所述的设备,其特征在于,所述分类单元获取样本数据集之前,还用于:
    获取样本语音信号,其中,所述样本语音信号携带了分数信息;
    根据预设的分数信息与语音类别的对应关系,确定与所述样本语音信号所携带的分数信息对应的语音类别。
  14. 根据权利要求13所述的设备,其特征在于,所述分类单元获取样本数据集时,具体用于:
    根据第三预设时长对所述样本语音信号进行加窗分帧处理,将所述样本语音信号拆分为多段第四预设时长的样本语音帧,所述第四预设时长小于或等于所述第三预设时长;
    对每一段所述第四预设时长的样本语音帧进行去噪处理,并将去噪处理后的所有所述第二预设时长的样本语音帧转换为样本语音信号序列;
    确定所述样本语音信号序列和所述样本语音信号对应的语音类别为所述样本数据集。
  15. 根据权利要求14所述的设备,其特征在于,所述分类单元将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别时,具体用于:
    计算所述语音信号序列与所述语音识别模型中各样本语音信号序列的相似度;
    获取所述相似度大于预设阈值的至少一个样本语音信号序列;
    从所述至少一个样本语音信号序列中,确定出所述相似度最大的样本语音信号序列所对应的目标语音类别。
  16. 根据权利要求14所述的设备,其特征在于,所述分类单元将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别时,具体用于:
    根据预设的归一化指数函数,计算所述语音信号序列属于各语音类别的概率,并确定所述语音信号序列属于各语音类别的最大概率值;
    将所述最大概率值所对应的语音类别确定为与所述语音信号序列对应的目标语音类别。
  17. 根据权利要求11所述的设备,其特征在于,所述确定单元根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级,以便根据所述目标语音等级确定所述面试者是否面试成功时,具体用于:
    根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级;
    判断所述目标语音等级是否大于预设等级阈值;
    如果判断结果出所述目标语音等级大于预设等级阈值,则将所述目标语音等级与面试者的用户标识的对应关系存储至数据库;
    当面试结束时,按照目标语音等级从高到底的顺序从所述数据库筛选出预设数量的目标用户标识,并确定所述目标用户标识对应的面试者面试成功。
  18. 根据权利要求11所述的设备,其特征在于,所述获取单元获取面试过程中面试者发出的语音信号时,具体用于:
    通过传感器获取语音信号;
    对获取到的语音信号添加用户标识,其中,所述用户标识用于区分不同的面试者的语音信号。
  19. 一种语音信号处理设备,其特征在于,包括处理器、输入设备、输出设备和存储器,所述处理器、输入设备、输出设备和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行:
    获取面试过程中面试者发出的语音信号;
    根据第一预设时长对所述语音信号进行加窗分帧处理,将所述语音信号拆分为多段第二预设时长的语音帧,所述第二预设时长小于或等于所述第一预设时长;
    对每一段所述第二预设时长的语音帧进行去噪处理,并将去噪处理后的所有所述第二预设时长的语音帧转换为语音信号序列;
    将所述语音信号序列输入语音识别模型进行分类处理,确定出与所述语音信号序列对应的目标语音类别;
    根据预设的语音类别与分数的对应关系,确定与所述目标语音类别对应的目标分数,并根据预设的分数与语音等级的对应关系确定与所述目标分数对应的目标语音等级,以便根据所述目标语音等级确定所述面试者是否面试成功。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-10任一项所述的方法。
PCT/CN2019/116962 2019-01-04 2019-11-11 一种语音信号处理方法、设备及计算机可读存储介质 WO2020140607A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910014077.7A CN109658921B (zh) 2019-01-04 2019-01-04 一种语音信号处理方法、设备及计算机可读存储介质
CN201910014077.7 2019-01-04

Publications (1)

Publication Number Publication Date
WO2020140607A1 true WO2020140607A1 (zh) 2020-07-09

Family

ID=66119555

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116962 WO2020140607A1 (zh) 2019-01-04 2019-11-11 一种语音信号处理方法、设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN109658921B (zh)
WO (1) WO2020140607A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109658921B (zh) * 2019-01-04 2024-05-28 平安科技(深圳)有限公司 一种语音信号处理方法、设备及计算机可读存储介质
CN110265025A (zh) * 2019-06-13 2019-09-20 赵斌 一种运用语音和视频设备的面试内容记录系统
CN110503952B (zh) * 2019-07-29 2022-02-22 北京搜狗科技发展有限公司 一种语音处理方法、装置和电子设备
CN111292766B (zh) * 2020-02-07 2023-08-08 抖音视界有限公司 用于生成语音样本的方法、装置、电子设备和介质
CN111696580B (zh) * 2020-04-22 2023-06-16 广州多益网络股份有限公司 一种语音检测方法、装置、电子设备及存储介质
CN112233664B (zh) * 2020-10-15 2021-11-09 北京百度网讯科技有限公司 语义预测网络的训练方法、装置、设备以及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739867A (zh) * 2008-11-19 2010-06-16 中国科学院自动化研究所 运用计算机对口语翻译质量进行评分的方法
CN103065626A (zh) * 2012-12-20 2013-04-24 中国科学院声学研究所 英语口语考试系统中的朗读题自动评分方法和设备
CN104573126A (zh) * 2015-02-10 2015-04-29 同方知网(北京)技术有限公司 一种基于专利全文的专利附图标注的附图展示方式
CN106407976A (zh) * 2016-08-30 2017-02-15 百度在线网络技术(北京)有限公司 图像字符识别模型生成和竖列字符图像识别方法和装置
CN106777083A (zh) * 2016-12-13 2017-05-31 四川研宝科技有限公司 一种标记图片中物体的方法及装置
CN109658921A (zh) * 2019-01-04 2019-04-19 平安科技(深圳)有限公司 一种语音信号处理方法、设备及计算机可读存储介质
CN109829457A (zh) * 2019-01-04 2019-05-31 平安科技(深圳)有限公司 一种图像数据处理方法、设备及计算机可读存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104732977B (zh) * 2015-03-09 2018-05-11 广东外语外贸大学 一种在线口语发音质量评价方法和系统
CN107680597B (zh) * 2017-10-23 2019-07-09 平安科技(深圳)有限公司 语音识别方法、装置、设备以及计算机可读存储介质
CN108877835A (zh) * 2018-05-31 2018-11-23 深圳市路通网络技术有限公司 评价语音信号的方法及系统

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739867A (zh) * 2008-11-19 2010-06-16 中国科学院自动化研究所 运用计算机对口语翻译质量进行评分的方法
CN103065626A (zh) * 2012-12-20 2013-04-24 中国科学院声学研究所 英语口语考试系统中的朗读题自动评分方法和设备
CN104573126A (zh) * 2015-02-10 2015-04-29 同方知网(北京)技术有限公司 一种基于专利全文的专利附图标注的附图展示方式
CN106407976A (zh) * 2016-08-30 2017-02-15 百度在线网络技术(北京)有限公司 图像字符识别模型生成和竖列字符图像识别方法和装置
CN106777083A (zh) * 2016-12-13 2017-05-31 四川研宝科技有限公司 一种标记图片中物体的方法及装置
CN109658921A (zh) * 2019-01-04 2019-04-19 平安科技(深圳)有限公司 一种语音信号处理方法、设备及计算机可读存储介质
CN109829457A (zh) * 2019-01-04 2019-05-31 平安科技(深圳)有限公司 一种图像数据处理方法、设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN109658921B (zh) 2024-05-28
CN109658921A (zh) 2019-04-19

Similar Documents

Publication Publication Date Title
WO2020140607A1 (zh) 一种语音信号处理方法、设备及计算机可读存储介质
WO2021208287A1 (zh) 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质
WO2019223457A1 (zh) 混合语音识别方法、装置及计算机可读存储介质
CN109087670B (zh) 情绪分析方法、系统、服务器及存储介质
EP3839942A1 (en) Quality inspection method, apparatus, device and computer storage medium for insurance recording
CN110443692B (zh) 企业信贷审核方法、装置、设备及计算机可读存储介质
WO2019196196A1 (zh) 一种耳语音恢复方法、装置、设备及可读存储介质
WO2021082420A1 (zh) 声纹认证方法、装置、介质及电子设备
WO2018113243A1 (zh) 语音分割的方法、装置、设备及计算机存储介质
US8731936B2 (en) Energy-efficient unobtrusive identification of a speaker
CN109431507A (zh) 基于深度学习的咳嗽疾病识别方法及装置
WO2020098083A1 (zh) 通话分离方法、装置、计算机设备及存储介质
CN112328761B (zh) 一种意图标签设置方法、装置、计算机设备及存储介质
WO2019136909A1 (zh) 基于深度学习的语音活体检测方法、服务器及存储介质
CN109299227B (zh) 基于语音识别的信息查询方法和装置
WO2019136911A1 (zh) 更新声纹数据的语音识别方法、终端装置及存储介质
WO2020056995A1 (zh) 语音流利度识别方法、装置、计算机设备及可读存储介质
WO2020140609A1 (zh) 一种语音识别方法、设备及计算机可读存储介质
CN116705034A (zh) 声纹特征提取方法、说话人识别方法、模型训练方法及装置
CN108920715B (zh) 客服的智能化辅助方法、装置、服务器和存储介质
CN116741155A (zh) 语音识别方法、语音识别模型的训练方法、装置及设备
CN115455142A (zh) 文本检索方法、计算机设备和存储介质
WO2021196477A1 (zh) 基于声纹特征与关联图谱数据的风险用户识别方法、装置
CN111444319B (zh) 文本匹配方法、装置和电子设备
JP2018109739A (ja) 音声フレーム処理用の装置及び方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19906850

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 25.08.2021)

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22.04.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19906850

Country of ref document: EP

Kind code of ref document: A1