CN110995943A - Multi-user streaming voice recognition method, system, device and medium - Google Patents

Multi-user streaming voice recognition method, system, device and medium Download PDF

Info

Publication number
CN110995943A
CN110995943A CN201911358893.6A CN201911358893A CN110995943A CN 110995943 A CN110995943 A CN 110995943A CN 201911358893 A CN201911358893 A CN 201911358893A CN 110995943 A CN110995943 A CN 110995943A
Authority
CN
China
Prior art keywords
audio
information
user
layer
identity information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911358893.6A
Other languages
Chinese (zh)
Other versions
CN110995943B (en
Inventor
郝竹林
罗超
胡泓
王俊彬
任君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ctrip Computer Technology Shanghai Co Ltd
Original Assignee
Ctrip Computer Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Computer Technology Shanghai Co Ltd filed Critical Ctrip Computer Technology Shanghai Co Ltd
Priority to CN201911358893.6A priority Critical patent/CN110995943B/en
Publication of CN110995943A publication Critical patent/CN110995943A/en
Application granted granted Critical
Publication of CN110995943B publication Critical patent/CN110995943B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/51Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing
    • H04M3/5166Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing in combination with interactive voice response systems or voice portals, e.g. as front-ends
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The invention discloses a multi-user streaming voice recognition method, a system, equipment and a medium, wherein the multi-user streaming voice recognition method comprises the following steps: acquiring audio streams of a plurality of corresponding users from a plurality of call lines; generating a request object according to the audio information and the identity information; analyzing the audio information in the request object by using a voice recognition model to obtain a decoding result corresponding to the identity information; and judging whether the audio stream is silent, if not, calling a part of decoder to analyze the decoding result to obtain intermediate identification characters, if so, calling a final decoder to analyze the decoding result to obtain final identification characters. The invention realizes the real-time recognition of the parallel multi-user call voice, greatly improves the speed of audio recognition, improves the call processing operation speed of the customer service, improves the response speed of the OTA intelligent customer service, ensures that a plurality of users do not need to wait for long-time character recognition in the conversation, and improves the good call feeling of the users.

Description

Multi-user streaming voice recognition method, system, device and medium
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a multi-user streaming speech recognition method, system, device, and medium.
Background
In the OTA (online travel) industry, when an OTA intelligent customer service performs real-time voice communication with a guest or a hotel party through a telephone, the OTA intelligent customer service needs to recognize voice information expressed by the guest or the hotel in real time, and after recognizing the information, a feedback response is made through a text reply means.
In the OTA industry, usually, a plurality of OTA intelligent customer services simultaneously talk with a plurality of users in parallel, but in the prior art, the real-time streaming voice recognition is limited to the offline voice recognition of audio segments, only the streaming audio of a single user can be subjected to voice recognition, and the real-time voice recognition of the audio stream when a plurality of users talk in parallel in the OTA industry cannot be applied.
Disclosure of Invention
The invention provides a multi-user streaming voice recognition method, a system, equipment and a medium, aiming at overcoming the defect that the prior art can not realize the real-time voice recognition of an audio stream when a plurality of users in the OTA industry have online parallel calls.
The invention solves the technical problems through the following technical scheme:
a multi-user streaming voice recognition method, the multi-user streaming voice recognition method comprising:
acquiring audio streams of a plurality of corresponding users from a plurality of call lines, wherein the audio streams comprise audio information and identity information of the users;
generating a request object according to the audio information and the identity information;
analyzing the audio information in the request object by using a voice recognition model to obtain a decoding result corresponding to the identity information;
and judging whether the audio information in the audio stream is silent, if not, calling a partial decoder to analyze the decoding result to obtain intermediate identification characters corresponding to the identity information of the user of the audio stream, if so, calling a final decoder to analyze the decoding result to obtain final identification characters corresponding to the identity information of the user of the audio stream.
Preferably, the speech recognition model is a time-delay neural network model;
and/or the step of obtaining the audio stream of the corresponding user from the plurality of call lines comprises the following steps:
initializing a feature extractor;
initializing initial parameters of a speech recognition model;
initializing a speech decoder, the speech decoder comprising a partial decoder and a final decoder;
the step of analyzing the audio information in the request object by using a speech recognition model to obtain a decoding result corresponding to the identity information comprises:
extracting the characteristics of the audio information in the request object by using the characteristic extractor to obtain characteristic information corresponding to the identity information;
and analyzing the characteristic information by using the voice recognition model to obtain the decoding result.
Preferably, the multi-user streaming voice recognition method includes:
acquiring audio streams of corresponding users from a plurality of call lines in a main thread, and generating a request object according to the audio information and the identity information;
adding the request object into a request queue in a main thread;
setting a return object corresponding to the request object in the main thread;
acquiring the request object in the request queue in a sub-thread, judging whether the identity information in the request object is a new user for first transmission, if so, initializing environmental parameters of the voice recognition model, and analyzing the audio information according to the request object by using the voice recognition model to obtain the decoding result corresponding to the identity information; if not, directly utilizing the voice recognition model to obtain the decoding result;
judging whether the audio information in the audio stream is silent in the sub-thread, if not, calling a partial decoder to analyze the decoding result to obtain intermediate identification characters corresponding to the identity information of the user of the audio stream, and if so, calling a final decoder to analyze the decoding result to obtain final identification characters corresponding to the identity information of the user of the audio stream; assigning the character recognition result to the return object;
the main thread processes the return object.
Preferably, the step of the main thread processing the return object comprises:
and periodically inquiring whether a character recognition result exists in the return object corresponding to the request object in the main thread, and if so, displaying the character recognition result.
Preferably, the step of determining whether the audio stream is muted includes:
judging whether the audio information in the audio stream is silent by using an endpoint detection model;
the endpoint detection model comprises an input layer, an audio CNN layer, a splicing layer, a convolution layer, a full connection layer and an output layer;
the input layer is used for receiving frame information of a test audio and extracting features of the frame information to obtain features of a preset dimension;
the audio CNN layer comprises a plurality of parallel one-dimensional convolutional layers, and is used for performing convolution calculation on the preset dimensional characteristics through the one-dimensional convolutional layers to obtain first characteristic data corresponding to each one-dimensional convolutional layer;
the splicing layer is used for connecting the first characteristic data to obtain second characteristic data;
the convolution layer is used for performing convolution calculation on the second characteristic data to obtain third characteristic data;
the full connection layer is used for obtaining the probability of the frame information according to the third characteristic data;
and the output layer is used for obtaining a judgment result of whether the test audio is silent according to the probability.
A multi-user streaming voice recognition system comprises a transmission module, a generation module, a decoding module and a recognition module;
the transmission module acquires audio streams of a plurality of corresponding users from a plurality of call lines, wherein the audio streams comprise audio information and identity information of the users;
the generating module is used for generating a request object according to the audio information and the identity information;
the decoding module is used for analyzing the audio information in the request object by utilizing a voice recognition model to obtain a decoding result corresponding to the identity information;
the identification module is used for judging whether the audio information in the audio stream is silent, if not, calling a part of decoders to analyze the decoding result to obtain intermediate identification characters corresponding to the identity information of the user of the audio stream, and if so, calling a final decoder to analyze the decoding result to obtain final identification characters corresponding to the identity information of the user of the audio stream.
Preferably, the speech recognition model is a time-delay neural network model;
and/or the presence of a gas in the gas,
the multi-user streaming voice recognition system further comprises an initialization module;
the initialization module is used for initializing the feature extractor;
the initial parameters are also used for initializing the voice recognition model;
also for initializing a speech decoder, the speech decoder comprising a partial decoder and a final decoder;
the decoding module is further configured to perform feature extraction on the audio information in the request object by using the feature extractor to obtain feature information corresponding to the identity information; and analyzing the characteristic information by using the voice recognition model to obtain the decoding result.
Preferably, the multi-user streaming voice recognition system comprises a main thread and a sub-thread;
the main thread is used for acquiring audio streams of corresponding users from a plurality of call lines and generating a request object according to the audio information and the identity information;
the main thread is also used for adding the request object into a request queue;
the main thread is also used for setting a return object corresponding to the request object;
the sub-thread is used for acquiring the request object in the request queue and judging whether the identity information in the request object is a new user for first transmission, if so, initializing the environmental parameters of the voice recognition model, and analyzing the audio information according to the request object by using the voice recognition model to obtain the decoding result corresponding to the identity information; if not, directly utilizing the voice recognition model to obtain the decoding result;
the sub-thread is further used for judging whether the audio information in the audio stream is silent, if not, a part of decoders are called to analyze the decoding result to obtain intermediate identification characters corresponding to the identity information of the user of the audio stream, and if so, a final decoder is called to analyze the decoding result to obtain final identification characters corresponding to the identity information of the user of the audio stream;
the main thread is also used for processing the return object.
Preferably, the main thread is further configured to periodically query whether a text recognition result exists in the returned object corresponding to the request object in the main thread, and if yes, display the text recognition result.
Preferably, the identification module is further configured to determine whether audio information in the audio stream is silent by using an endpoint detection model;
the endpoint detection model comprises an input layer, an audio CNN layer, a splicing layer, a convolution layer, a full connection layer and an output layer;
the input layer is used for receiving frame information of a test audio and extracting features of the frame information to obtain features of a preset dimension;
the audio CNN layer comprises a plurality of parallel one-dimensional convolutional layers, and is used for performing convolution calculation on the preset dimensional characteristics through the one-dimensional convolutional layers to obtain first characteristic data corresponding to each one-dimensional convolutional layer;
the splicing layer is used for connecting the first characteristic data to obtain second characteristic data;
the convolution layer is used for performing convolution calculation on the second characteristic data to obtain third characteristic data;
the full connection layer is used for obtaining the probability of the frame information according to the third characteristic data;
and the output layer is used for obtaining a judgment result of whether the test audio is silent according to the probability.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the multi-user streaming speech recognition method as described above when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the multi-user streaming speech recognition method as described above.
The positive progress effects of the invention are as follows:
the invention aims at the problem that in the OTA industry, a smart phone needs to perform real-time streaming voice recognition when the smart phone establishes telephone communication with a plurality of users in parallel, by establishing a request object corresponding to the user identity information, analyzing the audio stream corresponding to each user identity information by using a voice recognition model to obtain a decoding result corresponding to each user, and realizes the character decoding of the effective voice section between the silences through the judgment of the silence in the audio stream, compared with the traditional single-user offline voice recognition method for transmitting a long-time voice frequency, the voice frequency recognition speed is greatly improved, the call processing operation speed of the customer service is improved, the response speed of the OTA intelligent customer service is improved, the user does not need to wait for long-time character recognition in a conversation, and the good call feeling of the user is improved.
Drawings
Fig. 1 is a flowchart of a multi-user streaming voice recognition method according to embodiment 1 of the present invention.
Fig. 2 is a flowchart of step 13 of the multi-user streaming voice recognition method according to embodiment 2 of the present invention.
Fig. 3 is a flowchart of thread management of the multi-user streaming voice recognition method according to embodiment 1 of the present invention.
Fig. 4 is a flowchart of step 30 of the multi-user streaming voice recognition method according to embodiment 1 of the present invention.
Fig. 5 is a schematic block diagram of an endpoint detection model according to embodiment 1 of the present invention.
Fig. 6 is a block diagram of a multi-user streaming voice recognition system according to embodiment 2 of the present invention.
Fig. 7 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention.
Detailed Description
The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.
Example 1
The present embodiment provides a multi-user streaming voice recognition method, as shown in fig. 1, the multi-user streaming voice recognition method includes:
and step 10, initializing parameters.
The specific steps of initializing the parameters include:
initializing a feature extractor, and initializing initial parameters of the voice recognition model;
initializing a voice decoder, wherein the voice decoder comprises a partial decoder and a final decoder;
and 11, acquiring audio streams of a plurality of corresponding users from a plurality of call lines.
Wherein the audio stream comprises audio information and identity information of the user.
And step 12, generating a request object according to the audio information and the identity information.
Step 13, analyzing the audio information in the request object by using a voice recognition model to obtain a decoding result corresponding to the identity information;
and step 14, judging whether the audio information in the audio stream is mute, if not, executing step 15, and if so, executing step 16.
And step 15, calling a part of decoders to analyze decoding results to obtain intermediate identification characters corresponding to the identity information of the users of the audio streams, and returning to the step 11.
And step 16, calling a final decoder, and analyzing a decoding result to obtain a final identification character corresponding to the identity information of the user of the audio stream.
As shown in fig. 2, step 13 includes:
step 131, performing feature extraction on the audio information in the request object by using a feature extractor to obtain feature information corresponding to the identity information;
step 132, analyzing the feature information by using the speech recognition model to obtain a decoding result.
In this embodiment, a multithreading implementation is used, and specifically, as shown in fig. 3, the multi-user streaming voice recognition method includes:
step 21, acquiring audio streams of corresponding users from a plurality of call lines in a main thread, and generating a request object according to audio information and identity information;
step 22, adding the request object into the request queue in the main thread;
step 23, setting a return object corresponding to the request object in the main thread;
step 24, acquiring a request object in the request queue in the sub-thread, judging whether the identity information in the request object is a new user for first transmission, and if so, executing step 25; if not, go to step 26.
And 25, initializing the environmental parameters of the voice recognition model, and analyzing the audio information according to the request object by using the voice recognition model to obtain a decoding result corresponding to the identity information.
Step 26, directly utilizing the voice recognition model to obtain a decoding result;
step 27, judging whether the audio information in the audio stream is mute in the sub-thread, if not, executing step 28; if so, step 29 is performed.
And step 28, calling a part of decoders to analyze decoding results to obtain intermediate identification characters corresponding to the identity information of the users of the audio stream.
Step 29, calling a final decoder, and analyzing a decoding result to obtain a final identification character corresponding to the identity information of the user of the audio stream; and assigning the character recognition result to the return object.
Step 30, the main thread processes the return object.
As shown in fig. 4, step 30 includes:
step 301, periodically inquiring whether a return object corresponding to the request object has a character recognition result or not in the main thread, and if so, executing step 302.
Step 302, displaying the character recognition result.
The speech recognition model in this embodiment adopts a TDNN (time delay neural network) network model, which is a frame-level speech recognition model and supports a frame-level prediction model of an audio stream, so as to support synchronous transcription of a real-time audio stream. The method comprises the following steps of judging the start and the end of audio by adopting an endpoint detection model, carrying out voice recognition when an audio stream in synchronous transcription starts, stopping the voice recognition when a new endpoint is restarted, then carrying out effective voice decoding between the two endpoints, and carrying out the following processes and steps when facing multiple users:
it is arranged to transmit an audio stream every other block (chunk), for example 500ms (milliseconds), on the telephone transmission line.
Step 1: before multi-user response, the following parameters are initialized in the main thread:
initializing an audio feature extractor and calculating initial parameters of a network model by using the TDNN;
the following objects are defined:
adaptation _ state _ map < callld, speaker adaptive feature >
The callid represents identity information of a call user, each call has a speaker adaptive feature belonging to a guest party of the call, and the speaker adaptive feature belonging to the guest party of the call is initialized and applied to the TDNN model when a new audio stream block is identified each time.
decoder2cache _ map < callid, decode cache >
The decoding buffer (such as an initial audio feature extractor) is used for storing a complete decoding process of each call, when a user continuously speaks, the decoding buffer is divided into a plurality of chunks to make requests according to the sequence, and the decoding buffer is called for partial decoding when each chunk is partially decoded; when the endpoint detection model detects that the user finishes speaking, the last chunk information of the finish speaking can be completely decoded, and then the decoding cache is called for complete decoding; when the user resumes speaking, the sub-thread re-initializes a decode buffer to the audio stream of the newly started speech chunk information for partial decoding.
task _ lookup _ map < key, decoding result >
The key is a request object key value generated by the audio stream chunk number reqUuid and the identity information corresponding to the audio stream information, is used for storing a real-time decoding result after each partial decoding or complete decoding, and is stored or taken out according to the key.
chunk _ accepted _ map < callid, number of times chunk has been accepted >
For storing the number of received reqUuid times per phone call, each chunk is passed in with a new sequence value seq, indicating the order of the chunk, which sequence value seq has been decoded.
chunk _ decoded _ map < callid, number of times chunk has been decoded >
For: for storing the number of decoded requuids per phone call, reqUuid representing the number of the chunk audio stream at each time, plus 1 for each time the chunk is partially or completely decoded successfully, the currently decoded seq-1! When chunk _ decoded _ map [ valid ], it is found that the last chunk of the current valid has not been decoded yet, and the thread should enter idle mode to wait for the end of the last chunk decoding before decoding.
decoder2vad _ state _ map < callid, decoding vad state >
Defining a decoding state: vad 0 not activated vad 1 activated
Initializing the last frame, setting 1 to be activated
vadrequest _ queue message consumption buffer queue
For: when a chunk audio stream for each call is requested to the server, the current chunk related information object is stored in reqest _ queue according to key ═ callid + reqUuid, and then the main thread continuously takes out from the request _ queue and lets < key, audio stream > be handed to the sub-thread for decoding.
thread _ contexts network cache pool
The number of the sub-threads used for storing the initialization at the time of program startup may be set according to a specific usage scenario, and the default initialization number is 40 in this embodiment.
Judging whether the byte stream length of the audio stream is in a preset range, judging whether the audio sampling rate is 8kHz (kilohertz), and continuing the following steps if the audio sampling rate is 8 kHz.
Step 2: the method comprises the steps that a plurality of audio streams are transmitted in a plurality of call lines simultaneously, the audio stream in each call line carries identity information and audio information (callid, audio) related to a user, wherein callid represents the identity information of the user, the number reqUuid of a chunk stream is formed, and a key value pair < callid, rqUuid, chunk > is formed, reqUuid represents the number of the chunk audio stream at each time, chunk is the specific content of the corresponding audio stream, then the chunk number reqUuid and the identity information callid corresponding to the audio stream information are generated into a unique key and a reqUuid of a request object, after the request is sent to an algorithm service end, the algorithm service end adds the request into a request consumption queue of the service end, and simultaneously, a consumer corresponding to the key is initialized to return an object responself (key) and added into a tab _ lookup _ up of the return object by the key.
And continuously monitoring whether a request object exists in the consumption queue or not by the consumption function of the thread, popping up the request object if the request object exists, calling the consumption basic function of the sub-thread to perform decoding operation, and storing a decoding result into the tasks _ lookup _ map of the returned object when the consumption basic function is finished.
The processing flow of the child thread is shown in Step 4.
Step 3: after the request object of the current identity information is added into the consumption queue of the thread, the returned object of the current identity information of the map of the returned object is monitored to have a decoding result at specified time intervals. If the decoding result exists, the decoding result is directly returned. If no decoding result exists, monitoring is continued at specified time intervals.
For example, whether a request object exists in the consumption queue is queried every 10ms, that is, whether a key exists in the tasks _ lookup _ map or not, if yes, a decoding result is fetched, and the key is removed from the tasks _ lookup _ map.
The above is the execution content of the main thread.
Step 4: executing the child thread consuming basis function decoder _ core:
it is determined whether the current callid is the new user speaker, i.e. whether the first transmission,
if so:
the adaptation information feature of the new speaker, i.e., adaptation _ state _ map < valid, speaker adaptation feature > is initialized.
The vad _ state of the decoded vad state corresponding to the callled is set to 1, which indicates that the decoding has been activated.
Judging whether the vad _ state of the current callled is 1, if so, reinitializing the environment of the current callled, namely setting a new feature extractor and setting a new decoding cache, and simultaneously setting the current vad _ state to be 0, which represents starting a new callled decoding process.
If not:
the decode buffer is taken directly from the decoder2cache _ map.
The decoding result of the current audio stream is calculated using a decoding buffer, a feature extractor, and a network calculator.
An endpoint model is used to detect whether the current audio stream is an endpoint. If so: calling a final decoder to decode the characters, and setting vad _ state to be 1.
If not, only partial decoder is used to decode the character.
The consumer return object ResponseInfo (key) of the current callled is taken out from the tasks _ lookup _ map, and the decoded text result is assigned to ResponseInfo (key).
Step 4: a thread pool scheduling function executeWorker:
Figure BDA0002336652040000111
if request _ queue is empty, let the thread wait.
Through the steps, the synchronous transfer process control of the network cache pool of the multi-user response is completed, so that the process supports multiple users.
Determining whether the audio stream is silent using an endpoint detection model in step 27;
the endpoint detection model in this embodiment adopts the following structure:
as shown in fig. 5, the end point detection model includes an input layer 41, an audio CNN layer 42, a splice layer 43, a convolutional layer 44, a full connection layer 45, and an output layer 46.
The input layer 1 is used for receiving frame information of telephone background music and extracting features of the frame information to obtain features of preset dimensions.
The audio CNN layer 42 includes a plurality of one-dimensional convolution layers 421, a first pooling layer 422, and a Flatten layer 423 that are parallel to each other, and the audio CNN layer 42 is configured to perform convolution calculation on a feature of a preset dimension through each one-dimensional convolution layer 421 and obtain first feature data corresponding to each one-dimensional convolution layer 421; the first pooling layer 422 is connected to the tail of the one-dimensional convolutional layer 421, and the Flatten layer 423 is connected to the tail of the first pooling layer 422; the first pooling layer 422 is configured to pool a result obtained by performing convolution calculation on the feature of the preset dimension by the one-dimensional convolution layer 421; the Flatten layer 423 is used for performing Flatten on the pooled results to obtain first characteristic data.
The splicing layer 43 is used for connecting the first characteristic data corresponding to each path of one-dimensional convolution layer to obtain second characteristic data;
the convolutional layer 44 comprises a plurality of two-dimensional convolutional layers 441 and a second pooling layer 442, wherein the two-dimensional convolutional layers 441 are used for performing layer-by-layer convolution calculation on the second characteristic data to obtain third characteristic data; the tail of each convolutional layer 441 in the multi-layer convolutional layer is connected with a second pooling layer 442, and the second pooling layer 442 is used for pooling results of the layer-by-layer convolution calculation to obtain the third feature data.
The full link layer 45 is used to obtain the probability of the frame information according to the third feature data.
The fully-connected layer 45 includes a random deactivation layer Dropout layer 452 and a multi-layered fully-connected network layer, i.e., a sense layer 451, and the random deactivation layer 452 is disposed between adjacent two of the multi-layered sense layer 451.
The output layer 46 is used for obtaining the judgment result of whether the music is the telephone background music according to the probability.
The steps for training the endpoint detection model are as follows:
in this embodiment, the frame length of the audio of the endpoint detection model is set to 50ms, the frame shift is 25ms, the frame information of each frame is subjected to feature extraction by using an original spectrogram feature extraction method in the speech signal, and assuming that the feature dimension is set to (128,1), in order to fully consider the context information of adjacent frames, the frame information of the current frame and the frame information of two adjacent frames, namely three frames, left and right, are used as three-channel joint input, so that the network input size is designed to be (128, 3'), which is designed to be three channels in this embodiment, the number of the channels can be adjusted and designed according to the actual situation, and is not specifically limited herein.
There are two domains of audio information in the telephony scenario: in the time domain and the frequency domain, in this embodiment, three paths of parallel one-dimensional convolution layers are used, three paths of one-dimensional convolutions with three scales (scales kenerl size is 1, 2, and 3, respectively) respectively corresponding to the three paths are designed, the number of filters is set to 100, one-dimensional pooling layer is next followed after each one-dimensional convolution, the size (size) of one-dimensional pooling layer is set to 5-kernel size +1, and then the pooling layer is flattened. After using three different dimensions of such structures, their results were concat (ligated). After concat, the convolution sizes of the two convolutional layers are respectively set to be (3x3x128 and 3x3x256), the dimensionalities of the two fully-connected network layers are both 512, the random inactivation layer between the two fully-connected network layers is set to be 0.2, and finally the fully-connected layer with the dimensionality of 512 is connected to distinguish the probability of whether the convolution layer is an endpoint or not.
For frame information of one frame, the endpoint detection model designs two labels, namely an endpoint and a non-endpoint. In the training stage of the endpoint detection model, the learning rate can be initialized to 0.0001, the learning attenuation coefficient is designed to 0.000001, and the learning loss function is set to be the two-class cross entropy, or other commonly used loss functions can be selected.
In the aspect of improving the accuracy of model data, a method of model iteration on data is adopted, a model can be established by adopting a thicker sample audio data set, then the model is utilized to predict the prediction result of each frame of the whole section of audio for the existing training test data, the frame with incorrect prediction is recorded, and then the proportion of the correct effective frame record number to the total effective frame record number is calculated as the model detection and identification accuracy:
the end point activation area identifies the correct rate 1-incorrect number of frame records/total number of valid frame records.
The model detection accuracy rate can be continuously and manually trimmed below a preset threshold value, for example, 30%, and after trimming, the model detection accuracy rate is added into the training set to train the model again. Continuing with the above steps, trimming is again performed at less than a preset threshold, such as 20% or 10%.
The endpoint detection model of the embodiment can realize endpoint detection and identification of the voice content of the conversation audio in the OTA industry through the input layer, the audio CNN layer, the splicing layer, the convolution layer, the full connection layer and the output layer.
The embodiment addresses the problem of streaming speech recognition currently required by smart phones in the OTA industry for establishing telephone communications with multiple users in parallel, by establishing a request object corresponding to the user identity information, analyzing the audio stream corresponding to each user identity information by using a voice recognition model to obtain a decoding result corresponding to each user, and realizes the character decoding of the effective voice section between the silences through the judgment of the silence in the audio stream, compared with the traditional off-line voice recognition method for transmitting a long-time voice frequency, the voice frequency recognition speed is greatly improved, the call processing operation speed of the customer service is improved, the response speed of the OTA intelligent customer service is improved, the user does not need to wait for long-time word recognition in the conversation, and the good call feeling of the user is improved.
The endpoint detection model in the embodiment can ensure that the voice recognition can not be carried out on the same telephone only when the synchronous streaming voice recognition is carried out, and when the endpoint detection model identifies an effective endpoint, the synchronous streaming voice recognition can carry out real-time synchronous voice recognition on the audio stream between the starting point and the end point of the identified effective voice.
Example 2
The present embodiment provides a multi-user streaming voice recognition system, as shown in fig. 6, the multi-user streaming voice recognition system includes an initialization module 00, a transmission module 01, a generation module 02, a decoding module 03, and an identification module 04;
the initialization module is used for initializing the feature extractor;
the initial parameters are also used for initializing the voice recognition model;
the voice decoder is also used for initializing, and comprises a partial decoder and a final decoder;
the transmission module 01 acquires audio streams of a plurality of corresponding users from a plurality of call lines, wherein the audio streams comprise audio information and identity information of the users;
the generating module 02 is used for generating a request object according to the audio information and the identity information;
the decoding module 03 is configured to analyze the audio information in the request object by using the speech recognition model to obtain a decoding result corresponding to the identity information;
the identification module 04 is configured to determine whether audio information in the audio stream is silent, if not, invoke a partial decoder to analyze a decoding result to obtain an intermediate identification word, and if so, invoke a final decoder to analyze the decoding result to obtain a final identification word.
Preferably, the speech recognition model is a time-delay neural network model;
the decoding module 03 is further configured to perform feature extraction on the audio information in the request object by using a feature extractor to obtain feature information corresponding to the identity information; and analyzing the characteristic information by using the voice recognition model to obtain a decoding result.
In the embodiment, a thread mode is adopted for management, and the multi-user streaming type voice recognition system comprises a main thread and a sub-thread;
the main thread is used for acquiring audio streams of corresponding users from a plurality of call lines and generating a request object according to the audio information and the identity information;
the main thread is also used for adding the request object into the request queue;
the main thread is also used for setting a return object corresponding to the request object;
the sub-thread is used for acquiring a request object in the request queue and judging whether the identity information in the request object is a new user for first transmission, if so, initializing the environmental parameters of the voice recognition model, and also used for analyzing the audio information according to the request object by using the voice recognition model to obtain a decoding result corresponding to the identity information; if not, directly utilizing the voice recognition model to obtain a decoding result;
the sub-thread is also used for judging whether the audio stream is silent, if not, calling a part of decoder to analyze the decoding result to obtain a middle identification character corresponding to the user identity information of the audio stream, if so, calling a final decoder, and also used for analyzing the decoding result to obtain a final identification character corresponding to the user identity information of the audio stream; and assigning the character recognition result to a return object;
the main thread is also used to process the return object.
Preferably, the main thread is further configured to periodically query whether a text recognition result exists in a returned object corresponding to the request object in the main thread, and if so, display the text recognition result.
Preferably, the identification module is further configured to determine whether the audio stream is silent using an endpoint detection model.
The speech recognition model in this embodiment adopts a TDNN (time delay neural network) network model, which is a frame-level speech recognition model and supports a frame-level prediction model of an audio stream, so as to support synchronous transcription of a real-time audio stream. The method comprises the following steps of judging the start and the end of audio by adopting an endpoint detection model, carrying out voice recognition when an audio stream in synchronous transcription starts, stopping the voice recognition when a new endpoint is restarted, then carrying out effective voice decoding between the two endpoints, and carrying out the following processes and steps when facing multiple users:
step 1: before multi-user response, the following parameters are initialized in the main thread:
initializing an audio feature extractor and calculating initial parameters of a network model by using the TDNN;
the following objects are defined:
it is arranged to transmit an audio stream every other block (chunk), for example 500ms (milliseconds), on the telephone transmission line.
Step 1: before multi-user response, the following parameters are initialized in the main thread:
initializing an audio feature extractor and calculating initial parameters of a network model by using the TDNN;
the following objects are defined:
adaptation _ state _ map < callld, speaker adaptive feature >
The callid represents identity information of a call user, each call has a speaker adaptive feature belonging to a guest party of the call, and the speaker adaptive feature belonging to the guest party of the call is initialized and applied to the TDNN model when a new audio stream block is identified each time.
decoder2cache _ map < callid, decode cache >
The decoding buffer (such as an initial audio feature extractor) is used for storing a complete decoding process of each call, when a user continuously speaks, the decoding buffer is divided into a plurality of chunks to make requests according to the sequence, and the decoding buffer is called for partial decoding when each chunk is partially decoded; when the endpoint detection model detects that the user finishes speaking, the last chunk information of the finish speaking can be completely decoded, and then the decoding cache is called for complete decoding; when the user resumes speaking, the sub-thread re-initializes a decode buffer to the audio stream of the newly started speech chunk information for partial decoding.
task _ lookup _ map < key, decoding result >
The key is a request object key value generated by the audio stream chunk number reqUuid and the identity information corresponding to the audio stream information, is used for storing a real-time decoding result after each partial decoding or complete decoding, and is stored or taken out according to the key.
chunk _ accepted _ map < callid, number of times chunk has been accepted >
For storing the number of received reqUuid times per phone call, each chunk is passed in with a new sequence value seq, indicating the order of the chunk, which sequence value seq has been decoded.
chunk _ decoded _ map < callid, number of times chunk has been decoded >
For: for storing the number of decoded requuids per phone call, reqUuid representing the number of the chunk audio stream at each time, plus 1 for each time the chunk is partially or completely decoded successfully, the currently decoded seq-1! When chunk _ decoded _ map [ valid ], it is found that the last chunk of the current valid has not been decoded yet, and the thread should enter idle mode to wait for the end of the last chunk decoding before decoding.
decoder2vad _ state _ map < callid, decoding vad state >
Defining a decoding state: vad 0 not activated vad 1 activated
Initializing the last frame, setting 1 to be activated
vadrequest _ queue message consumption buffer queue
For: when a chunk audio stream for each call is requested to the server, the current chunk related information object is stored in reqest _ queue according to key ═ callid + reqUuid, and then the main thread continuously takes out from the request _ queue and lets < key, audio stream > be handed to the sub-thread for decoding.
thread _ contexts network cache pool
The number of the sub-threads used for storing the initialization at the time of program startup may be set according to a specific usage scenario, and the default initialization number is 40 in this embodiment.
Judging whether the byte stream length of the audio stream is in a preset range, judging whether the audio sampling rate is 8kHz (kilohertz), and continuing the following steps if the audio sampling rate is 8 kHz.
Step 2: the method comprises the steps that a plurality of audio streams are transmitted in a plurality of call lines simultaneously, the audio stream in each call line carries identity information and audio information (callid, audio) related to a user, wherein callid represents the identity information of the user, the number reqUuid of a chunk stream is formed, and a key value pair < callid, rqUuid, chunk > is formed, reqUuid represents the number of the chunk audio stream at each time, chunk is the specific content of the corresponding audio stream, then the chunk number reqUuid and the identity information callid corresponding to the audio stream information are generated into a unique key and a reqUuid of a request object, after the request is sent to an algorithm service end, the algorithm service end adds the request into a request consumption queue of the service end, and simultaneously, a consumer corresponding to the key is initialized to return an object responself (key) and added into a tab _ lookup _ up of the return object by the key.
And continuously monitoring whether a request object exists in the consumption queue or not by the consumption function of the thread, popping up the request object if the request object exists, calling the consumption basic function of the sub-thread to perform decoding operation, and storing a decoding result into the tasks _ lookup _ map of the returned object when the consumption basic function is finished.
The processing flow of the child thread is shown in Step 4.
Step 3: after the request object of the current identity information is added into the consumption queue of the thread, the returned object of the current identity information of the map of the returned object is monitored to have a decoding result at specified time intervals. If the decoding result exists, the decoding result is directly returned. If no decoding result exists, monitoring is continued at specified time intervals.
For example, whether a request object exists in the consumption queue is queried every 10ms, that is, whether a key exists in the tasks _ lookup _ map or not, if yes, a decoding result is fetched, and the key is removed from the tasks _ lookup _ map.
The above is the execution content of the main thread.
Step 4: executing the child thread consuming basis function decoder _ core:
it is determined whether the current callid is the new user speaker, i.e. whether the first transmission,
if so:
the adaptation information feature of the new speaker, i.e., adaptation _ state _ map < valid, speaker adaptation feature > is initialized.
The vad _ state of the decoded vad state corresponding to the callled is set to 1, which indicates that the decoding has been activated.
Judging whether the vad _ state of the current callled is 1, if so, reinitializing the environment of the current callled, namely setting a new feature extractor and setting a new decoding cache, and simultaneously setting the current vad _ state to be 0, which represents starting a new callled decoding process.
If not:
the decode buffer is taken directly from the decoder2cache _ map.
The decoding result of the current audio stream is calculated using a decoding buffer, a feature extractor, and a network calculator.
An endpoint model is used to detect whether the current audio stream is an endpoint. If so: calling a final decoder to decode the characters, and setting vad _ state to be 1.
If not, only partial decoder is used to decode the character.
The consumer return object ResponseInfo (key) of the current callled is taken out from the tasks _ lookup _ map, and the decoded text result is assigned to ResponseInfo (key).
Step 4: a thread pool scheduling function executeWorker:
Figure BDA0002336652040000191
if request _ queue is empty, let the thread wait.
Through the steps, the synchronous transfer process control of the network cache pool of the multi-user response is completed, so that the process supports multiple users.
Determining whether the audio stream is silent using an endpoint detection model in step 27;
the endpoint detection model in this embodiment adopts the following structure:
as shown in fig. 5, the end point detection model includes an input layer 41, an audio CNN layer 42, a splice layer 43, a convolutional layer 44, a full connection layer 45, and an output layer 46.
The input layer 1 is used for receiving frame information of telephone background music and extracting features of the frame information to obtain features of preset dimensions.
The audio CNN layer 42 includes a plurality of one-dimensional convolution layers 421, a first pooling layer 422, and a Flatten layer 423 that are parallel to each other, and the audio CNN layer 42 is configured to perform convolution calculation on a feature of a preset dimension through each one-dimensional convolution layer 421 and obtain first feature data corresponding to each one-dimensional convolution layer 421; the first pooling layer 422 is connected to the tail of the one-dimensional convolutional layer 421, and the Flatten layer 423 is connected to the tail of the first pooling layer 422; the first pooling layer 422 is configured to pool a result obtained by performing convolution calculation on the feature of the preset dimension by the one-dimensional convolution layer 421; the Flatten layer 423 is used for performing Flatten on the pooled results to obtain first characteristic data.
The splicing layer 43 is used for connecting the first characteristic data corresponding to each path of one-dimensional convolution layer to obtain second characteristic data;
the convolutional layer 44 comprises a plurality of two-dimensional convolutional layers 441 and a second pooling layer 442, wherein the two-dimensional convolutional layers 441 are used for performing layer-by-layer convolution calculation on the second characteristic data to obtain third characteristic data; the tail of each convolutional layer 441 in the multi-layer convolutional layer is connected with a second pooling layer 442, and the second pooling layer 442 is used for pooling results of the layer-by-layer convolution calculation to obtain the third feature data.
The full link layer 45 is used to obtain the probability of the frame information according to the third feature data.
The fully-connected layer 45 includes a random deactivation layer Dropout layer 452 and a multi-layered fully-connected network layer, i.e., a sense layer 451, and the random deactivation layer 452 is disposed between adjacent two of the multi-layered sense layer 451.
The output layer 46 is used for obtaining the judgment result of whether the music is the telephone background music according to the probability.
The steps for training the endpoint detection model are as follows:
in this embodiment, the frame length of the audio of the endpoint detection model is set to 50ms, the frame shift is 25ms, the frame information of each frame is subjected to feature extraction by using an original spectrogram feature extraction method in the speech signal, and assuming that the feature dimension is set to (128,1), in order to fully consider the context information of adjacent frames, the frame information of the current frame and the frame information of two adjacent frames, namely three frames, left and right, are used as three-channel joint input, so that the network input size is designed to be (128, 3'), which is designed to be three channels in this embodiment, the number of the channels can be adjusted and designed according to the actual situation, and is not specifically limited herein.
There are two domains of audio information in the telephony scenario: in the time domain and the frequency domain, in this embodiment, three paths of parallel one-dimensional convolution layers are used, three paths of one-dimensional convolutions with three scales (scales kenerl size is 1, 2, and 3, respectively) respectively corresponding to the three paths are designed, the number of filters is set to 100, one-dimensional pooling layer is next followed after each one-dimensional convolution, the size (size) of one-dimensional pooling layer is set to 5-kernel size +1, and then the pooling layer is flattened. After using three different dimensions of such structures, their results were concat (ligated). After concat, the convolution sizes of the two convolutional layers are respectively set to be (3x3x128 and 3x3x256), the dimensionalities of the two fully-connected network layers are both 512, the random inactivation layer between the two fully-connected network layers is set to be 0.2, and finally the fully-connected layer with the dimensionality of 512 is connected to distinguish the probability of whether the convolution layer is an endpoint or not.
For frame information of one frame, the endpoint detection model designs two labels, namely an endpoint and a non-endpoint. In the training stage of the endpoint detection model, the learning rate can be initialized to 0.0001, the learning attenuation coefficient is designed to 0.000001, and the learning loss function is set to be the two-class cross entropy, or other commonly used loss functions can be selected.
In the aspect of improving the accuracy of model data, a method of model iteration on data is adopted, a model can be established by adopting a thicker sample audio data set, then the model is utilized to predict the prediction result of each frame of the whole section of audio for the existing training test data, the frame with incorrect prediction is recorded, and then the proportion of the correct effective frame record number to the total effective frame record number is calculated as the model detection and identification accuracy:
the end point activation area identifies the correct rate 1-incorrect number of frame records/total number of valid frame records.
The model detection accuracy rate can be continuously and manually trimmed below a preset threshold value, for example, 30%, and after trimming, the model detection accuracy rate is added into the training set to train the model again. Continuing with the above steps, trimming is again performed at less than a preset threshold, such as 20% or 10%.
The endpoint detection model of the embodiment can realize endpoint detection and identification of the voice content of the conversation audio in the OTA industry through the input layer, the audio CNN layer, the splicing layer, the convolution layer, the full connection layer and the output layer.
The embodiment addresses the problem of streaming speech recognition currently required by smart phones in the OTA industry for establishing telephone communications with multiple users in parallel, by establishing a request object corresponding to the user identity information, analyzing the audio stream corresponding to each user identity information by using a voice recognition model to obtain a decoding result corresponding to each user, and realizes the character decoding of the effective voice section between the silences through the judgment of the silence in the audio stream, compared with the traditional off-line voice recognition method for transmitting a long-time voice frequency, the voice frequency recognition speed is greatly improved, the call processing operation speed of the customer service is improved, the response speed of the OTA intelligent customer service is improved, the user does not need to wait for long-time word recognition in the conversation, and the good call feeling of the user is improved.
The endpoint detection model in the embodiment can ensure that the voice recognition can not be carried out on the same telephone only when the synchronous streaming voice recognition is carried out, and when the endpoint detection model identifies an effective endpoint, the synchronous streaming voice recognition can carry out real-time synchronous voice recognition on the audio stream between the starting point and the end point of the identified effective voice.
Example 3
Fig. 7 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention. The electronic device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-user streaming speech recognition method of embodiment 1 when executing the program. The electronic device 50 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 7, the electronic device 50 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 50 may include, but are not limited to: the at least one processor 51, the at least one memory 52, and a bus 53 connecting the various system components (including the memory 52 and the processor 51).
The bus 53 includes a data bus, an address bus, and a control bus.
The memory 52 may include volatile memory, such as Random Access Memory (RAM)521 and/or cache memory 522, and may further include Read Only Memory (ROM) 523.
Memory 52 may also include a program/utility 525 having a set (at least one) of program modules 524, such program modules 524 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The processor 51 executes various functional applications and data processing, such as the multi-user streaming voice recognition method provided in embodiment 1 of the present invention, by executing the computer program stored in the memory 52.
The electronic device 50 may also communicate with one or more external devices 54 (e.g., a keyboard, a pointing device, etc.). Such communication may be through an input/output (I/O) interface 55. Also, the model-generating device 50 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via a network adapter 56. As shown, the network adapter 56 communicates with the other modules of the model-generating device 50 over a bus 53. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating device 50, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.
It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Example 4
The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the multi-user streaming speech recognition method provided in embodiment 1.
More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
In a possible implementation, the invention can also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps of implementing the multi-user streaming speech recognition method of embodiment 1, when the program product is run on the terminal device.
Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims (12)

1. A multi-user streaming voice recognition method, characterized in that the multi-user streaming voice recognition method comprises:
acquiring audio streams of a plurality of corresponding users from a plurality of call lines, wherein the audio streams comprise audio information and identity information of the users;
generating a request object according to the audio information and the identity information;
analyzing the audio information in the request object by using a voice recognition model to obtain a decoding result corresponding to the identity information;
and judging whether the audio information in the audio stream is silent, if not, calling a partial decoder to analyze the decoding result to obtain intermediate identification characters corresponding to the identity information of the user of the audio stream, if so, calling a final decoder to analyze the decoding result to obtain final identification characters corresponding to the identity information of the user of the audio stream.
2. The multi-user streaming speech recognition method of claim 1, wherein the speech recognition model is a time-delayed neural network model;
and/or the step of obtaining the audio stream of the corresponding user from the plurality of call lines comprises the following steps:
initializing a feature extractor;
initializing initial parameters of a speech recognition model;
initializing a speech decoder, the speech decoder comprising a partial decoder and a final decoder;
the step of analyzing the audio information in the request object by using a speech recognition model to obtain a decoding result corresponding to the identity information comprises:
extracting the characteristics of the audio information in the request object by using the characteristic extractor to obtain characteristic information corresponding to the identity information;
and analyzing the characteristic information by using the voice recognition model to obtain the decoding result.
3. The multi-user streaming speech recognition method of claim 1, wherein the multi-user streaming speech recognition method comprises:
acquiring audio streams of corresponding users from a plurality of call lines in a main thread, and generating a request object according to the audio information and the identity information;
adding the request object into a request queue in a main thread;
setting a return object corresponding to the request object in the main thread;
acquiring the request object in the request queue in a sub-thread, judging whether the identity information in the request object is a new user for first transmission, if so, initializing environmental parameters of the voice recognition model, and analyzing the audio information according to the request object by using the voice recognition model to obtain the decoding result corresponding to the identity information; if not, directly utilizing the voice recognition model to obtain the decoding result;
judging whether the audio information in the audio stream is silent in the sub-thread, if not, calling a partial decoder to analyze the decoding result to obtain intermediate identification characters corresponding to the identity information of the user of the audio stream, and if so, calling a final decoder to analyze the decoding result to obtain final identification characters corresponding to the identity information of the user of the audio stream; assigning the character recognition result to the return object;
the main thread processes the return object.
4. The multi-user streaming speech recognition method of claim 3, wherein the step of the main thread processing the return object comprises:
and periodically inquiring whether a character recognition result exists in the return object corresponding to the request object in the main thread, and if so, displaying the character recognition result.
5. The multi-user streaming speech recognition method of claim 1, wherein the step of determining whether the audio stream is silent comprises:
judging whether the audio information in the audio stream is silent by using an endpoint detection model;
the endpoint detection model comprises an input layer, an audio CNN layer, a splicing layer, a convolution layer, a full connection layer and an output layer;
the input layer is used for receiving frame information of a test audio and extracting features of the frame information to obtain features of a preset dimension;
the audio CNN layer comprises a plurality of parallel one-dimensional convolutional layers, and is used for performing convolution calculation on the preset dimensional characteristics through the one-dimensional convolutional layers to obtain first characteristic data corresponding to each one-dimensional convolutional layer;
the splicing layer is used for connecting the first characteristic data to obtain second characteristic data;
the convolution layer is used for performing convolution calculation on the second characteristic data to obtain third characteristic data;
the full connection layer is used for obtaining the probability of the frame information according to the third characteristic data;
and the output layer is used for obtaining a judgment result of whether the test audio is silent according to the probability.
6. The multi-user streaming voice recognition system is characterized by comprising a transmission module, a generation module, a decoding module and a recognition module;
the transmission module acquires audio streams of a plurality of corresponding users from a plurality of call lines, wherein the audio streams comprise audio information and identity information of the users;
the generating module is used for generating a request object according to the audio information and the identity information;
the decoding module is used for analyzing the audio information in the request object by utilizing a voice recognition model to obtain a decoding result corresponding to the identity information;
the identification module is used for judging whether the audio information in the audio stream is silent, if not, calling a part of decoders to analyze the decoding result to obtain intermediate identification characters corresponding to the identity information of the user of the audio stream, and if so, calling a final decoder to analyze the decoding result to obtain final identification characters corresponding to the identity information of the user of the audio stream.
7. The multi-user streaming speech recognition system of claim 6, wherein the speech recognition model is a time-delayed neural network model;
and/or the presence of a gas in the gas,
the multi-user streaming voice recognition system further comprises an initialization module;
the initialization module is used for initializing the feature extractor;
the initial parameters are also used for initializing the voice recognition model;
also for initializing a speech decoder, the speech decoder comprising a partial decoder and a final decoder;
the decoding module is further configured to perform feature extraction on the audio information in the request object by using the feature extractor to obtain feature information corresponding to the identity information; and analyzing the characteristic information by using the voice recognition model to obtain the decoding result.
8. The multi-user streaming voice recognition system of claim 6, wherein the multi-user streaming voice recognition system comprises a main thread and a sub-thread;
the main thread is used for acquiring audio streams of corresponding users from a plurality of call lines and generating a request object according to the audio information and the identity information;
the main thread is also used for adding the request object into a request queue;
the main thread is also used for setting a return object corresponding to the request object;
the sub-thread is used for acquiring the request object in the request queue and judging whether the identity information in the request object is a new user for first transmission, if so, initializing the environmental parameters of the voice recognition model, and analyzing the audio information according to the request object by using the voice recognition model to obtain the decoding result corresponding to the identity information; if not, directly utilizing the voice recognition model to obtain the decoding result;
the sub-thread is further used for judging whether the audio information in the audio stream is silent, if not, a part of decoders are called to analyze the decoding result to obtain intermediate identification characters corresponding to the identity information of the user of the audio stream, and if so, a final decoder is called to analyze the decoding result to obtain final identification characters corresponding to the identity information of the user of the audio stream;
the main thread is also used for processing the return object.
9. The multi-user streaming voice recognition system of claim 8, wherein the main thread is further configured to periodically query whether a text recognition result exists in the returned object corresponding to the requested object in the main thread, and if so, display the text recognition result.
10. The multi-user streaming speech recognition system of claim 6, wherein the recognition module is further configured to determine whether audio information in the audio stream is silent using an endpoint detection model;
the endpoint detection model comprises an input layer, an audio CNN layer, a splicing layer, a convolution layer, a full connection layer and an output layer;
the input layer is used for receiving frame information of a test audio and extracting features of the frame information to obtain features of a preset dimension;
the audio CNN layer comprises a plurality of parallel one-dimensional convolutional layers, and is used for performing convolution calculation on the preset dimensional characteristics through the one-dimensional convolutional layers to obtain first characteristic data corresponding to each one-dimensional convolutional layer;
the splicing layer is used for connecting the first characteristic data to obtain second characteristic data;
the convolution layer is used for performing convolution calculation on the second characteristic data to obtain third characteristic data;
the full connection layer is used for obtaining the probability of the frame information according to the third characteristic data;
and the output layer is used for obtaining a judgment result of whether the test audio is silent according to the probability.
11. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-user streaming speech recognition method of any of claims 1-5 when executing the computer program.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the multi-user streaming speech recognition method of any one of claims 1 to 5.
CN201911358893.6A 2019-12-25 2019-12-25 Multi-user streaming voice recognition method, system, device and medium Active CN110995943B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911358893.6A CN110995943B (en) 2019-12-25 2019-12-25 Multi-user streaming voice recognition method, system, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911358893.6A CN110995943B (en) 2019-12-25 2019-12-25 Multi-user streaming voice recognition method, system, device and medium

Publications (2)

Publication Number Publication Date
CN110995943A true CN110995943A (en) 2020-04-10
CN110995943B CN110995943B (en) 2021-05-07

Family

ID=70075531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911358893.6A Active CN110995943B (en) 2019-12-25 2019-12-25 Multi-user streaming voice recognition method, system, device and medium

Country Status (1)

Country Link
CN (1) CN110995943B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382278A (en) * 2020-11-18 2021-02-19 北京百度网讯科技有限公司 Streaming voice recognition result display method and device, electronic equipment and storage medium
CN113053392A (en) * 2021-03-26 2021-06-29 京东数字科技控股股份有限公司 Speech recognition method, speech recognition apparatus, electronic device, and medium
CN113205800A (en) * 2021-04-22 2021-08-03 京东数字科技控股股份有限公司 Audio recognition method and device, computer equipment and storage medium
CN114822540A (en) * 2022-06-29 2022-07-29 广州小鹏汽车科技有限公司 Vehicle voice interaction method, server and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139849A (en) * 2015-07-22 2015-12-09 百度在线网络技术(北京)有限公司 Speech recognition method and apparatus
US9263034B1 (en) * 2010-07-13 2016-02-16 Google Inc. Adapting enhanced acoustic models
CN107578777A (en) * 2016-07-05 2018-01-12 阿里巴巴集团控股有限公司 Word-information display method, apparatus and system, audio recognition method and device
CN107910008A (en) * 2017-11-13 2018-04-13 河海大学 A kind of audio recognition method based on more acoustic models for personal device
CN108630193A (en) * 2017-03-21 2018-10-09 北京嘀嘀无限科技发展有限公司 Audio recognition method and device
CN109360551A (en) * 2018-10-25 2019-02-19 珠海格力电器股份有限公司 A kind of audio recognition method and device
CN110265040A (en) * 2019-06-20 2019-09-20 Oppo广东移动通信有限公司 Training method, device, storage medium and the electronic equipment of sound-groove model
US20190294974A1 (en) * 2018-03-26 2019-09-26 International Business Machines Corporation Voice prompt avatar
CN110335612A (en) * 2019-07-11 2019-10-15 招商局金融科技有限公司 Minutes generation method, device and storage medium based on speech recognition

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9263034B1 (en) * 2010-07-13 2016-02-16 Google Inc. Adapting enhanced acoustic models
CN105139849A (en) * 2015-07-22 2015-12-09 百度在线网络技术(北京)有限公司 Speech recognition method and apparatus
CN107578777A (en) * 2016-07-05 2018-01-12 阿里巴巴集团控股有限公司 Word-information display method, apparatus and system, audio recognition method and device
CN108630193A (en) * 2017-03-21 2018-10-09 北京嘀嘀无限科技发展有限公司 Audio recognition method and device
CN107910008A (en) * 2017-11-13 2018-04-13 河海大学 A kind of audio recognition method based on more acoustic models for personal device
US20190294974A1 (en) * 2018-03-26 2019-09-26 International Business Machines Corporation Voice prompt avatar
CN109360551A (en) * 2018-10-25 2019-02-19 珠海格力电器股份有限公司 A kind of audio recognition method and device
CN110265040A (en) * 2019-06-20 2019-09-20 Oppo广东移动通信有限公司 Training method, device, storage medium and the electronic equipment of sound-groove model
CN110335612A (en) * 2019-07-11 2019-10-15 招商局金融科技有限公司 Minutes generation method, device and storage medium based on speech recognition

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112382278A (en) * 2020-11-18 2021-02-19 北京百度网讯科技有限公司 Streaming voice recognition result display method and device, electronic equipment and storage medium
CN112382278B (en) * 2020-11-18 2021-08-17 北京百度网讯科技有限公司 Streaming voice recognition result display method and device, electronic equipment and storage medium
CN113053392A (en) * 2021-03-26 2021-06-29 京东数字科技控股股份有限公司 Speech recognition method, speech recognition apparatus, electronic device, and medium
CN113053392B (en) * 2021-03-26 2024-04-05 京东科技控股股份有限公司 Speech recognition method, speech recognition device, electronic equipment and medium
CN113205800A (en) * 2021-04-22 2021-08-03 京东数字科技控股股份有限公司 Audio recognition method and device, computer equipment and storage medium
CN113205800B (en) * 2021-04-22 2024-03-01 京东科技控股股份有限公司 Audio identification method, device, computer equipment and storage medium
CN114822540A (en) * 2022-06-29 2022-07-29 广州小鹏汽车科技有限公司 Vehicle voice interaction method, server and storage medium

Also Published As

Publication number Publication date
CN110995943B (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN110995943B (en) Multi-user streaming voice recognition method, system, device and medium
JP7365985B2 (en) Methods, devices, electronic devices, computer-readable storage media and computer programs for recognizing speech
CN104766608A (en) Voice control method and voice control device
CN103514882A (en) Voice identification method and system
JP2021089438A (en) Selective adaptation and utilization of noise reduction technique in invocation phrase detection
CN111816172A (en) Voice response method and device
CN112767916A (en) Voice interaction method, device, equipment, medium and product of intelligent voice equipment
CN116417003A (en) Voice interaction system, method, electronic device and storage medium
CN113793599B (en) Training method of voice recognition model, voice recognition method and device
CN113782013B (en) Method, apparatus, storage medium and program product for speech recognition and model training
CN113611316A (en) Man-machine interaction method, device, equipment and storage medium
CN112087726B (en) Method and system for identifying polyphonic ringtone, electronic equipment and storage medium
CN112242135A (en) Voice data processing method and intelligent customer service device
CN115346517A (en) Streaming voice recognition method, device, equipment and storage medium
CN108932943A (en) Order word sound detection method, device, equipment and storage medium
CN114299955B (en) Voice interaction method and device, electronic equipment and storage medium
CN114171016B (en) Voice interaction method and device, electronic equipment and storage medium
CN110930985B (en) Telephone voice recognition model, method, system, equipment and medium
CN114512123A (en) Training method and device of VAD model and voice endpoint detection method and device
CN111049997B (en) Telephone background music detection model method, system, equipment and medium
CN114242064A (en) Speech recognition method and device, and training method and device of speech recognition model
CN114078478B (en) Voice interaction method and device, electronic equipment and storage medium
CN112820276B (en) Speech processing method, device, computer readable storage medium and processor
CN114743540A (en) Speech recognition method, system, electronic device and storage medium
CN114446307A (en) Model training method, voice transcription method, system, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant