CN115148191A - Voice processing method and server - Google Patents
Voice processing method and server Download PDFInfo
- Publication number
- CN115148191A CN115148191A CN202210750758.1A CN202210750758A CN115148191A CN 115148191 A CN115148191 A CN 115148191A CN 202210750758 A CN202210750758 A CN 202210750758A CN 115148191 A CN115148191 A CN 115148191A
- Authority
- CN
- China
- Prior art keywords
- voice
- acoustic
- algorithm
- decoding
- results
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title abstract description 22
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 141
- 238000000605 extraction Methods 0.000 claims abstract description 96
- 238000004364 calculation method Methods 0.000 claims abstract description 68
- 230000006870 function Effects 0.000 claims abstract description 48
- 238000012545 processing Methods 0.000 claims abstract description 48
- 238000000034 method Methods 0.000 claims abstract description 30
- 239000011159 matrix material Substances 0.000 claims description 29
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 238000012821 model calculation Methods 0.000 claims description 10
- 230000006399 behavior Effects 0.000 claims description 7
- 238000010586 diagram Methods 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 7
- 239000013598 vector Substances 0.000 description 7
- 230000003993 interaction Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000002085 persistent effect Effects 0.000 description 4
- 230000000903 blocking effect Effects 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000002355 dual-layer Substances 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/285—Memory allocation or algorithm optimisation to reduce hardware requirements
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/005—Correction of errors induced by the transmission channel, if related to the coding algorithm
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Telephonic Communication Services (AREA)
Abstract
The application relates to a voice processing method and a server. The method comprises the following steps: receiving a voice signal of a sound area in a cabin forwarded by a vehicle; combining voice algorithms with different functions to extract the characteristics of voice signals, obtaining characteristic extraction results and caching the characteristic extraction results; calling feature extraction results in the cache according to preset configuration parameters of each voice algorithm, and calculating through an acoustic model to obtain corresponding acoustic calculation results; and respectively inputting each acoustic calculation result into a decoding module of a corresponding voice algorithm for decoding, summarizing each decoding result, obtaining a voice processing result and sending the voice processing result to the vehicle. According to the scheme provided by the application, each voice algorithm can extract the features of the voice signals in a combined mode, repeated operation is avoided, and occupation of CPU resources is reduced.
Description
Technical Field
The present application relates to the field of intelligent voice technologies, and in particular, to a voice processing method and a server.
Background
Voice interaction is a new generation of interaction model based on voice input. With the continuous development of the automobile industry and the human-computer interaction technology, the intelligent automobile also provides a voice interaction function for users.
In the related technology, the intelligent automobile can be provided with a voice assistant to provide natural man-machine interaction, and a user can control the navigation, music and other vehicle-mounted software and can also control vehicle windows, air conditioners and other vehicle-mounted hardware in the automobile through voice.
However, since the voice application has a high requirement for real-time performance, various voice algorithms, such as a voice recognition algorithm, a voiceprint recognition algorithm, a voice wakeup algorithm, and other functional algorithms, involve a large amount of computation in data processing, and especially when a plurality of algorithms are synchronously processed, the algorithms occupy high CPU resources, and are prone to cause blocking to other applications executed simultaneously, which affects data processing efficiency of other applications.
Disclosure of Invention
In order to solve or partially solve the problems in the related art, the application provides a voice processing method and a server, which can enable each voice algorithm to perform feature extraction on voice signal combination, avoid repeated operation and reduce occupation of CPU resources.
A first aspect of the present application provides a speech processing method, including: receiving a voice signal of a sound area in a cabin forwarded by a vehicle; combining voice algorithms with different functions to extract the characteristics of voice signals, obtaining characteristic extraction results and caching the characteristic extraction results; calling feature extraction results in the cache according to preset configuration parameters of each voice algorithm, and calculating through an acoustic model to obtain corresponding acoustic calculation results; and respectively inputting each acoustic calculation result into a decoding module of a corresponding voice algorithm for decoding, summarizing each decoding result, obtaining a voice processing result and sending the voice processing result to the vehicle. The speech signals are combined through each algorithm to carry out feature extraction, so that repeated operation is avoided, and occupation of CPU resources is reduced.
In the speech processing method of the present application, the merging the speech algorithms with different functions to extract features of the speech signal includes: and in the corresponding sound zone, combining the voice algorithms with different functions according to a preset format to extract the characteristics in the voice signal. Aiming at different sound zones, the server can respectively extract the features and mutually independently set the corresponding feature extraction formats, so that the feature extraction requirements of multiple directions are met without mutual interference.
In the speech processing method of the present application, the calling the feature extraction result in the cache according to the preset configuration parameters of each speech algorithm respectively includes: calling feature extraction results of corresponding frame numbers in a cache for splicing according to preset configuration parameters of each voice algorithm to generate corresponding feature matrixes; the preset configuration parameters comprise at least one of splicing frame number, filling behavior and interval frame number. The input requirements of the feature matrixes of different acoustic models can be met according to preset configuration parameters, the input requirements can be met by simply setting the parameters, and the matrix splicing scheme is simplified.
The method for processing the voice comprises the steps of calling feature extraction results of corresponding frames in a cache for splicing according to preset configuration parameters of a corresponding voice algorithm to generate a corresponding feature matrix, wherein when a new voice signal is received, calling the feature extraction results in the cache according to a feature splicing interface respectively; and when the feature extraction result meets the preset configuration parameters of at least one voice algorithm, generating a corresponding feature matrix according to a splicing function in the feature splicing interface. According to the flow calculation with follow-up calculation, the voice algorithms do not need to wait for each other and are independently performed, and the timeliness requirement of the voice technology is met.
In the speech processing method of the present application, before the respectively inputting each acoustic calculation result into the decoding module of the corresponding speech algorithm for decoding, the method further includes: respectively determining whether the corresponding voice algorithm obtains a feature matrix; and if the voice algorithm is determined to obtain the corresponding feature matrix, decoding according to an acoustic calculation result output by the acoustic model. By judging whether the characteristic matrix of the front end is generated effectively and then correspondingly determining whether decoding is needed, invalid transfer of a decoding module is avoided, and CPU resources are saved.
In the speech processing method of the present application, the receiving a speech signal of an in-cabin sound zone forwarded by a vehicle includes: respectively receiving voice signals of each sound zone in a cockpit forwarded by a vehicle; the merging of the speech algorithms with different functions to extract the features of the speech signal comprises: and combining the voice signals from the same sound area according to the voice algorithms with different functions to extract the characteristics of the voice signals. The voice signal processing of different sound zones is distinguished through the server, and the voice signals corresponding to the sound zones are subjected to feature extraction only once, so that the CPU resource is saved, the interference is avoided, and the accuracy of the final voice processing result is ensured.
In the speech processing method of the present application, the obtaining of the corresponding acoustic calculation result through the acoustic model calculation includes: and respectively calling the corresponding acoustic submodels to calculate according to a calculation interface of a preset neural network processing engine to obtain acoustic calculation results of the corresponding voice algorithm. The acoustic models are transplanted to a professional neural network processing engine for calculation, occupation of CPU resources is reduced, different speech algorithms have respective acoustic submodels, independent parallel calculation is achieved, and calculation efficiency is improved.
In the speech processing method of the present application, the method further includes: when the number of the sound zones in the cockpit is multiple, the acoustic models corresponding to the single sound zone or the acoustic models corresponding to the multiple sound zones are preset. And the acoustic models are independently or jointly set for each sound zone in the server, so that a flexible model building framework is realized.
A second aspect of the present application provides a server, comprising: the information transceiving module is used for receiving the voice signal of the sound area in the cockpit forwarded by the vehicle; the feature extraction module is used for combining the voice algorithms with different functions to extract the features of the voice signals, and obtaining and caching the feature extraction results; the acoustic model calculation module is used for calling the feature extraction result in the cache according to the preset configuration parameters of each voice algorithm and obtaining the corresponding acoustic calculation result through calculation of an acoustic model; the decoding module is used for decoding according to the acoustic calculation results, summarizing the decoding results to obtain a voice processing result, and sending the voice processing result to the vehicle through the information transceiving module. The server can improve the voice processing efficiency while reducing the occupation of CPU resources.
A third aspect of the present application provides a server comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described above.
A fourth aspect of the present application provides a computer readable storage medium having stored thereon executable code, which when executed by a processor of a server, causes the processor to perform the method as described above.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in more detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.
FIG. 1 is a schematic flow chart diagram of a speech processing method shown in the present application;
FIG. 2 is a flow chart diagram of a speech processing method shown in the present application;
FIG. 3 is a schematic diagram of a server shown in the present application;
FIG. 4 is a schematic diagram of the server shown in the present application;
FIG. 5 is a schematic diagram of a server in a single zone shown in the present application;
fig. 6 is a schematic structural diagram of a server shown in the present application.
Detailed Description
Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are illustrated in the accompanying drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms "first," "third," etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as third information, and similarly, the third information may also be referred to as the first information, without departing from the scope of the present application. Thus, a feature defined as "first" or "third" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
In the related art, when a plurality of functional algorithms such as voice recognition, voiceprint recognition, voice wakeup and the like perform synchronous processing on the same voice data, the method occupies higher CPU resources, and is easy to cause blocking of other applications executed simultaneously.
In order to solve the above problems, the present application provides a speech processing method, which enables each speech algorithm to perform common feature extraction on a speech signal, avoids repeated operations, and reduces occupation of CPU resources.
The technical scheme of the application is described in detail in the following with the accompanying drawings.
Fig. 1 is a flowchart illustrating a speech processing method according to the present application.
Referring to fig. 1, the present application illustrates a speech processing method, which includes:
and S110, receiving the voice signal of the cabin inner sound area forwarded by the vehicle.
The number of sound zones in the cabin of the vehicle may be one or more. In this step, when the execution subject is a server, the voice signals from different sound ranges are received, respectively. For example, the sound zones in the vehicle cabin are arranged in a manner that the sound zones are only illustrated by way of example and not by way of limitation. It can be understood that each sound zone is independently arranged, and the server respectively and independently processes the received voice signals of different sound zones.
And S120, combining the voice algorithms with different functions to extract the characteristics of the voice signals, obtaining the characteristic extraction result and caching the characteristic extraction result.
In order to perform more comprehensive analysis on the voice signals, different functional voice algorithms need to be adopted for respective processing. The voice algorithm with different functions may be, for example, a voice recognition algorithm, a voice wakeup algorithm, a voiceprint recognition algorithm, etc., and is not limited thereto. In order to implement a corresponding function, each speech algorithm needs to extract features of a speech signal. In order to reduce the occupation of resources of the CPU, in the step, the voice algorithms of all functions do not need to extract voice signals respectively, the actions of feature extraction can be combined, the voice signals only need to be extracted once, the feature extraction does not need to be repeated, and the obtained feature extraction result can be applied to the voice algorithms of all functions, so that the calculation resources of the CPU are reduced.
Further, in this step, for example, voice feature vectors such as MFCC (Mel Frequency Cepstral coeffients, cepstrum parameters extracted in Mel scale Frequency domain) or FBank (Filterbanks, acoustic features obtained by applying Mel filter bank on energy spectrum) in the voice signal may be extracted by using the correlation technique, so as to obtain a feature extraction result.
After receiving the voice signals forwarded by the vehicle, in order to ensure timeliness, feature extraction can be carried out in real time, and feature extraction results are cached, so that feature extraction results with different frame numbers can be called in subsequent steps according to the calculation requirements of different voice algorithms. It can be understood that the buffer stores the feature extraction result corresponding to more than one frame of speech signal.
And S130, calling the feature extraction result in the cache according to the preset configuration parameters of each voice algorithm, and calculating through an acoustic model to obtain a corresponding acoustic calculation result.
In this step, the speech algorithms have different functions, and different acoustic models are required for calculation, so that the number of frames of feature extraction results required to be input into each acoustic model for calculation may be different. Based on this, each voice algorithm has corresponding preset configuration parameters. Based on the feature extraction results stored in the cache according to the time sequence, the feature extraction results corresponding to the frame number can be processed into input data required by the acoustic model of each speech algorithm according to the preset configuration parameters corresponding to each speech algorithm.
Specifically, the preset configuration parameters may include a splicing frame number, a padding behavior, an interval frame number, and the like. According to the configuration parameters, the feature extraction results of the corresponding frame number can be called in the cache for processing to form a feature matrix, so that the input data required by the corresponding acoustic model can be obtained, and the acoustic model can obtain the acoustic calculation result according to the feature matrix.
Acoustic calculation results of acoustic models corresponding to different speech algorithms are different, for example, for a speech algorithm with a speech recognition function, a phoneme feature sequence and corresponding probability can be obtained through calculation of the corresponding acoustic model; aiming at a voice algorithm with a voice awakening function, a phoneme characteristic sequence and corresponding probability of an awakening word can be obtained through corresponding acoustic model calculation; for the voice algorithm of the voiceprint recognition function, the person feature sequence of the speaker and the corresponding probability can be obtained through the corresponding acoustic model calculation, which is only illustrated here, and the type of the acoustic calculation result of each acoustic model in the practical application is not limited.
And S140, inputting each acoustic calculation result into a decoding module of a corresponding voice algorithm for decoding, summarizing each decoding result, obtaining a voice processing result and sending the voice processing result to the vehicle.
In this step, the speech algorithms of each function respectively adopt different decoding modules for decoding. For example, in the related art, the speech algorithm of the speech recognition function mainly performs viterbi decoding on the posterior probability calculated by the acoustic model on the corresponding decoding graph; the voice algorithm of the voice awakening function is mainly that Viterbi decoding is carried out on the posterior probability obtained by calculating the acoustic model on a corresponding decoding graph; the voice algorithm of the voiceprint recognition function mainly performs decoding by calculating cosine similarity, which is only illustrated here, and the actual decoding scheme of each decoding module is not limited here.
And inputting each acoustic calculation result into a corresponding decoding module for decoding to respectively obtain the decoding results output by each decoding module. It can be understood that, since the running time and the running duration of each decoding module are different, there may be a situation that decoding cannot be started by an individual decoding module due to lack of valid input data, and in order to ensure timeliness of user experience, decoding results obtained according to the current time may be summarized to generate a corresponding voice processing result and send the corresponding voice processing result to the vehicle in time.
It is understood that, according to the voice signals from different sound ranges, the processing is performed according to the above steps S120 to S140, respectively, and the corresponding voice processing result is obtained. That is, the server processes the voice signals from different sound zones independently.
As can be seen from this example, the speech processing method of the present application combines different speech algorithms to extract the speech signal of each speech region by respectively receiving the speech signals of each speech region forwarded by the vehicle, and does not need to separately perform feature extraction on the speech signal by each speech algorithm, thereby reducing repetitive operations and further reducing occupation of CPU resources; in addition, different feature extraction results in the cache can be called as input only by adjusting preset configuration parameters, and an acoustic calculation result is obtained according to the corresponding acoustic model calculation, so that a speech algorithm with new functions is easy to expand; meanwhile, the decoding modules of the voice algorithms decode independently without waiting, so that the real-time property of the output decoding result is ensured, the resource utilization rate of a CPU (central processing unit) is improved, and finally the decoding results generated immediately are gathered and sent to a vehicle in time, so that the user experience is improved.
Fig. 2 is a flowchart illustrating a speech processing method according to the present application.
Referring to fig. 2, the present application illustrates a speech processing method, which includes:
and S210, respectively receiving the voice signals of each sound zone in the cockpit transferred by the vehicle.
The execution main body of the voice processing method can be a server, and the server can receive the voice signal forwarded by the vehicle in real time. Based on the mutual independence of the sound zones, the sound zones collect voice signals through the corresponding microphones and transmit the voice signals to the server in real time in the vehicle, and the server can receive the voice signals of different sound zones at the same time.
And S220, combining the voice signals from the same sound zone according to the voice algorithms with different functions, extracting the characteristics of the voice signals, obtaining a characteristic extraction result and caching the characteristic extraction result.
The server can be preloaded with engines of voice algorithms with different functions, such as a voice recognition algorithm engine, a voiceprint recognition algorithm engine, a voice wake-up algorithm engine, and the like. The algorithm engines of all the voice algorithms share the same feature extraction module, each feature extraction module is mapped with one sound zone, the same feature extraction module is used for carrying out feature extraction on voice signals from the same sound zone, and the same voice signal only needs to be subjected to feature extraction once through the feature extraction module without repeated operation. The feature extraction results correspondingly obtained by the voice signals of different sound areas can be cached separately, so that confusion is avoided.
Furthermore, in the corresponding sound zone, the speech algorithms with different functions are combined according to a preset format to extract the features in the speech signal. That is, for the voice signals from a single sound zone, when the voice algorithms adopt the same feature extraction module to perform feature extraction, the feature extraction result in the preset format can be extracted and obtained. For the voice signals of different sound regions, the preset formats of the feature extraction results extracted respectively can be set. For example, the predetermined format of the feature extraction result may be MFCC (Mel Frequency Cepstral coeffiences, cepstral parameters extracted in Mel scale Frequency domain) feature vectors or FBank (Filterbanks, acoustic features obtained by applying Mel filter banks on energy spectrum) feature vectors. For example, the feature extraction results corresponding to each phoneme region collectively use 80-dimensional first-order FBank features, which is only exemplified herein, and the feature extraction results corresponding to different phoneme regions may use the same or different preset formats.
And S230, calling feature extraction results of corresponding frame numbers in the cache for splicing according to preset configuration parameters of each voice algorithm, and generating corresponding feature matrixes.
The preset configuration parameters of each voice algorithm comprise at least one of splicing frame number, filling behavior and interval frame number. The splicing frame number refers to the splicing number of the voice frames to be framed; the filling frame number refers to the filling frame number on the left or right of the voice frame; fill behavior includes zero padding or copying; the number of interval frames refers to the number of interval frames between two frame features. Of course, other configuration parameters may also be included, without limitation. It can be understood that specific splicing behaviors can be set through specific values of preset configuration parameters corresponding to different speech algorithms, so that different feature matrices are obtained through splicing, and the feature input format requirements of acoustic models of different speech algorithms are met. Optionally, by adjusting the configuration parameters, a new speech algorithm can be conveniently expanded, and a human-computer interaction function module is enriched.
Optionally, before feature concatenation is performed on feature extraction results of the same sound volume, a Normalization process may be performed on each feature extraction result (e.g., a speech feature vector) by using CMVN (Cepstral Mean and Variance Normalization). Specifically, after the speech feature vectors are obtained, the speech feature vectors are converted from one space to another space, so that the feature parameters in the space better conform to a certain probability distribution, the dynamic range of the feature parameter value domain is compressed, and more standard speech feature vectors are obtained, thereby being beneficial to improving the robustness of the prediction result of the acoustic model in the subsequent step.
Further, when a new voice signal is received, calling feature extraction results in a cache according to the feature splicing interfaces respectively; and when the feature extraction result meets the preset configuration parameters of at least one voice algorithm, generating a corresponding feature matrix according to a splicing function in the feature splicing interface. The feature extraction module corresponding to each sound zone may be respectively provided with a feature splicing interface, and a splicing function is respectively predefined in each feature splicing interface, where the splicing function is configured to call feature extraction results of corresponding frame numbers in a cache according to preset configuration parameters corresponding to each speech algorithm, and splice each feature extraction result into a feature matrix required by a corresponding acoustic model.
That is, based on the stream computing manner, as the server receives a new voice signal in real time, the feature extraction module performs feature extraction on the voice signal in real time correspondingly and generates a feature extraction result. Because the number of frames of the feature extraction results generated every moment in the cache is uncertain, when the number of frames of the feature extraction results in the cache accords with the preset configuration parameters of any one speech algorithm, the feature extraction results with proper number of frames can be spliced into the corresponding feature matrix through the splicing function in the feature splicing interface, and an effective feature matrix is obtained, so that the obtained feature matrix can be input into the corresponding acoustic model for calculation in the subsequent steps.
And S240, according to a calculation interface of a preset neural network processing engine, each voice algorithm calls a corresponding acoustic submodel and calculates according to the input feature matrix to obtain an acoustic calculation result of the corresponding voice algorithm.
It should be understood that, in order to satisfy the real-time performance of the speech technology, the acoustic models of the multiple speech algorithms have huge computation amount, and therefore occupy higher CPU resources. In order to reduce CPU resources, the acoustic models corresponding to the speech algorithms may be collectively transplanted into a Neural network Processing Engine, such as a SNPE (Neural network Processing Engine provided by a high-pass chip), which is only exemplified herein. That is, the acoustic models of the speech algorithms are integrated into a total acoustic model in the neural network processing engine, and the acoustic models of the speech algorithms correspond to the acoustic submodels in the total acoustic model. That is, the acoustic models of the respective speech algorithms mentioned in the above steps are all regarded as acoustic submodels in this step. The neural network processing engine is provided with a computing interface, and the algorithm engine of each voice algorithm selects a corresponding acoustic submodel to perform computation by calling the same computing interface, and is favorable for reducing CPU resource consumption.
It should be noted that the acoustic submodels corresponding to different speech algorithms may have partially or completely different computation frequencies due to the different feature matrices that need to be input. For example, the acoustic submodel of the speech recognition algorithm is computed every n milliseconds, and the acoustic submodel of the voiceprint recognition algorithm is computed every m milliseconds. Although the computation frequency is different, each speech algorithm can respectively try to call the acoustic submodel computation each time the server receives a new speech signal, so as to improve the timeliness through streaming computation. Specifically, the cache has feature extraction results generated according to a time sequence, and before the acoustic submodel is called for calculation, each speech algorithm may first call a feature concatenation interface respectively, and attempt to concatenate the feature extraction results through a concatenation function. If the feature extraction result meeting the corresponding preset configuration parameters exists, the splicing functions can be spliced respectively to generate corresponding feature matrixes, so that the acoustic submodel can calculate according to the corresponding feature matrixes to obtain an acoustic calculation result. That is to say, as long as the feature splicing interface obtains an effective feature matrix according to any one preset configuration parameter, the feature matrix can be input into the corresponding acoustic submodel for calculation, and the calculation results of other acoustic submodels do not need to be waited for.
Preferably, when the number of the sound zones in the cabin is multiple, the acoustic models corresponding to the individual sound zones or the acoustic models corresponding to the multiple sound zones are preset in the server. For example, each sound zone is respectively mapped to an acoustic model in the server, and after the server receives the voice signal of each sound zone, the corresponding acoustic model is adopted to perform corresponding acoustic calculation; the acoustic model comprises acoustic submodels corresponding to different speech algorithms. Or all the sound zones are uniformly mapped to one acoustic model in the server, the acoustic model also comprises acoustic submodels corresponding to different voice algorithms, and each acoustic submodel is respectively responsible for acoustic calculation corresponding to different sound zones. The number of the acoustic models is flexibly set, so that the requirements of different architecture construction in the server can be met.
And S250, respectively inputting each acoustic calculation result into a decoding module of a corresponding voice algorithm for decoding to obtain a corresponding decoding result.
It can be understood that the frequencies of the acoustic calculation results obtained by each speech algorithm may be different, and based on the frequencies, whether the corresponding speech algorithm obtains the feature matrix is determined respectively; and if the voice algorithm is determined to obtain the corresponding feature matrix, decoding according to an acoustic calculation result output by the acoustic model. That is to say, it may be determined whether a feature matrix corresponding to any one of the speech algorithms is generated according to the feature concatenation interface in step S230, that is, it may be determined whether an acoustic submodel performs calculation in step S240, so as to determine whether the acoustic submodel outputs an acoustic calculation result, and then the corresponding speech algorithm engine invokes the corresponding decoding module to perform decoding.
That is, if the feature concatenation interface of step S230 does not generate any feature matrix, the acoustic submodel of step S240 cannot perform calculation due to lack of input data, and obviously cannot output an acoustic calculation result. At this time, the algorithm engine of each speech algorithm does not need to call any decoding module for decoding. Whether the decoding module is called for decoding or not is judged in the mode, so that the CPU resource consumption is reduced, and invalid calling is avoided.
And S260, summarizing all decoding results, obtaining voice processing results and sending the voice processing results to the vehicle.
And summarizing the decoding results corresponding to the currently obtained voice algorithms according to the streaming calculation to obtain voice processing results. That is, the content contained in the speech processing results obtained at different times is different. In order to ensure timeliness, the voice processing result obtained at the current time can be sent to the vehicle in real time, so that the vehicle can feed back to the user according to the voice processing result.
As can be seen from this example, in the speech processing method of the present application, the servers may receive speech signals of different sound zones and process the speech signals independently. By adopting the corresponding feature extraction modules respectively, the voice algorithms can be combined to extract features without repeatedly extracting the features, so that the occupation of CPU resources is reduced; in addition, splicing requirements of each voice algorithm are met according to different preset configuration parameters, and corresponding feature matrixes are obtained from buffering according to feature splicing interfaces in a splicing mode; meanwhile, effective calculation can be carried out through the corresponding acoustic submodel by obtaining the feature matrix corresponding to any one of the voice algorithms, so that the corresponding acoustic calculation result is obtained, and the corresponding decoding module is determined to be called for decoding, so that invalid calculation caused by occupation of CPU resources is avoided, and the phenomenon of blocking caused by occupation of CPU resources on other applications is reduced. In addition, the acoustic models of the voice algorithms are integrated in the same neural network processing engine as the acoustic submodels, so that the distributed arrangement is avoided, and the occupation of CPU resources can be reduced.
Corresponding to the embodiment of the application function implementation method, the application also provides a server and a corresponding embodiment.
Fig. 3 is a schematic structural diagram of a server shown in the present application.
Referring to fig. 3, the present application illustrates a server, which includes an information transceiver module 310, a feature extraction module 320, an acoustic model calculation module 330, and a decoding module 340. Wherein:
the information transceiver module 310 is used for receiving the voice signal of the cabin inner sound zone forwarded by the vehicle.
The feature extraction module 320 is configured to combine the voice algorithms with different functions to extract features of the voice signal, obtain a feature extraction result, and perform caching.
The acoustic model calculation module 330 is configured to call the feature extraction result in the cache according to the preset configuration parameters of each speech algorithm, and obtain a corresponding acoustic calculation result through acoustic model calculation.
The decoding module 340 is configured to decode according to each acoustic calculation result, collect each decoding result, obtain a voice processing result, and send the voice processing result to the vehicle through the information transceiver module.
Fig. 4 is another schematic structural diagram of the server shown in the present application. Fig. 5 is a schematic diagram of a server structure in a single zone.
Referring to fig. 4 and 5, in particular, the transceiver module 310 is used for receiving the voice signals of each zone in the cabin retransmitted by the vehicle. The feature extraction module 320 is configured to combine the speech signals from the same vocal range according to the speech algorithms with different functions to extract features of the speech signals. The number of the feature extraction modules 320 may be more than one, and each feature extraction module 320 respectively processes the voice signal of one vocal range correspondingly. The voice algorithms of all functions share one feature extraction module to extract features of voice signals of the same vocal range. Optionally, in the corresponding sound zone, the feature extraction module 320 is configured to combine the speech algorithms with different functions according to a preset format to extract features in the speech signal.
The acoustic model calculation module 330 includes a feature stitching module 331 and a calculation module 332. The feature splicing module 331 is configured to call feature extraction results of corresponding frames in the cache for splicing according to preset configuration parameters of each speech algorithm, and generate corresponding feature matrices; the preset configuration parameters comprise at least one of splicing frame number, filling behavior and interval frame number. Optionally, before performing feature splicing, the feature splicing module 331 performs CMVN normalization processing on the feature extraction results respectively.
Further, the feature splicing module 331 is configured to call feature extraction results in the cache according to the feature splicing interface when a new voice signal is received; and when the feature extraction result meets the preset configuration parameters of at least one voice algorithm, generating a corresponding feature matrix according to a splicing function in the feature splicing interface. Optionally, the number of the feature concatenation modules 331 corresponds to the number of the functional speech algorithms, that is, each feature concatenation module 331 has a respective concatenation function for respectively concatenating feature matrices required by the acoustic models of the corresponding speech algorithms. Or, each speech algorithm may share one feature concatenation module 331, so that the same feature concatenation module 331 performs concatenation on feature matrices according to different concatenation functions, respectively.
Further, when the number of the sound zones in the cabin is multiple, the acoustic models corresponding to the sound zones respectively or the acoustic models corresponding to the sound zones together are preset. And each acoustic model comprises an acoustic sub-model corresponding to each algorithm. The calculation module 332 is configured to respectively invoke corresponding acoustic submodels to perform calculation according to a calculation interface of a preset neural network processing engine, so as to obtain an acoustic calculation result of a corresponding speech algorithm.
The number of decoding modules 340 corresponds to the number of speech algorithms of different functions. Before decoding, each decoding module 340 calls a feature splicing interface of the feature splicing module respectively, and is used for determining whether a corresponding speech algorithm obtains a feature matrix according to the feature splicing interface; and if the voice algorithm is determined to obtain the corresponding feature matrix, decoding according to an acoustic calculation result output by the acoustic model. Due to the design, invalid calling of the decoding module 340 is avoided, and CPU resources are saved.
In summary, the server of the application can extract the characteristics of the voice signals of the same sound zone forwarded by the vehicle in the common characteristic extraction module, so that the characteristics do not need to be extracted repeatedly, and the CPU resource is saved; the feature splicing module respectively obtains feature matrixes required by corresponding voice algorithms according to different preset configuration parameters; calling the corresponding acoustic submodels respectively through a unified computing interface of the computing module to perform computing to obtain corresponding acoustic computing results; in addition, the decoding module carries out calculation and decoding of the acoustic model in real time according to a mode that the characteristic matrix in the characteristic splicing module follows along with calculation, meets the timeliness requirement of the voice technology, avoids invalid calling of the decoding module and reduces occupation of CPU resources; each decoding module independently operates corresponding decoding, and through parallel decoding, the processing efficiency is improved, and the occupation time of a CPU is shortened.
With regard to the server in the above embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 6 is a schematic structural diagram of a server shown in the present application.
Referring to fig. 6, the server 1000 includes a memory 1010 and a processor 1020.
The Processor 1020 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 1010 may include various types of storage units, such as system memory, read Only Memory (ROM), and permanent storage. The ROM may store, among other things, static data or instructions for the processor 1020 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 1010 may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, among others. In some embodiments, memory 1010 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
The memory 1010 has stored thereon executable code that, when processed by the processor 1020, may cause the processor 1020 to perform some or all of the methods described above.
Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing some or all of the steps of the above-described method of the present application.
Alternatively, the present application may also be embodied as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having executable code (or a computer program or computer instruction code) stored thereon, which, when executed by a processor of a server (or server, etc.), causes the processor to perform some or all of the various steps of the above-described methods according to the present application.
Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Claims (11)
1. A method of speech processing, comprising:
receiving a voice signal of a sound area in a cabin forwarded by a vehicle;
combining voice algorithms with different functions to extract the characteristics of voice signals, obtaining characteristic extraction results and caching the characteristic extraction results;
calling feature extraction results in the cache according to preset configuration parameters of each voice algorithm, and calculating through an acoustic model to obtain corresponding acoustic calculation results;
and inputting the acoustic calculation results into the decoding modules of the corresponding voice algorithms for decoding, summarizing the decoding results, obtaining voice processing results and sending the voice processing results to the vehicle.
2. The method of claim 1, wherein combining different functional speech algorithms to extract features of the speech signal comprises:
and in the corresponding sound zone, combining the voice algorithms with different functions according to a preset format to extract the characteristics in the voice signal.
3. The method according to claim 1, wherein the invoking the feature extraction result in the cache according to the preset configuration parameters of each speech algorithm respectively comprises:
according to the preset configuration parameters of each voice algorithm, calling feature extraction results of corresponding frame numbers in the cache for splicing to generate corresponding feature matrixes; the preset configuration parameters comprise at least one of splicing frame number, filling behavior and interval frame number.
4. The method according to claim 3, wherein said calling the feature extraction results of the corresponding frames in the cache for splicing according to the preset configuration parameters of the corresponding speech algorithm to generate the corresponding feature matrix, comprises:
when a new voice signal is received, calling feature extraction results in the cache according to the feature splicing interfaces respectively;
and when the feature extraction result meets the preset configuration parameters of at least one voice algorithm, generating a corresponding feature matrix according to a splicing function in the feature splicing interface.
5. The method according to claim 3, wherein before the step of inputting each acoustic calculation result into the decoding module of the corresponding speech algorithm for decoding, the method further comprises:
respectively determining whether the corresponding voice algorithm obtains a feature matrix;
and if the voice algorithm is determined to obtain the corresponding feature matrix, decoding according to an acoustic calculation result output by the acoustic model.
6. The method of claim 1, wherein receiving the vehicle-relayed in-cabin soundfield speech signals comprises:
respectively receiving voice signals of each sound zone in a cockpit forwarded by a vehicle;
the combining the voice algorithms with different functions to extract the characteristics of the voice signals comprises the following steps:
and combining the voice signals from the same sound area according to the voice algorithms with different functions to extract the characteristics of the voice signals.
7. The method of claim 1, wherein the obtaining corresponding acoustic computation results through acoustic model computation comprises:
and respectively calling the corresponding acoustic submodels to calculate according to a calculation interface of a preset neural network processing engine to obtain acoustic calculation results of the corresponding voice algorithm.
8. The method of claim 7, further comprising:
when the number of the sound zones in the cockpit is multiple, the acoustic models corresponding to the single sound zone or the acoustic models corresponding to the multiple sound zones are preset.
9. A server, comprising:
the information transceiving module is used for receiving the voice signals of the sound area in the cockpit forwarded by the vehicle;
the feature extraction module is used for combining the voice algorithms with different functions to extract the features of the voice signals, and obtaining and caching the feature extraction results;
the acoustic model calculation module is used for calling the feature extraction result in the cache according to the preset configuration parameters of each voice algorithm and obtaining the corresponding acoustic calculation result through the calculation of the acoustic model;
and the decoding module is used for decoding according to the acoustic calculation results respectively, gathering the decoding results to obtain a voice processing result, and sending the voice processing result to the vehicle through the information transceiving module.
10. A server, comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-8.
11. A computer readable storage medium having stored thereon executable code which, when executed by a processor of a server, causes the processor to perform the method of any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210750758.1A CN115148191A (en) | 2022-06-29 | 2022-06-29 | Voice processing method and server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210750758.1A CN115148191A (en) | 2022-06-29 | 2022-06-29 | Voice processing method and server |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115148191A true CN115148191A (en) | 2022-10-04 |
Family
ID=83410166
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210750758.1A Pending CN115148191A (en) | 2022-06-29 | 2022-06-29 | Voice processing method and server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115148191A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118398011A (en) * | 2024-06-26 | 2024-07-26 | 广州小鹏汽车科技有限公司 | Voice request processing method, server device and storage medium |
-
2022
- 2022-06-29 CN CN202210750758.1A patent/CN115148191A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118398011A (en) * | 2024-06-26 | 2024-07-26 | 广州小鹏汽车科技有限公司 | Voice request processing method, server device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6914236B2 (en) | Speech recognition methods, devices, devices, computer-readable storage media and programs | |
CN112262431B (en) | Speaker logging using speaker embedding and trained generative models | |
CN111739521B (en) | Electronic equipment awakening method and device, electronic equipment and storage medium | |
CN112712813B (en) | Voice processing method, device, equipment and storage medium | |
CN110556103A (en) | Audio signal processing method, apparatus, system, device and storage medium | |
CN110211599B (en) | Application awakening method and device, storage medium and electronic equipment | |
CN112185352A (en) | Voice recognition method and device and electronic equipment | |
US11822958B2 (en) | Method and a device for data transmission between an internal memory of a system-on-chip and an external memory | |
CN109509465A (en) | Processing method, component, equipment and the medium of voice signal | |
CN113261056B (en) | Speaker perception using speaker dependent speech models | |
CN114365216A (en) | Targeted voice separation for speech recognition by speaker | |
US11869483B2 (en) | Unsupervised alignment for text to speech synthesis using neural networks | |
CN114203163A (en) | Audio signal processing method and device | |
CN113628612A (en) | Voice recognition method and device, electronic equipment and computer readable storage medium | |
CN112599127B (en) | Voice instruction processing method, device, equipment and storage medium | |
CN103514882A (en) | Voice identification method and system | |
CN115148191A (en) | Voice processing method and server | |
JP2023546703A (en) | Multichannel voice activity detection | |
JP2023162265A (en) | Text echo cancellation | |
KR20240017404A (en) | Noise suppression using tandem networks | |
WO2020195897A1 (en) | Language identifying device and computer program for same, and speech processing device | |
CN117037772A (en) | Voice audio segmentation method, device, computer equipment and storage medium | |
WO2023168713A1 (en) | Interactive speech signal processing method, related device and system | |
WO2022105392A1 (en) | Method and apparatus for performing speech processing in electronic device, electronic device, and chip | |
CN114420136A (en) | Method and device for training voiceprint recognition model and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |