CN116741161A - Voice processing method, device, terminal equipment and storage medium - Google Patents

Voice processing method, device, terminal equipment and storage medium Download PDF

Info

Publication number
CN116741161A
CN116741161A CN202310689788.0A CN202310689788A CN116741161A CN 116741161 A CN116741161 A CN 116741161A CN 202310689788 A CN202310689788 A CN 202310689788A CN 116741161 A CN116741161 A CN 116741161A
Authority
CN
China
Prior art keywords
voice
evaluation
preset
processed
sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310689788.0A
Other languages
Chinese (zh)
Inventor
刘宗栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xiaopeng Motors Technology Co Ltd
Original Assignee
Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Xiaopeng Motors Technology Co Ltd filed Critical Guangzhou Xiaopeng Motors Technology Co Ltd
Priority to CN202310689788.0A priority Critical patent/CN116741161A/en
Publication of CN116741161A publication Critical patent/CN116741161A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The application discloses a voice processing method, a device, terminal equipment and a storage medium, wherein the voice processing method comprises the following steps: acquiring to-be-processed voice signals corresponding to each of a plurality of sound partitions; performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result; based on the evaluation result, a target sound zone is determined. Based on the scheme of the application, the dependence on the reference voice signal can be eliminated, the definition evaluation model can adapt to the dynamic influence of factors such as environmental noise, speaker pose change and the like on the voice signal to be processed, and the target voice partition can be accurately determined on the basis, so that the occurrence of the situation of voice leakage is effectively reduced.

Description

Voice processing method, device, terminal equipment and storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a speech processing method, apparatus, terminal device, and storage medium.
Background
Today, the intelligent degree of automobiles is higher and higher, and the voice interaction functions supported by the automobiles are also more and more abundant. According to the seat distribution of the automobile, a plurality of sound partitions can be divided in the automobile, and in certain voice interaction scenes, the sound partition where a speaker is located, namely, a target sound partition, needs to be accurately identified.
Currently, the method for identifying the target sound partition is to compare the collected voice signal to be processed with the reference voice signal and determine the target sound partition according to the comparison result. However, environmental noise and speaker pose changes may cause a continuous dynamic effect on the collected speech signal to be processed, and the reference speech signal cannot dynamically adapt to the effect, which easily results in that the identified target voice partition is not the voice partition where the speaker is located, and this misidentification is called voice partition leakage.
In summary, the current method for identifying the target sound partition easily causes the problem of sound partition leakage.
Disclosure of Invention
The application mainly aims to provide a voice processing method, a voice processing device, a terminal device and a storage medium, and aims to solve or improve the problem that the current method for identifying a target voice partition easily causes voice area leakage.
To achieve the above object, the present application provides a voice processing method, including:
acquiring to-be-processed voice signals corresponding to each of a plurality of sound partitions;
performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result;
based on the evaluation result, a target sound zone is determined.
Optionally, the sharpness evaluation model includes a frame-by-frame convolution model, a time sequence model and a pooling model, and the step of performing sharpness evaluation on the to-be-processed voice signal based on the preset sharpness evaluation model to obtain a corresponding evaluation result includes:
performing spectrum segmentation on the voice signal to be processed based on a preset window length to obtain a plurality of corresponding frame spectrums;
carrying out frame-by-frame convolution on the frame spectrums by a preset frame-by-frame convolution model to obtain first-class high-dimensional features corresponding to the frame spectrums;
modeling the time dependence of the first type of high-dimensional features through a preset time sequence model to obtain second type of high-dimensional features corresponding to the frame frequencies;
performing feature aggregation on the second-class high-dimensional features corresponding to the frame frequencies respectively through a preset pooling model to obtain aggregation features;
and obtaining a corresponding evaluation result according to the aggregation characteristic analysis.
Optionally, the step of obtaining the corresponding evaluation result according to the aggregate feature analysis includes:
and obtaining a corresponding evaluation value according to the aggregate feature analysis, wherein the type of the evaluation value comprises at least one of MOS value, noise evaluation value and voice evaluation value.
Optionally, the step of obtaining the corresponding evaluation value according to the aggregate feature analysis includes:
and analyzing to obtain corresponding evaluation values according to the aggregation characteristics and preset voice quality scoring standards.
Optionally, the step of determining the target sound zone based on the evaluation result includes:
screening the plurality of sound partitions according to the evaluation values and a preset threshold screening rule, and determining at least one to-be-selected sound partition;
and performing score comparison according to the evaluation scores corresponding to the to-be-selected sound partitions and a preset score comparison rule, and determining the target sound partition.
Optionally, before the step of performing the sharpness evaluation on the to-be-processed voice signal based on the preset sharpness evaluation model to obtain the corresponding evaluation result, the method further includes:
detecting the voice activity of the voice signal to be processed, and determining at least one voice partition with the voice activity;
the step of performing the definition evaluation on the voice signal to be processed based on the preset definition evaluation model to obtain a corresponding evaluation result comprises the following steps:
and performing definition evaluation on the voice signals to be processed corresponding to the voice partitions with voice activity based on a preset definition evaluation model to obtain corresponding evaluation results.
Optionally, after the step of determining the target sound partition based on the evaluation result, the method further includes:
and adjusting a preset voice zone allocation strategy according to the target voice zone to obtain an adjusted voice zone allocation strategy, wherein the adjusted voice zone allocation strategy is used for controlling a voice interaction task corresponding to the target voice zone.
The embodiment of the application also provides a voice processing device, which comprises:
the acquisition module is used for acquiring the voice signals to be processed corresponding to each of the plurality of sound partitions;
the evaluation module is used for performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result;
and the determining module is used for determining the target sound partition based on the evaluation result.
The embodiment of the application also provides a terminal device, which comprises a memory, a processor and a voice processing program stored on the memory and capable of running on the processor, wherein the voice processing program realizes the steps of the voice processing method when being executed by the processor.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a voice processing program, and the voice processing program realizes the steps of the voice processing method when being executed by a processor.
The voice processing method, the voice processing device, the terminal equipment and the storage medium provided by the embodiment of the application are characterized in that the voice signals to be processed, which correspond to a plurality of sound partitions, are obtained; performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result; based on the evaluation result, a target sound zone is determined. Based on the scheme of the application, the definition evaluation model is adopted to evaluate the definition of the voice signal to be processed, so as to obtain the evaluation result reflecting the definition of the voice, and further, the target sound partition is determined according to the evaluation result. Therefore, the dependence on the reference voice signal can be eliminated, the definition evaluation model can adapt to dynamic influence caused by factors such as environmental noise, speaker pose change and the like on the voice signal to be processed, and the target voice partition can be accurately determined on the basis, so that the occurrence of voice area leakage is effectively reduced.
Drawings
FIG. 1 is a schematic diagram of functional modules of a terminal device to which a speech processing apparatus of the present application belongs;
FIG. 2 is a flowchart of a first exemplary embodiment of a speech processing method according to the present application;
FIG. 3 is a flowchart of a second exemplary embodiment of a speech processing method according to the present application;
FIG. 4 is a schematic diagram of a sharpness evaluation model according to the speech processing method of the present application;
FIG. 5 is a flowchart of a third exemplary embodiment of a speech processing method according to the present application;
FIG. 6 is a flowchart of a fourth exemplary embodiment of a speech processing method according to the present application;
FIG. 7 is a flowchart of a fifth exemplary embodiment of a speech processing method according to the present application;
FIG. 8 is a flowchart of a sixth exemplary embodiment of a speech processing method according to the present application;
fig. 9 is a flowchart of a seventh exemplary embodiment of a speech processing method according to the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The main solutions of the embodiments of the present application are: acquiring to-be-processed voice signals corresponding to each of a plurality of sound partitions; performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result; based on the evaluation result, a target sound zone is determined. Based on the scheme of the application, the definition evaluation model is adopted to evaluate the definition of the voice signal to be processed, so as to obtain the evaluation result reflecting the definition of the voice, and further, the target sound partition is determined according to the evaluation result. Therefore, the dependence on the reference voice signal can be eliminated, the definition evaluation model can adapt to dynamic influence caused by factors such as environmental noise, speaker pose change and the like on the voice signal to be processed, and the target voice partition can be accurately determined on the basis, so that the occurrence of voice area leakage is effectively reduced.
Specifically, referring to fig. 1, fig. 1 is a schematic diagram of functional modules of a terminal device to which a speech processing device of the present application belongs. The speech processing means may be a device independent of the terminal device capable of speech processing, which may be carried on the terminal device in the form of hardware or software. The terminal equipment can be an intelligent mobile terminal with a data processing function such as a mobile phone and a tablet personal computer, and can also be a fixed terminal equipment or a server with a data processing function.
In this embodiment, the terminal device to which the speech processing apparatus belongs at least includes an output module 110, a processor 120, a memory 130, and a communication module 140.
The memory 130 stores an operating system and a voice processing program, and the voice processing device may divide the acquired plurality of voice signals to be processed, where the voice signals correspond to each other; performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result; based on the evaluation result, information such as sound partition information corresponding to the determined target sound partition is stored in the memory 130; the output module 110 may be a display screen or the like. The communication module 140 may include a WIFI module, a mobile communication module, a bluetooth module, and the like, and communicates with an external device or a server through the communication module 140.
Wherein the voice processing program in the memory 130 when executed by the processor performs the steps of:
acquiring to-be-processed voice signals corresponding to each of a plurality of sound partitions;
performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result;
based on the evaluation result, a target sound zone is determined.
Further, the speech processing program in the memory 130 when executed by the processor also implements the steps of:
performing spectrum segmentation on the voice signal to be processed based on a preset window length to obtain a plurality of corresponding frame spectrums;
carrying out frame-by-frame convolution on the frame spectrums by a preset frame-by-frame convolution model to obtain first-class high-dimensional features corresponding to the frame spectrums;
modeling the time dependence of the first type of high-dimensional features through a preset time sequence model to obtain second type of high-dimensional features corresponding to the frame frequencies;
performing feature aggregation on the second-class high-dimensional features corresponding to the frame frequencies respectively through a preset pooling model to obtain aggregation features;
and obtaining a corresponding evaluation result according to the aggregation characteristic analysis.
Further, the speech processing program in the memory 130 when executed by the processor also implements the steps of:
and obtaining a corresponding evaluation value according to the aggregate feature analysis, wherein the type of the evaluation value comprises at least one of MOS value, noise evaluation value and voice evaluation value.
Further, the speech processing program in the memory 130 when executed by the processor also implements the steps of:
and analyzing to obtain corresponding evaluation values according to the aggregation characteristics and preset voice quality scoring standards.
Further, the speech processing program in the memory 130 when executed by the processor also implements the steps of:
screening the plurality of sound partitions according to the evaluation values and a preset threshold screening rule, and determining at least one to-be-selected sound partition;
and performing score comparison according to the evaluation scores corresponding to the to-be-selected sound partitions and a preset score comparison rule, and determining the target sound partition.
Further, the speech processing program in the memory 130 when executed by the processor also implements the steps of:
detecting the voice activity of the voice signal to be processed, and determining at least one voice partition with the voice activity;
and performing definition evaluation on the voice signals to be processed corresponding to the voice partitions with voice activity based on a preset definition evaluation model to obtain corresponding evaluation results.
Further, the speech processing program in the memory 130 when executed by the processor also implements the steps of:
and adjusting a preset voice zone allocation strategy according to the target voice zone to obtain an adjusted voice zone allocation strategy, wherein the adjusted voice zone allocation strategy is used for controlling a voice interaction task corresponding to the target voice zone.
According to the scheme, the voice signals to be processed corresponding to the sound partitions are obtained; performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result; based on the evaluation result, a target sound zone is determined. In this embodiment, the sharpness evaluation model is used to evaluate the sharpness of the speech signal to be processed, so as to obtain an evaluation result reflecting the sharpness of the speech, and further determine the target sound partition according to the evaluation result. Therefore, the dependence on the reference voice signal can be eliminated, the definition evaluation model can adapt to dynamic influence caused by factors such as environmental noise, speaker pose change and the like on the voice signal to be processed, and the target voice partition can be accurately determined on the basis, so that the occurrence of voice area leakage is effectively reduced.
Referring to fig. 2, a first embodiment of a speech processing method of the present application provides a flowchart, where the speech processing method includes:
step S10, a voice signal to be processed corresponding to each of the plurality of sound partitions is obtained.
In particular, the voice processing method according to the present embodiment may be applied to a scenario of multi-zone voice interaction, for example, to an intelligent cabin of an automobile. The sound partition is a partition obtained by dividing a whole sound system according to different regions, and is exemplified by four or five intelligent cabin systems, and can be divided into a driver seat sound partition, a co-driver seat sound partition, a left rear sound partition and a right rear sound partition. Furthermore, on the basis of dividing the sound partitions, each sound partition can independently support at least one function of sound collection, sound playing and linkage component control.
In order to determine the voice partition where the speaker is located, a to-be-processed voice signal corresponding to each of the plurality of voice partitions needs to be acquired. More specifically, a recording unit, such as a microphone unit, may be preset for each sound partition, so that the to-be-processed voice signals corresponding to the sound partitions may be obtained.
And step S20, performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result.
Specifically, after the to-be-processed voice signals corresponding to the plurality of sound partitions are obtained, the to-be-processed voice signals corresponding to the plurality of sound partitions are used as input of a definition evaluation model, and the definition evaluation model is used for evaluating the definition of the to-be-processed voice signals, so that evaluation results corresponding to the plurality of sound partitions can be obtained. The definition evaluation model can perform processing such as frequency spectrum segmentation, convolution, time dependence modeling and feature aggregation on the voice signal to be processed, preliminarily output aggregate features of the voice signal to be processed, the aggregate features characterize the definition of the voice signal to be processed, and the aggregate features are further quantized to obtain corresponding evaluation results.
It will be appreciated that the evaluation result may be an evaluation score, an evaluation level or other data form that may be used to characterize the clarity of the speech signal to be processed.
And step S30, determining a target sound partition based on the evaluation result.
Specifically, the evaluation result reflects the definition of the corresponding sound partition, and according to the evaluation result, it can be determined that the sound partition with the optimal definition is the target sound partition. For example, when the evaluation result is an evaluation score, the sound partition with the highest evaluation score may be determined as the target sound partition by means of score comparison; for another example, when the evaluation result is the evaluation level, the sound partition with the highest evaluation level may be determined as the target sound partition by means of level comparison. Similarly, if the evaluation result adopts other data forms which can be used for representing the definition of the voice signal to be processed, the sound partition with the optimal definition can be determined to be the target sound partition based on the corresponding analysis mode.
According to the scheme, the voice signals to be processed corresponding to the sound partitions are obtained; performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result; based on the evaluation result, a target sound zone is determined. In this embodiment, the sharpness evaluation model is used to evaluate the sharpness of the speech signal to be processed, so as to obtain an evaluation result reflecting the sharpness of the speech, and further determine the target sound partition according to the evaluation result. Therefore, the dependence on the reference voice signal can be eliminated, the definition evaluation model can adapt to dynamic influence caused by factors such as environmental noise, speaker pose change and the like on the voice signal to be processed, and the target voice partition can be accurately determined on the basis, so that the occurrence of voice area leakage is effectively reduced.
Further, referring to fig. 3, a flow chart is provided in a second embodiment of the speech processing method of the present application, based on the embodiment shown in fig. 2, step S20 performs a sharpness evaluation on the speech signal to be processed based on a preset sharpness evaluation model, and the obtaining a corresponding evaluation result is further refined, including:
step S201, performing spectrum segmentation on the to-be-processed voice signal based on a preset window length, to obtain a plurality of corresponding frame spectrums.
Specifically, a window length (i.e., a window function) may be preset according to an actual spectrum analysis requirement, a short-time fourier transform (short-time Fourier transform, STFT) is used to divide a to-be-processed voice signal corresponding to each of a plurality of sound partitions into a plurality of frames, and then each frame is multiplied by the window length to obtain a corresponding spectrum, so as to obtain a plurality of corresponding frame spectrums.
Step S202, carrying out frame-by-frame convolution on the frame spectrums by a preset frame-by-frame convolution model to obtain first-class high-dimensional features corresponding to the frame spectrums.
Specifically, as shown in fig. 4, fig. 4 is a schematic diagram of a definition evaluation model related to the speech processing method of the present application, where the definition evaluation model includes a frame-by-frame convolution model, and the frame-by-frame convolution model is a model based on a convolutional neural network (Convolutional Neural Networks, CNN). And respectively taking the frame spectrums as the input of a frame-by-frame convolution model, and carrying out frame-by-frame convolution on the frame spectrums by the frame-by-frame convolution model to obtain first-class high-dimensional characteristics corresponding to the frame spectrums. It is understood that the first type of high-dimensional features are spatially extracted high-dimensional features.
And step S203, performing time dependency modeling on the first-class high-dimensional features through a preset time sequence model to obtain second-class high-dimensional features corresponding to the frame frequencies.
Specifically, as shown in fig. 4, the sharpness evaluation model includes a time series model, which is a model based on Long Short-Term Memory (LSTM). And taking the second-class high-dimensional features corresponding to the frame spectrums as the input of the time sequence model, and performing time-dependent modeling on the first-class high-dimensional features by the time sequence model to obtain the second-class high-dimensional features corresponding to the frame spectrums. It is understood that the second type of high-dimensional feature is a high-dimensional feature extracted from time.
And S204, performing feature aggregation on the second-class high-dimensional features corresponding to the frame frequencies through a preset pooling model to obtain aggregation features.
Specifically, as shown in fig. 4, the sharpness evaluation model includes a pooling model that can reduce the amount of computation while retaining the main features. And taking the second-class high-dimensional features corresponding to the frame frequencies as the input of the pooling model, and carrying out feature aggregation on the second-class high-dimensional features corresponding to the frame frequencies by the pooling model to obtain aggregated features. It is understood that the aggregate feature is also an acoustic feature that characterizes the clarity of speech.
And step S205, obtaining a corresponding evaluation result according to the aggregation characteristic analysis.
Specifically, after the aggregate feature is obtained, the aggregate feature is further quantified to obtain a corresponding evaluation result. The evaluation result may be an evaluation value, an evaluation grade or other data form which can be used for representing the definition of the voice signal to be processed.
According to the scheme, the voice signal to be processed is subjected to spectrum segmentation based on the preset window length, so that a plurality of corresponding frame spectrums are obtained; carrying out frame-by-frame convolution on the frame spectrums by a preset frame-by-frame convolution model to obtain first-class high-dimensional features corresponding to the frame spectrums; modeling the time dependence of the first type of high-dimensional features through a preset time sequence model to obtain second type of high-dimensional features corresponding to the frame frequencies; performing feature aggregation on the second-class high-dimensional features corresponding to the frame frequencies respectively through a preset pooling model to obtain aggregation features; and obtaining a corresponding evaluation result according to the aggregation characteristic analysis. In this embodiment, the to-be-processed voice signals corresponding to the plurality of sound partitions are subjected to processing such as spectrum segmentation, convolution, time dependency modeling, feature aggregation and the like, so that aggregated features of the to-be-processed voice signals corresponding to the plurality of sound partitions can be obtained, and evaluation results reflecting voice definition can be obtained by further quantifying the aggregated features. Therefore, the target sound partition can be accurately identified on the basis of the evaluation result, and the occurrence of the condition of sound partition leakage is effectively reduced.
Further, referring to fig. 5, a flowchart is provided in a third embodiment of the speech processing method according to the present application, based on the embodiment shown in fig. 3, step S205 further refines the corresponding evaluation result according to the aggregate feature analysis, including:
step S2051, obtaining a corresponding evaluation value according to the aggregate feature analysis, where the type of the evaluation value includes at least one of a MOS value, a noise evaluation value, and a voice evaluation value.
Specifically, the present embodiment adopts the evaluation score as a data form of the evaluation result, and the type of the evaluation score includes at least one of a MOS value (Mean Opinion Score ), a noise evaluation score, and a voice evaluation score. Notably, the voice assessment score reflects the voice loudness.
Generally, the MOS value is a necessary evaluation score type, and at least one of the noise evaluation score and the voice evaluation score may be combined on the basis of the MOS value. For example, two evaluation score types, namely a MOS value and a noise evaluation score, are adopted; or, adopting two evaluation score types of MOS value and voice evaluation score; or three evaluation score types including MOS value, noise evaluation score and voice evaluation score are adopted.
It is noted that the MOS value, the noise evaluation score, and the voice evaluation score all have a corresponding score range, for example, 1 to 5 points, and generally, a higher score indicates that the corresponding evaluation score performs better.
According to the scheme, the corresponding evaluation value is obtained specifically through analysis according to the aggregation characteristics, wherein the type of the evaluation value comprises at least one of MOS value, noise evaluation value and voice evaluation value. In this embodiment, the MOS value, the noise evaluation value, and the voice evaluation value are selectively classified into the types of evaluation values, so that the clarity of the to-be-processed voice signal can be more comprehensively evaluated, and further, the target sound partition can be accurately determined, so that the occurrence of the situation of leakage in the sound region is effectively reduced.
Further, referring to fig. 6, a flowchart is provided in a fourth embodiment of the speech processing method according to the present application, based on the embodiment shown in fig. 5, step S2051 further refines the corresponding evaluation score according to the aggregated feature analysis, including:
step S2052, analyzing to obtain a corresponding evaluation value according to the aggregation feature and a preset voice quality scoring standard.
Specifically, in order to make the evaluation score more reasonably reflect the definition of the voice signal to be processed, the embodiment analyzes and obtains the corresponding evaluation score according to the aggregation characteristics and the preset voice quality scoring standard. Wherein the pre-set speech quality scoring criteria may be a priori knowledge based scoring criteria. For example, the international telecommunications union established p.808 speech quality assessment standard (ITU-t.808) is employed as a preset speech quality scoring standard. The range of speech quality scores may relate to multiple dimensions of MOS values, noise assessment scores, voice assessment scores, and the like.
According to the scheme, the corresponding evaluation value is obtained through analysis according to the aggregation characteristics and the preset voice quality scoring standard. The embodiment introduces the voice quality scoring standard to the evaluation process, and the voice quality scoring standard can be obtained based on priori knowledge such as international standard, so that the evaluation value can more reasonably represent the definition of the voice signal to be processed.
Further, referring to fig. 7, a fifth embodiment of the speech processing method of the present application provides a flowchart, based on the embodiment shown in fig. 5, step S30, based on the evaluation result, determines that the target sound partition is further refined, including:
step S301, screening the plurality of sound partitions according to the evaluation value and a preset threshold screening rule, and determining at least one candidate sound partition.
Specifically, if the evaluation value corresponding to a certain sound partition shows extremely poor performance, it may be determined that the to-be-processed speech signal corresponding to the sound partition is not clear enough and lacks the necessity of further processing. Therefore, the plurality of sound partitions can be screened according to the evaluation values corresponding to the plurality of sound partitions and a preset threshold screening rule, and at least one candidate sound partition is determined. The preset threshold screening rule is to preset a corresponding threshold according to the type of the evaluation score, compare a certain type of the evaluation score with the corresponding threshold, and consider the evaluation score of the type as an effective evaluation score if the evaluation score is greater than (or equal to) the corresponding threshold; if it is less than or equal to (or less than) the corresponding threshold value, then the evaluation score of that type is considered an invalid evaluation score.
For example, a threshold corresponding to the MOS value may be preset to be 1.5, a threshold corresponding to the noise evaluation value is 1, and a threshold corresponding to the voice evaluation value is 2. Then, when the MOS value is greater than or equal to 1.5, the noise evaluation value is greater than or equal to 1, and the voice evaluation value is greater than or equal to 2, it may be determined that the MOS value, the noise evaluation value, and the voice evaluation value are all valid, and further it is determined that the corresponding voice partition is the to-be-selected voice partition.
It is understood that the candidate sound partition refers to a sound partition in which evaluation scores of the corresponding to-be-processed voice signals are valid.
Step S302, according to the evaluation value corresponding to the to-be-selected sound partition and a preset value comparison rule, the value comparison is carried out, and the target sound partition is determined.
Specifically, after the screening and determining at least one candidate sound partition, score comparison may be further performed according to the evaluation score corresponding to the candidate sound partition and a preset score comparison rule, and the candidate sound partition with the highest score may be determined as the target sound partition. For the score comparison process, the preferred type of evaluation score is a MOS value, since the MOS value can better characterize the sharpness of the speech signal to be processed for a certain candidate sound partition.
For example, the number of the to-be-selected sound partitions is three, the MOS value corresponding to the first to-be-selected sound partition is 3, the MOS value corresponding to the second to-be-selected sound partition is 2, and the MOS value corresponding to the third to-be-selected sound partition is 1. Then, the first candidate sound partition with the highest MOS value may be taken as the target sound partition by score comparison.
As another example, the number of the candidate sound partitions is one, and the candidate sound partition may be directly determined as the target sound partition.
According to the scheme, the plurality of sound partitions are screened according to the evaluation value and a preset threshold screening rule, and at least one to-be-selected sound partition is determined; and performing score comparison according to the evaluation scores corresponding to the to-be-selected sound partitions and a preset score comparison rule, and determining the target sound partition. In this embodiment, first, a sound partition with lower part of definition is screened out based on a threshold screening rule to obtain at least one to-be-selected sound partition, and further, score comparison is performed according to the evaluation scores corresponding to the to-be-selected sound partitions and a preset score comparison rule, so as to determine that the to-be-selected sound partition with higher evaluation score is a target sound partition. Therefore, the target sound partition can be accurately determined, and the occurrence of the condition of sound zone leakage is effectively reduced.
Further, referring to fig. 8, a flowchart is provided in a sixth embodiment of the voice processing method according to the present application, based on the embodiment shown in fig. 2, step S20, performing a sharpness evaluation on the voice signal to be processed based on a preset sharpness evaluation model, and before obtaining a corresponding evaluation result, further includes:
step S001, detecting the voice activity of the voice signal to be processed, and determining at least one voice partition with voice activity.
In particular, voice activity detection (Voice activity detection, VAD) is a technique for voice processing that aims to detect the presence or absence of a voice signal. In order to eliminate the interference of the invalid voice signals, voice activity detection can be performed on the voice signals to be processed corresponding to each of the plurality of voice partitions, and at least one voice partition with voice activity is determined. It can be understood that the voice signal to be processed corresponding to the voice partition with voice activity contains voice information of the speaker.
Step S20, performing a sharpness evaluation on the to-be-processed speech signal based on a preset sharpness evaluation model, to obtain a corresponding evaluation result, and further refine the evaluation result, including:
step S206, performing definition evaluation on the to-be-processed voice signal corresponding to the voice partition with voice activity based on a preset definition evaluation model to obtain a corresponding evaluation result.
Specifically, after determining at least one voice partition with voice activity, the to-be-processed voice signal corresponding to the voice partition with voice activity may be used as an input of a sharpness evaluation model, and the sharpness evaluation model performs sharpness evaluation on the to-be-processed voice signal corresponding to the voice partition with voice activity, so as to obtain a corresponding evaluation result.
According to the scheme, particularly, the voice activity detection is carried out on the voice signal to be processed, so that at least one voice partition with voice activity is determined; and performing definition evaluation on the voice signals to be processed corresponding to the voice partitions with voice activity based on a preset definition evaluation model to obtain corresponding evaluation results. In this embodiment, first, voice activity detection is performed on to-be-processed voice signals corresponding to each of a plurality of voice partitions, a part of voice partitions without voice activity are removed, and at least one voice partition with voice activity is determined. And then, performing definition evaluation on the voice signals to be processed corresponding to the voice partition with voice activity based on a preset definition evaluation model to obtain a corresponding evaluation result. Thus, the calculation pressure of the definition evaluation model is reduced, and the accuracy of the evaluation result is improved.
Further, referring to fig. 9, a seventh embodiment of the voice processing method according to the present application provides a flowchart, based on the embodiment shown in fig. 2, step S30, after determining the target sound partition based on the evaluation result, further includes:
step S002, adjusting a preset voice zone allocation policy according to the target voice zone to obtain an adjusted voice zone allocation policy, where the adjusted voice zone allocation policy is used to control a voice interaction task corresponding to the target voice zone.
Specifically, after determining the target sound partition, the embodiment may adjust the preset sound zone allocation policy according to the target sound partition, to obtain the adjusted sound zone allocation policy. The adjusted voice zone allocation strategy can control the execution of the voice interaction task corresponding to the target voice zone.
Taking the intelligent cabin system as an example, determining the target sound partition as the main driver seat sound partition, and adjusting the sound area allocation strategy according to the main driver seat sound partition to obtain an adjusted sound area allocation strategy. In addition, the speaking content of the speaker (the driver at this time) is obtained through voice recognition and is "opening the window", so that the window which the speaker needs to open can be judged to be the driver seat window according to the adjusted voice area allocation strategy, and the corresponding linkage component can be further controlled to open the driver seat window. It will be appreciated that the voice control task described above with respect to opening the window is a voice interaction task.
According to the scheme, the preset voice zone allocation strategy is adjusted according to the target voice zone, so that an adjusted voice zone allocation strategy is obtained, wherein the adjusted voice zone allocation strategy is used for controlling voice interaction tasks corresponding to the target voice zone. In this embodiment, on the basis of determining the target sound partition, the sound partition allocation policy may be further adjusted to control the corresponding voice interaction task, so that the method and the device may be applied to the voice interaction scene such as the intelligent cabin system of the vehicle, and effectively improve the voice interaction experience or driving experience of the user.
In addition, an embodiment of the present application further provides a voice processing apparatus, where the voice processing apparatus includes:
the acquisition module is used for acquiring the voice signals to be processed corresponding to each of the plurality of sound partitions;
the evaluation module is used for performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result;
and the determining module is used for determining the target sound partition based on the evaluation result.
The principles and implementation processes of voice processing are implemented in this embodiment, please refer to the above embodiments, and are not repeated here.
In addition, the embodiment of the application also provides a terminal device, which comprises a memory, a processor and a voice processing program stored on the memory and capable of running on the processor, wherein the voice processing program realizes the steps of the voice processing method when being executed by the processor.
Because the voice processing program is executed by the processor and adopts all the technical schemes of all the embodiments, the voice processing program at least has all the beneficial effects brought by all the technical schemes of all the embodiments and is not described in detail herein.
In addition, the embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a voice processing program, and the voice processing program realizes the steps of the voice processing method when being executed by a processor.
Because the voice processing program is executed by the processor and adopts all the technical schemes of all the embodiments, the voice processing program at least has all the beneficial effects brought by all the technical schemes of all the embodiments and is not described in detail herein.
Compared with the prior art, the voice processing method, the voice processing device, the terminal equipment and the storage medium provided by the embodiment of the application are characterized in that the voice signals to be processed corresponding to a plurality of sound partitions are obtained; performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result; based on the evaluation result, a target sound zone is determined. Based on the scheme of the application, the definition evaluation model is adopted to evaluate the definition of the voice signal to be processed, so as to obtain the evaluation result reflecting the definition of the voice, and further, the target sound partition is determined according to the evaluation result. Therefore, the dependence on the reference voice signal can be eliminated, the definition evaluation model can adapt to dynamic influence caused by factors such as environmental noise, speaker pose change and the like on the voice signal to be processed, and the target voice partition can be accurately determined on the basis, so that the occurrence of voice area leakage is effectively reduced.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, a controlled terminal, or a network device, etc.) to perform the method of each embodiment of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. A speech processing method, the speech processing method comprising:
acquiring to-be-processed voice signals corresponding to each of a plurality of sound partitions;
performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result;
based on the evaluation result, a target sound zone is determined.
2. The method for processing speech according to claim 1, wherein the step of performing the sharpness evaluation on the speech signal to be processed based on a preset sharpness evaluation model to obtain the corresponding evaluation result comprises:
performing spectrum segmentation on the voice signal to be processed based on a preset window length to obtain a plurality of corresponding frame spectrums;
carrying out frame-by-frame convolution on the frame spectrums by a preset frame-by-frame convolution model to obtain first-class high-dimensional features corresponding to the frame spectrums;
modeling the time dependence of the first type of high-dimensional features through a preset time sequence model to obtain second type of high-dimensional features corresponding to the frame frequencies;
performing feature aggregation on the second-class high-dimensional features corresponding to the frame frequencies respectively through a preset pooling model to obtain aggregation features;
and obtaining a corresponding evaluation result according to the aggregation characteristic analysis.
3. The method of claim 2, wherein the step of obtaining the corresponding evaluation result according to the aggregated feature analysis comprises:
and obtaining a corresponding evaluation value according to the aggregate feature analysis, wherein the type of the evaluation value comprises at least one of MOS value, noise evaluation value and voice evaluation value.
4. The method of claim 3, wherein the step of obtaining the corresponding evaluation score from the aggregated feature analysis comprises:
and analyzing to obtain corresponding evaluation values according to the aggregation characteristics and preset voice quality scoring standards.
5. The voice processing method according to claim 3, wherein the step of determining the target sound zone based on the evaluation result comprises:
screening the plurality of sound partitions according to the evaluation values and a preset threshold screening rule, and determining at least one to-be-selected sound partition;
and performing score comparison according to the evaluation scores corresponding to the to-be-selected sound partitions and a preset score comparison rule, and determining the target sound partition.
6. The method for processing speech according to claim 1, wherein before the step of performing a sharpness evaluation on the speech signal to be processed based on the preset sharpness evaluation model to obtain the corresponding evaluation result, the method further comprises:
detecting the voice activity of the voice signal to be processed, and determining at least one voice partition with the voice activity;
the step of performing the definition evaluation on the voice signal to be processed based on the preset definition evaluation model to obtain a corresponding evaluation result comprises the following steps:
and performing definition evaluation on the voice signals to be processed corresponding to the voice partitions with voice activity based on a preset definition evaluation model to obtain corresponding evaluation results.
7. The voice processing method according to claim 1, wherein after the step of determining the target sound zone based on the evaluation result, further comprising:
and adjusting a preset voice zone allocation strategy according to the target voice zone to obtain an adjusted voice zone allocation strategy, wherein the adjusted voice zone allocation strategy is used for controlling a voice interaction task corresponding to the target voice zone.
8. A speech processing apparatus, characterized in that the speech processing apparatus comprises:
the acquisition module is used for acquiring the voice signals to be processed corresponding to each of the plurality of sound partitions;
the evaluation module is used for performing definition evaluation on the voice signal to be processed based on a preset definition evaluation model to obtain a corresponding evaluation result;
and the determining module is used for determining the target sound partition based on the evaluation result.
9. A terminal device, characterized in that the terminal device comprises a memory, a processor and a speech processing program stored on the memory and executable on the processor, which speech processing program, when executed by the processor, realizes the steps of the speech processing method according to any of claims 1-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a speech processing program which, when executed by a processor, implements the steps of the speech processing method according to any of claims 1-7.
CN202310689788.0A 2023-06-09 2023-06-09 Voice processing method, device, terminal equipment and storage medium Pending CN116741161A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310689788.0A CN116741161A (en) 2023-06-09 2023-06-09 Voice processing method, device, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310689788.0A CN116741161A (en) 2023-06-09 2023-06-09 Voice processing method, device, terminal equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116741161A true CN116741161A (en) 2023-09-12

Family

ID=87918065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310689788.0A Pending CN116741161A (en) 2023-06-09 2023-06-09 Voice processing method, device, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116741161A (en)

Similar Documents

Publication Publication Date Title
US11064296B2 (en) Voice denoising method and apparatus, server and storage medium
CN110634497B (en) Noise reduction method and device, terminal equipment and storage medium
US11694710B2 (en) Multi-stream target-speech detection and channel fusion
CN112435684B (en) Voice separation method and device, computer equipment and storage medium
KR101610151B1 (en) Speech recognition device and method using individual sound model
KR100745976B1 (en) Method and apparatus for classifying voice and non-voice using sound model
US10297251B2 (en) Vehicle having dynamic acoustic model switching to improve noisy speech recognition
US20080208578A1 (en) Robust Speaker-Dependent Speech Recognition System
KR100308028B1 (en) method and apparatus for adaptive speech detection and computer-readable medium using the method
CN111048118A (en) Voice signal processing method and device and terminal
JP6843701B2 (en) Parameter prediction device and parameter prediction method for acoustic signal processing
CN111627453B (en) Public security voice information management method, device, equipment and computer storage medium
US20220277766A1 (en) Dialog enhancement using adaptive smoothing
CN116741161A (en) Voice processing method, device, terminal equipment and storage medium
CN112312280B (en) In-vehicle sound playing method and device
CN111640450A (en) Multi-person audio processing method, device, equipment and readable storage medium
US20190214037A1 (en) Recommendation device, recommendation method, and non-transitory computer-readable storage medium storing recommendation program
Krishnakumar et al. A comparison of boosted deep neural networks for voice activity detection
CN114627899A (en) Sound signal detection method and device, computer readable storage medium and terminal
CN115307721A (en) Method, device and equipment for evaluating quality of automobile acceleration sound and storage medium
CN111048096B (en) Voice signal processing method and device and terminal
CN113220933A (en) Method and device for classifying audio segments and electronic equipment
CN115938382B (en) Noise reduction control method, device, equipment and storage medium
CN111312276B (en) Audio signal processing method, device, equipment and medium
US11227622B2 (en) Speech communication system and method for improving speech intelligibility

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination