CN117524259A - Audio processing method and system - Google Patents

Audio processing method and system Download PDF

Info

Publication number
CN117524259A
CN117524259A CN202311463138.0A CN202311463138A CN117524259A CN 117524259 A CN117524259 A CN 117524259A CN 202311463138 A CN202311463138 A CN 202311463138A CN 117524259 A CN117524259 A CN 117524259A
Authority
CN
China
Prior art keywords
data
emotion
audio
voice
intonation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311463138.0A
Other languages
Chinese (zh)
Inventor
黄焕伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Desfine Acoustics Co ltd
Original Assignee
Shenzhen Desfine Acoustics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Desfine Acoustics Co ltd filed Critical Shenzhen Desfine Acoustics Co ltd
Priority to CN202311463138.0A priority Critical patent/CN117524259A/en
Publication of CN117524259A publication Critical patent/CN117524259A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to the field of audio processing technologies, and in particular, to an audio processing method and system. The method comprises the following steps: the audio is subjected to signal collection, structure quantization processing and digital code conversion to obtain audio digital data; extracting audio content from the audio, performing text word segmentation and emotion analysis to obtain text emotion analysis data; performing time stamping on the audio, performing environmental sound correction and emotion feature analysis simultaneously to obtain speech speed emotion feature data and intonation emotion feature data, performing feature structure integration to obtain speech emotion feature data, performing emotion recognition and obtaining speech emotion analysis data; and constructing an audio scene model according to the text emotion analysis data and the voice emotion analysis data to obtain an audio scene restoration model so as to realize restoration and play of the audio scene. The invention ensures that the audio processing process is more accurate through the optimization processing of the audio.

Description

Audio processing method and system
Technical Field
The present invention relates to the field of audio processing technologies, and in particular, to an audio processing method and system.
Background
Audio processing has wide application in the modern communication, entertainment and broadcasting fields. However, the conventional audio processing technology only stays on the audio auditory effect, cannot accurately analyze the scene effect presented by the audio, and has the problems of poor reproduction of the scene effect and large delay.
Disclosure of Invention
Based on the foregoing, it is necessary to provide an audio processing method and system to solve at least one of the above-mentioned problems.
In order to achieve the above object, a method and a system for processing audio, the method comprises the following steps:
step S1: collecting the audio signals to obtain audio sampling signals; performing structure quantization processing on the audio sampling signal to obtain an audio sampling quantized signal; performing digital coding conversion on the audio sampling quantized signal to obtain audio digital data;
step S2: audio content extraction processing is carried out on the audio according to the audio digital data, so that audio content cleaning data are obtained; performing text word segmentation processing according to the audio content cleaning data to obtain content text word segmentation data; carrying out emotion analysis according to the content text word segmentation data to obtain text emotion analysis data;
step S3: performing time stamp marking on the audio to obtain voice time stamp data, and performing ambient sound correction on the voice time stamp data to obtain corrected voice data; performing emotion feature analysis on the corrected voice data to obtain speech speed emotion feature data and intonation emotion feature data; feature structure integration is carried out on the speech speed emotion feature data and the intonation emotion feature data to obtain speech emotion feature data; carrying out emotion recognition on the voice emotion characteristic data to obtain voice emotion analysis data;
Step S4: and constructing an audio scene model according to the text emotion analysis data and the voice emotion analysis data to obtain an audio scene restoration model so as to realize restoration and play of the audio scene.
The invention can acquire the original sampled signal of the audio by collecting the signal of the audio, which is helpful for capturing the details and characteristics of the audio, providing the basic data required by the subsequent processing, carrying out the structure quantization processing on the sampled signal of the audio, converting the continuous analog signal into discrete digital signals, which is helpful for digitizing the audio data, enabling the audio data to be processed and stored by a computer, and representing the quantized audio signal as digital data by carrying out the digital code conversion, which is helpful for converting the audio signal into the form which can be understood and processed by the computer, and providing the basis for the subsequent analysis and processing; by carrying out audio content extraction processing on audio digital data, useful information and content can be extracted from audio, which is helpful for understanding voice, dialogue and music content contained in the audio, basic data is provided for subsequent analysis and processing, text word segmentation processing is carried out according to audio content cleaning data, extracted content is converted into meaningful words, which is helpful for analyzing and understanding semantics and meaning in the audio, the text in the long term is divided into smaller units, further processing and analysis are convenient, emotion analysis is carried out according to content text word segmentation data, emotion tendency of text expression can be deduced, emotion state of the text can be judged through analyzing vocabulary, semantics and sentence structure information in the text, for example, positive, negative or neutral emotion state is helpful for deeply understanding emotion content expressed in the audio; by means of time stamping the audio, voice data can be corresponding to specific time points, the voice data can be accurately positioned and quoted in different parts of the audio in subsequent analysis and processing, fine analysis and processing are facilitated, environmental sound correction is conducted on the voice time stamp data, environmental noise and noise can be removed, quality and definition of the voice data are improved, interference of noise on emotion analysis is reduced, extraction of emotion features is enabled to be more accurate and reliable, emotion feature analysis is conducted on the corrected voice data, voice speed and intonation features can be extracted, voice speed emotion feature data reflect the speed degree of voice, intonation emotion feature data reflect the pitch change of voice, the features can help to deeply understand emotion information expressed in voice, feature structure integration is conducted on the voice speed emotion feature data and intonation emotion feature data, different features are fused together, comprehensive and comprehensive emotion feature data are obtained, the voice capture is beneficial to capturing the rich emotion expression in voice, accurate and comprehensive emotion analysis results are provided, recognition can be carried out based on emotion feature data, and emotion information expressed in a sense of voice can be used for understanding, and emotion information can be provided; by combining text emotion analysis data and voice emotion analysis data, an audio scene model can be constructed, emotion information and audio characteristics can be combined by the model, and audio characteristic modes under different emotion states can be learned, so that relevance between emotion and an audio scene can be established, and a foundation is provided for subsequent audio scene restoration; based on the constructed audio scene model, the audio can be subjected to scene restoration, and the emotion state and scene environment corresponding to the audio can be deduced by analyzing and processing the audio characteristics, so that the emotion atmosphere, background environment and context information in the audio can be restored, and the feeling and experience of the audio are richer and more real. Therefore, the audio processing method and the system are the optimization made by the traditional audio processing method, solve the problems that the traditional audio processing method cannot accurately analyze the scene effect presented by the audio, have poor reproduction of the scene effect and larger delay, accurately analyze the scene effect presented by the audio, improve the reproduction capability of the scene effect and reduce the delay.
Preferably, step S1 comprises the steps of:
step S11: collecting the audio signals to obtain audio signals;
step S12: performing sound signal sampling processing on the audio signal to obtain an audio sampling signal;
step S13: performing structure quantization processing on the audio sampling signal to obtain an audio sampling quantized signal;
step S14: and performing digital coding conversion on the audio sampling quantized signal to obtain audio digital data.
The invention can acquire the original audio signal by collecting the audio, which is helpful for acquiring the audio data from the audio source and provides a basis for the subsequent processing and analysis; performing sound signal sampling processing on the audio signal, converting the continuous analog audio signal into discrete audio sampling signals, and discretizing the audio signal into a series of sampling points through the sampling processing, so that the audio signal can be processed and represented in a digital system; performing a structural quantization process on the audio sample signal, mapping the continuous sample values into discrete quantized values, which helps to reduce the storage space and transmission bandwidth of the audio data while maintaining the perceptual quality of the audio; the quantized signal of the audio sample is digitally encoded, the quantized value is represented as audio data in digital form, and by digital encoding, the audio signal is converted into binary data which can be processed and stored by a computer.
Preferably, step S13 comprises the steps of:
step S131: carrying out sampling signal structure division on the audio sampling signals by using a preset amplitude structure division manual to obtain an audio structure signal set;
step S132: performing adjacent amplitude difference calculation on the audio structure signal set to obtain an audio amplitude signal set;
step S133: zero-crossing rate calculation is carried out on the audio amplitude signal set, and a zero-crossing rate signal set is obtained;
step S134: extracting overlapping parts of the audio amplitude signal sets to obtain audio amplitude overlapping signals;
step S135: and carrying out nonlinear quantization processing on the audio sampling signal according to the audio amplitude overlapping signal and the zero crossing rate signal set to obtain an audio sampling quantized signal.
The invention divides the audio sampling signal by using the preset amplitude structure division manual, and divides the audio signal into different structures, which is helpful to decompose the audio signal into smaller parts, so that the subsequent analysis and processing are finer and more accurate; performing adjacent amplitude difference calculation on the audio structure signal set to obtain an audio amplitude signal set, wherein the adjacent amplitude difference reflects the amplitude change condition of the audio signal, which is helpful for extracting amplitude characteristics for subsequent analysis and processing; the zero-crossing rate calculation is carried out on the audio amplitude signal set to obtain a zero-crossing rate signal set, wherein the zero-crossing rate represents the frequency of the audio signal passing through the zero point, reflects the rapid change condition of the audio signal, and can extract the instantaneous characteristic of the audio for subsequent analysis and processing; the overlapping part extraction is carried out on the audio amplitude signal set to obtain audio amplitude overlapping signals, and the overlapping part extraction can identify overlapping phenomena existing in audio, namely overlapping parts of a plurality of audio signals, which is helpful for separating and processing the overlapping signals and improves the quality and definition of audio data; according to the audio amplitude overlapping signal and the zero crossing rate signal set, nonlinear quantization processing is carried out on the audio sampling signal to obtain an audio sampling quantized signal, and the nonlinear quantization processing can adjust the dynamic range and amplitude distribution of the audio signal, so that the audio data is more suitable for storage and transmission.
Preferably, step S2 comprises the steps of:
step S21: audio content extraction processing is carried out on the audio according to the audio digital data, so that audio content text data are obtained;
step S22: data cleaning is carried out on the text data of the audio content, and audio content cleaning data are obtained;
step S23: performing text word segmentation processing on the audio content text data according to the audio content cleaning data to obtain content text word segmentation data;
step S24: carrying out semantic analysis on the content text word segmentation data to obtain content text semantic data;
step S25: performing associated entity labeling on the content text semantic data to obtain content text associated data;
step S26: and carrying out emotion analysis on the semantic data of the content text according to the associated data of the content text to obtain text emotion analysis data.
According to the invention, audio content extraction processing is carried out according to audio digital data, voice information in audio is converted into text data, which is helpful for extracting information in audio, so that subsequent text processing and analysis are possible; the data cleaning is carried out on the audio content text data, noise, irrelevant information and error data are removed, clean audio content data are obtained, the quality and accuracy of the data can be improved through the data cleaning, and a reliable basis is provided for subsequent processing and analysis; text word segmentation processing is carried out on the audio content cleaning data, the text is split into meaningful words or phrases, the text word segmentation can convert complex text data into a form which is easier to process and analyze, and a foundation is provided for subsequent semantic analysis and entity labeling; semantic analysis is carried out on the text word segmentation data of the content, the semantics and meaning of the text data are understood, the semantic analysis can identify entities, relations and contexts in the text, the content of the text is helped to be understood, and a foundation is provided for the follow-up associated entity labeling and emotion analysis; according to the content text cleaning data, carrying out associated entity labeling on the content text word segmentation data, identifying the entities in the text, and giving corresponding labels, wherein the associated entity labeling can identify important entities and keywords in the text, so that more accurate information is provided for subsequent analysis and application; according to the content text associated data, emotion analysis is carried out on the content text semantic data, emotion tendencies and emotion states in the text are identified, emotion analysis can help to understand emotion meanings of the text, emotion information is extracted, and a basis is provided for emotion analysis and application.
Preferably, step S26 includes the steps of:
step S261: performing emotion part-of-speech screening on the content text associated data to obtain a key emotion part-of-speech list;
step S262: carrying out vocabulary combination conversion analysis according to the key emotion part list to obtain a combined part emotion list;
step S263: establishing an emotion dictionary according to the key emotion part list and the combined part emotion list to obtain a key emotion dictionary;
step S264: performing syntactic analysis on the content text semantic data to obtain text syntactic structure data;
step S265: and carrying out emotion analysis on the text syntactic structure data according to the key emotion dictionary to obtain text emotion analysis data.
According to the invention, emotion parts of speech are screened for content text associated data, key emotion parts of speech are extracted, and specific parts of speech are screened out, so that key words expressing emotion can be focused, redundant information is reduced, and accuracy and effect of emotion analysis are improved; based on the key emotion part-of-speech list, carrying out vocabulary combination conversion analysis on the text, which means that the key emotion parts-of-speech are combined according to a certain rule to form a new combined part-of-speech, and the analysis can capture more complex emotion expression modes and provide more comprehensive and rich emotion information; according to the key emotion part list and the combined part emotion list, an emotion dictionary is established, wherein the emotion dictionary contains emotion words related to the key emotion parts and emotion trends corresponding to the emotion words, and the establishment of the emotion dictionary can provide reference for subsequent emotion analysis and help judge emotion in a text; syntactic analysis is carried out on semantic data of the content text, grammar relations and syntax structures among words in the text are analyzed, the syntactic analysis is helpful for understanding the grammar structures of the text, and context relations of emotion expression are captured, so that emotion meanings of the text are accurately interpreted; according to the key emotion dictionary and the text syntactic structure data, emotion analysis is carried out to identify emotion tendencies and emotion states in the text, emotion analysis can be based on the emotion dictionary and the syntactic structure, emotion information is associated with the context, and deeper and accurate results are provided for emotion analysis.
Preferably, step S3 comprises the steps of:
step S31: collecting character voice data of the audio to obtain character voice data;
step S32: performing time stamp marking on the voice data to obtain voice time stamp data;
step S33: performing ambient sound correction on the voice time stamp data by using an ambient sound correction algorithm to obtain corrected voice data;
step S34: carrying out speech speed phoneme feature extraction and intonation phoneme feature extraction on the corrected voice data to obtain speech speed phoneme feature data and intonation phoneme feature data;
step S35: carrying out emotion feature analysis on the speech speed phoneme feature data and the intonation phoneme feature data to obtain speech speed emotion feature data and intonation emotion feature data;
step S36: carrying out speech speed emotion assessment on the speech speed emotion feature data according to the text emotion analysis data to obtain speech speed emotion assessment data;
step S37: carrying out emotion adaptation evaluation on the emotion feature data of the intonation according to the emotion evaluation data of the speed of speech to obtain emotion evaluation data of the intonation;
step S38: feature structure integration is carried out on the speech speed emotion estimation data and the intonation emotion estimation data to obtain speech emotion feature data;
Step S39: and carrying out emotion recognition on the voice emotion feature data through a voice analysis emotion recognizer to obtain voice emotion analysis data.
The invention collects the voice data of the audio frequency in a role to acquire the voice data of a specific role, which is helpful to correlate the audio frequency with the specific role or a speaker and provide a role context for subsequent analysis and processing; performing time stamping on the voice data, namely adding time stamping for each segment or word in the voice data, wherein the time stamping can help accurately identify and position different parts of the voice data in subsequent processing; performing ambient sound correction on the voice time stamp data by using an ambient sound correction algorithm, namely correcting and adjusting the voice data according to ambient noise so as to improve the quality and the understandability of the voice data; the voice speed phoneme feature extraction is carried out on the corrected voice data, and the voice speed related features in the voice data are extracted, so that the voice speed rhythm and beat can be captured, and a basis is provided for subsequent emotion analysis; the modified voice data is subjected to intonation phoneme feature extraction, intonation related features in the voice data are extracted, the voice pitch, the tone and the intonation change of the voice can be analyzed, and a basis is provided for subsequent emotion analysis; emotion feature analysis is carried out on the speech speed phoneme feature data and the intonation phoneme feature data, namely information related to emotion is extracted from the speech speed and the intonation features, so that the understanding of the relation between the speech speed and the intonation and emotion can be facilitated, and a basis is provided for subsequent emotion assessment; according to the text emotion analysis data, carrying out emotion evaluation on the emotion characteristic data of the speech speed, and evaluating the relation between the emotion and the speech speed characteristics, so that the influence degree of the speech speed on emotion expression can be judged, and an emotion analysis result related to the speech speed is provided; according to the emotion evaluation data of the speech speed, emotion adaptation evaluation is carried out on emotion feature data of the speech, and the relation between the emotion and the intonation feature is evaluated, so that the adaptation degree of the intonation to emotion expression can be judged, and emotion analysis results related to the intonation can be provided; feature structure integration is carried out on the speech speed emotion estimation data and the intonation emotion estimation data, and the estimation results of the speech speed emotion estimation data and the intonation emotion estimation data are integrated to obtain integrated speech emotion feature data, so that influence of speech speed and intonation on emotion can be comprehensively considered, and a more comprehensive and accurate speech emotion analysis result is provided; emotion recognition is carried out on the voice emotion feature data through the voice analysis emotion recognizer, namely, the voice data and emotion are associated and classified to obtain voice emotion analysis data, so that emotion expressed in voice can be recognized, and information about emotion states is provided.
Preferably, the ambient sound correction algorithm in step S33 is as follows:
wherein f represents corrected voice data, x represents input voice time stamp data, λ represents propagation velocity value of sound wave, μ represents environmental noise coefficient, α represents air damping coefficient, β represents sound wave amplitude coefficient, γ represents carrier frequency value, t represents voice duration value, and R represents deviation correction value of environmental sound correction algorithm.
The invention constructs an environmental sound correction algorithm, and each parameter in the algorithm has important influence on the quality and adaptability of corrected voice data, and the intelligibility, definition and naturalness of the voice data can be improved by reasonably adjusting the parameters, so that the voice data can be better adapted to different environmental noise and acoustic conditions. The algorithm fully considers input voice time stamp data x, which is the result of the original voice data to be corrected after time stamp marking, and provides time information of the voice data for calculating corrected voice data; the propagation speed value lambda of the sound wave, which is the speed of sound transmission in the environment, can be adjusted in time by adjusting the propagation speed value in the environmental sound correction so as to adapt to different propagation environments; the environmental noisy coefficient mu represents the noise level in the environment, and increasing the environmental noisy coefficient can enhance the correction effect on the environmental noise, reduce the influence of the environmental noise on voice data and improve the definition and the intelligibility of voice; the air damping coefficient alpha is used for adjusting the damping effect of sound waves when the sound waves propagate in the air, and the proper air damping coefficient can reduce the sound attenuation caused by the increase of the propagation distance and improve the tone quality and audibility of the voice; the sound wave amplitude coefficient beta represents the amplitude of sound waves, and the intensity of sound can be enhanced by adjusting the amplitude coefficient, so that the corrected voice data is clearer and more definite; the carrier frequency value gamma is used for adjusting the frequency characteristic in the correction algorithm, and the proper carrier frequency value can enable corrected voice data to be more balanced on the frequency domain, reduce the influence of frequency offset and improve the accuracy and naturalness of voice; a voice duration value t representing a time point or a time period of voice data, wherein in ambient sound correction, a language time value is used for calculating a time position of the corrected voice data so as to ensure the consistency and accuracy of the data in time; the deviation correction value R of the ambient sound correction algorithm represents that the correction result is subjected to deviation correction so as to further optimize the corrected voice data, different environmental conditions can be better adapted through adjustment of the deviation correction value, and the accuracy and the robustness of the correction algorithm are improved; the purpose of this algorithm is to correct the speech, which can also be corrected by conventional speech processing techniques, but the effect is often less good than this algorithm.
Preferably, step S35 includes the steps of:
step S351: carrying out three-dimensional thermodynamic diagram drawing on the speech speed phoneme characteristic data and the intonation phoneme characteristic data to respectively obtain a speech speed phoneme characteristic diagram and a intonation phoneme characteristic diagram;
step S352: the method comprises the steps of carrying out central thermal region marking on a speech speed phoneme feature graph and a intonation phoneme feature graph to obtain speech speed thermal region data and intonation thermal region data;
step S353: carrying out distribution density calculation on the speech speed thermodynamic region data according to the speech speed phoneme characteristic diagram to obtain speech speed thermodynamic density data;
step S354: carrying out fluctuation change calculation on the intonation thermal region data according to the intonation phoneme feature diagram to obtain intonation fluctuation change data;
step S355: carrying out regional density random extraction on the speech speed thermodynamic density data to obtain speech speed random density data; randomly extracting the intonation fluctuation data to obtain intonation random fluctuation data;
step S356: respectively carrying out Monte Carlo simulation on the speech speed random density data and the intonation random fluctuation data to respectively obtain speech speed simulation output data and intonation simulation output data;
step S357: and carrying out emotion feature analysis according to the speech speed simulation output data and the intonation simulation output data to obtain speech speed emotion feature data and intonation emotion feature data.
According to the invention, by drawing the three-dimensional thermodynamic diagram of the speech speed phoneme characteristic data and the intonation phoneme characteristic data, the distribution condition of the speech speed and the intonation on different phonemes can be intuitively displayed, which is helpful for understanding the integral characteristics of the speech speed and the intonation and finding the rules and trends therein; by carrying out central thermal region marking on the speech speed phoneme feature diagram and the intonation phoneme feature diagram, the thermal regions of the speech speed and the intonation, namely the regions with higher or lower values in a specific range, can be determined, so that the key feature regions of the speech speed and the intonation are focused, and related information is extracted; according to the speech speed phoneme feature diagram, carrying out distribution density calculation on the speech speed thermodynamic region data to obtain speech speed thermodynamic density data, so that the distribution features of the speech speed can be quantized, the density distribution condition of the speech speed in different regions is known, and a basis is provided for subsequent analysis; the intonation thermal region data is calculated to obtain intonation fluctuation change data according to the intonation phoneme feature diagram, and the fluctuation degree of intonation in different regions, namely the fluctuation range of intonation, can be reflected. This helps understand the dynamics and expression characteristics of intonation; carrying out regional density random extraction on the speech speed thermodynamic density data to obtain speech speed random density data; the intonation fluctuation data are subjected to fluctuation random extraction to obtain intonation random fluctuation data, and the randomly extracted data can be used for simulating the random change conditions of the speed and intonation, so that data samples are further enriched, and the analysis diversity is increased; monte Carlo simulation is carried out on the speech speed random density data and the intonation random fluctuation data, so that simulation output data can be generated, the simulation output data can help to evaluate the influence of the change of the speech speed and the intonation on the emotion characteristics, and the basis for evaluation and prediction is provided; according to the speech speed simulation output data and the intonation simulation output data, emotion feature analysis is carried out, so that speech speed emotion feature data and intonation emotion feature data can be obtained, the data reflect the relation between the speech speed and the intonation and emotion, the understanding of the roles of the speech speed and the intonation in emotion expression is facilitated, and references are provided for applications such as emotion recognition and emotion generation.
Drawings
FIG. 1 is a schematic flow chart of steps of an audio processing method and system;
FIG. 2 is a flowchart illustrating the detailed implementation of step S3 in FIG. 1;
FIG. 3 is a detailed flowchart illustrating the implementation of step S35 in FIG. 2;
the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The following is a clear and complete description of the technical method of the present patent in conjunction with the accompanying drawings, and it is evident that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.
Furthermore, the drawings are merely schematic illustrations of the present invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. The functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor methods and/or microcontroller methods.
It will be understood that, although the terms "first," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
To achieve the above objective, please refer to fig. 1 to 3, an audio processing method and system, the method includes the following steps:
step S1: collecting the audio signals to obtain audio sampling signals; performing structure quantization processing on the audio sampling signal to obtain an audio sampling quantized signal; performing digital coding conversion on the audio sampling quantized signal to obtain audio digital data;
step S2: audio content extraction processing is carried out on the audio according to the audio digital data, so that audio content cleaning data are obtained; performing text word segmentation processing according to the audio content cleaning data to obtain content text word segmentation data; carrying out emotion analysis according to the content text word segmentation data to obtain text emotion analysis data;
Step S3: performing time stamp marking on the audio to obtain voice time stamp data, and performing ambient sound correction on the voice time stamp data to obtain corrected voice data; performing emotion feature analysis on the corrected voice data to obtain speech speed emotion feature data and intonation emotion feature data; feature structure integration is carried out on the speech speed emotion feature data and the intonation emotion feature data to obtain speech emotion feature data; carrying out emotion recognition on the voice emotion characteristic data to obtain voice emotion analysis data;
step S4: and constructing an audio scene model according to the text emotion analysis data and the voice emotion analysis data to obtain an audio scene restoration model so as to realize restoration and play of the audio scene.
In the embodiment of the present invention, as described with reference to fig. 1, a flow chart of steps of an audio processing method and system of the present invention is shown, and in this example, the audio processing method includes the following steps:
step S1: collecting the audio signals to obtain audio sampling signals; performing structure quantization processing on the audio sampling signal to obtain an audio sampling quantized signal; performing digital coding conversion on the audio sampling quantized signal to obtain audio digital data;
In the embodiment of the present invention, an audio signal is collected by a microphone or other audio device, the continuous audio signal is sampled, the continuous audio signal is discretized, that is, the continuous audio signal is sampled to obtain a series of discrete audio samples, the audio sampled signal is quantized, and the continuous analog signal is converted into a discrete digital signal, which involves mapping the sampled signal into a specific discrete value range, and the quantized audio signal is digitally encoded and converted, typically using a uniform quantization or non-uniform quantization method, and is represented as digital data that can be understood and processed by a computer, and common encoding methods include Pulse Code Modulation (PCM) and compression encoding.
Step S2: audio content extraction processing is carried out on the audio according to the audio digital data, so that audio content cleaning data are obtained; performing text word segmentation processing according to the audio content cleaning data to obtain content text word segmentation data; carrying out emotion analysis according to the content text word segmentation data to obtain text emotion analysis data;
in the embodiment of the invention, content extraction processing is performed on audio according to audio digital data to obtain useful information in the audio, which may include speech recognition, speech segmentation and speech activity detection techniques to convert the audio into text or mark a meaningful part of speech, text data extracted from the audio is cleaned and preprocessed to remove noise, silence, repetition or irrelevant parts, so as to obtain clean audio content data, which may involve text cleaning, stop word removal and spelling error correction, word segmentation processing is performed on the audio content cleaning data, text is segmented into meaningful words or phrases, which may use Natural Language Processing (NLP) techniques such as word segmentation algorithm and part of speech tagging, emotion analysis and analysis are performed according to the segmented text data to identify emotion tendency and emotion state in the text, which may involve the application of emotion dictionary, machine learning model or deep learning model, for deducing emotion and emotion polarity of the text.
Step S3: performing time stamp marking on the audio to obtain voice time stamp data, and performing ambient sound correction on the voice time stamp data to obtain corrected voice data; performing emotion feature analysis on the corrected voice data to obtain speech speed emotion feature data and intonation emotion feature data; feature structure integration is carried out on the speech speed emotion feature data and the intonation emotion feature data to obtain speech emotion feature data; carrying out emotion recognition on the voice emotion characteristic data to obtain voice emotion analysis data;
in the embodiment of the invention, each audio sample is marked with a time stamp according to the audio data, and each sample is associated with the corresponding time, so that the time stamp of each sample can be calculated through the audio sampling rate and the number of sampling points; according to the voice time stamp data, carrying out environmental sound correction on the audio, and removing the interference of background noise and environmental sound so as to improve the accuracy of emotion characteristics, wherein a noise reduction algorithm and a filtering technology can be used for carrying out environmental sound correction processing; extracting emotion characteristics of the corrected voice data, including the analysis of speech speed emotion characteristic data and intonation emotion characteristic data, which can use voice signal processing technology and characteristic extraction algorithms such as fundamental frequency extraction, tone analysis and speech speed analysis; combining the extracted speech speed emotion feature data and intonation emotion feature data to integrate into a group of comprehensive speech emotion feature data, which can be performed by a feature fusion or feature combination method; and carrying out emotion recognition on the comprehensive voice emotion characteristic data, judging the emotion states expressed in the voice through a machine learning model or other algorithms, and obtaining voice emotion analysis data, wherein a classification algorithm, a support vector machine and a deep learning model can be used.
Step S4: and constructing an audio scene model according to the text emotion analysis data and the voice emotion analysis data to obtain an audio scene restoration model so as to realize restoration and play of the audio scene.
In the embodiment of the invention, a data set subjected to text emotion analysis and voice emotion analysis is collected, wherein the data set comprises a text emotion analysis result and a corresponding audio emotion analysis result; extracting features from text emotion analysis data and speech emotion analysis data, which may involve extracting word vectors or other semantic features from the text data, extracting audio features from the audio data, such as speech speed, intonation, emotion tendencies, using the marked text emotion analysis data and speech emotion analysis data as training sets, training an audio scene model, which may use various machine learning algorithms or deep learning models to construct models, such as support vector machines, random forests, neural networks, evaluating and optimizing the trained audio scene model, using verification sets or cross-verification methods to evaluate the performance of the model, and performing parameter adjustment and optimization to improve the accuracy and generalization capability of the model, applying the trained audio scene model to new audio data, outputting the emotion scene restoration result of audio according to the input audio features, which may be fed back to the terminal by way of speech playing, where speech playing may restore emotion, background environment and audio context information, and may also be converted into context information by way of converting the audio context information into context information, and restoring context information, and may be further converted into context information by way of restoring context information, such as a context information, and context information.
The invention can acquire the original sampled signal of the audio by collecting the signal of the audio, which is helpful for capturing the details and characteristics of the audio, providing the basic data required by the subsequent processing, carrying out the structure quantization processing on the sampled signal of the audio, converting the continuous analog signal into discrete digital signals, which is helpful for digitizing the audio data, enabling the audio data to be processed and stored by a computer, and representing the quantized audio signal as digital data by carrying out the digital code conversion, which is helpful for converting the audio signal into the form which can be understood and processed by the computer, and providing the basis for the subsequent analysis and processing; by carrying out audio content extraction processing on audio digital data, useful information and content can be extracted from audio, which is helpful for understanding voice, dialogue and music content contained in the audio, basic data is provided for subsequent analysis and processing, text word segmentation processing is carried out according to audio content cleaning data, extracted content is converted into meaningful words, which is helpful for analyzing and understanding semantics and meaning in the audio, the text in the long term is divided into smaller units, further processing and analysis are convenient, emotion analysis is carried out according to content text word segmentation data, emotion tendency of text expression can be deduced, emotion state of the text can be judged through analyzing vocabulary, semantics and sentence structure information in the text, for example, positive, negative or neutral emotion state is helpful for deeply understanding emotion content expressed in the audio; by means of time stamping the audio, voice data can be corresponding to specific time points, the voice data can be accurately positioned and quoted in different parts of the audio in subsequent analysis and processing, fine analysis and processing are facilitated, environmental sound correction is conducted on the voice time stamp data, environmental noise and noise can be removed, quality and definition of the voice data are improved, interference of noise on emotion analysis is reduced, extraction of emotion features is enabled to be more accurate and reliable, emotion feature analysis is conducted on the corrected voice data, voice speed and intonation features can be extracted, voice speed emotion feature data reflect the speed degree of voice, intonation emotion feature data reflect the pitch change of voice, the features can help to deeply understand emotion information expressed in voice, feature structure integration is conducted on the voice speed emotion feature data and intonation emotion feature data, different features are fused together, comprehensive and comprehensive emotion feature data are obtained, the voice capture is beneficial to capturing the rich emotion expression in voice, accurate and comprehensive emotion analysis results are provided, recognition can be carried out based on emotion feature data, and emotion information expressed in a sense of voice can be used for understanding, and emotion information can be provided; by combining text emotion analysis data and voice emotion analysis data, an audio scene model can be constructed, emotion information and audio characteristics can be combined by the model, and audio characteristic modes under different emotion states can be learned, so that relevance between emotion and an audio scene can be established, and a foundation is provided for subsequent audio scene restoration; based on the constructed audio scene model, the audio can be subjected to scene restoration, and the emotion state and scene environment corresponding to the audio can be deduced by analyzing and processing the audio characteristics, so that the emotion atmosphere, background environment and context information in the audio can be restored, and the feeling and experience of the audio are richer and more real. Therefore, the audio processing method and the system are the optimization made by the traditional audio processing method, solve the problems that the traditional audio processing method cannot accurately analyze the scene effect presented by the audio, have poor reproduction of the scene effect and larger delay, accurately analyze the scene effect presented by the audio, improve the reproduction capability of the scene effect and reduce the delay.
Preferably, step S1 comprises the steps of:
step S11: collecting the audio signals to obtain audio signals;
step S12: performing sound signal sampling processing on the audio signal to obtain an audio sampling signal;
step S13: performing structure quantization processing on the audio sampling signal to obtain an audio sampling quantized signal;
step S14: and performing digital coding conversion on the audio sampling quantized signal to obtain audio digital data.
In the embodiment of the invention, a proper audio signal collecting device, such as a microphone or a recording device, is prepared, the audio collecting environment is relatively quiet, so that the influence of interference and noise on audio signals is avoided, and the audio signal collecting device is used for collecting the audio signals, which can be acquired in real time or from the existing audio files; the method comprises determining the frequency of the audio samples, i.e. the sampling rate, the usual sampling rates having 8kHz, 16kHz, 44.1kHz, the bit depth of the audio samples, i.e. the precision of each sample point, the usual bit depth having 8 bits, 16 bits, 24 bits, sampling the audio signal using the sampling rate and the sampling depth to obtain a series of discrete audio sample points, determining the quantization range of the audio sample signal, i.e. the range between maximum and minimum values, the usual quantization range having between-1 and 1 or between-32767 and 32767, mapping the audio sample signal onto discrete quantization levels using the quantization range to obtain discrete audio sample quantization values, determining the encoding of the audio digital data, such as Pulse Code Modulation (PCM) or other compression encoding, using a selected encoding, converting the audio sample quantization signal into a corresponding digital data representation, such as a PCM encoded 16 bit integer or floating point number.
The invention can acquire the original audio signal by collecting the audio, which is helpful for acquiring the audio data from the audio source and provides a basis for the subsequent processing and analysis; performing sound signal sampling processing on the audio signal, converting the continuous analog audio signal into discrete audio sampling signals, and discretizing the audio signal into a series of sampling points through the sampling processing, so that the audio signal can be processed and represented in a digital system; performing a structural quantization process on the audio sample signal, mapping the continuous sample values into discrete quantized values, which helps to reduce the storage space and transmission bandwidth of the audio data while maintaining the perceptual quality of the audio; the quantized signal of the audio sample is digitally encoded, the quantized value is represented as audio data in digital form, and by digital encoding, the audio signal is converted into binary data which can be processed and stored by a computer.
Preferably, step S13 comprises the steps of:
step S131: carrying out sampling signal structure division on the audio sampling signals by using a preset amplitude structure division manual to obtain an audio structure signal set;
step S132: performing adjacent amplitude difference calculation on the audio structure signal set to obtain an audio amplitude signal set;
Step S133: zero-crossing rate calculation is carried out on the audio amplitude signal set, and a zero-crossing rate signal set is obtained;
step S134: extracting overlapping parts of the audio amplitude signal sets to obtain audio amplitude overlapping signals;
step S135: and carrying out nonlinear quantization processing on the audio sampling signal according to the audio amplitude overlapping signal and the zero crossing rate signal set to obtain an audio sampling quantized signal.
In the embodiment of the invention, a group of amplitude structure division manuals are predefined according to the needs and application scenes, wherein the amplitude structure division manuals comprise division standards of different amplitude ranges, audio sampling signals are divided according to the amplitude structure division manuals, and each sampling point is distributed into corresponding amplitude structures to form an audio structure signal set; calculating adjacent amplitude differences of each structure in the audio structure signal set, namely calculating the difference between adjacent amplitudes, taking the adjacent amplitude difference as a part of the audio amplitude signal set to form a new signal set, calculating zero crossing rate of each amplitude in the audio amplitude signal set, namely calculating the number of times that the signal passes through zero, taking the zero crossing rate obtained by calculation as a part of the zero crossing rate signal set to form a new signal set, extracting overlapped parts, namely parts with the amplitude difference smaller than a certain threshold value, according to the adjacent amplitude differences in the audio amplitude signal set, taking the extracted overlapped parts as a part of the audio amplitude overlapped signals to form a new signal; and designing a proper nonlinear quantization function according to the audio amplitude overlapping signal and the zero crossing rate signal set, carrying out nonlinear quantization processing on the audio sampling signal, and taking the audio sampling signal subjected to the nonlinear quantization processing as a result of the audio sampling quantization signal.
The invention divides the audio sampling signal by using the preset amplitude structure division manual, and divides the audio signal into different structures, which is helpful to decompose the audio signal into smaller parts, so that the subsequent analysis and processing are finer and more accurate; performing adjacent amplitude difference calculation on the audio structure signal set to obtain an audio amplitude signal set, wherein the adjacent amplitude difference reflects the amplitude change condition of the audio signal, which is helpful for extracting amplitude characteristics for subsequent analysis and processing; the zero-crossing rate calculation is carried out on the audio amplitude signal set to obtain a zero-crossing rate signal set, wherein the zero-crossing rate represents the frequency of the audio signal passing through the zero point, reflects the rapid change condition of the audio signal, and can extract the instantaneous characteristic of the audio for subsequent analysis and processing; the overlapping part extraction is carried out on the audio amplitude signal set to obtain audio amplitude overlapping signals, and the overlapping part extraction can identify overlapping phenomena existing in audio, namely overlapping parts of a plurality of audio signals, which is helpful for separating and processing the overlapping signals and improves the quality and definition of audio data; according to the audio amplitude overlapping signal and the zero crossing rate signal set, nonlinear quantization processing is carried out on the audio sampling signal to obtain an audio sampling quantized signal, and the nonlinear quantization processing can adjust the dynamic range and amplitude distribution of the audio signal, so that the audio data is more suitable for storage and transmission.
Preferably, step S2 comprises the steps of:
step S21: audio content extraction processing is carried out on the audio according to the audio digital data, so that audio content text data are obtained;
step S22: data cleaning is carried out on the text data of the audio content, and audio content cleaning data are obtained;
step S23: performing text word segmentation processing on the audio content text data according to the audio content cleaning data to obtain content text word segmentation data;
step S24: carrying out semantic analysis on the content text word segmentation data to obtain content text semantic data;
step S25: performing associated entity labeling on the content text semantic data to obtain content text associated data;
step S26: and carrying out emotion analysis on the semantic data of the content text according to the associated data of the content text to obtain text emotion analysis data.
In the embodiment of the invention, a proper audio content extraction technology, such as voice recognition or audio feature extraction, is selected, and is converted into processable audio content text data according to audio digital data, and simultaneously the audio digital data can be processed by using the selected audio content extraction technology, so that voice content in audio is converted into text form, and the audio content text data is obtained; determining data cleaning operations to be performed according to application requirements and data quality requirements, such as removing special characters, punctuation marks and stop words, performing data cleaning operations on audio content text data, removing unnecessary content or performing text normalization processing according to requirements to obtain audio content cleaning data, selecting proper word segmentation tools or algorithms, such as a Chinese word segmentation device or a word bag model, for word segmentation processing on the audio content cleaning data, using the selected word segmentation tools to segment the audio content cleaning data, splitting a text into a series of words or phrases to obtain content text word segmentation data, selecting proper semantic analysis technologies, such as a word vector model or a deep learning model, for understanding and deducing the semantics of the text, and using the selected semantic analysis technologies to process the content text word segmentation data to extract semantic information of the text to obtain content text semantic data; selecting a proper entity labeling tool or algorithm, such as a named entity identifier or an entity link model, for identifying and labeling key entities in the text, and processing the semantic data of the content text by using the selected entity labeling tool to identify and label the key entities related to a specific field in the text and generate content text association data; appropriate emotion analysis techniques, such as emotion dictionaries or deep learning models, are selected for identifying and analyzing the emotion tendencies of the text.
According to the invention, audio content extraction processing is carried out according to audio digital data, voice information in audio is converted into text data, which is helpful for extracting information in audio, so that subsequent text processing and analysis are possible; the data cleaning is carried out on the audio content text data, noise, irrelevant information and error data are removed, clean audio content data are obtained, the quality and accuracy of the data can be improved through the data cleaning, and a reliable basis is provided for subsequent processing and analysis; text word segmentation processing is carried out on the audio content cleaning data, the text is split into meaningful words or phrases, the text word segmentation can convert complex text data into a form which is easier to process and analyze, and a foundation is provided for subsequent semantic analysis and entity labeling; semantic analysis is carried out on the text word segmentation data of the content, the semantics and meaning of the text data are understood, the semantic analysis can identify entities, relations and contexts in the text, the content of the text is helped to be understood, and a foundation is provided for the follow-up associated entity labeling and emotion analysis; according to the content text cleaning data, carrying out associated entity labeling on the content text word segmentation data, identifying the entities in the text, and giving corresponding labels, wherein the associated entity labeling can identify important entities and keywords in the text, so that more accurate information is provided for subsequent analysis and application; according to the content text associated data, emotion analysis is carried out on the content text semantic data, emotion tendencies and emotion states in the text are identified, emotion analysis can help to understand emotion meanings of the text, emotion information is extracted, and a basis is provided for emotion analysis and application.
Preferably, step S26 includes the steps of:
step S261: performing emotion part-of-speech screening on the content text associated data to obtain a key emotion part-of-speech list;
step S262: carrying out vocabulary combination conversion analysis according to the key emotion part list to obtain a combined part emotion list;
step S263: establishing an emotion dictionary according to the key emotion part list and the combined part emotion list to obtain a key emotion dictionary;
step S264: performing syntactic analysis on the content text semantic data to obtain text syntactic structure data;
step S265: and carrying out emotion analysis on the text syntactic structure data according to the key emotion dictionary to obtain text emotion analysis data.
In the embodiment of the invention, an emotion part-of-speech list to be considered, such as a positive emotion part-of-speech (such as adjective and adverb) and a negative emotion part-of-speech (such as negation and adverb), is determined, part-of-speech in content text associated data is screened according to the defined emotion part-of-speech list, words containing key emotion parts-of-speech are extracted, and a key emotion part-of-speech list is obtained; determining a combined part-of-speech list to be considered, such as positive emotion part-of-speech combination and negative emotion part-of-speech combination, carrying out combined conversion analysis on words in the content text associated data according to the key emotion part-of-speech list, finding out words with specific combined parts-of-speech, and obtaining a combined part-of-speech list; establishing an emotion dictionary containing positive and negative emotion vocabularies according to the key emotion part list and the combined part-of-speech emotion list, adding words screened according to the key emotion part-of-speech list and the combined part-of-speech emotion list into the emotion dictionary, and distributing corresponding emotion polarities for each word; selecting proper syntactic analysis tools or algorithms, such as a dependency syntactic analyzer or a chunk analyzer, and the like, for analyzing the syntactic structure of the text, processing the text semantic data of the content by using the selected syntactic analysis tools to acquire the syntactic dependency relationship and the syntactic structure of the text to obtain text syntactic structure data, selecting proper emotion analysis algorithms or models, such as a rule-based method or a machine learning-based method, for emotion reasoning according to the key emotion dictionary and the text syntactic structure data, processing the text syntactic structure data by using the selected emotion analysis algorithms, and performing emotion reasoning and analysis by combining the key emotion dictionary to obtain text emotion analysis data.
According to the invention, emotion parts of speech are screened for content text associated data, key emotion parts of speech are extracted, and specific parts of speech are screened out, so that key words expressing emotion can be focused, redundant information is reduced, and accuracy and effect of emotion analysis are improved; based on the key emotion part-of-speech list, carrying out vocabulary combination conversion analysis on the text, which means that the key emotion parts-of-speech are combined according to a certain rule to form a new combined part-of-speech, and the analysis can capture more complex emotion expression modes and provide more comprehensive and rich emotion information; according to the key emotion part list and the combined part emotion list, an emotion dictionary is established, wherein the emotion dictionary contains emotion words related to the key emotion parts and emotion trends corresponding to the emotion words, and the establishment of the emotion dictionary can provide reference for subsequent emotion analysis and help judge emotion in a text; syntactic analysis is carried out on semantic data of the content text, grammar relations and syntax structures among words in the text are analyzed, the syntactic analysis is helpful for understanding the grammar structures of the text, and context relations of emotion expression are captured, so that emotion meanings of the text are accurately interpreted; according to the key emotion dictionary and the text syntactic structure data, emotion analysis is carried out to identify emotion tendencies and emotion states in the text, emotion analysis can be based on the emotion dictionary and the syntactic structure, emotion information is associated with the context, and deeper and accurate results are provided for emotion analysis.
Preferably, step S3 comprises the steps of:
step S31: collecting character voice data of the audio to obtain character voice data;
step S32: performing time stamp marking on the voice data to obtain voice time stamp data;
step S33: performing ambient sound correction on the voice time stamp data by using an ambient sound correction algorithm to obtain corrected voice data;
step S34: carrying out speech speed phoneme feature extraction and intonation phoneme feature extraction on the corrected voice data to obtain speech speed phoneme feature data and intonation phoneme feature data;
step S35: carrying out emotion feature analysis on the speech speed phoneme feature data and the intonation phoneme feature data to obtain speech speed emotion feature data and intonation emotion feature data;
step S36: carrying out speech speed emotion assessment on the speech speed emotion feature data according to the text emotion analysis data to obtain speech speed emotion assessment data;
step S37: carrying out emotion adaptation evaluation on the emotion feature data of the intonation according to the emotion evaluation data of the speed of speech to obtain emotion evaluation data of the intonation;
step S38: feature structure integration is carried out on the speech speed emotion estimation data and the intonation emotion estimation data to obtain speech emotion feature data;
Step S39: and carrying out emotion recognition on the voice emotion feature data through a voice analysis emotion recognizer to obtain voice emotion analysis data.
As an example of the present invention, referring to fig. 2, the step S3 in this example includes:
step S31: collecting character voice data of the audio to obtain character voice data;
in the embodiment of the invention, the character type, such as men, women and children, of which the voice data is collected and the character characteristics, such as accent and speed, are definitely required to prepare proper recording equipment or system, so that the good working state of the recording equipment is ensured. Professional microphones, recording studio or portable recording equipment can be selected to sort and archive the collected voice data, so that the voice data of each role can be accurately identified and managed.
Step S32: performing time stamp marking on the voice data to obtain voice time stamp data;
in the embodiment of the invention, character voice data to be marked is prepared, the file format and sampling rate of the data are ensured to be compatible with the selected tool or algorithm, the granularity and interval of the time stamp marking are set according to the requirement, and the marking can be carried out according to sentences, syllables or other suitable units.
Step S33: performing ambient sound correction on the voice time stamp data by using an ambient sound correction algorithm to obtain corrected voice data;
in the embodiment of the invention, in the collected voice data, the environmental sound data such as silence or background noise is collected at the same time, a proper environmental sound correction algorithm or technology such as noise suppression and voice enhancement is selected for correcting the environmental sound in the voice time stamp data, and the voice time stamp data is processed by using the selected environmental sound correction algorithm to remove or suppress the environmental sound so as to obtain corrected voice data.
Step S34: carrying out speech speed phoneme feature extraction and intonation phoneme feature extraction on the corrected voice data to obtain speech speed phoneme feature data and intonation phoneme feature data;
in the embodiment of the invention, the modified voice data is processed by using a voice processing tool or algorithm such as acoustic analysis and voice recognition, the phoneme features related to the voice speed such as syllable duration and voice speed change are extracted, the modified voice data is processed by using a voice processing tool or algorithm such as fundamental frequency analysis and voice recognition, and the phoneme features related to the voice call such as fundamental frequency change and voice call outline are extracted.
Step S35: carrying out emotion feature analysis on the speech speed phoneme feature data and the intonation phoneme feature data to obtain speech speed emotion feature data and intonation emotion feature data;
in the embodiment of the invention, a proper emotion feature analysis algorithm or model is selected, such as a statistical method or a machine learning-based method, is used for extracting and analyzing emotion features of the speech speed phoneme feature data and the intonation phoneme feature data, the selected emotion feature analysis algorithm is used for processing the speech speed phoneme feature data and the intonation phoneme feature data, and features related to emotion, such as the speed degree of speech speed and the rising and falling degree of intonation, are extracted to obtain the speech speed emotion feature data and the intonation emotion feature data.
Step S36: carrying out speech speed emotion assessment on the speech speed emotion feature data according to the text emotion analysis data to obtain speech speed emotion assessment data;
in the embodiment of the invention, a proper speech speed emotion assessment algorithm or model is selected, such as a rule-based method or a machine learning-based method, and is used for carrying out emotion assessment on speech speed emotion feature data according to text emotion analysis data, processing the speech speed emotion feature data by using the selected speech speed emotion assessment algorithm, and carrying out emotion reasoning and assessment by combining the text emotion analysis data to obtain speech speed emotion assessment data.
Step S37: carrying out emotion adaptation evaluation on the emotion feature data of the intonation according to the emotion evaluation data of the speed of speech to obtain emotion evaluation data of the intonation;
in the embodiment of the invention, a proper intonation emotion adaptation evaluation algorithm or model is selected, such as a rule-based method or a machine learning-based method, and is used for carrying out emotion adaptation evaluation on the intonation emotion feature data according to the emotion speed emotion evaluation data, the selected intonation emotion adaptation evaluation algorithm is used for processing the intonation emotion feature data, and emotion reasoning and evaluation are carried out by combining the emotion speed emotion evaluation data to obtain the intonation emotion evaluation data.
Step S38: feature structure integration is carried out on the speech speed emotion estimation data and the intonation emotion estimation data to obtain speech emotion feature data;
in the embodiment of the invention, the speech speed emotion estimation data and the intonation emotion estimation data are subjected to feature structure integration, and the features of the speech speed emotion estimation data and the intonation emotion estimation data can be combined and weighted to obtain integrated speech emotion feature data.
Step S39: and carrying out emotion recognition on the voice emotion feature data through a voice analysis emotion recognizer to obtain voice emotion analysis data.
In the embodiment of the invention, a proper voice emotion recognizer or model is selected, such as a feature-based method or a deep learning-based method, for emotion recognition of voice emotion feature data, and the selected voice emotion recognizer is used for processing and analyzing the integrated voice emotion feature data to perform emotion reasoning and recognition so as to obtain voice emotion analysis data.
The invention collects the voice data of the audio frequency in a role to acquire the voice data of a specific role, which is helpful to correlate the audio frequency with the specific role or a speaker and provide a role context for subsequent analysis and processing; performing time stamping on the voice data, namely adding time stamping for each segment or word in the voice data, wherein the time stamping can help accurately identify and position different parts of the voice data in subsequent processing; performing ambient sound correction on the voice time stamp data by using an ambient sound correction algorithm, namely correcting and adjusting the voice data according to ambient noise so as to improve the quality and the understandability of the voice data; the voice speed phoneme feature extraction is carried out on the corrected voice data, and the voice speed related features in the voice data are extracted, so that the voice speed rhythm and beat can be captured, and a basis is provided for subsequent emotion analysis; the modified voice data is subjected to intonation phoneme feature extraction, intonation related features in the voice data are extracted, the voice pitch, the tone and the intonation change of the voice can be analyzed, and a basis is provided for subsequent emotion analysis; emotion feature analysis is carried out on the speech speed phoneme feature data and the intonation phoneme feature data, namely information related to emotion is extracted from the speech speed and the intonation features, so that the understanding of the relation between the speech speed and the intonation and emotion can be facilitated, and a basis is provided for subsequent emotion assessment; according to the text emotion analysis data, carrying out emotion evaluation on the emotion characteristic data of the speech speed, and evaluating the relation between the emotion and the speech speed characteristics, so that the influence degree of the speech speed on emotion expression can be judged, and an emotion analysis result related to the speech speed is provided; according to the emotion evaluation data of the speech speed, emotion adaptation evaluation is carried out on emotion feature data of the speech, and the relation between the emotion and the intonation feature is evaluated, so that the adaptation degree of the intonation to emotion expression can be judged, and emotion analysis results related to the intonation can be provided; feature structure integration is carried out on the speech speed emotion estimation data and the intonation emotion estimation data, and the estimation results of the speech speed emotion estimation data and the intonation emotion estimation data are integrated to obtain integrated speech emotion feature data, so that influence of speech speed and intonation on emotion can be comprehensively considered, and a more comprehensive and accurate speech emotion analysis result is provided; emotion recognition is carried out on the voice emotion feature data through the voice analysis emotion recognizer, namely, the voice data and emotion are associated and classified to obtain voice emotion analysis data, so that emotion expressed in voice can be recognized, and information about emotion states is provided.
Preferably, the ambient sound correction algorithm in step S33 is as follows:
wherein f represents corrected voice data, x represents input voice time stamp data, λ represents propagation velocity value of sound wave, μ represents environmental noise coefficient, α represents air damping coefficient, β represents sound wave amplitude coefficient, γ represents carrier frequency value, t represents voice duration value, and R represents deviation correction value of environmental sound correction algorithm.
The invention constructs an environmental sound correction algorithm, and each parameter in the algorithm has important influence on the quality and adaptability of corrected voice data, and the intelligibility, definition and naturalness of the voice data can be improved by reasonably adjusting the parameters, so that the voice data can be better adapted to different environmental noise and acoustic conditions. The algorithm fully considers input voice time stamp data x, which is the result of the original voice data to be corrected after time stamp marking, and provides time information of the voice data for calculating corrected voice data; the propagation speed value lambda of the sound wave, which is the speed of sound transmission in the environment, can be adjusted in time by adjusting the propagation speed value in the environmental sound correction so as to adapt to different propagation environments; the environmental noisy coefficient mu represents the noise level in the environment, and increasing the environmental noisy coefficient can enhance the correction effect on the environmental noise, reduce the influence of the environmental noise on voice data and improve the definition and the intelligibility of voice; the air damping coefficient alpha is used for adjusting the damping effect of sound waves when the sound waves propagate in the air, and the proper air damping coefficient can reduce the sound attenuation caused by the increase of the propagation distance and improve the tone quality and audibility of the voice; the sound wave amplitude coefficient beta represents the amplitude of sound waves, and the intensity of sound can be enhanced by adjusting the amplitude coefficient, so that the corrected voice data is clearer and more definite; the carrier frequency value gamma is used for adjusting the frequency characteristic in the correction algorithm, and the proper carrier frequency value can enable corrected voice data to be more balanced on the frequency domain, reduce the influence of frequency offset and improve the accuracy and naturalness of voice; a voice duration value t representing a time point or a time period of voice data, wherein in ambient sound correction, a language time value is used for calculating a time position of the corrected voice data so as to ensure the consistency and accuracy of the data in time; the deviation correction value R of the ambient sound correction algorithm represents that the correction result is subjected to deviation correction so as to further optimize the corrected voice data, different environmental conditions can be better adapted through adjustment of the deviation correction value, and the accuracy and the robustness of the correction algorithm are improved; the purpose of this algorithm is to correct the speech, which can also be corrected by conventional speech processing techniques, but the effect is often less good than this algorithm.
Preferably, step S35 includes the steps of:
step S351: carrying out three-dimensional thermodynamic diagram drawing on the speech speed phoneme characteristic data and the intonation phoneme characteristic data to respectively obtain a speech speed phoneme characteristic diagram and a intonation phoneme characteristic diagram;
step S352: the method comprises the steps of carrying out central thermal region marking on a speech speed phoneme feature graph and a intonation phoneme feature graph to obtain speech speed thermal region data and intonation thermal region data;
step S353: carrying out distribution density calculation on the speech speed thermodynamic region data according to the speech speed phoneme characteristic diagram to obtain speech speed thermodynamic density data;
step S354: carrying out fluctuation change calculation on the intonation thermal region data according to the intonation phoneme feature diagram to obtain intonation fluctuation change data;
step S355: carrying out regional density random extraction on the speech speed thermodynamic density data to obtain speech speed random density data; randomly extracting the intonation fluctuation data to obtain intonation random fluctuation data;
step S356: respectively carrying out Monte Carlo simulation on the speech speed random density data and the intonation random fluctuation data to respectively obtain speech speed simulation output data and intonation simulation output data;
step S357: and carrying out emotion feature analysis according to the speech speed simulation output data and the intonation simulation output data to obtain speech speed emotion feature data and intonation emotion feature data.
As an example of the present invention, referring to fig. 3, the step S35 in this example includes:
step S351: carrying out three-dimensional thermodynamic diagram drawing on the speech speed phoneme characteristic data and the intonation phoneme characteristic data to respectively obtain a speech speed phoneme characteristic diagram and a intonation phoneme characteristic diagram;
in the embodiment of the invention, the speech speed phoneme feature data and the intonation phoneme feature data are prepared, the data can be corresponding features extracted from a speech signal, such as changes of speech speed and intonation, a proper algorithm or tool is used for extracting specific numerical features from the speech speed phoneme feature data and the intonation phoneme feature data, the features can be speech speed and intonation values of corresponding phonemes, coordinate axes of a three-dimensional thermodynamic diagram are determined according to the dimension of the data, generally, the horizontal axis can represent time, the vertical axis can represent phonemes, the color can represent values of speech speed or intonation, the speech speed phoneme feature data are converted into the three-dimensional thermodynamic diagram by a proper tool or library, the corresponding thermodynamic diagram is drawn according to the determined coordinate axes, the changes of color reflect the changes of speech speed, the intonation phoneme feature data are converted into the three-dimensional thermodynamic diagram by a proper tool or library, and the corresponding thermodynamic diagram is drawn according to the determined coordinate axes, wherein the changes of color reflect the changes of intonation.
Step S352: the method comprises the steps of carrying out central thermal region marking on a speech speed phoneme feature graph and a intonation phoneme feature graph to obtain speech speed thermal region data and intonation thermal region data;
in the embodiment of the invention, firstly, an appropriate threshold value needs to be determined to mark the central heating area, and the threshold value can be selected according to the specific application and the characteristics of data, such as color distribution in a thermodynamic diagram or the statistical characteristics of the data; the central points in the speech speed phoneme feature graph and the intonation phoneme feature graph are detected by using a proper method or algorithm, the central points generally represent the areas with highest or denser heat, represent higher speech speed or intonation features, the areas around the detected central points are marked as heat areas according to a predetermined threshold value, a circular, oval or other shape area marking method can be used, the specific shape and size can be adjusted according to the requirement, the marked heat area data in the speech speed phoneme feature graph can be recorded or extracted, and the data can be the coordinates, the shape, the area or other relevant attributes of the heat areas and the corresponding speech speed feature values, and the marked heat area data in the intonation phoneme feature graph can be recorded or extracted. The data may be coordinates, shape, area or other relevant attributes of the thermal region, and corresponding intonation feature values.
Step S353: carrying out distribution density calculation on the speech speed thermodynamic region data according to the speech speed phoneme characteristic diagram to obtain speech speed thermodynamic density data;
in the embodiment of the present invention, the method in step S352 is used to obtain the data of the speech speed thermal region, where the data includes the coordinates, shape, area and corresponding speech speed characteristic values of the thermal region; dividing the speech speed phoneme feature graph into grids, so that each grid unit has a fixed size, and the size of the grids can be selected according to the distribution and the resolution of data; for each grid cell, counting the number of thermodynamic areas in the thermodynamic area data at the speed, which can be achieved by calculating whether coordinates of a thermodynamic area center point are in the grid, normalizing the calculated number of thermodynamic areas of each grid so that the density values are within a comparable range, normalizing the normalized density values by maximum-minimum scaling or other suitable methods, and associating the normalized density values with the corresponding grid cells to construct the thermodynamic density data at the speed, wherein the data can be represented in the form of a thermodynamic diagram or matrix, and each grid cell corresponds to one density value.
Step S354: carrying out fluctuation change calculation on the intonation thermal region data according to the intonation phoneme feature diagram to obtain intonation fluctuation change data;
in the embodiment of the present invention, the method in step S352 is used to obtain intonation thermal region data, where the data includes coordinates, shape, area and corresponding intonation feature values of the thermal region, and feature values related to intonation are extracted from the intonation phoneme feature graph. This may be a value of pitch or frequency representing a change in pitch, calculating for each intonation thermal region an average of all intonation features within the region, which may be achieved by summing the intonation features within the region and dividing by the area of the intonation region, comparing the average intonation features for each intonation thermal region with the global average of the entire phoneme graph to obtain the fluctuation value for the region. The degree of relief of a region may be represented using a difference or scale, and the relief value of each intonation thermal region may be associated with the corresponding region to construct intonation relief data, which may be represented in the form of a thermodynamic diagram or matrix, where each region corresponds to a relief value.
Step S355: carrying out regional density random extraction on the speech speed thermodynamic density data to obtain speech speed random density data; randomly extracting the intonation fluctuation data to obtain intonation random fluctuation data;
in the embodiment of the invention, the speed thermodynamic density data is obtained, the data represents the distribution density condition of the speed in different areas, a certain amount of speed thermodynamic density data is extracted according to the need, the selection can be carried out according to the extraction proportion or the specific amount, the random extraction can be realized by randomly selecting the areas from the speed thermodynamic density data or using a random number generator, the extracted speed thermodynamic density data are combined to construct the speed thermodynamic density data, and the data can be expressed in the form of a thermodynamic diagram or a matrix, wherein each area corresponds to a random density value; the method comprises the steps of obtaining intonation fluctuation data, wherein the data represent the fluctuation conditions of intonation in different areas, extracting a certain amount of intonation fluctuation data according to requirements, selecting according to extraction proportion or specific amount, randomly extracting the intonation fluctuation data by randomly selecting areas from the intonation fluctuation data or using a random number generator, combining the extracted intonation fluctuation data to construct intonation random fluctuation data, and the data can be expressed in a thermodynamic diagram or matrix form, wherein each area corresponds to a random fluctuation value.
Step S356: respectively carrying out Monte Carlo simulation on the speech speed random density data and the intonation random fluctuation data to respectively obtain speech speed simulation output data and intonation simulation output data;
in the embodiment of the invention, the random density data of the speech speed is obtained, the data represent the random density conditions of the speech speed in different areas, parameters of Monte Carlo simulation such as simulation times and simulation time steps are determined, the parameters influence the accuracy and precision of the simulation, and a blank matrix with the same size as the random density data of the speech speed is created and used for storing the simulated output data, and each area is simulated. For each time step, according to the random density value of the area, using a proper model or method to simulate to obtain simulated output data, selecting a proper model such as a random walk model and a random diffusion model according to the need, updating a simulated output matrix according to the simulated output data, and accumulating and storing the simulation result of each time step; and acquiring intonation random fluctuation data, wherein the data represent random fluctuation change conditions of intonation in different areas, determining parameters of Monte Carlo simulation, such as simulation times and simulation time steps, which influence the accuracy and precision of the simulation, creating a blank matrix with the same size as the intonation random fluctuation data, and storing the simulated output data to simulate each area. For each time step, the simulation is performed using an appropriate model or method based on the random fluctuation value of the region, resulting in simulated output data. Suitable models, such as a fluctuation model and a random oscillation model, can be selected according to the needs, a simulation output matrix is updated according to the simulated output data, and the simulation result of each time step is accumulated and stored.
Step S357: and carrying out emotion feature analysis according to the speech speed simulation output data and the intonation simulation output data to obtain speech speed emotion feature data and intonation emotion feature data.
According to the embodiment of the invention, the speech speed simulation output data is obtained, the data represent simulation results of the speech speed in different areas, some emotion characteristic indexes such as average speech speed, fluctuation degree and change rate are defined according to requirements, the indexes are used for analyzing emotion characteristics of the speech speed, for each area or whole data, calculation is carried out according to the defined emotion characteristic indexes, for example, the average speech speed of each area is calculated, the fluctuation degree of the whole data is calculated, the calculated emotion characteristic indexes are associated with the corresponding area or whole data, and the speech speed emotion characteristic data are constructed, and can be expressed as thermodynamic diagrams, matrixes or other forms, wherein each area or whole data corresponds to one emotion characteristic value; and obtaining intonation simulation output data, wherein the data represent simulation results of intonation in different areas, defining emotion characteristic indexes such as a change range of pitch and tone stability according to requirements, calculating according to the defined emotion characteristic indexes for each area or whole data, such as calculating the pitch change range of each area and calculating the tone stability of whole data, associating the calculated emotion characteristic indexes with the corresponding area or whole data to construct intonation emotion characteristic data, wherein the data can be expressed as a thermodynamic diagram, a matrix or other forms, and each area or whole data corresponds to one emotion characteristic value.
According to the invention, by drawing the three-dimensional thermodynamic diagram of the speech speed phoneme characteristic data and the intonation phoneme characteristic data, the distribution condition of the speech speed and the intonation on different phonemes can be intuitively displayed, which is helpful for understanding the integral characteristics of the speech speed and the intonation and finding the rules and trends therein; by carrying out central thermal region marking on the speech speed phoneme feature diagram and the intonation phoneme feature diagram, the thermal regions of the speech speed and the intonation, namely the regions with higher or lower values in a specific range, can be determined, so that the key feature regions of the speech speed and the intonation are focused, and related information is extracted; according to the speech speed phoneme feature diagram, carrying out distribution density calculation on the speech speed thermodynamic region data to obtain speech speed thermodynamic density data, so that the distribution features of the speech speed can be quantized, the density distribution condition of the speech speed in different regions is known, and a basis is provided for subsequent analysis; the intonation thermal region data is calculated to obtain intonation fluctuation change data according to the intonation phoneme feature diagram, and the fluctuation degree of intonation in different regions, namely the fluctuation range of intonation, can be reflected. This helps understand the dynamics and expression characteristics of intonation; carrying out regional density random extraction on the speech speed thermodynamic density data to obtain speech speed random density data; the intonation fluctuation data are subjected to fluctuation random extraction to obtain intonation random fluctuation data, and the randomly extracted data can be used for simulating the random change conditions of the speed and intonation, so that data samples are further enriched, and the analysis diversity is increased; monte Carlo simulation is carried out on the speech speed random density data and the intonation random fluctuation data, so that simulation output data can be generated, the simulation output data can help to evaluate the influence of the change of the speech speed and the intonation on the emotion characteristics, and the basis for evaluation and prediction is provided; according to the speech speed simulation output data and the intonation simulation output data, emotion feature analysis is carried out, so that speech speed emotion feature data and intonation emotion feature data can be obtained, the data reflect the relation between the speech speed and the intonation and emotion, the understanding of the roles of the speech speed and the intonation in emotion expression is facilitated, and references are provided for applications such as emotion recognition and emotion generation.
Preferably, the constructing step of the emotion recognizer for voice analysis includes the steps of:
acquiring historical voice data;
carrying out structural segmentation processing according to the historical voice data to obtain a historical voice data set;
performing time period marking on the historical voice data set to obtain a voice time period data set;
extracting voice fluctuation points according to the voice time period data set to obtain a voice fluctuation point data set;
extracting a voice edge time period according to the voice fluctuation point data set to obtain a voice edge data set;
and performing voice machine learning on the voice fluctuation point data set by utilizing the Scikit-learn machine learning library, and performing coupling association through the voice edge data set to obtain the voice analysis emotion recognizer.
In the embodiment of the invention, a sufficient amount of historical voice data is collected, the data should cover different emotion states and voice characteristics, and the historical voice data can come from different sources, such as a voice database and user records; determining an appropriate segmentation strategy according to requirements and tasks, for example, segmenting historical voice data according to a voice length, voice pause and voice characteristics, segmenting historical voice data according to a selected segmentation strategy, cutting the voice data according to a fixed length or a variable length based on the voice length, determining the length of each voice segment according to a time window, detecting pause areas (namely silent areas among voices) in voice based on voice pause, segmenting the voice data into different paragraphs, and adjusting the definition and threshold of the pause according to specific requirements; based on the voice characteristics, segmenting according to the characteristics of the voice, such as energy, frequency spectrum and zero crossing rate, judging the segmentation position of the voice according to the change of the characteristics, and marking the voice paragraphs obtained by each segment so as to facilitate subsequent emotion recognition or other analysis tasks, wherein the marks can be emotion tags, time tags or other required tags; selecting a proper fluctuation point detection method, wherein common methods comprise threshold detection based on energy and frequency spectrum, differential detection and classification method based on machine learning, selecting the most proper method according to actual conditions and task demands, detecting and extracting fluctuation points in each voice time period according to the selected fluctuation point detection method, marking the extracted fluctuation points to facilitate subsequent analysis and training, marking the fluctuation points as starting points or end points of emotion changes, distributing corresponding emotion labels, combining the marked fluctuation points and the corresponding emotion labels into a voice fluctuation point data set, and each sample comprises fluctuation point characteristics and corresponding emotion labels; a speech relief point data set is prepared that includes relief point features and corresponding emotion tags. Each sample represents a voice fluctuation point and emotion labels thereof, definition of an edge time period is determined, the edge time period refers to voice paragraphs before and after the fluctuation point and is used for capturing context information of emotion change, the length of the edge time period or other standards can be defined according to the requirement, the edge time period is extracted from a voice fluctuation point data set according to the position of the fluctuation point, the voice paragraphs with the corresponding length are intercepted before and after each fluctuation point according to the defined edge time period length, the extracted edge time periods are marked for subsequent analysis and training, the edge time period can be marked as context of emotion change, corresponding emotion labels are reserved, the marked edge time period and the corresponding emotion labels are combined into a voice edge data set, and each sample contains voice data of the edge time period and the corresponding emotion labels; a speech relief point data set and a speech edge data set are prepared. Features and labels in a data set are ensured to be extracted and marked, features in a voice fluctuation point data set are selected to extract the most relevant and distinguishing features, a feature selection algorithm such as correlation analysis and information gain can be used to select features which are helpful to emotion recognition tasks, an appropriate machine learning model is selected for the emotion recognition tasks, scikit-learn provides various common machine learning algorithms such as a Support Vector Machine (SVM), random Forest (Random Forest) and a neural network, an appropriate model is selected according to task requirements and data characteristics, the selected machine learning model is trained by using the voice fluctuation point data set, the data set is divided into a training set and a verification set, the models are trained by using the training set, the performance of the models is evaluated through the verification set, the trained models are evaluated by using the verification set, the accuracy and the recall index of the models are calculated, the performance of the models is measured according to evaluation results, the models are adjusted and optimized according to the evaluation results, the characteristics are selected by using the adjustment model parameters and different feature selection methods, the good recognition models are associated with the emotion data as the emotion recognition data of the text data, and the emotion recognition data is improved.
According to the invention, by acquiring the historical voice data, a voice data set is constructed, which is the basis for constructing the emotion recognizer, and the historical voice data can contain voice samples under different emotion states, so that the emotion recognizer can learn the characteristics and modes of different emotions; the historical voice data is subjected to structural segmentation processing, and the voice data can be divided into different segments or paragraphs, so that each segment contains a complete voice unit, and the extraction and analysis of the emotion characteristics of each voice unit are facilitated; through time period marking on the historical voice data set, each voice fragment can be associated with the corresponding time period, so that the recognizer can know voice characteristic changes in different time periods, and dynamic changes of emotion can be captured better; the recognizer can capture emotion change points in the voice by extracting fluctuation points of the voice time period data set, wherein the fluctuation points can be high-low voice change, voice speed change and the like of the voice, and are one of important characteristics in emotion expression; according to the voice fluctuation point data set, extracting a voice edge data set, wherein the voice edge refers to a time period around the fluctuation point, the voice edge data set contains important information in emotion expression, and by extracting the voice edge data set, a recognizer can pay attention to voice characteristics around the fluctuation point, so that fine emotion changes can be captured better; performing a machine learning analysis on the speech fluctuation point data set using a Scikit-learn machine learning library, which includes using various machine learning algorithms such as support vector machines, decision trees, random forests to train emotion recognition models by which the recognizer can learn patterns and features of emotion from the fluctuation point data; the coupling association of the voice edge data set and the voice fluctuation point data set is utilized to further optimize the performance of the emotion recognizer, and the association can help the recognizer to better understand the association between voice features and emotion around the fluctuation points, so that the accuracy and the robustness of emotion recognition are improved.
Preferably, the invention provides an audio processing method and system:
the audio conversion module is used for collecting the audio signals to obtain audio sampling signals; performing structure quantization processing on the audio sampling signal to obtain an audio sampling quantized signal; and performing digital coding conversion on the audio sampling quantized signal to obtain audio digital data.
The text analysis module is used for extracting and processing audio content according to the audio digital data to obtain audio content cleaning data; performing text word segmentation processing according to the audio content cleaning data to obtain content text word segmentation data; carrying out emotion analysis according to the content text word segmentation data to obtain text emotion analysis data;
the voice analysis module is used for performing time stamp marking on the audio to obtain voice time stamp data, and performing ambient sound correction on the voice time stamp data to obtain corrected voice data; performing emotion feature analysis on the corrected voice data to obtain speech speed emotion feature data and intonation emotion feature data; feature structure integration is carried out on the speech speed emotion feature data and the intonation emotion feature data to obtain speech emotion feature data; carrying out emotion recognition on the voice emotion feature data through a voice analysis emotion recognizer to obtain voice emotion analysis data;
The audio scene restoration module is used for constructing an audio scene model according to the text emotion analysis data and the voice emotion analysis data to obtain an audio scene restoration model so as to realize audio scene restoration and play.
The invention has the advantages that the original sampling signal of the audio can be obtained by collecting the signal of the audio, which is helpful for capturing the details and characteristics of the audio, providing the basic data required by the subsequent processing, carrying out the structure quantization processing on the audio sampling signal, converting the continuous analog signal into the discrete digital signal, which is helpful for digitizing the audio data, enabling the audio data to be processed and stored by a computer, and representing the quantized audio signal as the digital data by carrying out the digital coding conversion on the quantized audio signal, which is helpful for converting the audio signal into the form understood and processed by the computer, and providing the basis for the subsequent analysis and processing; by carrying out audio content extraction processing on audio digital data, useful information and content can be extracted from audio, which is helpful for understanding voice, dialogue and music content contained in the audio, basic data is provided for subsequent analysis and processing, text word segmentation processing is carried out according to audio content cleaning data, extracted content is converted into meaningful words, which is helpful for analyzing and understanding semantics and meaning in the audio, the text in the long term is divided into smaller units, further processing and analysis are convenient, emotion analysis is carried out according to content text word segmentation data, emotion tendency of text expression can be deduced, emotion state of the text can be judged through analyzing vocabulary, semantics and sentence structure information in the text, for example, positive, negative or neutral emotion state is helpful for deeply understanding emotion content expressed in the audio; by means of time stamping the audio, voice data can be corresponding to specific time points, the voice data can be accurately positioned and quoted in different parts of the audio in subsequent analysis and processing, fine analysis and processing are facilitated, environmental sound correction is conducted on the voice time stamp data, environmental noise and noise can be removed, quality and definition of the voice data are improved, interference of noise on emotion analysis is reduced, extraction of emotion features is enabled to be more accurate and reliable, emotion feature analysis is conducted on the corrected voice data, voice speed and intonation features can be extracted, voice speed emotion feature data reflect the speed degree of voice, intonation emotion feature data reflect the pitch change of voice, the features can help to deeply understand emotion information expressed in voice, feature structure integration is conducted on the voice speed emotion feature data and intonation emotion feature data, different features are fused together, comprehensive and comprehensive emotion feature data are obtained, the voice capture is beneficial to capturing the rich emotion expression in voice, accurate and comprehensive emotion analysis results are provided, recognition can be carried out based on emotion feature data, and emotion information expressed in a sense of voice can be used for understanding, and emotion information can be provided; by combining text emotion analysis data and voice emotion analysis data, an audio scene model can be constructed, emotion information and audio characteristics can be combined by the model, and audio characteristic modes under different emotion states can be learned, so that relevance between emotion and an audio scene can be established, and a foundation is provided for subsequent audio scene restoration; based on the constructed audio scene model, the audio can be subjected to scene restoration, and the emotion state and scene environment corresponding to the audio can be deduced by analyzing and processing the audio characteristics, so that the emotion atmosphere, background environment and context information in the audio can be restored, and the feeling and experience of the audio are richer and more real. Therefore, the audio processing method and the system are the optimization made by the traditional audio processing method, solve the problems that the traditional audio processing method cannot accurately analyze the scene effect presented by the audio, have poor reproduction of the scene effect and larger delay, accurately analyze the scene effect presented by the audio, improve the reproduction capability of the scene effect and reduce the delay.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. An audio processing method, comprising the steps of:
step S1: collecting the audio signals to obtain audio sampling signals; performing structure quantization processing on the audio sampling signal to obtain an audio sampling quantized signal; performing digital coding conversion on the audio sampling quantized signal to obtain audio digital data;
Step S2: audio content extraction processing is carried out on the audio according to the audio digital data, so that audio content cleaning data are obtained; performing text word segmentation processing according to the audio content cleaning data to obtain content text word segmentation data; carrying out emotion analysis according to the content text word segmentation data to obtain text emotion analysis data;
step S3: performing time stamp marking on the audio to obtain voice time stamp data, and performing ambient sound correction on the voice time stamp data to obtain corrected voice data; performing emotion feature analysis on the corrected voice data to obtain speech speed emotion feature data and intonation emotion feature data; feature structure integration is carried out on the speech speed emotion feature data and the intonation emotion feature data to obtain speech emotion feature data; carrying out emotion recognition on the voice emotion characteristic data to obtain voice emotion analysis data;
step S4: and constructing an audio scene model according to the text emotion analysis data and the voice emotion analysis data to obtain an audio scene restoration model so as to realize restoration and play of the audio scene.
2. The audio processing method according to claim 1, wherein step S1 includes the steps of:
step S11: collecting the audio signals to obtain audio signals;
Step S12: performing sound signal sampling processing on the audio signal to obtain an audio sampling signal;
step S13: performing structure quantization processing on the audio sampling signal to obtain an audio sampling quantized signal;
step S14: and performing digital coding conversion on the audio sampling quantized signal to obtain audio digital data.
3. The audio processing method according to claim 2, wherein step S13 includes the steps of:
step S131: carrying out sampling signal structure division on the audio sampling signals by using a preset amplitude structure division manual to obtain an audio structure signal set;
step S132: performing adjacent amplitude difference calculation on the audio structure signal set to obtain an audio amplitude signal set;
step S133: zero-crossing rate calculation is carried out on the audio amplitude signal set, and a zero-crossing rate signal set is obtained;
step S134: extracting overlapping parts of the audio amplitude signal sets to obtain audio amplitude overlapping signals;
step S135: and carrying out nonlinear quantization processing on the audio sampling signal according to the audio amplitude overlapping signal and the zero crossing rate signal set to obtain an audio sampling quantized signal.
4. The audio processing method according to claim 3, wherein step S2 comprises the steps of:
Step S21: audio content extraction processing is carried out on the audio according to the audio digital data, so that audio content text data are obtained;
step S22: data cleaning is carried out on the text data of the audio content, and audio content cleaning data are obtained;
step S23: performing text word segmentation processing on the audio content text data according to the audio content cleaning data to obtain content text word segmentation data;
step S24: carrying out semantic analysis on the content text word segmentation data to obtain content text semantic data;
step S25: performing associated entity labeling on the content text semantic data to obtain content text associated data;
step S26: and carrying out emotion analysis on the semantic data of the content text according to the associated data of the content text to obtain text emotion analysis data.
5. The audio processing method according to claim 4, wherein step S26 includes the steps of:
step S261: performing emotion part-of-speech screening on the content text associated data to obtain a key emotion part-of-speech list;
step S262: carrying out vocabulary combination conversion analysis according to the key emotion part list to obtain a combined part emotion list;
step S263: establishing an emotion dictionary according to the key emotion part list and the combined part emotion list to obtain a key emotion dictionary;
Step S264: performing syntactic analysis on the content text semantic data to obtain text syntactic structure data;
step S265: and carrying out emotion analysis on the text syntactic structure data according to the key emotion dictionary to obtain text emotion analysis data.
6. The audio processing method according to claim 5, wherein step S3 includes the steps of:
step S31: collecting character voice data of the audio to obtain character voice data;
step S32: performing time stamp marking on the voice data to obtain voice time stamp data;
step S33: performing ambient sound correction on the voice time stamp data by using an ambient sound correction algorithm to obtain corrected voice data;
step S34: carrying out speech speed phoneme feature extraction and intonation phoneme feature extraction on the corrected voice data to obtain speech speed phoneme feature data and intonation phoneme feature data;
step S35: carrying out emotion feature analysis on the speech speed phoneme feature data and the intonation phoneme feature data to obtain speech speed emotion feature data and intonation emotion feature data;
step S36: carrying out speech speed emotion assessment on the speech speed emotion feature data according to the text emotion analysis data to obtain speech speed emotion assessment data;
Step S37: carrying out emotion adaptation evaluation on the emotion feature data of the intonation according to the emotion evaluation data of the speed of speech to obtain emotion evaluation data of the intonation;
step S38: feature structure integration is carried out on the speech speed emotion estimation data and the intonation emotion estimation data to obtain speech emotion feature data;
step S39: and carrying out emotion recognition on the voice emotion feature data through a voice analysis emotion recognizer to obtain voice emotion analysis data.
7. The audio processing method according to claim 6, wherein the ambient sound correction algorithm in step S33 is as follows:
wherein f represents corrected voice data, x represents input voice time stamp data, λ represents propagation velocity value of sound wave, μ represents environmental noise coefficient, α represents air damping coefficient, β represents sound wave amplitude coefficient, γ represents carrier frequency value, t represents voice duration value, and R represents deviation correction value of environmental sound correction algorithm.
8. The audio processing method according to claim 6, wherein step S35 includes the steps of:
step S351: carrying out three-dimensional thermodynamic diagram drawing on the speech speed phoneme characteristic data and the intonation phoneme characteristic data to respectively obtain a speech speed phoneme characteristic diagram and a intonation phoneme characteristic diagram;
Step S352: the method comprises the steps of carrying out central thermal region marking on a speech speed phoneme feature graph and a intonation phoneme feature graph to obtain speech speed thermal region data and intonation thermal region data;
step S353: carrying out distribution density calculation on the speech speed thermodynamic region data according to the speech speed phoneme characteristic diagram to obtain speech speed thermodynamic density data;
step S354: carrying out fluctuation change calculation on the intonation thermal region data according to the intonation phoneme feature diagram to obtain intonation fluctuation change data;
step S355: carrying out regional density random extraction on the speech speed thermodynamic density data to obtain speech speed random density data; randomly extracting the intonation fluctuation data to obtain intonation random fluctuation data;
step S356: respectively carrying out Monte Carlo simulation on the speech speed random density data and the intonation random fluctuation data to respectively obtain speech speed simulation output data and intonation simulation output data;
step S357: and carrying out emotion feature analysis according to the speech speed simulation output data and the intonation simulation output data to obtain speech speed emotion feature data and intonation emotion feature data.
9. The audio processing method of claim 6, wherein the constructing step of the speech analysis emotion recognizer comprises the steps of:
Acquiring historical voice data;
carrying out structural segmentation processing according to the historical voice data to obtain a historical voice data set;
performing time period marking on the historical voice data set to obtain a voice time period data set;
extracting voice fluctuation points according to the voice time period data set to obtain a voice fluctuation point data set;
extracting a voice edge time period according to the voice fluctuation point data set to obtain a voice edge data set;
and performing voice machine learning on the voice fluctuation point data set by utilizing the Scikit-learn machine learning library, and performing coupling association through the voice edge data set to obtain the voice analysis emotion recognizer.
10. An audio processing method and system for performing the audio processing method as claimed in claim 1, the audio processing system comprising:
the audio conversion module is used for collecting the audio signals to obtain audio sampling signals; performing structure quantization processing on the audio sampling signal to obtain an audio sampling quantized signal; and performing digital coding conversion on the audio sampling quantized signal to obtain audio digital data.
The text analysis module is used for extracting and processing audio content according to the audio digital data to obtain audio content cleaning data; performing text word segmentation processing according to the audio content cleaning data to obtain content text word segmentation data; carrying out emotion analysis according to the content text word segmentation data to obtain text emotion analysis data;
The voice analysis module is used for performing time stamp marking on the audio to obtain voice time stamp data, and performing ambient sound correction on the voice time stamp data to obtain corrected voice data; performing emotion feature analysis on the corrected voice data to obtain speech speed emotion feature data and intonation emotion feature data; feature structure integration is carried out on the speech speed emotion feature data and the intonation emotion feature data to obtain speech emotion feature data; carrying out emotion recognition on the voice emotion feature data through a voice analysis emotion recognizer to obtain voice emotion analysis data;
the audio scene restoration module is used for constructing an audio scene model according to the text emotion analysis data and the voice emotion analysis data to obtain an audio scene restoration model so as to realize audio scene restoration and play.
CN202311463138.0A 2023-11-02 2023-11-02 Audio processing method and system Pending CN117524259A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311463138.0A CN117524259A (en) 2023-11-02 2023-11-02 Audio processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311463138.0A CN117524259A (en) 2023-11-02 2023-11-02 Audio processing method and system

Publications (1)

Publication Number Publication Date
CN117524259A true CN117524259A (en) 2024-02-06

Family

ID=89759964

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311463138.0A Pending CN117524259A (en) 2023-11-02 2023-11-02 Audio processing method and system

Country Status (1)

Country Link
CN (1) CN117524259A (en)

Similar Documents

Publication Publication Date Title
CN101346758B (en) Emotion recognizer
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
JPS62231996A (en) Allowance evaluation of word corresponding to voice input
CN101599271A (en) A kind of recognition methods of digital music emotion
CN101246685A (en) Pronunciation quality evaluation method of computer auxiliary language learning system
US11842721B2 (en) Systems and methods for generating synthesized speech responses to voice inputs by training a neural network model based on the voice input prosodic metrics and training voice inputs
CN101777347A (en) Model complementary Chinese accent identification method and system
CN113823323B (en) Audio processing method and device based on convolutional neural network and related equipment
CN106295717A (en) A kind of western musical instrument sorting technique based on rarefaction representation and machine learning
Gupta et al. Speech feature extraction and recognition using genetic algorithm
CN112270933A (en) Audio identification method and device
CN116665669A (en) Voice interaction method and system based on artificial intelligence
CN111583965A (en) Voice emotion recognition method, device, equipment and storage medium
Grewal et al. Isolated word recognition system for English language
Sarkar et al. Raga identification from Hindustani classical music signal using compositional properties
CN111402887A (en) Method and device for escaping characters by voice
CN117524259A (en) Audio processing method and system
Aso et al. Speakbysinging: Converting singing voices to speaking voices while retaining voice timbre
Woods et al. A robust ensemble model for spoken language recognition
CN114038481A (en) Lyric timestamp generation method, apparatus, device and medium
CN110910904A (en) Method for establishing voice emotion recognition model and voice emotion recognition method
Ashrafidoost et al. Recognizing Emotional State Changes Using Speech Processing
Camarena-Ibarrola et al. Speaker identification using entropygrams and convolutional neural networks
Hosain et al. Deep-Learning-Based Speech Emotion Recognition Using Synthetic Bone-Conducted Speech
CN117041430B (en) Method and device for improving outbound quality and robustness of intelligent coordinated outbound system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination