CN116665669A

CN116665669A - Voice interaction method and system based on artificial intelligence

Info

Publication number: CN116665669A
Application number: CN202310892002.5A
Authority: CN
Inventors: 吉祥
Original assignee: Shanghai Haiqi Technology Co ltd
Current assignee: Shanghai Haiqi Technology Co ltd
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-08-29

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a voice interaction method and system based on artificial intelligence. The method comprises the following steps: acquiring user voice data, and performing scene noise reduction to acquire user voice scene noise reduction data; carrying out optimized framing treatment on the denoising data of the user voice scene to obtain optimized framing data of the user voice; carrying out semantic feature extraction and emotion feature extraction on the user voice optimization framing data to obtain user voice semantic feature data and user voice emotion feature data; identifying user voice semantic feature data to obtain user voice text data, and correcting the user voice text data by utilizing the user voice emotion feature data to obtain user emotion voice text data; and mining the user intention according to the user emotion voice text data, so as to acquire the user intention data. The invention generates corresponding voice response according to voice input, and improves interaction efficiency and user satisfaction.

Description

Voice interaction method and system based on artificial intelligence

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a voice interaction method and system based on artificial intelligence.

Background

Artificial intelligence (Artificial Intelligence, AI) is a technology and method to study and develop intelligence for simulating, expanding and extending people. It encompasses a number of fields including machine learning, natural language processing, computer vision, with the aim of providing computer systems with the ability to perceive, understand, learn and make decisions similar to human intelligence. The voice interaction method is a method based on voice technology and man-machine interaction, and interaction between a person and a computer is realized through voice input and voice output. It enables users to communicate and operate with computers using natural language and speech without the need for conventional input devices that rely on keyboards and mice. The current voice interaction system has certain limitations in understanding and reasoning the intention and the context of the user, namely, for complex semantic understanding, inference and dialogue management, accurate understanding and timely response to the intention of the user cannot be achieved.

Disclosure of Invention

The application provides a voice interaction method and a voice interaction system based on artificial intelligence to solve at least one technical problem.

The application provides a voice interaction method based on artificial intelligence, which comprises the following steps:

Step S1: acquiring user voice data, and performing scene noise reduction on the user voice data so as to acquire user voice scene noise reduction data;

step S2: performing optimized framing treatment on the denoising data of the user voice scene, thereby obtaining user voice optimized framing data;

step S3: carrying out semantic feature extraction and emotion feature extraction on the user voice optimization framing data so as to obtain user voice semantic feature data and user voice emotion feature data;

step S4: identifying the user voice semantic feature data by using a preset user voice identification model so as to obtain user voice text data, and correcting the user voice text data by using the user voice emotion feature data so as to obtain user emotion voice text data;

step S5: mining user intention according to the user emotion voice text data, so as to acquire user intention data;

step S6: and acquiring voice interaction time data corresponding to the user voice data, and performing voice interaction generation according to the voice interaction time data, the user intention data and the user voice text data, so as to acquire voice interaction text data for performing voice interaction operation.

The invention can reduce the influence of the environmental noise on the voice signal and extract the clear user voice data by the scene noise reduction and the optimized framing processing, thereby improving the accuracy of voice recognition. Through semantic feature extraction and emotion feature extraction, semantic information and emotion features can be extracted from user voice data, so that a system can better understand user intention and emotion state. By converting user speech into text data and combining the user intent mining and the speech interaction time data, more contextual and personalized speech interaction text data can be generated, thereby providing a more intelligent, flexible and natural speech interaction experience. Through correcting the user voice text data and combining the user emotion feature data, the perception and response of the user emotion can be realized, and the system can perform emotional interaction with the user more humanizedly. Through user intention mining and voice interaction generation, user intention data can be obtained from user voice data, corresponding voice interaction text data is deduced and generated according to context and voice interaction time, and accurate understanding and response capability of user intention are improved.

Preferably, the user voice scene noise reduction data includes user voice high-frequency noise reduction data, user voice low-frequency noise reduction data and user voice deep noise reduction data, and step S1 specifically includes:

step S11: acquiring user voice data and scene noise data in real time;

step S12: extracting scene noise characteristics from the scene noise data, thereby obtaining scene noise characteristic data;

step S13: noise characteristic processing is carried out according to the scene noise characteristic data, so that scene noise characteristic data are obtained;

step S14: when the scene noise characteristic data are determined to be high-frequency noise characteristic data, performing high-frequency noise reduction processing on the user voice data, thereby obtaining the high-frequency noise reduction data of the user voice;

step S15: when the scene noise characteristic data are determined to be low-frequency noise characteristic data, performing low-frequency noise reduction processing on the user voice data, so as to obtain the user voice low-frequency noise reduction data;

step S16: when the scene noise characteristic data are the mixed noise characteristic data, the noise reduction processing is carried out on the user voice data by using a preset scene depth noise reduction model, so that the user voice depth noise reduction data are obtained, wherein the scene depth noise reduction model is formed by carrying out deep neural network learning by using a voice signal and a preset environment noise set, and parameter adjustment is carried out according to the scene noise characteristic data so as to carry out noise reduction operation.

According to the invention, the voice input of the user and the noise condition in the current environment can be captured in time by acquiring the voice data of the user and the scene noise data in real time, so that an accurate data base is provided for subsequent processing. By extracting features from the scene noise data, feature information about ambient noise can be obtained, helping the system to better understand and analyze the noise characteristics of the current environment. And according to the scene noise characteristic data, performing corresponding noise characteristic processing, such as high-frequency noise reduction, low-frequency noise reduction or mixed noise reduction, on the user voice data. Therefore, the interference of the environmental noise to the voice signal can be reduced, and the quality and the identifiability of the voice signal are improved. When scene noise characteristic data are mixed noise characteristic data, the noise reduction processing is carried out on the user voice data by utilizing a preset scene depth noise reduction model, so that complex noise interference can be removed more accurately, and the definition and the understandability of voice signals are improved. According to the scene noise characteristic data, parameters of the scene depth noise reduction model are adjusted and optimized to adapt to noise reduction requirements of different scenes and noise characteristics, and noise reduction effect and voice quality are further improved.

Preferably, the user voice optimizing framing data includes first user voice framing data, second user voice framing data and third user voice framing data, and step S2 specifically includes:

step S21: when the scene noise characteristic data are high-frequency noise characteristic data, performing first framing processing on the user voice scene denoising data, so as to obtain first user voice framing data;

step S22: when the scene noise characteristic data are determined to be low-frequency noise characteristic data, performing second framing processing on the user voice scene denoising data, so as to obtain second user voice framing data;

step S23: when the scene noise characteristic data are the mixed noise characteristic data, performing third framing processing on the user voice scene denoising data, so as to obtain third user voice framing data, wherein the framing length of the first user voice framing data is larger than that of the third user voice framing data, and the framing length of the third user voice framing data is larger than or equal to that of the second user voice framing data.

The invention divides a continuous speech signal into a series of frames, each frame containing shorter speech segments. The framing process helps to extract local features of the speech and convert the speech data into a more manageable form. Finer framing operations are performed according to the requirements of low frequency noise characteristics. Unlike the first framing process, the second framing process may employ different framing parameters to better accommodate the processing requirements of low frequency noise. The third framing process may have a more flexible framing length and parameter settings than the first two framing processes to accommodate voice data characteristics in complex noise environments. After framing, each frame may better represent features of short-time speech segments, such as audio spectrum, energy. This facilitates subsequent processing steps of semantic feature extraction and speech recognition. After framing, the speech signal in each frame is relatively short and the effect of noise in each frame is relatively small. This can alleviate the influence of noise on speech processing, improve the accuracy of subsequent processing. According to the different scene noise characteristic data, the corresponding framing processing strategy is selected, so that different types of noise can be better adapted, and better voice signal quality is provided.

Preferably, step S3 is specifically:

carrying out normalization processing on the user voice optimized framing data so as to obtain user normalized framing data;

and carrying out semantic feature extraction and emotion feature extraction on the user normalized framing data so as to obtain user voice semantic feature data and user voice emotion feature data.

In the invention, normalization processing is carried out on user voice optimization framing data. The normalization can unify the numerical ranges among different frames and eliminate the amplitude difference among different frames. This helps to improve data comparability and stability in the subsequent feature extraction process. Semantic features are extracted from the normalized framing data. Semantic features refer to information contained in speech that is related to semantic content, such as the pitch, pace, prosody of the speaker. By extracting semantic features, the semantic content of the voice can be captured better, and an important clue is provided for subsequent tasks of voice recognition and intention understanding. And extracting emotion characteristics from the normalized framing data. Emotional characteristics refer to emotional states expressed in speech, such as happiness, anger, sadness. By extracting the emotion characteristics, the emotion tendency and emotion state of the user can be known, and a basis is provided for emotion analysis and emotion recognition application.

Preferably, step S4 is specifically:

step S41: recognizing the user voice semantic feature data by using a preset user voice recognition model so as to acquire user voice text data;

step S42: carrying out emotion label identification and scoring processing on the user voice emotion feature data so as to obtain user voice emotion label data and user voice emotion scoring data;

step S43: carrying out emotion correction on the user voice text data according to the user voice emotion label data and the user voice emotion scoring data, so as to obtain user emotion voice correction data;

step S44: generating a user emotion voice book according to the user emotion voice correction data and the user voice emotion feature data, so that user emotion voice text data;

the scoring processing is performed through a user voice emotion scoring calculation formula, wherein the user voice emotion scoring calculation formula specifically comprises:

s is user speech emotion scoring data, e is a natural index term,the method is characterized in that the method comprises the steps of taking user voice emotion characteristic data, wherein beta is a user voice emotion tag data weight item, f (t) is a user voice emotion change function, t is time data, u is a user voice emotion stability item, r is a user voice emotion adjustment item, d is a scaling index, k is a user voice emotion constant item, and q is a user voice emotion random item.

The invention constructs a calculation formula of the user voice emotion score, which converts voice emotion characteristic data of the user into emotion score data S. This allows the intensity or tendency of emotion expressed in the user's speech to be quantified and measured. The parameter beta in the formula is used for adjusting the weight of the emotion label data, namely adjusting the importance of different emotion labels. By adjusting the value of beta, the contribution degree of different emotion labels to the score can be weighted flexibly so as to meet the requirement of actual emotion analysis. F (t) in the formula represents a change function of the emotion of the voice of the user, wherein t is time data. The emotion change function describes the evolution rule of the emotion of the user in time, and the change trend of the emotion in different time periods can be captured through the shape and parameter adjustment of the function, so that the accuracy of emotion scoring is further refined. Parameters u and r in the formula represent stability and adjustment terms of emotion, respectively. The emotion stability item u can reflect the stability degree of the emotion of the user, and the emotion persistence and consistency can be considered through the emotion stability item u. The emotion adjustment item r is used for correcting deviation of emotion scores or adjusting flexibility of scores, and the emotion scores can be finely adjusted by adjusting values of u and r so as to meet requirements under specific situations. Parameters d, k and q in the formula represent scaling indexes, constant terms and random terms respectively, and the scaling, offset and random adjustment can be carried out on the emotion scores, so that the score results can be further corrected and personalized, and the variability and the richness of the scores are increased. The user voice emotion feature data are converted into emotion scoring data, and emotion features in different aspects can be weighted and adjusted by adjusting the values of all parameters, so that more accurate and fine emotion scoring results are obtained. Such emotion scoring data may provide a basis for emotion recognition and understanding in voice interactions, providing support and guidance for the emotional interactions and responses of the system.

In the invention, the preset user voice recognition model is utilized to recognize the user voice semantic feature data, so that the voice is converted into text data. This allows the system to understand the user's voice input and perform subsequent semantic processing and intent analysis. And carrying out emotion label identification and scoring processing on the user voice emotion characteristic data. This allows the emotional state expressed in the user's speech to be identified and scored quantitatively. Through emotion labels and scoring, the system can better understand the emotion tendencies and emotional states of the user. And carrying out emotion correction on the user voice text data according to the user voice emotion label data and the user voice emotion scoring data. This means that the system can adjust the generated phonetic text to better match the user's emotional state based on the user's emotional characteristics. And generating voice text data according with the emotion state of the user according to the emotion voice correction data of the user and the emotion characteristic data of the user. This enables the system to produce a voice response consistent with the user's emotion, enhancing the emotion resonance and interaction effects of the user interaction.

Preferably, the construction step of the user voice recognition model specifically comprises the following steps:

Step S411: standard user voice data and corresponding meta user voice identification data are acquired;

step S412: preprocessing standard user voice data so as to obtain standard user voice preprocessing data;

step S413: extracting features of the standard user voice preprocessing data so as to obtain standard user voice feature data;

step S414: performing convolution calculation according to the standard user voice characteristic data so as to obtain standard user voice convolution layer data;

step S415: carrying out pooling calculation on the standard user voice convolution layer data so as to obtain standard user voice pooling layer data;

step S416: carrying out full connection calculation on the standard user voice pooling layer data so as to obtain standard user voice full connection data;

step S417: and identifying the standard user voice full-connection data by using the meta user voice identification data so as to construct a user voice recognition model.

In the present invention, through step S411, standard user voice data and corresponding meta user voice identification data are obtained. These data are critical to building accurate and reliable speech recognition models because they provide a true speech sample and corresponding label data for training and evaluating the model. Standard user speech data is preprocessed to prepare for feature extraction and model training, via step S412. The preprocessing step may include operations to remove noise, speech signal enhancement, audio segmentation to enhance the effectiveness of subsequent steps. The voice features are extracted from the standard user voice pre-processing data, via step S413. The speech features may be vocal spectral features, mel-spectral features, cepstral coefficients, which are capable of capturing important information of the speech signal and providing useful input to the model. The standard user voice feature data is convolved and pooled by steps S414 and S415. These computational operations help extract local and global structural information of the features and reduce the dimensionality of the data, thereby reducing the complexity and computational effort of the model. Full connection calculation is performed on the standard user voice pooling layer data, via step S416. The fully connected layer is able to learn higher level feature representations and provide decision making capability for classification and recognition tasks of the model. Through step S417, standard user voice full connection data is identified by using meta user voice identification data, thereby constructing a user voice recognition model. The model can be used for mapping voice characteristic data of a user to corresponding identifications or texts so as to realize the function of voice recognition.

Preferably, step S5 is specifically:

text matching is carried out according to the user emotion voice text data through a preset user intention rule set, so that user intention data are obtained;

the text matching method specifically comprises the following steps:

step S51: performing first text matching according to user emotion voice text data and a preset user intention rule set, so as to obtain first user intention data;

step S52: performing second text matching according to the user emotion voice text data and a preset user intention rule set so as to obtain second user intention data, wherein the first text matching and the second text matching are different character string matching algorithms, and the primary user intention data comprise first user intention data and second user intention data;

step S53: when the primary user intention data is primary user intention data which does not contain user intention data, acquiring scene audio data, and mining scene intention according to the scene audio data so as to acquire the user intention data;

step S54: when the primary user intention data is determined to be primary user intention data containing only one user intention data, the primary user intention data is determined to be user intention data;

Step S55: when the primary user intention data is determined to contain at least two user intention data, confidence calculation is carried out on the primary user intention data according to the historical user intention matching data, so that primary user intention confidence data is obtained;

step S56: and performing optimal confidence screening on the primary user intention data by using the primary user intention confidence data so as to acquire the user intention data.

According to the method, through step S5, text matching is carried out according to the user emotion voice text data through a preset user intention rule set, so that user intention data are obtained. This can help the system understand the intent and needs of the user and provide guidance and corresponding response for subsequent voice interaction processes. Through steps S51 and S52, text matching is performed using different character string matching algorithms, thereby obtaining first user intention data and second user intention data. By adopting different matching algorithms, the accuracy and coverage range of matching can be improved, and the understanding and recognition capability of the system to the intention of the user can be enhanced. The primary user intention data is processed through steps S53 and S54. When the primary user intention data contains only one user intention, directly determining as user intention data; further processing is required when the primary user intent data does not contain a user intent or contains multiple user intents. When the primary user intention data does not contain a user intention, scene intention mining is performed using the scene audio data, through step S53. This helps to infer the user's intent from the context information of the scene, thereby capturing the user's needs more accurately. Through step S55, when the primary user intention data contains a plurality of user intents, confidence calculation is performed on the primary user intention data according to the historical user intention matching data. This may infer the reliability and accuracy of the primary user intent from the historical matching results and patterns. Through step S56, the primary user intention data is optimally confidence filtered using the primary user intention confidence data, thereby obtaining user intention data. The intention with lower confidence can be filtered, and the accuracy and reliability of the system on the intention of the user are improved.

Preferably, the confidence calculation is performed by a user intention confidence calculation formula, wherein the user intention confidence calculation formula is specifically:

c is primary user intention confidence data, x is primary user intention data, n is number data of primary user intention data, i is a special order term, o is a base constant term, a _i Match the degree term, x, for historical primary user intent _i For the ith primary user intention data, b is a recent historical user intention fitting degree term, c is a user scene adjustment term, e is a correction error term, f is a fuzzy error term, g is a basic constant term, and h is a random term.

The invention constructs a user intention confidence coefficient calculation formula which is used for calculating the confidence coefficient of primary user intention data, namely measuring the confidence degree of a system on the user intention. Through calculation, the primary user intent data may be mapped to a confidence value for determining the reliability and accuracy of the intent. Different parameters play different roles in the formula. For example, a historical primary user intent match degree term a _i And primary user intention data item x _i Reflecting the matching degree of the history matching result and the current intention; the fitting degree of the latest intention of the historical user is reflected by the fitting degree term b of the latest intention; the user scene adjustment item c is used for adjusting the confidence coefficient according to the current scene; correction error term And the blur error term f is used to take into account the correction of the error and the blur. Confidence data for the primary user intent is calculated through interactions between mathematical symbols. This facilitates understanding and inference of the user's intent by the system and provides more reliable and accurate intent recognition results. Meanwhile, the setting and adjustment of different parameters can be carried out according to specific conditionsFlexible adjustment is performed to adapt to different application scenes and user requirements.

Preferably, step S6 is specifically:

step S61: according to the voice interaction time data, time sequence labeling and association are carried out on the user intention data and the user voice text data, so that the time sequence data of the user intention voice text are obtained;

step S62: generating voice interaction form data according to the user intention data;

step S63: performing recognition calculation on the user intention voice text time sequence data by using a preset artificial intelligent voice interaction generation model so as to acquire voice interaction text data;

step S64: and performing voice interaction operation according to the voice interaction form data and the voice interaction text data.

In the invention, through step S61, the user intention data and the user voice text data are marked and associated in time sequence, so that the user intention voice text time sequence data is obtained. Such labeling and association processes can help the system understand and analyze the user's intent and voice content in voice interactions and establish correspondence between the two. In step S62, voice interactive form data is generated from the user intention data. This may enable the system to talk and interact with the user in a voice interaction manner, enhancing the user experience and naturalness of the interaction. Through step S63, recognition calculation is performed on the user intention voice text time sequence data by using the preset artificial intelligence voice interaction generation model, so as to obtain voice interaction text data. This allows the system to automatically recognize and understand the user's voice input and convert it into text form for subsequent processing and response. Step S64 performs a voice interaction operation using the voice interaction form data and the voice interaction text data. This means that the system can respond to the user's intention and voice input, perform specific tasks or provide relevant information, thereby achieving meaningful voice interaction with the user.

Preferably, a voice interaction system based on artificial intelligence comprises:

user voice data acquisition module: the method comprises the steps of acquiring user voice data, and performing scene noise reduction on the user voice data so as to acquire user voice scene noise reduction data;

user voice optimization framing module: the method comprises the steps of carrying out optimized framing treatment on user voice scene denoising data so as to obtain user voice optimized framing data;

the user voice feature extraction module: the method comprises the steps of carrying out semantic feature extraction and emotion feature extraction on user voice optimization framing data so as to obtain user voice semantic feature data and user voice emotion feature data;

the user emotion voice text processing module: the method comprises the steps of identifying user voice semantic feature data by using a preset user voice identification model so as to obtain user voice text data, and correcting the user voice text data by using user voice emotion feature data so as to obtain user emotion voice text data;

user intention mining module: the method comprises the steps of carrying out user intention mining according to user emotion voice text data so as to obtain user intention data;

the voice interaction text processing module: the voice interaction method comprises the steps of acquiring voice interaction time data corresponding to user voice data, and performing voice interaction generation according to the voice interaction time data, user intention data and user voice text data, so that voice interaction text data are acquired, and voice interaction operation is performed.

The invention has the beneficial effects that: through scene noise reduction processing, the interference of environmental noise on the voice signal can be reduced, and the accuracy and effect of subsequent processing steps are improved. Framing is carried out on the denoising data of the user voice scene, and the optimized framing data of the user voice can be obtained. The advantage of this is that the segmentation of the continuous speech signal into short-time frames facilitates finer granularity of analysis and processing of the speech by subsequent feature extraction and speech recognition algorithms. Framing is carried out on the denoising data of the user voice scene, and the optimized framing data of the user voice can be obtained. The advantage of this is that the segmentation of the continuous speech signal into short-time frames facilitates finer granularity of analysis and processing of the speech by subsequent feature extraction and speech recognition algorithms. And carrying out semantic feature extraction and emotion feature extraction on the user voice optimization framing data so as to acquire semantic information and emotion information of the user voice. These features are important for the task of subsequent speech recognition, intent mining, and emotion analysis, and can provide deeper, accurate speech understanding and emotion recognition. And recognizing semantic feature data of the user voice by using a preset user voice recognition model to obtain user voice text data. And simultaneously, correcting the text data according to the emotion characteristic data of the user to obtain emotion voice text data of the user. The step can improve the accuracy of voice recognition, and carry out more reasonable correction on the recognition result according to the emotion characteristics of the user, so that the final voice text is more in line with the expression and emotion of the user. And carrying out user intention mining on the user emotion voice text data so as to acquire user intention data. By combining the voice input and emotion information of the user, the intention of the user can be more accurately understood, and more accurate and personalized voice interaction service can be provided. And carrying out voice interaction generation according to the user intention data, the voice interaction time data and the user voice text data to obtain voice interaction text data. This step converts the user's intent and voice input into interactable voice text to enable meaningful voice interaction with the user. By considering the interaction time, the user intention and the voice content, the step can generate more natural and smooth personalized voice interaction text, and the user experience and interaction effect are improved.

Drawings

Other features, objects and advantages of the application will become more apparent upon reading of the detailed description of a non-limiting implementation, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a flow chart of steps of an artificial intelligence based voice interaction method of an embodiment;

FIG. 2 shows a step flow diagram of step S1 of an embodiment;

FIG. 3 shows a step flow diagram of step S2 of an embodiment;

FIG. 4 shows a step flow diagram of step S4 of an embodiment;

FIG. 5 is a flow chart illustrating steps of a method for constructing a user speech recognition model of an embodiment;

FIG. 6 shows a step flow diagram of step S5 of an embodiment;

fig. 7 shows a step flow diagram of step S6 of an embodiment.

Detailed Description

The following is a clear and complete description of the technical method of the present patent in conjunction with the accompanying drawings, and it is evident that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to fall within the scope of the present application.

Furthermore, the drawings are merely schematic illustrations of the present application and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. The functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor methods and/or microcontroller methods.

It will be understood that, although the terms "first," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

Referring to fig. 1 to 7, the application provides a voice interaction method and a system based on artificial intelligence, wherein the voice interaction method based on artificial intelligence comprises the following steps:

specifically, voice data of the user is collected, for example, using a microphone or other audio device. Then, a scene noise reduction algorithm, such as a noise reduction method based on an acoustic model or a noise reduction method based on deep learning, is applied to perform noise reduction processing on the voice data, so as to remove background noise and environmental interference and obtain user voice scene noise reduction data.

specifically, for example, the user voice scene noise reduction data is subjected to an optimized framing process. First, the continuous speech signal is segmented into Short-time frames, typically using Short-time Fourier transform (STFT) or other framing methods. Windowing techniques, such as hamming or hanning windows, may then be applied to each frame to reduce spectral leakage. Finally, user voice optimization framing data is obtained.

specifically, semantic features and emotional features are captured, for example, by feature extraction of user speech optimized framing data. Common semantic feature extraction methods include Mel-frequency spectral coefficients (Mel-frequency Cepstral Coefficients, MFCC) and filter bank features (Filter Bank Feature). Emotion feature extraction may use emotion recognition algorithms such as emotion analysis methods based on features such as sound energy, fundamental frequency, pitch, etc. The feature extraction method can obtain semantic information and emotion information of the user voice.

specifically, for example, pre-trained user speech recognition models, such as deep learning models (e.g., recurrent neural networks, convolutional neural networks), are used to recognize semantic feature data of user speech and convert it into corresponding text data. And then, according to emotion characteristic data of the user, such as emotion analysis results or emotion tendencies of the user, correcting or adjusting the voice text data so as to obtain emotion voice text data of the user.

specifically, emotion voice text data of a user is analyzed and processed to extract intention information of the user, for example, using natural language processing (Natural Language Processing, NLP) techniques such as text classification, keyword extraction, named entity recognition methods. This may include identifying a user's question, demand, instruction, or other expressed intent to obtain user intent data.

Specifically, voice interaction time data is obtained, for example, in combination with voice data of a user and corresponding time stamp information. Then, based on the user intent data and the phonetic text data, a natural language generation (Natural Language Generation, NLG) technique or other generation model is utilized to generate phonetic interactive text data that matches the user intent. This may include answering a user's question, providing relevant information, executing instructions, or performing other interactive operations to meet the user's needs.

step S11: acquiring user voice data and scene noise data in real time;

specifically, voice data of a user is collected in real time, for example, using a microphone or other audio device, and noise data in the environment is collected simultaneously, including background noise in the scene, traffic noise.

specifically, for example, feature extraction is performed on the acquired scene noise data. Common scene noise characteristics include noise energy, spectral morphology. Spectral features of scene noise may be extracted using signal processing algorithms such as Short-time fourier transform (STFT) or other spectral analysis methods.

specifically, the noise is characteristic-processed, for example, according to scene noise characteristic data. For example, the characteristics of scene noise, such as high frequency noise, low frequency noise, or mixed noise, may be determined based on the spectral morphology and energy distribution of the noise.

specifically, for example, when the scene noise characteristic data is determined to be high-frequency noise, the user voice data may be processed using a high-frequency noise reduction algorithm. For example, a digital filter or other noise reduction algorithm may be applied to suppress the high frequency noise component, thereby obtaining high frequency noise reduction data of the user's voice.

specifically, for example, when the scene noise characteristic data is determined to be low-frequency noise, a low-frequency noise reduction algorithm may be employed to process the user voice data. For example, a digital filter or other noise reduction algorithm may be applied to suppress the low frequency noise component, thereby obtaining low frequency noise reduction data for the user's voice.

In particular, the user speech data may be processed using a pre-trained scene depth noise reduction model, for example, when the scene noise characteristic data is determined to be mixed noise. Such models are typically trained based on deep neural networks (Deep Neural Network, DNN), learning using speech signals and a set of preset ambient noise. And carrying out parameter adjustment according to the scene noise characteristic data so as to carry out noise reduction operation, and finally obtaining the deep noise reduction data of the user voice.

specifically, for example, after determining that the scene noise characteristic is high-frequency noise, the user voice scene denoising data may be subjected to a first framing process. Framing is the segmentation of a continuous speech signal into short-period frames, each frame typically containing a fixed number of speech samples. The process may utilize a window function (e.g., a hamming window) to divide the speech signal into overlapping frames and apply a Fast Fourier Transform (FFT) on each frame to obtain the spectral representation.

Specifically, for example, divided into one frame of 30-40 ms.

specifically, for example, after determining that the scene noise characteristic is low-frequency noise, the user voice scene denoising data may be subjected to a second framing process. The process is similar to the first framing process, where the speech signal is divided into short-period frames, using a window function and FFT conversion. However, for low frequency noise, it may be necessary to adjust the parameter settings of the framing, such as window length and frame shift, to better accommodate the characteristics of the low frequency signal.

Specifically, for example, it is divided into one frame of 10 to 20 ms.

Specifically, for example, after determining that the scene noise characteristic is the mixed noise, third framing processing may be performed on the user voice scene denoising data. The process is similar to the first two steps, but the frame lengths may be different. Depending on the given scene noise characteristics, the frame length of the first user speech framing data may be longer to accommodate the processing of higher frequency components. The frame length of the third user speech framing data may be equal to or longer than the second user speech framing data to address the characteristics of the mixed noise.

Specifically, for example, divided into one frame of 15-25 ms.

Preferably, step S3 is specifically:

specifically, for example, for user speech optimization framing data, normalization processing may be performed so as to be within a uniform range for subsequent feature extraction and processing. One common normalization method is to subtract the average value from each frame of data and then divide by the standard deviation to give a mean value of 0 and a standard deviation of 1 for the data. This eliminates scale differences between the data and ensures that the data has similar ranges of variation.

Specifically, for example, for user normalized framing data, semantic feature extraction and emotion feature extraction may be performed to obtain semantic information and emotion information of user speech. Semantic feature extraction: common semantic features include Mel spectral coefficients (Mel-frequency cepstral coefficients, MFCCs), linear predictive coding (Linear Predictive Coding, LPC), filter bank frequency response (Filterbank frequency responses). These feature extraction methods can extract important speech features from the spectrum for representing the semantic content of speech. Extracting emotion characteristics: the emotion features are features extracted from speech that reflect the emotion state. Common emotional characteristics include fundamental frequency (fundamental frequency, F0), tonal characteristics (intonation features), speech rate (speech rate), energy (energy). These features may reveal emotional colors in speech, such as happiness, sadness, anger.

Preferably, step S4 is specifically:

specifically, the speech semantic feature data of the user is identified, for example, using a speech recognition model that has been trained, such as an end-to-end speech recognition model based on deep learning. This model may convert the speech input into a corresponding text output, resulting in speech-text data for the user.

specifically, emotion analysis algorithms are used to identify emotion labels, such as "happy", "sad", "angry", for example, on the speech based on emotion feature data of the user's speech. Meanwhile, the emotion degree of the user voice is scored to represent the strength or degree of the emotion. Thus, emotion label data and emotion score data of the user voice can be obtained.

specifically, emotion correction can be performed on the voice text data of the user, for example, according to the emotion label data and emotion score data of the user. For example, for a particular emotion label and score, the vocabulary, mood, or grammar in the phonetic text may be adjusted to better match the emotion content expressed by the user.

Specifically, for example, using the user emotion voice correction data and the voice emotion feature data, voice text data conforming to the emotion of the user can be generated. This process may include adjustments in pitch, speed, emotion color aspects of the synthesized speech to produce phonetic text consistent with the emotion of the user.

s is user speech emotion scoring data, e is a natural index term,user voice emotion characteristic data, and beta is user voice emotion label data weightThe term f (t) is a user voice emotion change function, t is time data, u is a user voice emotion stability term, r is a user voice emotion adjustment term, d is a scaling index, k is a user voice emotion constant term, and q is a user voice emotion random term. />

specifically, for example, representative standard user speech data is collected, including variations in pronunciation, speech speed, intonation of different people, and each speech sample is annotated with corresponding identification data, such as corresponding text transcription or voice instructions.

specifically, for example, standard user speech data is preprocessed, including noise reduction of the audio signal, removal of irrelevant background sounds, audio gain adjustment, to improve the quality and clarity of the speech signal.

specifically, standard user speech pre-processing data is converted into a set of feature vectors, for example, using signal processing and audio feature extraction techniques such as Short Time Fourier Transform (STFT), mel Frequency Cepstral Coefficient (MFCC), to capture the spectrum, sound intensity and time domain features of the speech.

Specifically, the standard user speech feature data is convolved, for example, using a Convolutional Neural Network (CNN) structure, to extract spatial correlation and local patterns of speech features.

in particular, standard user speech convolution layer data is downsampled, for example, by a pooling operation, such as maximum pooling or average pooling, to reduce feature dimensionality and computational complexity, and to preserve important feature information.

specifically, for example, standard user speech pooling layer data is input into a fully connected neural network (FCN) for fully connected computation to learn and extract higher level speech feature representations.

Specifically, for example, using meta-user voice identification data as a supervisory signal, training standard user voice full-connection data, and constructing a user voice recognition model by using a classification algorithm (such as softmax) or a sequence modeling algorithm (such as a recurrent neural network or a transcribed CTC) to realize the function of mapping voice features to corresponding text identifications.

Preferably, step S5 is specifically:

specifically, a set of user intent rules, including common intent categories, keywords, phrases, or patterns, are predefined, for example. For example, intent rules of "query weather", "booking air ticket", "play music" may be included. Building a matching algorithm: based on text matching algorithms, such as character string matching, regular expression, pattern matching, and combining with a user intention rule set, the emotion voice text data of the user is matched. Different matching algorithms can be selected according to requirements.

The text matching method specifically comprises the following steps:

specifically, for example, using a string matching algorithm, such as regular expression, keyword matching, the user emotion voice text data is matched with a preset user intention rule set, and information related to the user intention, such as requirements, problems or instructions of the user, is extracted therefrom.

specifically, for example, another character string matching algorithm, such as fuzzy matching and editing distance, is used to match the user emotion voice text data with a preset user intention rule set, so as to obtain more information related to the user intention, and further enrich the user intention data.

specifically, when primary user intention data cannot be obtained from user emotion voice text data, for example, scene audio data (such as environmental sounds and dialogue recordings) can be collected, scene intention mining algorithms (such as clustering, classification and neural networks) can be applied to analyze the scene, and user intention data can be extracted from the scene.

specifically, for example, when only one explicit user intention is included in the primary user intention data, the intention is directly used as the user intention data.

specifically, for example, based on historical user intent matching data, past behavior patterns and intentions of the user are analyzed, and frequency, similarity or other relevant indexes of occurrence of primary user intent data in the historical data are calculated to obtain primary user intent confidence data.

Specifically, for example, the user intention with the highest confidence is selected as the final user intention data by screening and sorting according to the primary user intention confidence data, so as to ensure accuracy and reliability.

c is primary user intention confidence data, x is primary user intention data, n is number data of primary user intention data, i is a special order term, o is a base constant term, a _i Match the degree term, x, for historical primary user intent _i For the ith primary user intent data, b is the recent historical user intent fitness term, c is the user scene adjustment term,to correct the error term, f is the fuzzy error term, g is the base constant term, and h is the random term.

The invention constructs a user intention confidence coefficient calculation formula which is used for calculating the confidence coefficient of primary user intention data, namely measuring the confidence degree of a system on the user intention. Through calculation, the primary user intent data may be mapped to a confidence value for determining the reliability and accuracy of the intent. Different parameters play different roles in the formula. For example, a historical primary user intent match degree term a _i And primary user intention data item x _i Match reflecting history matching result and current intentionThe degree of matching; the fitting degree of the latest intention of the historical user is reflected by the fitting degree term b of the latest intention; the user scene adjustment item c is used for adjusting the confidence coefficient according to the current scene; correction error term And the blur error term f is used to take into account the correction of the error and the blur. Confidence data for the primary user intent is calculated through interactions between mathematical symbols. This facilitates understanding and inference of the user's intent by the system and provides more reliable and accurate intent recognition results. Meanwhile, the setting and adjustment of different parameters can be flexibly adjusted according to specific conditions so as to adapt to different application scenes and user requirements.

Preferably, step S6 is specifically:

specifically, the user intent data and the user voice text data are correspondingly associated, for example, according to a time stamp or time sequence of the voice interaction, ensuring that they are annotated in time sequence, and forming the user intent voice text time sequence data.

specifically, corresponding voice interaction form data, such as synthesized voice, prompt tone and reply information, are generated according to the user intention data, so as to meet the interaction requirement between the user and the system.

specifically, for example, a pre-trained artificial intelligent voice interaction generation model is used, recognition calculation is performed on the user intention voice text time sequence data input model, voice interaction is converted into a text form, and voice interaction text data are obtained.

Specifically, corresponding voice interaction tasks, such as reply of a voice assistant and indication of voice navigation, are performed according to the voice interaction form data and the voice interaction text data, so as to meet the requirements of users.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The voice interaction method based on the artificial intelligence is characterized by comprising the following steps of:

2. The method according to claim 1, wherein the user speech scene noise reduction data includes user speech high frequency noise reduction data, user speech low frequency noise reduction data, and user speech deep noise reduction data, and step S1 is specifically:

acquiring user voice data and scene noise data in real time;

Extracting scene noise characteristics from the scene noise data, thereby obtaining scene noise characteristic data;

noise characteristic processing is carried out according to the scene noise characteristic data, so that scene noise characteristic data are obtained;

when the scene noise characteristic data are determined to be high-frequency noise characteristic data, performing high-frequency noise reduction processing on the user voice data, thereby obtaining the high-frequency noise reduction data of the user voice;

when the scene noise characteristic data are determined to be low-frequency noise characteristic data, performing low-frequency noise reduction processing on the user voice data, so as to obtain the user voice low-frequency noise reduction data;

when the scene noise characteristic data are the mixed noise characteristic data, the noise reduction processing is carried out on the user voice data by using a preset scene depth noise reduction model, so that the user voice depth noise reduction data are obtained, wherein the scene depth noise reduction model is formed by carrying out deep neural network learning by using a voice signal and a preset environment noise set, and parameter adjustment is carried out according to the scene noise characteristic data so as to carry out noise reduction operation.

3. The method according to claim 2, wherein the user speech optimized framing data comprises first user speech framing data, second user speech framing data and third user speech framing data, and step S2 is specifically:

When the scene noise characteristic data are high-frequency noise characteristic data, performing first framing processing on the user voice scene denoising data, so as to obtain first user voice framing data;

when the scene noise characteristic data are determined to be low-frequency noise characteristic data, performing second framing processing on the user voice scene denoising data, so as to obtain second user voice framing data;

when the scene noise characteristic data are the mixed noise characteristic data, performing third framing processing on the user voice scene denoising data, so as to obtain third user voice framing data, wherein the framing length of the first user voice framing data is larger than that of the third user voice framing data, and the framing length of the third user voice framing data is larger than or equal to that of the second user voice framing data.

4. The method according to claim 1, wherein step S3 is specifically:

5. The method according to claim 1, wherein step S4 is specifically:

recognizing the user voice semantic feature data by using a preset user voice recognition model so as to acquire user voice text data;

carrying out emotion label identification and scoring processing on the user voice emotion feature data so as to obtain user voice emotion label data and user voice emotion scoring data;

carrying out emotion correction on the user voice text data according to the user voice emotion label data and the user voice emotion scoring data, so as to obtain user emotion voice correction data;

generating a user emotion voice book according to the user emotion voice correction data and the user voice emotion feature data, so that user emotion voice text data;

6. The method according to claim 5, wherein the step of constructing the user speech recognition model is specifically:

standard user voice data and corresponding meta user voice identification data are acquired;

preprocessing standard user voice data so as to obtain standard user voice preprocessing data;

extracting features of the standard user voice preprocessing data so as to obtain standard user voice feature data;

performing convolution calculation according to the standard user voice characteristic data so as to obtain standard user voice convolution layer data;

carrying out pooling calculation on the standard user voice convolution layer data so as to obtain standard user voice pooling layer data;

carrying out full connection calculation on the standard user voice pooling layer data so as to obtain standard user voice full connection data;

and identifying the standard user voice full-connection data by using the meta user voice identification data so as to construct a user voice recognition model.

7. The method according to claim 1, wherein step S5 is specifically:

The text matching method specifically comprises the following steps:

performing first text matching according to user emotion voice text data and a preset user intention rule set, so as to obtain first user intention data;

performing second text matching according to the user emotion voice text data and a preset user intention rule set so as to obtain second user intention data, wherein the first text matching and the second text matching are different character string matching algorithms, and the primary user intention data comprise first user intention data and second user intention data;

when the primary user intention data is primary user intention data which does not contain user intention data, acquiring scene audio data, and mining scene intention according to the scene audio data so as to acquire the user intention data;

when the primary user intention data is determined to be primary user intention data containing only one user intention data, the primary user intention data is determined to be user intention data;

when the primary user intention data is determined to contain at least two user intention data, confidence calculation is carried out on the primary user intention data according to the historical user intention matching data, so that primary user intention confidence data is obtained;

And performing optimal confidence screening on the primary user intention data by using the primary user intention confidence data so as to acquire the user intention data.

8. The method of claim 7, wherein the confidence level calculation is performed by a user intent confidence level calculation formula, wherein the user intent confidence level calculation formula is specifically:

9. The method according to claim 1, wherein step S6 is specifically:

according to the voice interaction time data, time sequence labeling and association are carried out on the user intention data and the user voice text data, so that the time sequence data of the user intention voice text are obtained;

generating voice interaction form data according to the user intention data;

performing recognition calculation on the user intention voice text time sequence data by using a preset artificial intelligent voice interaction generation model so as to acquire voice interaction text data;

And performing voice interaction operation according to the voice interaction form data and the voice interaction text data.

10. A speech interaction system based on artificial intelligence, comprising: