CN117555916A

CN117555916A - Voice interaction method and system based on natural language processing

Info

Publication number: CN117555916A
Application number: CN202311467616.5A
Authority: CN
Inventors: 皇甫汉聪; 王永才; 关兆雄; 林浩; 李沐栩; 宋才华; 杜家兵; 刘胜强
Original assignee: Foshan Power Supply Bureau of Guangdong Power Grid Corp
Current assignee: Foshan Power Supply Bureau of Guangdong Power Grid Corp
Priority date: 2023-11-06
Filing date: 2023-11-06
Publication date: 2024-02-13

Abstract

The invention discloses a voice interaction method and a voice interaction system based on natural language processing, wherein the method comprises the following steps: acquiring voice information input by a user, and converting the voice information into natural language text information based on a voice recognition technology; performing text preprocessing on the natural language text information to obtain the natural language text information after text preprocessing; converting the text information of the natural language after text pretreatment into a structured query language based on an NLP semantic analysis model; acquiring corresponding target data based on the structured query language; inputting the corresponding target data into an optimized natural language generation model, and converting the corresponding target data into a response text based on the optimized natural language generation model; the response text is converted into response voice based on voice synthesis technology, and the response voice is output. The invention can provide more accurate, reliable and quick voice interaction service and improve the use experience of people on voice interaction.

Description

Voice interaction method and system based on natural language processing

Technical Field

The invention relates to the technical field of computers, in particular to a voice interaction method and system based on natural language processing.

Background

Along with the progress of artificial intelligence technology and the continuous change of people to life demands, people need a quicker and more convenient interaction mode to bring convenience to daily life and work of people, so that the voice interaction mode is rapidly developed and widely applied, in the development process of voice interaction, how to accurately identify voice and how to accurately analyze the semantics of the voice are always important problems of enterprises.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a voice interaction method and a voice interaction system based on natural language processing, which not only can provide more accurate and rapid voice interaction service, but also can improve the use experience of people on voice interaction.

In order to solve the technical problems, the invention provides a voice interaction method based on natural language processing, which comprises the following steps:

Acquiring voice information input by a user, and converting the voice information into natural language text information based on a voice recognition technology;

performing text preprocessing on the natural language text information to obtain the natural language text information after text preprocessing;

inputting the text-preprocessed natural language text information into an NLP semantic analysis model, and converting the text-preprocessed natural language text information into a structured query language based on the NLP semantic analysis model;

acquiring corresponding target data based on the structured query language;

inputting the corresponding target data into an optimized natural language generation model, and converting the corresponding target data into a response text based on the optimized natural language generation model;

the response text is converted into response voice based on voice synthesis technology, and the response voice is output.

Optionally, the converting the voice information into natural language text information based on the voice recognition technology includes:

preprocessing the voice information to obtain preprocessed voice information;

performing feature extraction processing on the preprocessed voice information based on a perception linear prediction algorithm to obtain a voice feature vector;

Inputting the voice feature vector into an acoustic model, and outputting a phoneme sequence based on the acoustic model;

inputting the phoneme sequence into a language text model, and outputting natural language text information based on the language text model.

Optionally, the inputting the speech feature vector into an acoustic model, outputting a phoneme sequence based on the acoustic model, includes:

constructing a GMM-HMM model as an initial acoustic model, wherein the structure of the GMM-HMM model comprises: an input layer, an acoustic state layer, an hidden layer, an observable layer, a Gao Sicheng layer, and an output layer;

training the initial acoustic model to obtain a trained acoustic model;

inputting the voice feature vector into a trained acoustic model, and generating a phoneme sequence by using a decoding algorithm based on the trained acoustic model.

Optionally, the performing text preprocessing on the natural language text information to obtain text preprocessed natural language text information includes:

carrying out corpus cleaning treatment on the natural language text information to obtain natural language text information after corpus cleaning treatment;

word segmentation processing is carried out on the natural language text information after the language material is cleaned, and a plurality of text word segmentation is obtained;

And performing part-of-speech tagging on the text segmentation to obtain the natural language text information after text preprocessing.

Optionally, the inputting the text-preprocessed natural language text information into an NLP semantic parsing model, converting the text-preprocessed natural language text information into the structured query language based on the NLP semantic parsing model, includes:

carrying out feature extraction processing on the natural language text information after text pretreatment to obtain corresponding feature codes;

training the NLP semantic analysis model based on the corresponding feature codes to obtain a trained NLP semantic analysis model;

and performing language conversion based on the trained NLP semantic analysis model to obtain a structured query language.

Optionally, training the NLP semantic analysis model based on the corresponding feature codes to obtain a trained NLP semantic analysis model, including:

inputting the corresponding feature codes into an NLP semantic analysis model, and processing the corresponding feature codes by using a first decoder in the NLP semantic analysis model to obtain a processing result of selecting clauses;

processing the corresponding feature codes by using a second decoder in the NLP semantic parsing model to obtain a processing result of the condition clause;

Calculating a loss function based on the processing result of the selection clause and the processing result of the condition clause;

and updating and adjusting parameters of the NLP semantic analysis model by using an optimizer based on the loss function, and optimizing the NLP semantic analysis model based on the updated and adjusted parameters to obtain a trained NLP semantic analysis model.

Optionally, the obtaining the corresponding target data based on the structured query language includes:

acquiring a target storage position by using a query engine based on the structured query language;

and acquiring corresponding target data by utilizing logic rules and a calculation engine in the structured query language based on the target storage position.

Optionally, the inputting the corresponding target data into an optimized natural language generating model, converting the corresponding target data into the answer text based on the optimized natural language generating model includes:

constructing an LSTM model as an initial natural language generation model, wherein the structure of the LSTM model comprises: the system comprises an input layer, an hidden layer and an output layer, wherein the hidden layer comprises a plurality of LSTM units, and each LSTM unit comprises an input gate, a forget gate, an output gate and a cell state;

Training the initial natural language generation model to obtain a trained natural language generation model;

performing structural optimization on the trained natural language generation model by using a genetic particle swarm optimization algorithm to obtain an optimized natural language generation model;

and converting the corresponding target data into response text based on the optimized natural language generation model.

Optionally, the voice synthesis technology converts the answer text into answer voice and outputs the answer voice, including:

performing text preprocessing on the response text to obtain a text preprocessed response text;

performing prosody prediction and word-to-sound conversion processing on the response text after text pretreatment to obtain a corresponding phoneme;

converting the corresponding phonemes into acoustic features based on a preset autoregressive acoustic model;

the acoustic features are converted to responsive speech based on a vocoder and the responsive speech is output.

In addition, the invention also provides a voice interaction system based on natural language processing, which comprises:

the voice recognition module is used for acquiring voice information input by a user and converting the voice information into natural language text information based on a voice recognition technology;

The text preprocessing module is used for carrying out text preprocessing on the natural language text information to obtain the natural language text information after text preprocessing;

the semantic analysis module is used for inputting the text-preprocessed natural language text information into an NLP semantic analysis model, and converting the text-preprocessed natural language text information into a structured query language based on the NLP semantic analysis model;

the target data acquisition module is used for acquiring corresponding target data based on the structured query language;

the natural language generation module is used for inputting the corresponding target data into the optimized natural language generation model, and converting the corresponding target data into a response text based on the optimized natural language generation model;

and the voice synthesis module is used for converting the response text into response voice based on a voice synthesis technology and outputting the response voice.

In the embodiment of the invention, the voice information input by the user is accurately identified through the voice recognition technology, the voice information is converted into the natural language text information, the natural language text information is converted into the structured query language through the NLP semantic analysis model, the semantics in the natural language text information can be accurately analyzed and represented by the structured query language, the target data can be accurately acquired through the logic and the calculation engine contained in the structured query language, the target data is processed through the natural language generation model after the structural optimization, the response text can be accurately and rapidly generated, the voice synthesis technology is adopted to convert the response text into the response voice and output the response voice, and the complete flow of voice interaction can be completed, so that more accurate and rapid voice interaction service is provided, and the use experience of people on voice interaction is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a natural language processing based voice interaction method in an embodiment of the invention;

fig. 2 is a schematic structural diagram of a speech interaction system based on natural language processing in an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, fig. 1 is a flow chart of a voice interaction method based on natural language processing in an embodiment of the invention.

As shown in fig. 1, a voice interaction method based on natural language processing, the method includes:

s11: acquiring voice information input by a user, and converting the voice information into natural language text information based on a voice recognition technology;

in the implementation process of the invention, the voice information is converted into natural language text information based on the voice recognition technology, which comprises the following steps: preprocessing the voice information to obtain preprocessed voice information; performing feature extraction processing on the preprocessed voice information based on a perception linear prediction algorithm to obtain a voice feature vector; inputting the voice feature vector into an acoustic model, and outputting a phoneme sequence based on the acoustic model; inputting the phoneme sequence into a language text model, and outputting natural language text information based on the language text model.

Further, the inputting the speech feature vector into an acoustic model, outputting a phoneme sequence based on the acoustic model, includes: constructing a GMM-HMM model as an initial acoustic model, wherein the structure of the GMM-HMM model comprises: an input layer, an acoustic state layer, an hidden layer, an observable layer, a Gao Sicheng layer, and an output layer; training the initial acoustic model to obtain a trained acoustic model; inputting the voice feature vector into a trained acoustic model, and generating a phoneme sequence by using a decoding algorithm based on the trained acoustic model.

Specifically, preprocessing voice information, carrying out noise reduction processing on the voice information by adopting spectral subtraction, initializing parameters, determining the frame length, frame shift and Fourier transform length of the voice, calculating the frame number according to the frame length and the frame shift of the voice, calculating a noise spectrum according to the frame number of the voice, carrying out Fourier transform on each frame signal of the voice according to the Fourier transform length, subtracting the noise spectrum by each frame signal after the Fourier transform to obtain an enhanced voice amplitude spectrum, calculating a replacement value according to an over-subtraction factor by utilizing an over-subtraction mathematical formula if the enhanced voice amplitude spectrum has a negative value, wherein the over-subtraction factor is used for preventing the voice from generating a noise peak value, replacing the calculated replacement value with the negative value, carrying out reconstruction by utilizing the signal phase of the voice and the voice amplitude spectrum after all the negative values are replaced, reversely transforming the voice to a time domain, and completing the noise reduction processing of the corresponding voice after each frame signal of the voice is reversely transformed; performing analog-digital (Analog to Digital, AD) AD sampling processing on the noise-reduced voice information, wherein AD sampling is a mode of converting a signal into a time domain signal, mapping the time domain of the noise-reduced voice information onto digital quantized discrete points according to the period of the noise-reduced voice signal, constructing a corresponding mathematical function according to the discrete points, and describing the time domain change of the voice signal according to the mathematical function to obtain a voice time domain signal; pre-emphasis processing is carried out on the voice time domain signal, the low frequency part of the voice time domain signal is kept unchanged, the high frequency part of the voice time domain signal is improved, so that attenuation of high frequency components in the transmission process is compensated, and the voice time domain signal after the pre-emphasis processing is obtained; framing the pre-emphasis processed voice time domain signal, wherein framing is actually segmentation processing, unnecessary parts of voice are cut off, only useful parts are reserved, and the pre-emphasis processed voice frequency domain signal is segmented according to a designated length to obtain a plurality of voice framing; after framing, an energy leakage phenomenon is generated, the windowing treatment is carried out on the voice frames, the energy leakage can be reduced, each voice frame is subjected to the windowing treatment, each frame signal of each voice frame is multiplied by a window function, the window function is a function for cutting off the energy leakage in the signal, a plurality of windowed voice frames are obtained, and the preprocessing of voice information is completed after the windowing of the voice frames is completed; performing feature extraction processing on the preprocessed voice information based on a perception linear prediction algorithm, performing discrete Fourier transform on the preprocessed voice information, and performing prolongation transform on the preprocessed voice information from a time domain to a frequency domain according to a period to obtain a voice spectrogram; calculating the square sum of a real part and an imaginary part in a voice spectrogram to obtain a short-time power spectrum, mapping frequencies in the short-time power spectrum into critical frequency bands, dividing the critical frequency bands to reflect human ear hearing, pre-emphasis is carried out on the mapped frequencies by adopting an equal loudness curve, the equal loudness curve is a curve reflecting the relation between voice loudness and frequencies, voice information after equal loudness pre-emphasis is obtained, and in order to approximate the nonlinear relation between the intensity of analog sound and the loudness felt by human ear, the voice information after equal loudness pre-emphasis is subjected to intensity-loudness conversion, and inverse Fourier transformation is carried out after the intensity-loudness conversion is completed, so that a voice feature vector is obtained; constructing a GMM-HMM model (Gaussian Mixture Model-Hidden Markov Model, GMM-HMM, gaussian mixture-hidden Markov model) as an initial acoustic model, the GMM-HMM model being a phoneme acoustic model, the structure of the GMM-HMM model comprising: the system comprises an input layer, an acoustic state layer, an implicit layer, an observable layer, a Gao Sicheng layering and an output layer, wherein the input layer is used for inputting voice feature vectors, the acoustic state layer is used for observing the distribution of state sequences of the voice feature vectors, determining the state acoustics of the voice feature vectors, the implicit layer is used for mapping voice features into a feature space with higher dimensionality according to the distribution of the state sequences of the voice feature vectors, extracting more specific voice features, the observable layer extracts the voice features into voice features with higher expressive power and interpretability through a perception field and a circulation layer, redundant information in the features is removed, the Gaussian composition layer clusters and recognizes the voice features, and phonemes corresponding to the voice features can be obtained by fitting the voice features and calculating probability distribution; the output layer outputs the phonemes; inputting the voice characteristic vector into a trained acoustic model, generating a phoneme sequence based on the trained acoustic model by using a decoding algorithm, searching an optimal excitation vector during decoding, wherein the decoding value of the excitation vector is the sequence number of the phonemes, and forming the phoneme sequence according to each decoding value; an ngram model (N-gram Recurrent Neural Networks, N-element cyclic neural network) can be constructed as an initial language text model, wherein the ngram is a language model in large-vocabulary continuous speech recognition, the final text sequence is determined by calculating the composition probability of sentences, the initial language text model is trained to obtain a trained initial language text model, the phoneme sequence is input into the trained language text model, natural language text information is output, the speech information input by a user can be accurately recognized through a speech recognition technology, and the speech information can be quickly converted into the natural language text information.

S12: performing text preprocessing on the natural language text information to obtain the natural language text information after text preprocessing;

in the implementation process of the invention, the text preprocessing is performed on the natural language text information to obtain the text preprocessed natural language text information, which comprises the following steps: carrying out corpus cleaning treatment on the natural language text information to obtain natural language text information after corpus cleaning treatment; word segmentation processing is carried out on the natural language text information after the language material is cleaned, and a plurality of text word segmentation is obtained; and performing part-of-speech tagging on the text segmentation to obtain the natural language text information after text preprocessing.

Specifically, corpus cleaning processing is carried out on the natural language text information, character strings in the corpus cleaning processing are matched with character strings conforming to rules by utilizing regular matching rules, blank characters, special characters, repeated data and stop words are removed, and the natural language text information after corpus cleaning processing is obtained; word segmentation processing is carried out on natural language text information after the language material is cleaned, the frequency of occurrence of adjacent characters is utilized to reflect the reliability of word formation, the frequency of combination of all characters which commonly occur adjacently in the text information is counted, when the combination frequency is higher than a critical value, the combination of the characters can be considered to form a word, and a plurality of text word segmentation is obtained; part-of-speech tagging is carried out on a plurality of text words, part-of-speech tagging is used as a sequence tagging, a unit sequence is given, a tag is allocated to each unit in the sequence, probability distribution of possible tag sequences is calculated, an optimal tag sequence is selected, the most probable part-of-speech is judged according to the optimal tag sequence, then part-of-speech tagging is carried out on the words, and after part-of-speech tagging is completed, the text preprocessing flow is completed; the text preprocessing is carried out on the natural language text information, so that the information is cleaner, more accurate and more reliable.

S13: inputting the text-preprocessed natural language text information into an NLP semantic analysis model, and converting the text-preprocessed natural language text information into a structured query language based on the NLP semantic analysis model;

in the implementation process of the invention, the method for inputting the text pre-processed natural language text information into the NLP semantic analysis model and converting the text pre-processed natural language text information into the structured query language based on the NLP semantic analysis model comprises the following steps: carrying out feature extraction processing on the natural language text information after text pretreatment to obtain corresponding feature codes; training the NLP semantic analysis model based on the corresponding feature codes to obtain a trained NLP semantic analysis model; and performing language conversion based on the trained NLP semantic analysis model to obtain a structured query language.

Further, training the NLP semantic analysis model based on the corresponding feature codes to obtain a trained NLP semantic analysis model, including: inputting the corresponding feature codes into an NLP semantic analysis model, and processing the corresponding feature codes by using a first decoder in the NLP semantic analysis model to obtain a processing result of selecting clauses; processing the corresponding feature codes by using a second decoder in the NLP semantic parsing model to obtain a processing result of the condition clause; calculating a loss function based on the processing result of the selection clause and the processing result of the condition clause; and updating and adjusting parameters of the NLP semantic analysis model by using an optimizer based on the loss function, and optimizing the NLP semantic analysis model based on the updated and adjusted parameters to obtain a trained NLP semantic analysis model.

Specifically, feature extraction processing is performed on the natural language text information after text preprocessing, feature extraction is performed on the natural language text information after text preprocessing by adopting a word frequency-inverse text frequency algorithm, the occurrence frequency of each word and the distinguishing degree of each word to the text in the natural language text information are counted, hash processing is performed according to the occurrence frequency of each word through a hash function, a hash feature vector is obtained, correction processing is performed on the hash feature vector according to the distinguishing degree of each word to the text through a fit () function, the word frequency feature vector is obtained, the fit () function is a function used for fitting and correcting data, after feature extraction, the obtained word frequency feature vector is a high-dimensional feature vector, computational complexity is increased, the word frequency feature vector is compressed through sparse coding, and corresponding feature codes are obtained; inputting the corresponding feature codes into an NLP (Natural Language Processing, NLP, natural language processing) semantic analysis model, processing the corresponding feature codes by using a first decoder in the NLP semantic analysis model to obtain a processing result of a selection clause, processing the feature codes according to two full-connection layers in the first decoder, wherein the first full-connection layer is used for determining a prediction result of the selection clause, the second full-connection layer is used for determining probability distribution of the prediction result corresponding to the selection clause, and determining the processing result of the selection clause according to the probability distribution; processing the corresponding feature codes by using a second decoder in the NLP semantic analysis model, wherein the second decoder is used for determining a condition clause, the feature codes are processed by four full-connection layers in the second decoder, the first full-connection layer is used for determining a prediction result of a corresponding condition column name, the second full-connection layer is used for determining a prediction result of an operator in a condition, the third full-connection layer is used for determining a prediction result of the number of the conditions, the fourth full-connection layer is used for determining a corresponding relation among the operator, the condition column name and the number of the conditions, and after the corresponding relation is determined, a processing result of the condition clause is obtained; calculating a loss function based on the processing result of the selected clause and the processing result of the conditional clause, calculating a first loss function according to the processing result of the selected clause, taking the processing result of the selected clause as a training point, introducing a nonlinear factor through the activation function, obtaining an optimal fit line according to the activation function, calculating the distance from the training point to the optimal fit line, calculating a first loss function according to the distance by using a mean square error loss definition, calculating a second loss function according to the processing result of the conditional clause, taking the processing result of the conditional clause as the training point, introducing the nonlinear factor through the activation function, obtaining the optimal fit line according to the activation function, calculating the distance from the training point to the optimal fit line, calculating a second loss function according to the distance by using the mean square error loss definition, and carrying out weighted summation on the first loss function and the second loss function to obtain a final loss function; updating and adjusting parameters of the NLP semantic analysis model by using an optimizer based on the loss function, calculating gradients related to the parameters by using the optimizer according to the loss function, updating and adjusting the parameters according to the gradients of the parameters, repeating the calculation of the gradients and the updating of the parameters by the gradients until convergence conditions are reached, and optimizing the NLP semantic analysis model based on the updated and adjusted parameters to obtain a trained NLP semantic analysis model; according to the structure of the trained NLP semantic analysis model, language conversion is carried out, natural language text information is converted into a structured query language, and the NLP semantic analysis model is used for converting the natural language text information into the structured query language, so that the semantics in the natural language text information can be accurately analyzed.

S14: acquiring corresponding target data based on the structured query language;

in the implementation process of the invention, the obtaining the corresponding target data based on the structured query language includes: acquiring a target storage position by using a query engine based on the structured query language; and acquiring corresponding target data by utilizing logic rules and a calculation engine in the structured query language based on the target storage position.

Specifically, a target storage position is obtained by utilizing a query engine according to a structured query language, and a field name can be queried by a selection statement in the structured query language, so that the storage of a target is determined; the corresponding target data is acquired by utilizing the logic rules and the calculation engines in the structured query language according to the target storage positions, the rows and the columns of the conditional sentences in the structured query language can be limited according to the target storage positions, the corresponding target data is acquired according to the corresponding relation, and the target data can be acquired more accurately and rapidly through the logic and the calculation engines contained in the structured query language.

S15: inputting the corresponding target data into an optimized natural language generation model, and converting the corresponding target data into a response text based on the optimized natural language generation model;

In the implementation process of the invention, the inputting the corresponding target data into the optimized natural language generating model, and converting the corresponding target data into the response text based on the optimized natural language generating model comprises the following steps: constructing an LSTM model as an initial natural language generation model, wherein the structure of the LSTM model comprises: the system comprises an input layer, an hidden layer and an output layer, wherein the hidden layer comprises a plurality of LSTM units, and each LSTM unit comprises an input gate, a forget gate, an output gate and a cell state; training the initial natural language generation model to obtain a trained natural language generation model; performing structural optimization on the trained natural language generation model by using a genetic particle swarm optimization algorithm to obtain an optimized natural language generation model; and converting the corresponding target data into response text based on the optimized natural language generation model.

Specifically, an LSTM (Long Short Term Memory, LSTM, long-short-term memory) model is constructed as an initial natural language generation model, the LSTM model is a neural network with long-short-term information memorizing capability, the LSTM model has a chain structure, and the structure of the LSTM model includes: the input layer is used for inputting target data, the hidden layer comprises a plurality of LSTM units, the LSTM units are used for leading in a gate mechanism and description of cell states to reflect the structure of the LSTM units, each LSTM unit comprises an input gate, a forgetting gate, an output gate and a cell state, the cell state provides a memory function of a model, network information of front and back layers can still be transmitted when the model structure is deepened, the input gate determines which new information is added to the cell state, namely the cell state is updated, the current input is transmitted to an activation function, updated information is determined through the activation function, importance evaluation is carried out on the updated information, unimportant information is removed, meanwhile, the current input is transmitted to a tanh function for processing, the tanh function is a hyperbolic tangent function and used for adjusting the model, the forgetting gate determines which information needs to be discarded by combining the processing value of the tanh function and the output of the activation function, the forgetting gate determines whether the cell state needs to be discarded after the input information from the current moment and the input information at the last moment are spliced, the input gate carries out weighting calculation on the spliced information, the information after the splicing is subjected to the weighting calculation, whether the output condition can be judged through the output gate can be judged through the output condition of the output gate, and the output condition of the output condition is retained by the output condition of the LSgate; the output layer is used for outputting the text information; training the initial natural language generation model to obtain a trained natural language generation model; carrying out structural optimization on a trained natural language generation model by utilizing a genetic particle swarm optimization algorithm, initializing an initial particle population composed of a plurality of particles, wherein each particle represents a candidate solution, each candidate solution represents a hierarchical structure of the model, calculating a fitness value for each particle in the initial particle population, defining a fitness function, converting the characteristics of each particle into a mathematical function of an adaptive value, converting each particle in the initial particle population into binary data strings for expression, mapping the binary data strings into a solution space, carrying out evaluation on the quality degree of each particle in the solution space according to the fitness function, carrying out arrangement according to the sequence from high to low according to the evaluation of the quality degree of each particle, selecting two particles with the highest fitness for genetic calculation, carrying out cross mutation processing on the two particles with the highest fitness according to a cross operator, generating a child particle population, carrying out mutation operation processing on a certain particle in the child particle population according to a mutation operator, carrying out mutation operation on the certain particle in the child particle population, carrying out random mutation operation on the corresponding bit to the child particle population, carrying out random mutation operation on the child particle model, carrying out random operation on the genetic algorithm to obtain a plurality of child particle models, carrying out the genetic algorithm optimization operation on the child model, further carrying out the genetic algorithm, obtaining the child genetic algorithm, and carrying out the genetic algorithm, and carrying out the genetic algorithm, obtaining the child model, and further carrying out the genetic algorithm, and the genetic algorithm, obtaining the child model, and further carrying out the genetic algorithm, by using the child model, thereby improving the accuracy and the robustness of the output result; according to the optimized natural language generation model, corresponding target data are converted into response texts, and the response texts can be generated more accurately and rapidly by processing the target data through the natural language generation model after structure optimization.

S16: the response text is converted into response voice based on voice synthesis technology, and the response voice is output.

In the implementation process of the invention, the voice synthesis technology is used for converting the response text into response voice and outputting the response voice, and the voice synthesis method comprises the following steps: performing text preprocessing on the response text to obtain a text preprocessed response text; performing prosody prediction and word-to-sound conversion processing on the response text after text pretreatment to obtain a corresponding phoneme; converting the corresponding phonemes into acoustic features based on a preset autoregressive acoustic model; the acoustic features are converted to responsive speech based on a vocoder and the responsive speech is output.

Specifically, text preprocessing is carried out on the answer text, corpus cleaning processing is carried out on the answer text, blank characters, special characters, repeated data and stop words are removed, and natural language text information after corpus cleaning processing is obtained; word segmentation processing is carried out on the response text after corpus cleaning, the word forming reliability is reflected by the frequency of adjacent occurrence of the words, the frequency of the combination of the words which are commonly and adjacently occurring in the text information is counted, and when the combination frequency is higher than a critical value, the combination of the words can be considered to form a word, so that a plurality of text word segmentation is obtained; part-of-speech tagging is carried out on a plurality of text word segments, probability distribution of each part of speech is calculated, and the part of speech corresponding to the word segment tagging is carried out according to the probability distribution, so that a response text after text preprocessing is obtained; prosody prediction and word-to-sound conversion processing are carried out on the response text after the text pretreatment, prosody prediction is carried out on the response text according to prosody words, prosody phrases and intonation phrases, the prosody prediction is carried out so as to be more fit with real human voice, and word-to-sound conversion searches for corresponding phonemes according to text information, so that correct pronunciation of the text can be determined, and corresponding phonemes are obtained; converting the corresponding phonemes into acoustic features based on a preset autoregressive acoustic model, wherein the preset autoregressive acoustic model can adopt a TTS (Text-to-Speech) model, which is an end-to-end neural network model, and the structure of the model comprises an encoder module and a decoder module, wherein the encoder module is responsible for mapping the input phonemes into discrete encoding vectors, and the decoder module is responsible for decoding the encoding vectors into Speech frames, so that the corresponding phonemes can be converted into the acoustic features; based on the vocoder, the acoustic features are converted into response voices, the vocoder converts the acoustic features into sound waveforms, the acoustic features are sampled through an up-sampling module and a down-sampling module of the vocoder, the acoustic features can be converted into sound waveforms, the loss of sound information can be avoided, and finally the response voices are output.

Example two

Referring to fig. 2, fig. 2 is a schematic structural diagram of a voice interaction system based on natural language processing according to an embodiment of the invention.

As shown in fig. 2, a voice interaction system based on natural language processing, the system comprising:

the speech recognition module 21: the method comprises the steps of acquiring voice information input by a user, and converting the voice information into natural language text information based on a voice recognition technology;

Specifically, preprocessing voice information, carrying out noise reduction processing on the voice information by adopting spectral subtraction, initializing parameters, determining the frame length, frame shift and Fourier transform length of the voice, calculating the frame number according to the frame length and the frame shift of the voice, calculating a noise spectrum according to the frame number of the voice, carrying out Fourier transform on each frame signal of the voice according to the Fourier transform length, subtracting the noise spectrum by each frame signal after the Fourier transform to obtain an enhanced voice amplitude spectrum, calculating a replacement value according to an over-subtraction factor by utilizing an over-subtraction mathematical formula if the enhanced voice amplitude spectrum has a negative value, wherein the over-subtraction factor is used for preventing the voice from generating a noise peak value, replacing the calculated replacement value with the negative value, carrying out reconstruction by utilizing the signal phase of the voice and the voice amplitude spectrum after all the negative values are replaced, reversely transforming the voice to a time domain, and completing the noise reduction processing of the corresponding voice after each frame signal of the voice is reversely transformed; performing AD sampling processing on the noise-reduced voice information, wherein the AD sampling is a mode of converting a signal into a time domain signal, the time domain of the noise-reduced voice information is mapped onto digital quantized discrete points according to the period of the noise-reduced voice signal, a corresponding mathematical function is constructed according to the discrete points, and the time domain change of the voice signal is described according to the mathematical function to obtain a voice time domain signal; pre-emphasis processing is carried out on the voice time domain signal, the low frequency part of the voice time domain signal is kept unchanged, the high frequency part of the voice time domain signal is improved, so that attenuation of high frequency components in the transmission process is compensated, and the voice time domain signal after the pre-emphasis processing is obtained; framing the pre-emphasis processed voice time domain signal, wherein framing is actually segmentation processing, unnecessary parts of voice are cut off, only useful parts are reserved, and the pre-emphasis processed voice frequency domain signal is segmented according to a designated length to obtain a plurality of voice framing; after framing, an energy leakage phenomenon is generated, the windowing treatment is carried out on the voice frames, the energy leakage can be reduced, each voice frame is subjected to the windowing treatment, each frame signal of each voice frame is multiplied by a window function, the window function is a function for cutting off the energy leakage in the signal, a plurality of windowed voice frames are obtained, and the preprocessing of voice information is completed after the windowing of the voice frames is completed; performing feature extraction processing on the preprocessed voice information based on a perception linear prediction algorithm, performing discrete Fourier transform on the preprocessed voice information, and performing prolongation transform on the preprocessed voice information from a time domain to a frequency domain according to a period to obtain a voice spectrogram; calculating the square sum of a real part and an imaginary part in a voice spectrogram to obtain a short-time power spectrum, mapping frequencies in the short-time power spectrum into critical frequency bands, dividing the critical frequency bands to reflect human ear hearing, pre-emphasis is carried out on the mapped frequencies by adopting an equal loudness curve, the equal loudness curve is a curve reflecting the relation between voice loudness and frequencies, voice information after equal loudness pre-emphasis is obtained, and in order to approximate the nonlinear relation between the intensity of analog sound and the loudness felt by human ear, the voice information after equal loudness pre-emphasis is subjected to intensity-loudness conversion, and inverse Fourier transformation is carried out after the intensity-loudness conversion is completed, so that a voice feature vector is obtained; constructing a GMM-HMM model as an initial acoustic model, wherein the GMM-HMM model is a phoneme acoustic model, and the structure of the GMM-HMM model comprises: the system comprises an input layer, an acoustic state layer, an implicit layer, an observable layer, a Gao Sicheng layering and an output layer, wherein the input layer is used for inputting voice feature vectors, the acoustic state layer is used for observing the distribution of state sequences of the voice feature vectors, determining the state acoustics of the voice feature vectors, the implicit layer is used for mapping voice features into a feature space with higher dimensionality according to the distribution of the state sequences of the voice feature vectors, extracting more specific voice features, the observable layer extracts the voice features into voice features with higher expressive power and interpretability through a perception field and a circulation layer, redundant information in the features is removed, the Gaussian composition layer clusters and recognizes the voice features, and phonemes corresponding to the voice features can be obtained by fitting the voice features and calculating probability distribution; the output layer outputs the phonemes; inputting the voice characteristic vector into a trained acoustic model, generating a phoneme sequence based on the trained acoustic model by using a decoding algorithm, searching an optimal excitation vector during decoding, wherein the decoding value of the excitation vector is the sequence number of the phonemes, and forming the phoneme sequence according to each decoding value; the initial language text model is trained to obtain a trained initial language text model, the phoneme sequence is input into the trained language text model, natural language text information is output, voice information input by a user can be accurately identified through a voice identification technology, and the voice information can be quickly converted into the natural language text information.

Text preprocessing module 22: the method comprises the steps of performing text preprocessing on the natural language text information to obtain the natural language text information after text preprocessing;

Semantic parsing module 23: the method comprises the steps of inputting text pre-processed natural language text information into an NLP semantic analysis model, and converting the text pre-processed natural language text information into a structured query language based on the NLP semantic analysis model;

Specifically, feature extraction processing is performed on the natural language text information after text preprocessing, feature extraction is performed on the natural language text information after text preprocessing by adopting a word frequency-inverse text frequency algorithm, the occurrence frequency of each word and the distinguishing degree of each word to the text in the natural language text information are counted, hash processing is performed according to the occurrence frequency of each word through a hash function, a hash feature vector is obtained, correction processing is performed on the hash feature vector according to the distinguishing degree of each word to the text through a fit () function, the word frequency feature vector is obtained, the fit () function is a function used for fitting and correcting data, after feature extraction, the obtained word frequency feature vector is a high-dimensional feature vector, computational complexity is increased, the word frequency feature vector is compressed through sparse coding, and corresponding feature codes are obtained; inputting the corresponding feature codes into an NLP semantic analysis model, processing the corresponding feature codes by using a first decoder in the NLP semantic analysis model to obtain a processing result of a selection clause, processing the feature codes according to two full-connection layers in the first decoder, wherein the first full-connection layer is used for determining a prediction result of the selection clause, and the second full-connection layer is used for determining probability distribution of the prediction result corresponding to the selection clause, and determining the processing result of the selection clause according to the probability distribution; processing the corresponding feature codes by using a second decoder in the NLP semantic analysis model, wherein the second decoder is used for determining a condition clause, the feature codes are processed by four full-connection layers in the second decoder, the first full-connection layer is used for determining a prediction result of a corresponding condition column name, the second full-connection layer is used for determining a prediction result of an operator in a condition, the third full-connection layer is used for determining a prediction result of the number of the conditions, the fourth full-connection layer is used for determining a corresponding relation among the operator, the condition column name and the number of the conditions, and after the corresponding relation is determined, a processing result of the condition clause is obtained; calculating a loss function based on the processing result of the selected clause and the processing result of the conditional clause, calculating a first loss function according to the processing result of the selected clause, taking the processing result of the selected clause as a training point, introducing a nonlinear factor through the activation function, obtaining an optimal fit line according to the activation function, calculating the distance from the training point to the optimal fit line, calculating a first loss function according to the distance by using a mean square error loss definition, calculating a second loss function according to the processing result of the conditional clause, taking the processing result of the conditional clause as the training point, introducing the nonlinear factor through the activation function, obtaining the optimal fit line according to the activation function, calculating the distance from the training point to the optimal fit line, calculating a second loss function according to the distance by using the mean square error loss definition, and carrying out weighted summation on the first loss function and the second loss function to obtain a final loss function; updating and adjusting parameters of the NLP semantic analysis model by using an optimizer based on the loss function, calculating gradients related to the parameters by using the optimizer according to the loss function, updating and adjusting the parameters according to the gradients of the parameters, repeating the calculation of the gradients and the updating of the parameters by the gradients until convergence conditions are reached, and optimizing the NLP semantic analysis model based on the updated and adjusted parameters to obtain a trained NLP semantic analysis model; according to the structure of the trained NLP semantic analysis model, language conversion is carried out, natural language text information is converted into a structured query language, and the NLP semantic analysis model is used for converting the natural language text information into the structured query language, so that the semantics in the natural language text information can be accurately analyzed.

The acquisition target data module 24: the method comprises the steps of obtaining corresponding target data based on the structured query language;

The natural language generation module 25: the method comprises the steps of inputting corresponding target data into an optimized natural language generation model, and converting the corresponding target data into response text based on the optimized natural language generation model;

Specifically, an LSTM model is constructed as an initial natural language generation model, the LSTM model is a neural network with long-short-term information memorizing capability, the LSTM model is provided with a chain structure, and the structure of the LSTM model comprises: the input layer is used for inputting target data, the hidden layer comprises a plurality of LSTM units, the LSTM units are used for leading in a gate mechanism and description of cell states to reflect the structure of the LSTM units, each LSTM unit comprises an input gate, a forgetting gate, an output gate and a cell state, the cell state provides a memory function of a model, network information of front and back layers can still be transmitted when the model structure is deepened, the input gate determines which new information is added to the cell state, namely the cell state is updated, the current input is transmitted to an activation function, updated information is determined through the activation function, importance evaluation is carried out on the updated information, unimportant information is removed, meanwhile, the current input is transmitted to a tanh function for processing, the tanh function is a hyperbolic tangent function and used for adjusting the model, the forgetting gate determines which information needs to be discarded by combining the processing value of the tanh function and the output of the activation function, the forgetting gate determines whether the cell state needs to be discarded after the input information from the current moment and the input information at the last moment are spliced, the input gate carries out weighting calculation on the spliced information, the information after the splicing is subjected to the weighting calculation, whether the output condition can be judged through the output gate can be judged through the output condition of the output gate, and the output condition of the output condition is retained by the output condition of the LSgate; the output layer is used for outputting the text information; training the initial natural language generation model to obtain a trained natural language generation model; carrying out structural optimization on a trained natural language generation model by utilizing a genetic particle swarm optimization algorithm, initializing an initial particle population composed of a plurality of particles, wherein each particle represents a candidate solution, each candidate solution represents a hierarchical structure of the model, calculating a fitness value for each particle in the initial particle population, defining a fitness function, converting the characteristics of each particle into a mathematical function of an adaptive value, converting each particle in the initial particle population into binary data strings for expression, mapping the binary data strings into a solution space, carrying out evaluation on the quality degree of each particle in the solution space according to the fitness function, carrying out arrangement according to the sequence from high to low according to the evaluation of the quality degree of each particle, selecting two particles with the highest fitness for genetic calculation, carrying out cross mutation processing on the two particles with the highest fitness according to a cross operator, generating a child particle population, carrying out mutation operation processing on a certain particle in the child particle population according to a mutation operator, carrying out mutation operation on the certain particle in the child particle population, carrying out random mutation operation on the corresponding bit to the child particle population, carrying out random mutation operation on the child particle model, carrying out random operation on the genetic algorithm to obtain a plurality of child particle models, carrying out the genetic algorithm optimization operation on the child model, further carrying out the genetic algorithm, obtaining the child genetic algorithm, and carrying out the genetic algorithm, and carrying out the genetic algorithm, obtaining the child model, and further carrying out the genetic algorithm, and the genetic algorithm, obtaining the child model, and further carrying out the genetic algorithm, by using the child model, thereby improving the accuracy and the robustness of the output result; according to the optimized natural language generation model, corresponding target data are converted into response texts, and the response texts can be generated more accurately and rapidly by processing the target data through the natural language generation model after structure optimization.

The speech synthesis module 26: for converting the answer text into answer speech based on speech synthesis technology and outputting the answer speech.

Specifically, text preprocessing is carried out on the answer text, corpus cleaning processing is carried out on the answer text, blank characters, special characters, repeated data and stop words are removed, and natural language text information after corpus cleaning processing is obtained; word segmentation processing is carried out on the response text after corpus cleaning, the word forming reliability is reflected by the frequency of adjacent occurrence of the words, the frequency of the combination of the words which are commonly and adjacently occurring in the text information is counted, and when the combination frequency is higher than a critical value, the combination of the words can be considered to form a word, so that a plurality of text word segmentation is obtained; part-of-speech tagging is carried out on a plurality of text word segments, probability distribution of each part of speech is calculated, and the part of speech corresponding to the word segment tagging is carried out according to the probability distribution, so that a response text after text preprocessing is obtained; prosody prediction and word-to-sound conversion processing are carried out on the response text after the text pretreatment, prosody prediction is carried out on the response text according to prosody words, prosody phrases and intonation phrases, the prosody prediction is carried out so as to be more fit with real human voice, and word-to-sound conversion searches for corresponding phonemes according to text information, so that correct pronunciation of the text can be determined, and corresponding phonemes are obtained; converting the corresponding phonemes into acoustic features based on a preset autoregressive acoustic model, wherein the preset autoregressive acoustic model can adopt a TTS model, the TTS model is an end-to-end neural network model, the structure of the TTS model comprises an encoder module and a decoder module, the encoder module is responsible for mapping the input phonemes into discrete encoding vectors, and the decoder module is responsible for decoding the encoding vectors into voice frames, so that the corresponding phonemes can be converted into the acoustic features; based on the vocoder, the acoustic features are converted into response voices, the vocoder converts the acoustic features into sound waveforms, the acoustic features are sampled through an up-sampling module and a down-sampling module of the vocoder, the acoustic features can be converted into sound waveforms, the loss of sound information can be avoided, and finally the response voices are output.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

In addition, the foregoing describes in detail a method and a system for voice interaction based on natural language processing provided in the embodiments of the present invention, and specific examples should be adopted to illustrate the principles and embodiments of the present invention, where the foregoing description of the embodiments is only for helping to understand the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method of voice interaction based on natural language processing, the method comprising:

Acquiring corresponding target data based on the structured query language;

2. The natural language processing based voice interaction method according to claim 1, wherein the voice recognition technology converts the voice information into natural language text information, comprising:

preprocessing the voice information to obtain preprocessed voice information;

3. The natural language processing based voice interaction method according to claim 2, wherein the inputting the voice feature vector into an acoustic model, outputting a phoneme sequence based on the acoustic model, comprises:

training the initial acoustic model to obtain a trained acoustic model;

4. The voice interaction method based on natural language processing according to claim 1, wherein the text preprocessing is performed on the natural language text information to obtain text preprocessed natural language text information, and the method comprises the steps of:

5. The method for voice interaction based on natural language processing according to claim 1, wherein inputting the text-preprocessed natural language text information into an NLP semantic parsing model, converting the text-preprocessed natural language text information into a structured query language based on the NLP semantic parsing model, comprises:

6. The method for voice interaction based on natural language processing according to claim 5, wherein the training the NLP semantic parsing model based on the corresponding feature codes to obtain a trained NLP semantic parsing model comprises:

7. The method for voice interaction based on natural language processing according to claim 1, wherein the obtaining the corresponding target data based on the structured query language comprises:

8. The method for voice interaction based on natural language processing according to claim 1, wherein inputting the corresponding target data into an optimized natural language generation model, converting the corresponding target data into a response text based on the optimized natural language generation model, comprises:

9. The natural language processing based voice interaction method according to claim 1, wherein the voice synthesis based technology converts the answer text into answer voice and outputs the answer voice, comprising:

10. A natural language processing based voice interaction system, the system comprising: