CN117609574A

CN117609574A - Speaking recommendation method and device, computer equipment and storage medium

Info

Publication number: CN117609574A
Application number: CN202311630552.6A
Authority: CN
Inventors: 刘喜声
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2023-11-30
Filing date: 2023-11-30
Publication date: 2024-02-27

Abstract

The invention relates to the technical field of artificial intelligence and discloses a speaking recommendation method, which comprises the steps of performing voice recognition on acquired audio data to obtain text data; extracting characteristics of the text data through a text characteristic extraction layer in the GPT model to obtain text characteristics; performing feature extraction on the audio data through a voice feature extraction layer in the GPT model to obtain audio features; carrying out emotion recognition on the text features and the audio features through an emotion recognition layer in the GPT model to obtain an emotion recognition result; and recommending the voice operation to the audio data based on the emotion recognition result to obtain the target voice operation. The invention is applied to speech recommendation in insurance or financial services. According to the invention, the text features and the audio features are subjected to emotion recognition through the GPT model, so that the emotion in the audio data is accurately recognized, the accuracy of the emotion recognition result is improved, and the accuracy of the speech operation recommendation is further improved.

Description

Speaking recommendation method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for speech surgery recommendation, a computer device, and a storage medium.

Background

With the development of science and technology, the business of insurance companies is rapidly developed, and the communication and dialogue between customers and enterprises is also developed from face-to-face consultation to communication and communication based on remote means such as network. In the security industry, customer service centers face a large number of telephone voice services every day, process diversified service demands of customers, and in the process of telephone service, customer service needs to deal with service objects with different moods and make proper reactions.

In the prior art, there are two general classes of emotion recognition methods. The first method is based on rules, firstly, finding out emotion words appearing in a text according to an emotion dictionary, then, carrying out simple emotion polarity statistics, and comparing the final score with a set threshold value to obtain an emotion polarity result. The second is a machine learning based approach. Through training of a large number of labeling corpuses, an emotion classifier is generated. Among them, rule-based methods rely on emotional dictionaries, and how well an emotional dictionary is constructed directly affects the accuracy of the final emotional analysis. Based on the machine learning method, only grammar structures are considered, so that the accuracy of emotion recognition is low, and the accuracy of recommended speech operation is low.

Disclosure of Invention

The embodiment of the invention provides a method, a device, computer equipment and a storage medium for recommending a conversation, which are used for solving the problems of emotion recognition and low accuracy of recommending the conversation in the prior art.

A speaking recommendation method, comprising:

acquiring audio data, and performing voice recognition on the audio data to obtain text data;

inputting the text data into a GPT model, and extracting the characteristics of the text data through a text characteristic extraction layer in the GPT model to obtain text characteristics;

inputting the audio data into a GPT model, and extracting the characteristics of the audio data through a voice characteristic extraction layer in the GPT model to obtain audio characteristics;

carrying out emotion recognition on the text features and the audio features through an emotion recognition layer in the GPT model to obtain an emotion recognition result;

and recommending the voice frequency data based on the emotion recognition result to obtain a target voice.

A speech surgery recommendation apparatus comprising:

the voice recognition module is used for acquiring audio data and performing voice recognition on the audio data to obtain text data;

the text feature module is used for inputting the text data into a GPT model, and extracting features of the text data through a text feature extraction layer in the GPT model to obtain text features;

The audio feature module is used for inputting the audio data into a GPT model, and extracting features of the audio data through a voice feature extraction layer in the GPT model to obtain audio features;

the emotion recognition module is used for performing emotion recognition on the text features and the audio features through an emotion recognition layer in the GPT model to obtain an emotion recognition result;

and the voice recommendation module is used for recommending the voice of the audio data based on the emotion recognition result to obtain a target voice.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the speaking recommendation method described above when executing the computer program.

A computer readable storage medium storing a computer program which when executed by a processor implements the above-described speech recommendation method.

The invention provides a speaking recommendation method, a speaking recommendation device, computer equipment and a storage medium. And the text data and the audio data are respectively subjected to feature extraction through the GPT model, so that the extraction of the text features and the audio features is realized. The emotion recognition layer is used for carrying out emotion recognition on the text features and the audio features, so that the emotion in the audio data in insurance or financial service is accurately recognized, the emotion recognition result is determined, and the accuracy of the emotion recognition result is improved. And recommending the voice data based on the emotion recognition result, so that the recommendation of the target voice in the insurance or financial service is realized, and the accuracy of the voice recommendation in the insurance or financial service is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an application environment of a speaking recommendation method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a speaking recommendation method in an embodiment of the invention;

FIG. 3 is a flowchart of a speech recommendation method step S20 according to an embodiment of the present invention;

FIG. 4 is a flowchart of step S40 of the speaking recommendation method according to an embodiment of the present invention;

FIG. 5 is a functional block diagram of a speech recommendation apparatus in an embodiment of the present invention;

FIG. 6 is a schematic diagram of a computer device in accordance with an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The speaking recommendation method provided by the embodiment of the invention can be applied to an application environment shown in fig. 1. Specifically, the speaking recommendation method is applied to a speaking recommendation device, the speaking recommendation device comprises a client and a server as shown in fig. 1, and the client and the server communicate through a network to solve the problems of emotion recognition and low accuracy of recommended speaking in the prior art. The server may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and invoice recognition platforms. The client is also called a training staff end, and refers to a program corresponding to the server for providing classified service for the client. The client may be installed on, but is not limited to, various computers, notebook computers, smartphones, tablet computers, and portable wearable devices.

In one embodiment, as shown in fig. 2, a speaking recommendation method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

S10: and acquiring audio data, and performing voice recognition on the audio data to obtain text data.

The audio data is understandably conversational speech of the customer service and customer. The text data is text corresponding to the audio data.

Specifically, audio data acquired by the voice acquisition device is acquired from a database. Then, voice recognition is carried out on the audio data, namely, the audio data can be input into a voice recognition model by acquiring a pre-trained voice recognition model, so that text data corresponding to the audio data can be obtained, and recognition is carried out by adopting a deep model for example. Alternatively, the audio data is signal cut, the audio data is cut using a moving window function, the audio data is cut into small segments of speech signals, each small segment is referred to as a frame, and there is an overlap between the frames. Then, feature extraction is performed on each frame of voice signal, that is, each frame of voice signal is changed into a multidimensional vector through linear prediction cepstrum coefficient and mel cepstrum coefficient, that is, each frame of voice signal is represented by a feature vector. And converting all the multidimensional vectors according to the acoustic model to obtain the phoneme information corresponding to the audio data. And carrying out inverse conversion on the phoneme information through a language model, and decoding the phoneme information subjected to the inverse conversion according to the existing dictionary to obtain text data. Understandably, a voice recognition module can be added in the model, so that the model can be obtained in real time, recognized in real time and integrated.

S20: and inputting the text data into a GPT model, and extracting the characteristics of the text data through a text characteristic extraction layer in the GPT model to obtain text characteristics.

Understandably, the GPT model is a generating Pre-Trained Transformer, generating a Pre-trained transducer model. Text features are content used to characterize text data.

Specifically, after obtaining text data, inputting the text data into a GPT model, and extracting features of the text data through a text feature extraction layer in the GPT model, namely, firstly carrying out vector coding and position coding on the text data through a coding unit, and adding the coding vector and the position vector to obtain a text coding vector. Feature extraction is performed on the text encoding vector through the self-attention unit in the text feature extraction layer, namely, the self-attention unit performs left-to-right mask prediction on the context in the text encoding vector, so that the attention vector is obtained. And carrying out layer normalization processing on the attention vector, predicting the layer normalization result through the feedforward neural network unit to obtain a corresponding prediction result, and further carrying out layer normalization processing on the prediction result to obtain the text feature.

S30, inputting the audio data into a GPT model, and extracting the characteristics of the audio data through a voice characteristic extraction layer in the GPT model to obtain audio characteristics.

The audio features are understandably used to characterize the content of the audio data.

Specifically, after the audio data is obtained, the audio data is input into a GPT model, and feature extraction is performed on the audio data through a speech feature extraction layer in the GPT model, that is, vector encoding and position encoding are performed on the audio data through an encoding unit, and then the audio encoded vectors are obtained. The attention vector is obtained by feature extraction of the audio coding vector by the self-attention unit in the audio feature extraction layer, that is, by left to right mask prediction of the context in the audio coding vector by the self-attention unit. And carrying out layer normalization processing on the attention vector, predicting the layer normalization result through the feedforward neural network unit to obtain a corresponding prediction result, and further carrying out layer normalization processing on the prediction result to obtain the audio feature. The voice feature extraction layer and the text feature extraction layer can simultaneously perform feature extraction.

S40: and carrying out emotion recognition on the text features and the audio features through an emotion recognition layer in the GPT model to obtain an emotion recognition result.

Understandably, the emotion recognition result is emotion in the audio data and the text data, for example, happy, angry, or the like.

Specifically, after the audio features and the text features are obtained, the text features and the audio features are subjected to emotion recognition through an emotion recognition layer in the GPT model, namely, the text features are subjected to emotion recognition through a first recognition network in the emotion recognition layer, so that text emotion is obtained. And carrying out emotion recognition on the audio features through a second recognition network in the emotion recognition layer to obtain audio emotion. And determining an emotion recognition result according to the text emotion and the audio emotion corresponding to the same audio data.

S50: and recommending the voice frequency data based on the emotion recognition result to obtain a target voice.

The target speech is understandably speech that meets the mood and intent of the customer.

Specifically, after the emotion recognition result is obtained, performing voice recommendation on the audio data based on the emotion recognition result, namely selecting a preset voice corresponding to the audio data from a preset voice database according to the recognized emotion recognition result, namely performing intention recognition on the audio data and the text data to obtain recognition intention. Then, a pre-call database is obtained, the pre-call database comprising at least one pre-call. And screening out target dialogs from all preset dialogs based on emotion recognition results and recognition intents. For example, in the sales process of the insurance product or the financial product, the emotion of the customer is accurately identified, and the success rate of the customer purchase is improved by adopting the preset speaking technique corresponding to the emotion, and the accuracy of emotion identification and speaking technique recommendation is improved by adopting the method.

According to the voice recommendation method, the voice recognition is carried out on the acquired audio data, so that the conversion of the audio data in insurance or financial business is realized, and the acquisition of text data is realized. And the text data and the audio data are respectively subjected to feature extraction through the GPT model, so that the extraction of the text features and the audio features is realized. The emotion recognition layer is used for carrying out emotion recognition on the text features and the audio features, so that the emotion in the audio data in insurance or financial service is accurately recognized, the emotion recognition result is determined, and the accuracy of the emotion recognition result is improved. And recommending the voice data based on the emotion recognition result, so that the recommendation of the target voice in the insurance or financial service is realized, and the accuracy of the voice recommendation in the insurance or financial service is further improved.

In one embodiment, before step S10, that is, before the audio data is acquired, the method includes:

s101, acquiring original voice data, and carrying out framing treatment on the original voice data to obtain at least one framing data.

S102, carrying out end point detection on all the frame division data to obtain a starting point and an ending point corresponding to each frame division data.

S103, denoising the original voice data according to the starting points and the ending points of all the framing data to obtain audio data.

Understandably, the original voice data is audio that requires voice detection. For example, in an artificial intelligence customer service scenario, the raw speech data is a dialogue between the user and the customer service. The framing data is obtained by dividing the original voice data. The starting point is the starting position of the voice area in each frame data. The termination point is the end position of the voice region in each frame of data. Wherein each framing data may or may not include one of the endpoints. The audio data is data including only a voice region.

Specifically, the original voice data collected by the voice collection device is retrieved from a database or a client. The original voice data is cut, that is, the original voice data can be divided into voice data of one segment through a fixed frequency band, for example, in a security purchase scene, the original voice data between a service person and a client of 2 seconds in length is divided into voice data of 180 segments. Each of the dividing units includes the same number of signal sampling points, and determines the section-by-section voice data as frame data. The energy value of the signal in each framing data is then calculated. If the energy value of a plurality of continuous frame data at the front end part of the original voice data is lower than a preset energy threshold value (the preset energy value threshold value can be set according to the requirement), the energy value of a plurality of continuous frame data is larger than or equal to the preset energy threshold value, and the position where the signal energy is increased is the starting point of the voice data. Similarly, if the energy of the speech in the consecutive several pieces of frame data is large, then the energy of the speech in the several pieces of frame data becomes small and continues for a certain period of time, it can be considered that the point where the energy is reduced is the end point of the original speech data. Thereby determining a start point and an end point in each segment of framing data. And (3) retaining the voice data between the starting point and the ending point in each section of frame data, deleting the voice data between each section of frame data (between the first frame data ending point and the second frame data starting point), and sequentially deleting all non-voice data. And splicing all the reserved frame division data according to the segmentation sequence, so as to obtain the audio data.

According to the embodiment of the invention, the energy value of the signal in each section of the frame data is calculated, and the energy value of the frame data is compared with the preset energy value threshold, so that the determination of the starting point and/or the ending point in each section of the frame data is realized, the extraction of the audio data is further realized, and the redundancy of the voice data is reduced.

In one embodiment, before step S20, i.e. before inputting the text data into the GPT model, the method includes:

s201, a sample data set is acquired, where the sample data set includes at least one sample data and a sample tag corresponding to the sample data.

The sample data is understandably historical audio data and historical text data corresponding to the historical audio data, i.e. the sample data comprises a set of audio data and text data, e.g. in a sales scenario of a security product, the historical audio data is conversational speech of previous customer service and other customers. The sample label is used for representing the true emotion, wherein the sample label is obtained through emotion recognition model recognition or is obtained through manual labeling, and the emotion recognition model is obtained through fine tuning of the pre-training model in a text and label mode. And inputting the historical audio data and the historical text data into the emotion recognition model so as to obtain a sample label. The sample data and sample tags may be collected from different databases or different clients. Further, a sample data set is constructed from all the sample data and sample tags acquired. Wherein, can also increase some negative samples and train to predetermine training model to improve the rate of accuracy.

S202, acquiring a preset training model, and carrying out emotion recognition on the sample data through the preset training model to obtain an emotion label.

Understandably, the emotion label identifies the sample data for a preset training model to obtain an emotion result. The preset training model is a pre-trained GPT model, and a built-in emotion recognition pipeline is added in a transformers library for emotion recognition.

Specifically, a preset training model is obtained, all sample data are input into the preset training model, emotion recognition is carried out on the sample data through the preset training model, namely, feature extraction is carried out on historical text data through a text feature extraction unit in a feature extraction layer, and then historical text features can be obtained. And carrying out feature extraction on the historical audio data through an audio feature extraction unit in the feature extraction layer, so that the historical audio features can be obtained. And then, carrying out emotion recognition on the historical text features in the sample data through a first recognition network of the emotion recognition module, so as to obtain text recognition emotion. And carrying out emotion recognition on the historical audio features in the sample data through a second recognition network of the emotion recognition module, so as to obtain audio recognition emotion. The specific identification process is not described in detail, please refer to other steps. For example, in a financial product sales scenario, customer emotion is accurately identified, and the rate of success can be increased. When a financial product of interest to the customer is said, the customer may be exposed to an interesting, i.e. happy, emotion.

S203, determining a predicted loss value of the preset training model according to the emotion label and the sample label corresponding to the same sample data.

Understandably, the predictive loss value is generated during emotion recognition of the sample training data.

Specifically, after obtaining the emotion tags, arranging the emotion tags corresponding to the sample data according to the sequence of the sample data in the sample data set, and comparing the sample tags associated with the sample data with the emotion tags of the sample data with the same sequence; namely, according to sample data sequencing, comparing a sample label corresponding to the first sample data with an emotion label corresponding to the first sample data, and calculating a loss value between the sample label and the emotion label through a loss function; and comparing the sample label corresponding to the sample data positioned in the second with the emotion label corresponding to the sample data positioned in the second, and calculating the loss value between the sample label and the emotion label through the loss function until the loss values of all the sample labels and all the emotion labels are compared, so that the predicted loss value can be obtained.

S204, parameter adjustment is carried out on the preset training model through a target strategy network and the predicted loss value so that the preset training model is converged, and the converged preset training model is recorded as a GPT model.

It is to be understood that the convergence condition may be a condition that the predicted loss value is smaller than a set threshold value, or may be a condition that the predicted loss value is small after 500 times of calculation and does not drop any more, and the training is stopped.

Specifically, after the predicted loss value is obtained, when the predicted loss value does not reach a preset convergence condition, parameter adjustment is performed on a preset training model through a target strategy network and the predicted loss value, all sample data and sample labels are input into the preset training model for adjusting initial parameters again, namely, the sample labels are used as reward signals, and the reward signals are optimized through the target strategy network, so that the predicted emotion labels gradually approach the sample labels, and the predicted loss value is continuously connected into the convergence condition. Therefore, the predicted emotion recognition result is continuously drawn to the correct result, the accuracy of the preset training model is higher and higher, and the preset training model after convergence is determined as the GPT model until the predicted loss value of the preset training model reaches the preset convergence condition. And the recognition result of the emotion recognition model is used as rewarding of the GPT model, and a reinforcement learning algorithm, such as a strategy gradient-based method, is used for guiding updating of the GPT generation model by taking the confidence score of the emotion recognition model as a rewarding signal.

According to the embodiment of the invention, the preset training model is subjected to iterative training through a large amount of sample data, and the integral loss value of the preset training model is calculated through the loss function, so that the determination of the predicted loss value is realized. And carrying out parameter adjustment on a preset training model through the predicted loss value and the target strategy network until the model converges, so that the GPT model is determined, and further, higher accuracy of the GPT model is ensured.

In an embodiment, before step S204, that is, before performing parameter adjustment on the preset training model through the target policy network and the predicted loss value so that the preset training model converges, the method includes:

s2041, obtaining confidence values of the emotion recognition model for prediction of each sample data.

Understandably, the confidence value is predicted by the emotion recognition model for each sample data.

Specifically, the confidence value is obtained in the process of carrying out emotion recognition on sample data by the emotion recognition model, namely, each sample data is subjected to emotion recognition through a bidirectional long-short-time memory network, so that a recognition result, namely a sample label and the confidence score, are obtained, the confidence score is associated with the sample data and the sample label and stored in a database, and the confidence value of the emotion recognition model on the prediction of each sample data can be directly obtained from the database and used as rewards.

S2042, predicting the sample data and the confidence value corresponding to the sample data through a preset strategy network to obtain sequence probability distribution.

And S2043, carrying out parameter updating on the preset strategy network through a strategy gradient algorithm and the sequence probability distribution so as to obtain a target strategy network.

It is understood that the preset policy network means to build a neural network model, such as a recurrent neural network, and can directly predict the policy that should be executed at present by observing the environmental state, and the policy is executed to obtain the maximum expected benefit.

Specifically, all sample data and all confidence values corresponding to the sample data are input into a preset strategy network, probability prediction is carried out on the sample data through the preset strategy network, namely probability distribution of emotion in the sample data is output, and training is carried out by taking the confidence values as rewards, so that sequence probability distribution is obtained. The parameter updating is carried out on the preset strategy network through a strategy gradient algorithm and a sequence probability distribution, namely, the REINFORCE algorithm is adopted to carry out the parameter updating on the preset strategy network based on the sequence probability distribution, namely, the real use is adoptedUnbiased estimation of policy gradients with confidence values as the result of the case Middle Q _π And (3) performing Monte Carlo approximation, performing random gradient rising through a preset formula, and updating a preset strategy network to obtain a target strategy network. Wherein, the preset formula is-> Beta learning rate, n is the number of steps performed in each training session, pi (a _t ∣s _t ；θ _now ) Is the output of the policy network, lnpi (a _t ∣s _t ；θ _now ) Is the logarithm of the policy network output value, +.>Is the gradient of the logarithmic strategy network parameter of the strategy network output value.

According to the embodiment of the invention, the predicted confidence value of the emotion recognition model is used as the reward signal to train the preset strategy network, so that the target strategy network is obtained, and the accuracy of the target strategy network is improved.

In an embodiment, in step S20, that is, performing feature extraction on the text data through a text feature extraction layer in the GPT model to obtain text features, the method includes:

and S205, coding the text data through a coding unit in the text feature extraction layer to obtain a text coding vector.

Specifically, after obtaining text data, inputting the text data into a GPT model, and performing single-hot coding on the vector of the text data in the GPT model to obtain a word vector of the text data. Then, the word vectors are weighted and fused, namely, the semantic vector of a sentence is added to each word vector, and then the text vector can be obtained. And finally, determining the position vector of the text vector through the position function learned by the coding unit in the GPT model. And adding the text vector and the position vector corresponding to the same text content to obtain the text coding vector.

S206, feature extraction is carried out on the text coding vector through the self-attention unit in the text feature extraction layer, and an attention vector is obtained.

S207, predicting the attention vector through a feedforward neural network unit in the text feature extraction layer to obtain text features.

The attention vector is understandably the feature extraction of the text encoding vector.

Specifically, the attention processing is performed on the text encoding vector by the self-attention unit in the text feature extraction layer, namely, the text encoding vector is firstly converted into a Q vector, a K vector and a V vector by three conversion matrixes. And then, calculating Q vectors, K vectors and V vectors in the text coding vectors through a plurality of groups of attention mechanisms, namely calculating the correlation scores between the Q vectors and the K vectors among different word vectors in the text coding vectors by using a dot product method, namely calculating dot products by using the Q vectors of the word vectors and the K vectors of adjacent word vectors, and normalizing the correlation scores between the Q vectors and the K vectors so as to carry out mask prediction on the next word vector. And converting the score into probability distribution between [0,1] through a softmax function, and multiplying the probability distribution by a corresponding value vector to obtain a mask prediction result. And splicing a plurality of groups of mask prediction results and carrying out layer normalization processing to obtain the attention vector. Further, the attention vector is predicted through a feedforward neural network unit in the text feature extraction layer, namely the attention vector is predicted through forward propagation of the feedforward neural network, namely the attention vector is calculated through a multilayer hiding unit with different weights and is subjected to layer normalization processing, so that a probability value is obtained, and feature extraction is carried out on the attention vector according to the probability value, so that the text feature can be obtained. Text features may be extracted from the text encoding vectors in a multi-layered stack of self-attention units and feedforward neural network units.

According to the embodiment of the invention, the text data is encoded by the encoding unit, so that the encoding and the position encoding of the text are realized, and further, the acquisition of the text encoding vector is realized. The self-attention unit is used for extracting the characteristics of the text coding vector, so that the attention vector is acquired, and the prediction from left to right is realized. The feedforward neural network unit predicts the attention vector, so that the extraction of text features is realized, and the accuracy of the subsequent emotion recognition is improved.

In an embodiment, in step S40, that is, performing emotion recognition on the text feature and the audio feature through the emotion recognition layer in the GPT model, to obtain an emotion recognition result, including:

s401, carrying out emotion recognition on the text features through a first recognition network in the emotion recognition layer to obtain text emotion.

Understandably, both the first recognition network and the second recognition network are built based on the bert model, focusing more on semantic understanding. Text emotion refers to the emotional result identified from text features. Audio emotion refers to the emotional result identified from the audio features. Wherein, the text emotion and the audio emotion can be a plurality of emotions, namely, the happy probability is 85%, the happy probability is 79%, and the like.

Specifically, after obtaining the text feature and the audio feature, performing emotion recognition on the text feature and the audio feature through an emotion recognition layer, namely performing emotion recognition on the text feature through a first recognition network in the emotion recognition layer, namely performing word vector, sentence vector and position vector coding on the text feature through an input layer of the first recognition network, namely performing embedding processing on the text feature, namely determining word vector of the text feature, namely adding two special marker bits, namely CLS and SEP, to the beginning of the text feature and after each symbol. Then, the word vectors are weighted and fused, namely, the semantic vector of a sentence is added to each word vector, and then the sentence vector can be obtained. And finally, determining the position vectors of the word vectors and the sentence vectors through the learned position function. And adding the word vector, the sentence vector and the position vector corresponding to the same text content to obtain the text input vector. Then, semantic understanding is carried out on the text input vector through the attention layer, namely emotion recognition is carried out on the text input vector through bidirectional transformers, and prediction is carried out through a feedforward neural network, so that text emotion is obtained.

S402, carrying out emotion recognition on the audio features through a second recognition network in the emotion recognition layer to obtain audio emotion.

S403, determining an emotion recognition result according to the text emotion and the audio emotion corresponding to the same audio data.

Specifically, the emotion recognition is performed on the audio features through a second recognition network in the emotion recognition layer, namely, the audio features are encoded through an input layer of the second recognition network to obtain audio input vectors. Then, semantic understanding is carried out on the audio input vector through the attention layer of the second recognition network, namely emotion recognition is carried out on the audio input vector through a bidirectional transducer, and prediction is carried out through a feedforward neural network, so that audio emotion is obtained. The identification processes of the first identification network and the second identification network are not described in detail, and reference may be made to the identification process of the bert model. Further, according to the text emotion and the audio emotion corresponding to the same audio data, determining an emotion recognition result, namely obtaining a first preset weight and a second preset weight, respectively calculating the product of the first preset weight and the text emotion and the product of the second preset weight and the audio emotion, and finally summing to obtain the emotion recognition result. In another embodiment, the text emotion and the audio emotion are scored respectively, the scores are added, and the emotion corresponding to the largest score is selected as the emotion recognition result.

According to the embodiment of the invention, the emotion recognition is carried out on the text features through the first recognition network, so that the emotion in the text features is recognized, and the text emotion is obtained. And carrying out emotion recognition on the audio features through the second recognition network, so that the emotion in the audio features is recognized, the audio emotion is acquired, and further, the emotion recognition result is determined.

In one embodiment, in step S50, that is, performing a speaking recommendation on the audio data based on the emotion recognition result to obtain a target speaking, the method includes:

s501, intention recognition is carried out on the audio data and the text data, and recognition intention is obtained.

S502, a preset speaking operation database is obtained, wherein the preset speaking operation database comprises at least one preset speaking operation.

S503, screening out target utterances from all the preset utterances based on the emotion recognition result and the recognition intention.

It is understood that identifying intent refers to the purpose of the query in the audio data. The preset speaking refers to answers generated to different questions by a model. The target speech operation refers to a preset speech operation matched with the emotion recognition result and the recognition intention.

Specifically, after the emotion recognition result is obtained, an intention recognition model is obtained, audio data and text data are input into the intention recognition model, and intention recognition is carried out on the audio data and the text data through the intention recognition model, so that a text intention recognition result and an audio intention recognition result are obtained. According to the text intention recognition result and the audio intention recognition result, the text intention recognition result and the audio intention recognition result can be multiplied and summed with different weights, so that the recognition intention corresponding to the audio data is obtained. Or extracting keywords from the audio data and the text data, namely, selecting the extracted words by firstly word segmentation and then calculating word segmentation weight. Matching the extracted words with the keywords under the meaning icons, and determining the meaning labels with the largest number of the keywords as the recognition intents.

Further, a pre-call database is obtained from the database, the pre-call database comprising at least one pre-call. And carrying out semantic matching on the recognition intention and each preset conversation, namely calculating the semantic similarity between the recognition intention and each preset conversation, namely calculating the Euclidean distance or cosine similarity between the recognition intention and each preset conversation, so as to obtain a similarity value. And acquiring a semantic threshold, comparing all the similarity values with the semantic threshold, and deleting the preset speech corresponding to the similarity value smaller than the semantic threshold when the similarity value is smaller than the semantic threshold. When the similarity value is greater than or equal to the semantic threshold, the corresponding preset conversation is reserved. And then, carrying out emotion screening on all reserved preset dialects through the emotion recognition result, namely calculating the semantic similarity between the emotion recognition result and the reserved preset dialects, so as to obtain a target dialects which is most in line with the emotion recognition result. For example, in a security scenario, when a user asks for a product of his or her own interest, the speech emotion may be more pleasing so that sales personnel may choose to place an appropriate conversation recommendation, thereby facilitating the user's purchase.

According to the method and the device for identifying the intention, the intention is identified through the audio data and the text data, so that the identification intention is determined, the screening of the target voice operation is further realized, and the accuracy of voice operation recommendation is improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean the sequence of execution, and the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention in any way.

In an embodiment, a speaking recommendation device is provided, where the speaking recommendation device corresponds to the speaking recommendation method in the above embodiment one by one. As shown in fig. 5, the speech recommendation apparatus includes a speech recognition module 11, a text feature module 12, an audio feature module 13, a mood recognition module 14, and a speech recommendation module 15. The functional modules are described in detail as follows:

the voice recognition module 11 is used for acquiring audio data and performing voice recognition on the audio data to obtain text data;

the text feature module 12 is configured to input the text data into a GPT model, and perform feature extraction on the text data through a text feature extraction layer in the GPT model to obtain text features;

The audio feature module 13 is configured to input the audio data into a GPT model, and perform feature extraction on the audio data through a speech feature extraction layer in the GPT model to obtain audio features;

the emotion recognition module 14 is configured to perform emotion recognition on the text feature and the audio feature through an emotion recognition layer in the GPT model, so as to obtain an emotion recognition result;

and the voice recommendation module 15 is used for performing voice recommendation on the audio data based on the emotion recognition result to obtain a target voice.

For specific limitations of the speaking recommendation device, reference may be made to the above limitations of the speaking recommendation method, and no further description is given here. The respective modules in the above-described conversation recommendation apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store the data used in the speaking recommendation method in the above embodiment. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech recommendation method.

In one embodiment, a computer device is provided that includes a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the above-described session recommendation method when executing the computer program.

In one embodiment, a computer readable storage medium is provided, the computer readable storage medium storing a computer program which when executed by a processor implements the above-described speech recommendation method.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A speaking recommendation method, comprising:

2. The speaking recommendation method as claimed in claim 1, wherein said performing emotion recognition on the text feature and the audio feature by the emotion recognition layer in the GPT model to obtain an emotion recognition result comprises:

carrying out emotion recognition on the text features through a first recognition network in the emotion recognition layer to obtain text emotion;

carrying out emotion recognition on the audio features through a second recognition network in the emotion recognition layer to obtain audio emotion;

and determining an emotion recognition result according to the text emotion and the audio emotion corresponding to the same audio data.

3. The speaking recommendation method as claimed in claim 1, wherein said feature extraction of the text data by the text feature extraction layer in the GPT model to obtain text features comprises:

Coding the text data through a coding unit in the text feature extraction layer to obtain a text coding vector;

extracting the characteristics of the text coding vector through a self-attention unit in the text characteristic extraction layer to obtain an attention vector;

and predicting the attention vector through a feedforward neural network unit in the text feature extraction layer to obtain text features.

4. The speaking recommendation method as claimed in claim 1, wherein said performing speaking recommendation on said audio data based on said emotion recognition result to obtain a target speaking comprises:

performing intention recognition on the audio data and the text data to obtain recognition intention;

acquiring a preset speaking database, wherein the preset speaking database comprises at least one preset speaking;

and screening out target utterances from all the preset utterances based on the emotion recognition result and the recognition intention.

5. The speaking recommendation method of claim 1, wherein prior to entering the text data into a GPT model, comprising:

obtaining a sample data set, wherein the sample data set comprises at least one sample data and a sample label corresponding to the sample data;

Acquiring a preset training model, and carrying out emotion recognition on the sample data through the preset training model to obtain an emotion label;

determining a predicted loss value of the preset training model according to the emotion label and the sample label corresponding to the same sample data;

and carrying out parameter adjustment on the preset training model through a target strategy network and the predicted loss value so as to enable the preset training model to be converged, and recording the converged preset training model as a GPT model.

6. The speaking recommendation method of claim 5, wherein the parameter adjustment of the preset training model by the target policy network and the predicted loss value is performed so that before the preset training model converges, the method comprises:

obtaining confidence values of emotion recognition models for prediction of the sample data;

predicting the sample data and confidence values corresponding to the sample data through a preset strategy network to obtain sequence probability distribution;

and updating parameters of the preset strategy network through a strategy gradient algorithm and the sequence probability distribution to obtain a target strategy network.

7. The speaking recommendation method of claim 1, further comprising, prior to the obtaining the audio data:

Acquiring original voice data, and carrying out framing treatment on the original voice data to obtain at least one framing data;

performing end point detection on all the frame division data to obtain a starting point and an ending point corresponding to each frame division data;

and denoising the original voice data according to the starting points and the ending points of all the framing data to obtain audio data.

8. A speech surgery recommendation apparatus, comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the speaking recommendation method according to any one of claims 1 to 7 when executing the computer program.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the speaking recommendation method according to any one of claims 1 to 7.