CN113488025B

CN113488025B - Text generation method, device, electronic equipment and readable storage medium

Info

Publication number: CN113488025B
Application number: CN202110794502.6A
Authority: CN
Inventors: 姚晓颖
Original assignee: Vivo Mobile Communication Hangzhou Co Ltd
Current assignee: Vivo Mobile Communication Hangzhou Co Ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2024-05-14
Anticipated expiration: 2041-07-14
Also published as: CN113488025A

Abstract

The application discloses a text generation method, a text generation device, electronic equipment and a readable storage medium, and belongs to the field of communication. The method comprises the following steps: acquiring voice data; identifying feature information of the plurality of voice segments, wherein the feature information comprises at least one of a mood feature and a character feature; extracting key voice of the voice data based on the characteristic information of the voice data; and displaying the text corresponding to the voice data based on the key voice.

Description

Text generation method, device, electronic equipment and readable storage medium

Technical Field

The application belongs to the technical field of communication, and particularly relates to a text generation method, a device, electronic equipment and a readable storage medium.

Background

Text summary in scenes such as conferences, courses, and the like generally needs to be sorted based on recorded audio files or video files. Currently, all speech data is converted into text data, typically by processing an audio file or a video file, and the text thus generated is lengthy.

It can be seen that the text generated based on voice data in the prior art is lengthy.

Disclosure of Invention

The embodiment of the application aims to provide a text generation method, a device, electronic equipment and a readable storage medium, which can solve the problem that texts generated based on voice data in the prior art are long.

In a first aspect, an embodiment of the present application provides a text generating method, where the method includes:

Acquiring voice data;

Identifying characteristic information of the voice data, wherein the characteristic information comprises at least one of a mood characteristic and a character characteristic;

Extracting key voice of the voice data based on the characteristic information of the voice data;

And displaying the text corresponding to the voice data based on the key voice.

In a second aspect, an embodiment of the present application provides a text generating apparatus, including:

The acquisition module is used for acquiring voice data;

The recognition module is used for recognizing characteristic information of the voice data, wherein the characteristic information comprises at least one of a mood characteristic and a character characteristic;

The extraction module is used for extracting key voices of the voice data based on the characteristic information of the voice data;

And the generation module is used for displaying the text corresponding to the voice data based on the key voice.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and a program or instruction stored on the memory and executable on the processor, the program or instruction implementing the steps of the method according to the first aspect when executed by the processor.

In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and where the processor is configured to execute a program or instructions to implement a method according to the first aspect.

In the embodiment of the application, voice data are acquired; identifying feature information of the plurality of voice segments, wherein the feature information comprises at least one of a mood feature and a character feature; extracting key voice of the voice data based on the characteristic information of the voice data; and displaying the text corresponding to the voice data based on the key voice. Therefore, key voices of voice data are extracted based on characteristic information of the voice data, texts corresponding to the voice data can be displayed based on the key voices, and compared with texts obtained by converting all voice data in the prior art, the text extraction method is more refined and has more prominent key points.

Drawings

FIG. 1 is one of the flowcharts of a text generation method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of dividing sentence-level speech segments according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a language prediction model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a method for determining a high-frequency phrase according to an embodiment of the present application;

FIG. 5 is a second flowchart of a text generation method according to an embodiment of the present application;

FIG. 6 is a third flowchart of a text generation method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a display interface according to an embodiment of the present application;

FIG. 8 is a fourth flowchart of a text generation method according to an embodiment of the present application;

FIG. 9 is a fifth flowchart of a text generation method according to an embodiment of the present application;

FIG. 10 is a flowchart of a text generation method according to an embodiment of the present application;

fig. 11 is a block diagram of a text generating apparatus according to an embodiment of the present application;

FIG. 12 is one of the block diagrams of an electronic device provided by an embodiment of the present application;

Fig. 13 is a second block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The text generation method provided by the embodiment of the application is described in detail through specific embodiments and application scenes thereof with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flowchart of a text generating method according to an embodiment of the present application.

As shown in fig. 1, the method comprises the steps of:

step 101, voice data are acquired.

In a specific implementation, the voice data may be voice data collected in real time, or may be voice data in a pre-collected audio file or video file, which may be specifically determined according to an actual situation, and the embodiment of the present application is not limited herein.

Step 102, identifying characteristic information of the voice data, wherein the characteristic information comprises at least one of a mood characteristic and a character characteristic.

In the embodiment of the application, the mood features can be features for representing emotion colors and components of voice contents, and emotion represented by different voice contents can be determined based on the mood features, so that the importance degree of different voice contents is determined, and generally, the voice contents with heavier emotion colors and components represent thicker emotion and emotion, and the probability of the voice contents being important voice contents or key voice contents is higher. The key speech extracted based on the mood features may be speech content in the speech data that requires a significant attention, or summarized, guided speech content.

In particular implementations, the mood features may be determined by audio tones of the speech data, and illustratively, faster and higher audio tones may be more pronounced in mood features and the corresponding speech content may have a heavier emotion color and component, may be summarized speech content or speech content requiring significant attention, and such speech may be extracted as key speech.

The mood characteristics may also be determined by the phrase for representing the mood contained in the voice data, where the phrase for representing the mood refers to the phrase of emotion colors and components of word senses, for example, such as how, what, and other adjectives or adverbs have high emotion colors and components, and the voice content containing such phrases may be summarized voice content or voice content needing to be focused, and such voices may be extracted as key voices.

The language features can also be determined by sentence patterns of voice data, the sentence patterns comprise a presentation sentence pattern, a question sentence pattern, a imperative sentence pattern, an exclamation sentence pattern and the like, emotion colors and components represented by different sentence patterns are different, and for example, the voice content of the question sentence pattern may be led out of the issues of a meeting or the problems to be discussed after the voice content of the question sentence pattern, the voice content of the pray sentence pattern may be the voice content which needs to be completed or needs to be focused, and the voice can be extracted as key voice.

It will be appreciated that the implementation of the language-specific features is not limited thereto, and embodiments of the present application are not limited thereto.

In the embodiment of the present application, the character features may be features of a speaker corresponding to the voice content, for example, identity features of the speaker, and in general, the more important the speaker represented by the character features, the more important the voice content corresponding to the speaker. The key speech extracted based on the character features of the speech data may be speech content in the speech data that needs to be analyzed with emphasis, or summarized, guided speech content.

In specific implementation, the character features can be determined by the identity features of the speaking character corresponding to the voice content. The identity feature may be determined based on a manual annotation of the user, or may be determined based on a voiceprint feature of the voice content, and for example, after identifying the voiceprint feature of the voice content, the identity feature of the speaker corresponding to the voice content may be determined through a correspondence between a preconfigured voiceprint feature and the identity feature. The identity feature may also be determined based on a preset rule, for example, based on a position determination of the voice content in the voice data, where the identity feature corresponding to the voice content located later is important.

Step 103, extracting key voices of the voice data based on the characteristic information of the voice data.

In the embodiment of the present application, key voices may be extracted from the voice data based on the at least one feature information, so that voice contents corresponding to the extracted key voices may be topics, key points, key contents, etc. of the voice data, and then step 104 is continued.

And 104, displaying the text corresponding to the voice data based on the key voice.

In specific implementation, the key voices can be converted into texts and displayed after being tidied.

Optionally, the sorting the key voices may include de-duplication of the key voices, and may further include at least one of the following: 1) Sorting the key voices after de-duplication according to a time sequence; 2) Classifying the key voices after the duplication removal based on preset key phrase, for example, the preset key phrase of the voice data is a place, a material, a guest and security, then the key voices associated with the place, the key voices associated with the material, the key voices associated with the guest and the key voices associated with the security can be respectively determined, so that the key voices are classified; 3) And classifying the key voices after de-duplication based on the voiceprint features, namely classifying the key voices corresponding to the same voiceprint features into one type. It will be appreciated that the implementation manner of the arrangement of the key voices is not limited thereto, and may be specifically determined according to practical situations, and the embodiments of the present application are not limited thereto.

When the voice data is processed, the voice data can be converted into text data based on voice recognition, and then the characteristic information of the text data can be recognized; or the characteristic information of the voice data can be firstly identified, and after the key voice is determined, the key voice is converted into a text; or respectively determining whether to recognize the characteristic information first, then recognize the voice as a text or recognize the voice as a text and then recognize the characteristic information according to the actual situation, in the embodiment of the application, the characteristic information of the voice data is recognized first, and after extracting the key voice, the key voice is converted into the text for illustration.

The text generation method provided by the embodiment of the application acquires voice data; identifying characteristic information of the voice data; extracting key voice of the voice data based on the characteristic information of the voice data; and displaying the text corresponding to the voice data based on the key voice. Therefore, key voices of voice data are extracted based on characteristic information of the voice data, texts corresponding to the voice data can be displayed based on the key voices, and compared with texts obtained by converting all voice data in the prior art, the text extraction method is more refined and has more prominent key points.

In the embodiment of the application, key voices can be extracted by identifying at least one of the language features and the character features, and the key voices are respectively described below.

1) The characteristic information includes a condition of the mood characteristics.

In this case, optionally, the identifying feature information of the voice data includes:

Identifying a phrase used for representing the mood in the voice data, and determining the mood characteristic of the voice data based on the phrase used for representing the mood;

or, identifying the voice characteristics of the voice data, and determining the language characteristics of the voice data based on the voice characteristics of the voice data.

In a first embodiment, the identifying characteristic information of the voice data includes: and recognizing a phrase used for representing the mood in the voice data, and determining the mood characteristic of the voice data based on the phrase used for representing the mood.

In particular implementations, the phrase may be a word, a phrase, or a phrase. By splitting the phrase of the voice data, a plurality of phrases contained in the voice data can be obtained, wherein the phrases are usually named word phrases or verb phrases. Or a word stock of the mood words may be preset, where the word stock of the mood words includes mood words for representing various mood words, for example, words for representing accented mood words include adjectives such as "very," "important," "special," or adverbs, and based on preset mood words included in the preset word stock of the mood words, words matched with the preset mood words may be identified in the voice data, and words that are identical to any preset mood word, or words that are identical to or similar to the word sense of any preset mood word. If the word group matched with any word in the preset word stock is identified in the voice data, the mood characteristics of the speech data may be determined based on the mood characterized by the mood word.

It should be noted that, optionally, the voice data may be divided into a plurality of voice segments before the voice data is processed. The speech segment may be a word or phrase in the speech data, or may be a sentence, or may be a paragraph. In an exemplary implementation manner, the voice data may be pre-divided into a plurality of sentence-level voice segments based on a preset division rule, where the preset division rule may be determined based on a voice pause in the voice data or may be determined based on a sentence structure in the voice data, and may be specifically determined according to an actual situation, and embodiments of the present application are not limited herein. If a phrase matching any word in the preset word stock is identified in the certain voice segment, the mood characteristics of the speech segments may be determined based on the mood characterized by the mood word.

For ease of understanding, the following describes the division of sentence-level speech segments based on speech pauses in the speech data: as shown in fig. 2, when dividing sentence-level speech segments, sentences may be divided based on pauses in audio waveforms corresponding to speech data, and when the db of the speech data is lower than a preset db threshold for more than a certain time, speech data between a line and B line in the drawing may be determined as a pause, speech data before a line may be determined as a sentence-level speech segment, and speech data after B line may be determined as a sentence-level speech segment.

In a second embodiment, the identifying characteristic information of the voice data includes: and recognizing the voice characteristics of the voice data, and determining the language characteristics of the voice data based on the voice characteristics of the voice data.

In specific implementation, the voice characteristics of the voice fragments can be identified based on the voice characteristics, the audio frequency spectrum characteristics, the voiceprint characteristics and other information of the voice data.

In this embodiment, optionally, the identifying the voice feature of the voice data and determining the mood feature of the voice data based on the voice feature of the voice data includes:

dividing the voice data into a plurality of voice segments;

determining voice feature vectors of N voice frames in the voice fragments based on voice features of the N voice frames in the voice fragments, wherein N is a positive integer;

Inputting the voice feature vectors of the N voice frames into a pre-trained language-gas prediction model, and predicting the language-gas level of the voice fragments containing preset language gas;

Obtaining a target vector output by the language-gas prediction model, wherein the target vector is used for representing the language-gas level of the voice segment containing preset language gas;

and determining the mood characteristics of the voice fragments based on the target vector.

In this implementation, the speech data may be divided into a plurality of speech segments, and the speech feature vector may be extracted based on N speech frames in the speech segments. In particular implementation, the N speech frames may be determined based on a frame rate of the speech data, e.g., when the frame rate of the speech data is 20 ms, every 20 ms of speech data is one frame. For each speech frame, feature extraction can be performed from k dimensions to obtain 1*k speech feature vectors, k is a positive integer, and the speech feature vectors of N speech frames corresponding to one speech segment can form a speech feature matrix of n×k, so as to be used as an input matrix of the language prediction model. In this implementation, the k Low-Level descriptors (LLD) are described as examples, which are not limiting.

The k low-level descriptors can be pre-customized. Optionally, the k low-level descriptors are determined from at least one of a sound feature and an audio spectral feature. Wherein the sound features may include, but are not limited to, at least one of a tonal feature, a timbre feature, a loudness feature, such as a tonal or pitch of a sound, a brightness of a sound, etc., the audio spectral features may include, but are not limited to, at least one of a temporal feature, which may include, for example, attack time, spectral centroid, zero-crossing rate, etc., a frequency feature, which may include amplitude, fundamental frequency, sinusoidal component, residual component, etc. In an exemplary implementation, n=20, k=65, and the process flow for a speech segment is as follows: original voice fragment, 20 original voice frames, extracting voice feature vectors of each voice frame based on the customized 65 LLDs, inputting the 20 voice feature vectors into a language prediction model for language prediction, and determining language features.

Alternatively, the k low-level descriptors may be trained by a neural network, such as a convolutional neural network (Convolutional Neural Networks, CNN). In an exemplary implementation, n=20, k=65, and the process flow for a speech segment is as follows: original voice fragment, 20 original voice frames, 65 LLDs obtained through CNN training, voice characteristic vectors of each voice frame are extracted based on the 65 LLDs obtained through training, 20 voice characteristic vectors are input into a language prediction model to conduct language prediction, and language characteristics are determined. It should be understood that the implementation form of the k low-level descriptors is not limited thereto, and may be specifically determined according to practical situations, and embodiments of the present application are not limited thereto.

In an alternative implementation, the speech feature vector may also be decoded by an countermeasure automatic encoder (ADVERSARIAL AUTOENCODER, AAE) to representate the speech frame. The implicit factors of the AAE include the emotional state of the speech segment. Since the speech signal of the speech frame may be determined by a plurality of implicit factors, such as an emotional state, an age, a sex, and a content of speaking, decoding the speech frame by the AAE may infer the implicit factors determining the speech signal to reconstruct the speech frame and re-represent the speech characteristics of the speech frame. The reconstructed voice frame can carry emotion labels, and the determined voice characteristics can be more obvious. In an exemplary implementation, n=20, k=65, and the process flow for a speech segment is as follows: original voice fragment, 20 original voice frames, extracting voice characteristic vector of each voice frame based on 65 LLDs, performing characteristic re-expression on each voice characteristic vector by AAE, inputting 20 voice characteristic vectors into a language prediction model for language prediction, and determining language characteristics.

The mood prediction model is trained based on N speech feature vectors and mood levels of speech segments. Specifically, the speech prediction model may be trained based on a predetermined mood, and the mood prediction result is used to predict whether the speech segment contains the predetermined mood and a mood level containing the predetermined mood, where the mood level may be understood as the intensity containing the predetermined mood. Taking the preset mood as an example of the emphasized mood, the training sample of the speech prediction model includes a plurality of speech segments containing emphasized mood, and a predetermined mood level of each speech segment containing emphasized mood. The mood prediction model is trained based on the training samples such that the mood prediction results may characterize whether a speech segment contains a emphasized mood, and a mood level containing emphasized mood, e.g. a mood level of 1-5, where 1 stands for slightly emphasized and 5 stands for very emphasized. In other implementations, a mood prediction model for predicting a mood class of a speech segment may be trained in advance, and a mood prediction result of such a mood prediction model is used to determine which preset mood the speech segment contains.

Further optionally, the mood prediction model includes an input layer, at least one convolution layer, a fully-connected layer, and an output layer, the fully-connected layer including a first hidden layer and a second hidden layer;

Inputting the voice feature vectors of the N voice frames into a pre-trained language prediction model, predicting the language level of the voice segment containing the preset language, including:

acquiring the N voice feature vectors in the input layer;

Inputting the N voice feature vectors into the at least one convolution layer for feature extraction to obtain a target feature matrix;

Inputting the target feature matrix into the first hidden layer, multiplying a first preset weight matrix with the target feature matrix in the first hidden layer, adding a first preset bias matrix, and activating by a first preset activation function to obtain a first matrix;

Inputting the first matrix into the second hidden layer, multiplying a second preset weight matrix with the first matrix in the second hidden layer, adding a second preset bias matrix, and activating by a second preset activation function to obtain a second matrix;

Inputting the second matrix into the output layer, multiplying a third preset weight matrix with the second matrix in the output layer, adding a third preset offset vector, and activating by a third preset activation function to obtain a third matrix; the third matrix is a column vector, one row in the third matrix corresponds to one tone level, the value of a target row in the third matrix represents the matching probability of the voice fragment to the target tone level, and the target row acts on any row in the third matrix;

and outputting the target vector through the output layer, wherein the target vector is used for representing the language level with the highest matching probability in the third matrix.

In specific implementation, taking n=20 and k=65, where the preset mood is an example of emphasis mood, an exemplary implementation of the mood prediction model is as follows, as shown in fig. 3:

The language-gas prediction model comprises an input layer (input layer), at least one convolution layer (convolutional layer), a full-connection layer (fully connected layer) and an output layer (output layer), wherein the full-connection layer comprises a first hidden layer (HIDDEN LAYER 1) and a second hidden layer (HIDDEN LAYER 2), 256 nodes are respectively arranged on the first hidden layer and the second hidden layer, 5 nodes are respectively arranged on the output layer, the output layer corresponds to the level 1-5 containing the emphasized language-gas in 5, 1 represents slight emphasis, 5 represents very emphasis, and the strength containing the emphasized language-gas is gradually increased from 1 to 5.

The input layer of the language prediction model is provided with 20 nodes, each node corresponds to a voice feature vector of a voice frame, the N voice feature vectors are input into the at least one convolution layer, and feature extraction is performed on a voice feature matrix formed by combining the N voice feature vectors in the at least one convolution layer, so that a target feature matrix is obtained.

And then inputting the target feature matrix into the first hidden layer. Each node in the first hidden layer is connected with each node in the input layer, a weight value on each connecting line is determined in pre-training, and a preset weight value on 256 x 20 connecting lines between 256 nodes in the first hidden layer and 20 nodes in the input layer can form the first preset weight matrix. In the first hidden layer, a first preset weight matrix is multiplied by the target feature matrix, a first preset bias matrix is added, and activation is performed through a first preset activation function, so that an output result of the first hidden layer can be obtained, and the output result is recorded as the first matrix. The first preset activation function may be a ReLU function.

Then, the first matrix is input into the second hidden layer. Each node in the second hidden layer is connected with each node in the first hidden layer, a weight value on each connecting line is determined in pre-training, and a preset weight value on 256 lines between 256 nodes in the second hidden layer and 256 nodes in the first hidden layer can form the second preset weight matrix. And multiplying a second preset weight matrix with the first matrix in the second hidden layer, adding a second preset bias matrix, and activating through a second preset activation function to obtain an output result of the second hidden layer, which is marked as the second matrix. The second preset activation function may also be a ReLU function.

And then inputting the second matrix into the output layer. Each node in the output layer is connected with each node in the second hidden layer, a weight value on each connecting line is determined in pre-training, and preset weight values on 5 x 256 connecting lines between 5 nodes in the output layer and 256 nodes in the second hidden layer can form the third preset weight matrix. And multiplying a third preset weight matrix with the second matrix in the output layer, adding a third preset bias matrix, and activating by a third preset activation function to obtain the third matrix. Wherein the third preset activation function may be a softmax function.

The third matrix is a column vector of 5*1, each row corresponds to a mood level, the value of each row is a real number of (0, 1), the value of a certain row represents the matching probability of the speech segment matching to the mood level corresponding to the row, and the sum of the values of 5 rows is 1, illustratively, if the third matrix is:

The probability of the current speech segment matching to the mood level 1 is 0.2, the probability of the current speech segment matching to the mood level 2 is 0.2, the probability of the current speech segment matching to the mood level 3 is 0.4, the probability of the current speech segment matching to the mood level 4 is 0.1, and the probability of the current speech segment matching to the mood level 5 is 0.1.

Thereafter, the target vector may be determined based on the third matrix, for example, the target vector may be a 5x 1-dimensional column vector, and each row has a value of 1 or 0, where the row with the highest probability in the third matrix has a value of 1, and the other rows have a value of 0, for example, in the above example, the target vector is:

The mood prediction model outputs the target vector, and the mood level of the voice segment can be determined to be 3 based on the target vector. It should be noted that, if the speech segment input into the mood prediction model contains emphasized mood, the value of the target vector existing in the target row is 1, and the mood level corresponding to the target row is the mood level of the speech segment. If the voice segment input into the language-gas prediction model does not contain the emphasized language-gas, the matching probability of the voice segment to be matched with any language-gas level is 0, and the value of each row of the target vector is 0. It should be understood that the implementation form of the target vector is not limited to the specific implementation form, and may be determined according to practical situations, and embodiments of the present application are not limited herein.

It should be noted that, in other implementation forms, the 5 nodes of the output layer may respectively correspond to 5 classes 1-5 containing emphasized mood, where 1 indicates no emphasized mood, 5 indicates very emphasized, and the intensity of the emphasized mood is gradually increased from 1 to 5, which may be determined according to the actual situation, and embodiments of the present application are not limited herein.

In this case, optionally, the extracting key voices of the voice data based on feature information of the voice data includes:

Determining first voice data containing preset language in the voice data, wherein the first voice data comprises at least one first voice segment;

based on the occurrence frequency of the phrase in the voice data, determining a first keyword, wherein the first keyword meets at least one of the following: phrase with occurrence frequency greater than a second threshold value in the voice data; adding a phrase with the weight frequency larger than a third threshold value into the voice data, wherein the weight frequency is the product of the occurrence frequency and a second weight corresponding to the phrase;

Determining a first weight corresponding to the first voice segment based on the occurrence frequency or the weighted frequency of the first keyword in the first voice segment or the correlation degree of the first voice segment and a target word stock, wherein the correlation degree of the target word stock is determined based on the occurrence frequency or the weighted frequency of the first keyword matched with the target word stock in the first voice segment;

And extracting the first voice segment with the first weight greater than a first threshold value to obtain second voice data, wherein the key voice comprises the second voice data.

In this embodiment, taking the preset mood as an example of the emphasized mood, the first voice data is voice data containing the emphasized mood. However, due to the difference of speaking habits, there may be some voice data with emphasized speech but insignificant meaning of voice content in the first voice data, in order to screen out the voice data with insignificant meaning of voice content, a first voice segment with higher first weight in the first voice data may be extracted as a key voice, where the first weight is used to characterize the degree of correlation between the first voice segment and the main content of the voice data.

It should be noted that, if the voice data is divided into a plurality of voice segments before the voice data is processed, the first voice segment is a voice segment containing a preset language among the plurality of voice segments. If the voice data is not divided into a plurality of voice segments before the voice data is processed, the first voice data may be divided into a plurality of first voice segments based on the preset dividing rule after the first voice data is determined, which may be specifically determined according to the actual situation, and embodiments of the present application are not limited herein.

In a specific implementation, the first weight may be determined based on an occurrence frequency or a weighted frequency of the first keyword in the first speech segment, or based on a degree of correlation between the first speech segment and a target word stock. The first keyword is a high-frequency keyword in the first voice segment. After determining the plurality of phrases in the voice data, the occurrence frequency of each phrase can be determined, and the plurality of phrases are ranked according to the occurrence frequency from high to low. And in the case that the occurrence frequency is natural frequency, dividing the plurality of word components into a high-frequency phrase and a low-frequency phrase based on the first threshold, or in the case that the occurrence frequency is weighted frequency, dividing the plurality of word components into a high-frequency phrase and a low-frequency phrase based on the second threshold, wherein the high-frequency phrase can be determined as the first keyword. For example, the occurrence frequencies of the phrase 1, the phrase 2, the phrase 3, the phrase 4 and the phrase 5 are shown in fig. 4, and based on the first threshold, the above 5 word groups may be divided into two groups, where the occurrence frequencies of the phrase 1, the phrase 2 and the phrase 3 are higher, and the occurrence frequencies of the phrase 4 and the phrase 5 are lower, and are low-frequency phrases.

The frequency of occurrence of the first keyword refers to the number of times the first keyword occurs in the first voice segment, for example, assuming that the first keyword includes "S9", the voice segment "the promotion activity of S9 is very important, and for the promotion activity of S9, the frequency of occurrence of" S9 "is 2 in the following scheme.

The weighted frequency of the first keyword refers to a product of the occurrence frequency of the first keyword and a second weight corresponding to the first keyword, for example, assuming that the second weight of the first keyword "S9" is 1.5, the voice segment "the promotion activity of this time S9 is very important, and for the promotion activity of this time S9, we have the following scheme" in which the weighted frequency of "S9" is 3 ". The second weight of the phrase may be determined through user-defined setting, or may be determined based on occurrence frequency of the phrase in the historical voice data collected in advance in the user equipment, specifically may be determined according to the actual situation, and the embodiment of the present application is not limited herein.

The degree of correlation between the first speech segment and the target word stock may be determined based on the occurrence frequency or the weighted frequency of the first keyword in the first speech segment, which is matched with the target word stock. Specifically, a plurality of word banks may be preset, one word bank collects one type of phrase, for example, a model word bank collects phrases related to a model of the user equipment, a software function word bank collects phrases related to an application function on the user equipment, and a marketing scheme word bank collects phrases related to a marketing scheme. When a phrase in the first voice segment is identified to belong to the target word stock, the occurrence frequency or the weighted frequency of the phrase can be calculated in the correlation degree between the first voice segment and the target word stock.

For ease of understanding, the following examples are presented: assuming that the first keywords contained in the voice segment 1 are respectively "S9", "S8" and "popularization activity", the first keywords contained in the voice segment 2 are respectively "S9" and "popularization activity", wherein "S9" and "S8" belong to a model word stock, "popularization activity" belongs to an activity word stock, and the second weights of "S9", "S8" and "popularization activity" are respectively 1.5, 1.5 and 1, then the degree of correlation between the voice segment 1 and the model word stock is 3, and the degree of correlation between the voice segment 1 and the activity word stock is 1; the degree of correlation between the voice segment 2 and the model word stock is 1.5, and the degree of correlation between the voice segment 2 and the active word stock is 1. If the first voice segment with the correlation degree with the model word stock being more than 2 is extracted as the key voice, the voice segment 1 can be extracted as the key voice, and the voice segment 2 can not be extracted as the key voice; if the first speech segment with the correlation degree greater than 2 with the active word stock is extracted as the key speech, the speech segment 1 and the speech segment 2 may not be extracted as the key speech.

It should be noted that after the second voice data is extracted as the key voice, the first voice segment may be associated with a target word stock with a higher degree of correlation, and then when the key voice is arranged, the text may be orderly classified and arranged based on the target word stock.

It should be noted that, in other embodiments of the present application, optionally, the first weight may be further determined based on the following two implementation forms:

In a first implementation form, the first weight is determined based on a temporal position of the first speech segment in the speech data. In particular, when the method is implemented, the corresponding relation between the time position and the first weight may be determined based on a preset rule, where the preset rule may be a rule set by user definition, or may be a rule set by default by the electronic device, for example, since the important speaking content is usually at the pressing axis position, the first weight of the first voice segment with the later time position in the voice data is higher, and the first weight of the second voice segment with the middle or earlier time position is lower.

In a second implementation, the first weight is determined based on a character feature of the second speech segment. In specific implementation, the correspondence between the character features and the first weights may be determined based on a preset rule, where the preset rule may be a rule set by user definition, or may be a rule set by default by the electronic device, and, for example, when the character features of the first voice segment represent that the speaking character corresponding to the character features is a professor or a general manager, the first weight of the first voice segment is higher; when the character features of the first voice segment represent that the corresponding speaking character is a host, the first weight of the first voice segment is lower. After identifying the character feature of the speech data, a first weight of the first speech segment may be correspondingly determined.

For ease of understanding, as shown in fig. 5, an exemplary implementation flow in this case is as follows:

and 5-1, dividing the voice data into a plurality of voice fragments, and acquiring text fragments corresponding to the voice fragments.

In this example, the speech segments are sentence-level speech segments. After the voice data are acquired, firstly acquiring an audio waveform corresponding to the voice data. Then, based on the audio waveform, as shown in fig. 2, if the decibel of the voice data in a certain part is less than 70dBFS for more than 700 milliseconds, the part is determined as a pause, sentence-level voice fragments are divided based on the pause, and the obtained voice fragments are converted into text fragments. Thereafter, step 5-2 or step 5-3 may be performed.

And 5-2, extracting voice fragments containing preset language and converting the voice fragments into text fragments.

And extracting voice feature vectors from each voice segment according to frames to obtain an input matrix of N.65, inputting a pre-trained language-gas prediction model, predicting the language-gas level of the voice segment with emphasized language gas, extracting voice segments with emphasized language gas from the voice segments, and converting the voice segments into text segments. As shown in fig. 6, the specific implementation flow is as follows:

a. Training samples of the mood prediction model are collected. And collecting a large number of sample voice fragments containing emphasized mood, extracting voice feature vectors of a plurality of voice frames in each sample voice fragment, and determining the mood level of the emphasized mood of each sample voice fragment to obtain the training sample.

B. And decoding the voice feature vector by utilizing the AAE so as to re-express the characteristics of the voice feature vector.

C. And training a mood prediction model by using the voice characteristic vector after the characteristic re-representation and a predetermined mood level of each sample voice segment. And correcting the weight values on a plurality of connecting lines among all layers in the language-gas prediction model in the training process to determine the optimal weight value, thereby obtaining the language-gas prediction model used subsequently.

D. And acquiring a plurality of voice fragments in the current voice data, extracting voice feature vectors of a plurality of voice frames in each voice fragment, and decoding the voice feature vectors by utilizing an AAE.

E. And inputting a plurality of voice feature vectors of one voice segment into the trained language-gas prediction model after decoding to predict the language-gas level of the emphasized language-gas, and determining whether the voice segment contains the voice segment emphasized in language-gas based on the output result of the language-gas prediction model.

And 5-3, extracting text fragments containing preset mood words.

And carrying out phrase splitting on the text fragments to identify whether the text fragments containing the preset emphasized word of the Chinese language exist in the text fragments, wherein the preset emphasized word of the Chinese language includes 'very', 'important', 'special' and the like.

And 5-4, determining a first weight of the speech segment containing the emphasized mood and the text segment containing the emphasized mood word.

And determining a corresponding first weight based on the correlation degree between the speech segment containing the emphasized mood or the text segment containing the emphasized mood word and the target word stock. Taking the predetermined first keywords including "S9", "S8" and "promotion activity", and the second weights of "S9", "S8" and "promotion activity" being 1.5, 1.5 and 1, respectively, and the first threshold being 3 as an example, the existing speech segment 1 with emphasized mood hits "S9", "S8" and "promotion activity", the speech segment 2 with emphasized mood hits "S8" and "promotion activity", and the speech segment 3 with emphasized mood hits "promotion activity". Then, the degree of correlation between the speech segment 1 and the model word stock is 3, the degree of correlation between the speech segment 2 and the model word stock is 1.5, the degree of correlation between the speech segment 2 and the model word stock is 1, the degree of correlation between the speech segment 3 and the model word stock is 0, and the degree of correlation between the speech segment 3 and the model word stock is 1.

And 5-5, extracting the voice fragments with the first weight larger than the first threshold value as key voices.

Following the above example, speech segment 1 is extracted as the key speech. It should be noted that the text segment with the first weight greater than the first threshold may be directly determined as the key text corresponding to the key voice.

2) The feature information includes a case of a character feature.

In this case, three implementation forms are optionally included:

In a first implementation form, the identifying feature information of the voice data includes:

Recognizing a phrase used for representing the identity of a person in the voice data, and extracting third voice data containing the phrase used for representing the identity of the person, wherein the third voice data comprises at least one second voice segment;

determining a target sentence pattern matched with the second voice fragment in a preset sentence pattern set;

determining a third speech segment associated with the second speech segment in the speech data based on the target sentence pattern;

and determining the character characteristics of the third voice segment based on the phrase used for representing the character identity and contained in the second voice segment.

In this implementation form, for convenience of reading, the phrase for representing the identity of the person is expressed as the second keyword. The second keyword may include a phrase for representing a name, such as "Zhang san", "Liu Si", or a phrase for representing a title or a job, such as "professor", "teacher", "manager", or a phrase for representing a relative, such as "mom", "grandpa". The character features corresponding to each second keyword may be preset by the user, or may be determined by self based on the meaning of the second keyword, and may specifically be determined according to the actual situation, which is not limited herein.

In a specific implementation, the preset sentence pattern set is a sentence pattern set for determining character features of a voice segment, and optionally, the preset sentence pattern set may include the following three sentence patterns:

First, a sentence pattern of character features of voice data in a preset time period before the second voice segment is determined. For example, the preset sentence pattern set may include a first sentence pattern, where the first sentence pattern is "very good pair/summary of the xxx right words", and based on the first sentence pattern, it may be determined that, in a preset period of time before the current second speech segment, a speaking person of a speech segment may be "xxx", and then the person feature of the speech segment is the person feature of "xxx". If the sentence pattern matched with the second voice segment is the first sentence pattern or the sentence pattern similar to the first sentence pattern, a third voice segment associated with the second voice segment is a voice segment within a preset time period before the second voice segment, and the character feature of the third voice segment is the character feature of xxx.

Second, a sentence pattern of character features of the voice data in a preset time period after the third voice segment is determined. For example, the set of preset patterns may include a second pattern, the second pattern being "yyy, how you feel/how you see? Based on the second sentence pattern, it may be determined that, in a preset period of time after the current second speech segment, a speaking person of a speech segment may exist as "yyy", and then the person feature of the speech segment is the person feature of "yyy". If the sentence pattern matched with the second voice segment is the second sentence pattern or the sentence pattern similar to the second sentence pattern, a third voice segment associated with the second voice segment is a voice segment within a preset time period after the third voice segment, and the character feature of the third voice feature is the character feature of yyyy.

And thirdly, determining the sentence pattern of the character feature of the second voice segment. For example, the preset sentence pattern set may include a third sentence pattern, where the third sentence pattern is "i am zz", based on the third sentence pattern, it may be determined that the speaker corresponding to the current second speech segment is "zz", and then the character feature of the current second speech segment is the character feature of "zz". If the sentence pattern matched with the second voice segment is the third sentence pattern or the sentence patterns similar to the third sentence pattern, the third voice segment associated with the second voice segment is the second voice segment itself.

A second implementation form, the identifying feature information of the voice data includes:

determining target voiceprint features corresponding to the voice data in a preset voiceprint set, wherein the preset voiceprint set is preset with the corresponding relation between the voiceprint features and the character features;

and determining the character feature of the voice data according to the target voiceprint feature based on the corresponding relation between the voiceprint feature and the character feature.

In this implementation, the personality characteristic of the voice data may be determined based on the voiceprint characteristic of the voice data. The preset voiceprint set may store voiceprint features and character features of each speaking character in the voice data in advance, and correspondence between the voiceprint features and the character features. By identifying the target voiceprint feature of the voice data, the character feature corresponding to the target voiceprint feature can be searched in the preset voiceprint set and used as the character feature of the voice data. For example, the preset voiceprint is centrally stored with a voiceprint feature 1, a voiceprint feature 2 and a voiceprint feature 3, the voiceprint feature 1 corresponds to a "yangzhu", the voiceprint feature 2 corresponds to a "yangzhu", the voiceprint feature 3 corresponds to a "host a", and when the voiceprint feature of a certain part of voice data is recognized as the voiceprint feature 1, the character feature of the part of voice data is the character feature of the "yangzhu".

In a third implementation form, the identifying feature information of the plurality of speech segments includes:

Receiving a first input of a user;

a character feature of the voice data is determined in response to the first input.

In this implementation, the character features of the voice data are determined based on user-defined input by the user.

In a specific implementation manner, in an alternative implementation manner, a plurality of speaking characters in the voice data can be distinguished based on voiceprint features of the voice data to obtain a speaking character list corresponding to the voice data, and one table entry in the speaking character list is used for displaying voice data corresponding to one speaking character and high speaking characters. By displaying the speaker list, the user can listen to the voice data in the speaker list one by one and annotate the character feature by performing the input.

In this embodiment, optionally, in a case where the feature information includes a character feature, the extracting key speech of the speech data based on the feature information of the speech data includes:

And extracting fourth voice data matched with the preset character features based on the character features of the voice data, wherein the key voice comprises the fourth voice data.

In this implementation manner, optionally, the fourth voice data matching with the preset character feature includes: and voice data with the third weight corresponding to the character feature being greater than the fourth threshold. The third weight may be used to characterize the importance of the speaker character represented by the different character features, e.g., the third weight determined by the character feature of the "teacher" may be higher than the third weight determined by the character feature of the "student", and the third weight determined by the character feature of the "group leader" may be higher than the third weight determined by the character feature of the "group member".

The third weight may be determined based on the persona feature. For example, if the voice segment 1 and the voice segment 2 are both one voice segment in the voice data of a certain academic conference, and the character of the voice segment 1 is the character of the "professor", the character of the voice segment 2 is the character of the "host Zhang Sanj", the third weight of the voice segment 1 may be preset to 8, and the third weight of the voice segment 2 may be preset to 2.

The third weight may also be determined based on user input. By pre-identifying voiceprint features of the voice data to distinguish different speaker characters in the voice data, a list of speaker characters may be generated for presentation to a user. A user selection input of at least one speaker character may then be received, the third weight of the character features of the speaker character affected by the selection input being higher than the third weight of the character features of the speaker character not affected by the selection input. For example, as shown in fig. 7, in the case of displaying the voice clip list, if the input of the user acting on "user 1" is received, it can be determined that the speaker character corresponding to "user 1" is a key character, the third weight of the character feature corresponding to "user 1" is higher than the third weights of the character features of the speaker characters corresponding to "user 2" and "user 3".

In the case that the third weight of a certain character is greater than the fourth threshold, the character may be considered to be matched with a preset character, and the voice data corresponding to the character may be extracted as the key voice.

For ease of understanding, as shown in fig. 8, an exemplary implementation flow in this case is as follows:

And 8-1, determining key character features in the voice data.

The method comprises two modes: in the first mode, character features of the voice data are recognized based on a preset sentence pattern set, a third weight of each character feature is determined, and then key character features with the third weight being larger than a fourth threshold value are determined.

And 8-2, acquiring voiceprint features of the voice data corresponding to the key character features.

Herein denoted as key voiceprint features.

And 8-3, extracting voice data associated with the key voiceprint features as key voices.

3) The characteristic information includes the condition of the mood characteristic and the character characteristic

extracting fifth voice data matched with the preset character features based on the character features of the voice data, wherein the fifth voice data comprises at least one fourth voice segment;

and extracting a fourth voice segment containing preset language to obtain sixth voice data, wherein the key voice comprises the sixth voice data.

In this embodiment, after extracting the fifth voice data matched with the preset character feature, the sixth voice data containing the preset language may be further extracted from the fifth voice data. That is, the implementation forms in the above case 1) and the above case 2) may be combined, and the specific implementation form may refer to the description in the above embodiment, which is not repeated herein.

It should be noted that, in other embodiments of the present application, the feature information may further include a phrase frequency, and optionally, in a case that the feature information includes a phrase frequency, extracting the key voice of the voice data based on the feature information of the voice data includes:

Based on the occurrence frequency of the phrase in the voice data, determining the first keyword, wherein the first keyword meets at least one of the following: phrase with occurrence frequency greater than a second threshold value in the voice data; adding a phrase with the weight frequency larger than a third threshold value into the voice data, wherein the weight frequency is the product of the occurrence frequency and a second weight corresponding to the phrase;

And extracting seventh voice data containing the first keyword from the voice data, wherein the key voice comprises the seventh voice data.

In this embodiment, the occurrence frequency of different phrases may represent the importance degree of different phrases in the voice data, and for example, the phrase with higher occurrence frequency may be a topic or a key point of the voice content. That is, key voices extracted based on the occurrence frequency of phrases may be used to determine the subject or key point of voice data.

For easy understanding, as shown in fig. 9, an exemplary implementation flow in this case is as follows:

And 9-1, acquiring text data of the voice data.

And converting the voice data into text data based on voice recognition, and dividing the text data according to a preset sentence dividing rule to obtain a plurality of sentence-level text fragments.

And 9-2, carrying out phrase splitting on the text fragments of the sentence levels.

Taking the speech segment "the popularization activity of this time S9 is very important, for this time S9 popularization activity, we have the following scheme" as an example, and the phrases "S9", "popularization activity" and "scheme" can be obtained by splitting. Other sentence-level speech segments in the speech data may likewise be split into at least one phrase.

And 9-3, determining the occurrence frequency of each phrase in the voice data.

And 9-4, determining high-frequency keywords in the voice data based on the occurrence frequency of each phrase.

And 9-5, extracting the voice fragments containing the high-frequency keywords as key voices.

As shown in fig. 10, in an exemplary embodiment of the present application, the implementation flow of the text generation method is as follows:

step 10-1, voice data is obtained;

And step 10-2, preprocessing the voice data.

The preprocessing may include phrase partitioning, extracting nouns or verb phrases in the speech data, and determining high-frequency keywords; the method also comprises sentence division, wherein the voice data is divided into a plurality of sentence-level voice fragments according to a preset division rule; the voice data can be identified by voiceprint, the speaking characters corresponding to the voiceprint characteristics in the voice data are distinguished, and a speaking character list is generated; and may further comprise receiving user input for the speaker to determine key features in the voice data.

And step 10-3, extracting first key voices containing high-frequency keywords.

And step 10-4, extracting voice data containing preset language, and extracting second key voices based on the first weight of the voice fragments.

And step 10-5, extracting third key voices with character features meeting preset character features.

And step 10-6, sorting the first key voice, the second key voice and the third key voice.

And after the first key voice, the second key voice and the third key voice are converted into text fragments, sequencing and de-duplicating the text fragments according to the time sequence of appearance, so as to obtain a key text fragment list arranged according to a time line.

And 10-7, manually editing the key text segment list by the user, generating a text corresponding to the voice data and displaying the text.

In summary, according to the text generation method provided by the embodiment of the application, voice data is obtained; identifying characteristic information of the plurality of speech segments; extracting key voice of the voice data based on the characteristic information of the voice data; and displaying the text corresponding to the voice data based on the key voice. Therefore, key voices of voice data are extracted based on characteristic information of the voice data, texts corresponding to the voice data can be displayed based on the key voices, and compared with texts obtained by converting all voice data in the prior art, the text extraction method is more refined and has more prominent key points.

It should be noted that, in the text generating method provided by the embodiment of the present application, the execution subject may be a text generating device, or a control module in the text generating device for executing the method for generating text. In the embodiment of the present application, a method for executing text generation by a text generation device is taken as an example, and the text generation device provided by the embodiment of the present application is described.

Referring to fig. 11, fig. 11 is a block diagram of a text generating apparatus according to an embodiment of the present application.

As shown in fig. 11, the text generating apparatus 1100 includes:

an acquisition module 1101, configured to acquire voice data;

The recognition module 1102 is configured to recognize feature information of the voice data, where the feature information includes at least one of a mood feature and a character feature;

an extracting module 1103, configured to extract key voices of the voice data based on feature information of the voice data;

And the generating module 1104 is used for displaying the text corresponding to the voice data based on the key voice.

Optionally, in the case that the feature information includes a mood feature, the identifying module 1102 includes:

The first recognition unit is used for recognizing the phrase used for representing the mood in the voice data and determining the mood characteristic of the voice data based on the phrase used for representing the mood;

Or, a second recognition unit, configured to recognize the voice feature of the voice data, and determine the tone feature of the voice data based on the voice feature of the voice data.

Optionally, the second identifying unit includes:

A dividing subunit for dividing the voice data into a plurality of voice fragments;

a first determining subunit, configured to determine, based on speech features of N speech frames in the speech segment, speech feature vectors of N speech frames in the speech segment, where N is a positive integer;

The prediction subunit is used for inputting the voice feature vectors of the N voice frames into a pre-trained language prediction model to predict the language level of the voice segment containing the preset language;

the first obtaining subunit is used for obtaining a target vector output by the language-gas prediction model, wherein the target vector is used for representing the language-gas level of the voice segment containing preset language gas;

And the second determining subunit is used for determining the mood characteristics of the voice fragments based on the target vector.

Optionally, the mood prediction model includes an input layer, at least one convolution layer, a full connection layer, and an output layer, where the full connection layer includes a first hidden layer and a second hidden layer;

the prediction subunit is specifically configured to:

acquiring the N voice feature vectors in the input layer;

Optionally, in the case that the feature information includes a mood feature, the extracting module 1103 includes:

the first determining unit is used for determining first voice data containing preset language in the voice data, wherein the first voice data comprises at least one first voice fragment;

The second determining unit is used for determining a first keyword based on the occurrence frequency of the phrase in the voice data, wherein the first keyword meets at least one of the following: phrase with occurrence frequency greater than a second threshold value in the voice data; adding a phrase with the weight frequency larger than a third threshold value into the voice data, wherein the weight frequency is the product of the occurrence frequency and a second weight corresponding to the phrase;

A third determining unit, configured to determine a first weight corresponding to the first speech segment based on an occurrence frequency or a weighted frequency of the first keyword in the first speech segment, or a degree of correlation between the first speech segment and a target word stock, where the degree of correlation between the target word stock is determined based on an occurrence frequency or a weighted frequency of the first keyword in the first speech segment that matches the target word stock;

the first extraction unit is used for extracting the first voice segment with the first weight larger than a first threshold value to obtain second voice data, and the key voice comprises the second voice data.

Optionally, in the case that the feature information includes a character feature, the identifying module 1102 includes:

the third recognition unit is used for recognizing the phrase used for representing the identity of the person in the voice data and extracting third voice data containing the phrase used for representing the identity of the person, wherein the third voice data comprises at least one second voice segment;

a fourth determining unit, configured to determine, in a preset sentence pattern set, a target sentence pattern that matches the second speech segment;

a fifth determining unit configured to determine a third speech segment associated with the second speech segment in the speech data based on the target sentence pattern;

and a sixth determining unit, configured to determine a character feature of the third speech segment based on a phrase for characterizing the character identity included in the second speech segment.

Alternatively, in the case where the feature information includes a character feature, the extracting module 1103 includes:

and the second extraction unit is used for extracting fourth voice data matched with the preset character features based on the character features of the voice data, and the key voice comprises the fourth voice data.

Optionally, in the case where the feature information includes a mood feature and a character feature, the extracting module 1103 includes:

A third extraction unit for extracting fifth voice data matched with a preset character feature based on the character feature of the voice data, wherein the fifth voice data comprises at least one fourth voice segment;

And the fourth extraction unit is used for extracting a fourth voice segment with the tone characteristics meeting the preset tone to obtain sixth voice data, and the key voice comprises the sixth voice data.

The text generation device provided by the embodiment of the application acquires voice data; identifying feature information of the plurality of voice segments, wherein the feature information comprises at least one of a mood feature and a character feature; extracting key voice of the voice data based on the characteristic information of the voice data; and displaying the text corresponding to the voice data based on the key voice. Therefore, key voices of voice data are extracted based on characteristic information of the voice data, texts corresponding to the voice data can be displayed based on the key voices, and compared with texts obtained by converting all voice data in the prior art, the text extraction method is more refined and has more prominent key points.

The text generating device in the embodiment of the application can be a device, and can also be a component, an integrated circuit or a chip in a terminal. The device may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), etc., and the non-mobile electronic device may be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a Television (TV), a teller machine, a self-service machine, etc., and the embodiments of the present application are not limited in particular.

The text generating device in the embodiment of the application can be a device with an operating system. The operating system may be an Android operating system, an ios operating system, or other possible operating systems, and the embodiment of the present application is not limited specifically.

The text generating device provided by the embodiment of the present application can implement each process implemented by the embodiments of the methods of fig. 1 to 10, and in order to avoid repetition, a description is omitted here.

Optionally, as shown in fig. 12, an electronic device 1200 is further provided in the embodiment of the present application, which includes a processor 1201, a memory 1202, and a program or an instruction stored in the memory 1202 and capable of being executed by the processor 1201, where the program or the instruction implements each process of the embodiment of the text generation method and achieves the same technical effect, and in order to avoid repetition, a description is omitted here.

The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.

Fig. 13 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 1300 includes, but is not limited to: radio frequency unit 1301, network module 1302, audio output unit 1303, input unit 1304, sensor 1305, display unit 1306, user input unit 1307, interface unit 1308, memory 1309, and processor 1310.

Those skilled in the art will appreciate that the electronic device 1300 may also include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 1310 by a power management system, such as to perform functions such as managing charging, discharging, and power consumption by the power management system. The electronic device structure shown in fig. 13 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.

Wherein the processor 1310 is configured to:

Acquiring voice data;

And displaying the text corresponding to the voice data based on the key voice.

Optionally, where the feature information includes a mood feature, the processor 1310 is specifically configured to:

Optionally, the processor 1310 is specifically configured to:

dividing the voice data into a plurality of voice segments;

Optionally, the mood prediction model includes an input layer, at least one convolution layer, a full connection layer, and an output layer, where the full connection layer includes a first hidden layer and a second hidden layer; the processor 1310 is specifically configured to:

acquiring the N voice feature vectors in the input layer;

Optionally, where the feature information includes a character feature, the processor 1310 is specifically configured to:

Optionally, where the feature information includes a character feature, the processor 1310 is further configured to:

Optionally, where the characteristic information includes a mood characteristic and a personality characteristic, the processor 1310 is specifically configured to:

And extracting a fourth voice segment with the tone characteristics meeting the preset tone to obtain sixth voice data, wherein the key voice comprises the sixth voice data.

The electronic equipment provided by the embodiment of the application acquires voice data; identifying feature information of the plurality of voice segments, wherein the feature information comprises at least one of a mood feature and a character feature; extracting key voice of the voice data based on the characteristic information of the voice data; and displaying the text corresponding to the voice data based on the key voice. Therefore, key voices of voice data are extracted based on characteristic information of the voice data, texts corresponding to the voice data can be displayed based on the key voices, and compared with texts obtained by converting all voice data in the prior art, the text extraction method is more refined and has more prominent key points.

It should be appreciated that in embodiments of the present application, the input unit 1304 may include a graphics processor (Graphics Processing Unit, GPU) 13041 and a microphone 13042, the graphics processor 13041 processing image data of still pictures or video obtained by an image capture device (e.g., a camera) in a video capture mode or an image capture mode. The display unit 1306 may include a display panel 13061, and the display panel 13061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1307 includes a touch panel 13071 and other input devices 13072. Touch panel 13071, also referred to as a touch screen. The touch panel 13071 may include two parts, a touch detection device and a touch controller. Other input devices 13072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein. Memory 1309 may be used to store software programs as well as various data including, but not limited to, application programs and an operating system. The processor 1310 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1310.

The embodiment of the application also provides a readable storage medium, on which a program or an instruction is stored, which when executed by a processor, implements each process of the above text generation method embodiment, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here.

Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium such as a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.

The embodiment of the application further provides a chip, which comprises a processor and a communication interface, wherein the communication interface is coupled with the processor, and the processor is used for running programs or instructions to realize the processes of the embodiment of the text generation method, and can achieve the same technical effects, so that repetition is avoided, and the description is omitted here.

It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A method of text generation, the method comprising:

Acquiring voice data;

displaying a text corresponding to the voice data based on the key voice;

and extracting key voices of the voice data based on the characteristic information of the voice data under the condition that the characteristic information comprises the voice characteristics, wherein the key voices comprise:

2. The method according to claim 1, wherein, in the case where the feature information includes a mood feature, the identifying feature information of the voice data includes:

3. The method of claim 2, wherein the identifying the speech characteristics of the speech data and determining the mood characteristics of the speech data based on the speech characteristics of the speech data comprises:

dividing the voice data into a plurality of voice segments;

4. The method of claim 3, wherein the mood prediction model comprises an input layer, at least one convolution layer, a fully-connected layer, and an output layer, the fully-connected layer comprising a first hidden layer and a second hidden layer;

acquiring the N voice feature vectors in the input layer;

5. The method according to claim 1, wherein in the case where the feature information includes a character feature, the identifying feature information of the voice data includes:

6. The method according to claim 1 or 5, wherein, in the case where the feature information includes character features, the extracting key voices of the voice data based on feature information of the voice data includes:

7. The method according to claim 1, wherein, in the case where the feature information includes a mood feature and a character feature, the extracting key voices of the voice data based on feature information of the voice data includes:

8. A text generation apparatus, the apparatus comprising:

The acquisition module is used for acquiring voice data;

The generation module is used for displaying texts corresponding to the voice data based on the key voices;

In the case where the feature information includes a mood feature, the extracting module includes:

9. The apparatus of claim 8, wherein, in the case where the characteristic information includes a mood characteristic, the identifying module includes:

10. The apparatus of claim 9, wherein the second identifying unit comprises:

11. The apparatus of claim 10 wherein the mood prediction model comprises an input layer, at least one convolution layer, a fully connected layer, and an output layer, the fully connected layer comprising a first hidden layer and a second hidden layer;

the prediction subunit is specifically configured to:

acquiring the N voice feature vectors in the input layer;

12. The apparatus of claim 8, wherein, in the case where the characteristic information includes a character characteristic, the identification module includes:

13. The apparatus according to claim 8 or 12, wherein in case the feature information includes a character feature, the extracting module includes:

14. The apparatus of claim 8, wherein, in the case where the feature information includes a mood feature and a character feature, the extracting module includes:

15. An electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the text generation method of any of claims 1-7.

16. A readable storage medium, characterized in that the readable storage medium has stored thereon a program or instructions which, when executed by a processor, implement the steps of the text generation method according to any of claims 1-7.