CN113488025A

CN113488025A - Text generation method and device, electronic equipment and readable storage medium

Info

Publication number: CN113488025A
Application number: CN202110794502.6A
Authority: CN
Inventors: 姚晓颖
Original assignee: Vivo Mobile Communication Hangzhou Co Ltd
Current assignee: Vivo Mobile Communication Hangzhou Co Ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-10-08
Anticipated expiration: 2041-07-14
Also published as: CN113488025B

Abstract

The application discloses a text generation method and device, electronic equipment and a readable storage medium, and belongs to the field of communication. The method comprises the following steps: acquiring voice data; identifying feature information of the plurality of voice segments, wherein the feature information comprises at least one of tone features and character features; extracting key voice of the voice data based on the feature information of the voice data; and displaying a text corresponding to the voice data based on the key voice.

Description

Text generation method and device, electronic equipment and readable storage medium

Technical Field

The application belongs to the technical field of communication, and particularly relates to a text generation method and device, an electronic device and a readable storage medium.

Background

Text summary in a meeting, a course, etc. scene generally needs to be arranged based on a recorded audio file or a video file. At present, all voice data is generally converted into text data by processing an audio file or a video file, and the generated text is long.

As can be seen, the text generated based on the speech data in the prior art is lengthy.

Disclosure of Invention

An embodiment of the present application provides a text generation method, an apparatus, an electronic device, and a readable storage medium, which can solve the problem in the prior art that a text generated based on voice data is relatively long.

In a first aspect, an embodiment of the present application provides a text generation method, where the method includes:

acquiring voice data;

identifying feature information of the voice data, wherein the feature information comprises at least one of tone features and character features;

extracting key voice of the voice data based on the feature information of the voice data;

and displaying a text corresponding to the voice data based on the key voice.

In a second aspect, an embodiment of the present application provides a text generation apparatus, where the apparatus includes:

the acquisition module is used for acquiring voice data;

the recognition module is used for recognizing the characteristic information of the voice data, wherein the characteristic information comprises at least one of tone characteristics and character characteristics;

the extraction module is used for extracting key voice of the voice data based on the characteristic information of the voice data;

and the generating module is used for displaying the text corresponding to the voice data based on the key voice.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In the embodiment of the application, voice data is obtained; identifying feature information of the plurality of voice segments, wherein the feature information comprises at least one of tone features and character features; extracting key voice of the voice data based on the feature information of the voice data; and displaying a text corresponding to the voice data based on the key voice. Therefore, the key voice of the voice data is extracted based on the feature information of the voice data, so that the text corresponding to the voice data can be displayed based on the key voice.

Drawings

Fig. 1 is a flowchart of a text generation method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a principle of dividing sentence-level speech segments according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an utterance prediction model according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a principle of determining a high-frequency phrase according to an embodiment of the present application;

fig. 5 is a second flowchart of a text generation method according to an embodiment of the present application;

fig. 6 is a third flowchart of a text generation method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a display interface provided by an embodiment of the present application;

fig. 8 is a fourth flowchart of a text generation method according to an embodiment of the present application;

fig. 9 is a fifth flowchart of a text generation method provided in the embodiment of the present application;

fig. 10 is a sixth flowchart of a text generation method provided in an embodiment of the present application;

fig. 11 is a block diagram of a text generation apparatus according to an embodiment of the present application;

fig. 12 is one of structural diagrams of an electronic device according to an embodiment of the present application;

fig. 13 is a second structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The text generation method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

Referring to fig. 1, fig. 1 is a flowchart of a text generation method according to an embodiment of the present disclosure.

As shown in fig. 1, the method comprises the steps of:

step 101, voice data is obtained.

In a specific implementation, the voice data may be voice data collected in real time, or voice data in an audio file or a video file collected in advance, which may be determined according to an actual situation, and the embodiment of the present application is not limited herein.

And 102, identifying characteristic information of the voice data, wherein the characteristic information comprises at least one of tone characteristics and character characteristics.

In the embodiment of the application, the mood characteristic may be a characteristic used for representing emotional colors and components of the voice content, and the emotion and emotion expressed by different voice contents may be determined based on the mood characteristic, so as to determine the importance degree of different voice contents. The key voice extracted based on the mood features may be voice content needing important attention in the voice data, or summarized and guided voice content.

In a specific implementation, the mood characteristic can be determined by the audio tone of the voice data, for example, the voice mood characteristic with faster and higher audio tone is more vivid, the emotional color and component of the corresponding voice content may be heavier, the voice content may be summarized or the voice content needing important attention, and such voice can be extracted as a key voice.

The characteristic of the mood may also be determined by a phrase used for characterizing the mood, which is included in the speech data, and refers to a phrase of emotional colors and components of the word meaning, for example, the emotional colors and components of adjectives such as "how", "very", "important", "special", etc. or adverbs are heavier, the speech content including such phrases may be summarized speech content or speech content requiring important attention, and such speech may be extracted as key speech.

The mood characteristics can also be determined by the sentence patterns of the voice data, wherein the sentence patterns comprise statement sentence patterns, question sentence patterns, imperative sentence patterns, exclamation sentence patterns and the like, the emotional colors and components represented by different sentence patterns are different, illustratively, the voice content of the question sentence patterns may cause the meeting issues or the problems needing discussion, the voice content of the imperative sentence patterns may be tasks needing to be completed or voice content needing important attention, and the voices can be extracted as key voices.

It is to be understood that the implementation form of the mood characteristic is not limited thereto, and the embodiment of the present application does not limit this.

In this embodiment of the application, the character feature may be a feature of an utterance character corresponding to the speech content, for example, an identity feature of the utterance character, and generally, the more important the utterance character characterized by the character feature is, the more important the speech content corresponding to the utterance character is. The key voice extracted based on the human features of the voice data may be voice content needing important analysis in the voice data or summarized and guided voice content.

In specific implementation, the character characteristics can be determined by the identity characteristics of the speaker corresponding to the voice content. The identity characteristics can be determined based on manual labeling of a user or based on voiceprint characteristics of voice content, and for example, through a preset corresponding relationship between the voiceprint characteristics and the identity characteristics, after the voiceprint characteristics of the voice content are recognized, the identity characteristics of a speaker corresponding to the voice content can be determined. The identity characteristic can also be determined based on a preset rule, for example, based on the position of the voice content in the voice data, the identity characteristic corresponding to the voice content with the later position is more important.

And 103, extracting key voice of the voice data based on the characteristic information of the voice data.

In this embodiment, a key speech may be extracted from the speech data based on the at least one feature information, and then the speech content corresponding to the extracted key speech may be a topic, a key point, a key content, and the like of the speech data, and then step 104 is executed continuously.

And 104, displaying a text corresponding to the voice data based on the key voice.

In specific implementation, the key speech can be sorted, converted into a text and displayed.

Optionally, the sorting the key voices may include deduplication of the key voices, and may further include at least one of: 1) sorting the key voices after the duplication is removed according to a time sequence; 2) classifying the key voices after duplication removal based on preset key phrases, for example, if the preset key phrases of the voice data are a site, materials, guests and security, the key voices associated with the site, the key voices associated with the materials, the key voices associated with the guests and the key voices associated with the security can be respectively determined, so as to classify the key voices; 3) and classifying the key voices after the duplication is removed based on the voiceprint characteristics, namely classifying the key voices corresponding to the same voiceprint characteristics into one class. It is to be understood that the implementation form of the key speech processing is not limited thereto, and may be determined according to practical situations, and the embodiment of the present application is not limited thereto.

It should be noted that, when processing the voice data, the voice data may be converted into text data based on voice recognition, and then the feature information of the text data may be recognized; or, the feature information of the voice data may be recognized first, and after the key voice is determined, the key voice is converted into a text; or, for different feature information, it is determined whether to recognize feature information first and then recognize speech as a text, or to recognize speech as a text and then recognize feature information, which may be determined according to actual conditions.

The text generation method provided by the embodiment of the application acquires voice data; identifying characteristic information of the voice data; extracting key voice of the voice data based on the feature information of the voice data; and displaying a text corresponding to the voice data based on the key voice. Therefore, the key voice of the voice data is extracted based on the feature information of the voice data, so that the text corresponding to the voice data can be displayed based on the key voice.

In the embodiment of the present application, the key speech may be extracted by recognizing at least one of the mood features and the character features, which are described below.

1) The feature information includes a situation of the mood feature.

In this case, optionally, the recognizing the feature information of the voice data includes:

recognizing phrases used for representing tone in the voice data, and determining tone characteristics of the voice data based on the phrases used for representing tone;

or recognizing the voice characteristics of the voice data and determining the tone characteristics of the voice data based on the voice characteristics of the voice data.

In a first embodiment, the recognizing the feature information of the voice data includes: and identifying phrases used for representing tone in the voice data, and determining tone characteristics of the voice data based on the phrases used for representing tone.

In particular implementations, the phrases may be words, phrases, or phrases. By splitting the phrases of the voice data, a plurality of phrases contained in the voice data can be obtained, where the phrases generally refer to noun phrases or verb phrases. Or, a corpus of words may be preset, where the corpus of words includes corpora words used for representing various corpora, for example, phrases used for representing emphasized corpora have adjectives or adverbs such as "important", "special", etc., and based on preset corpora words included in the corpus of preset corpora, phrases matched with the preset corpora may be identified in the speech data, which may be completely the same as any preset corpora, or phrases having the same or similar meaning as any preset corpora. If a phrase matching any of the tone words in the preset tone word library is identified in the voice data, tone features of the voice data can be determined based on the tone represented by the tone words.

It should be noted that, optionally, before the voice data is processed, the voice data may be divided into a plurality of voice segments. The voice segment may be a word or a phrase in the voice data, or may be a sentence, or may be a paragraph. In an exemplary implementation form, the voice data may be pre-divided into a plurality of sentence-level voice segments based on a preset division rule, where the preset division rule may be determined based on a voice pause in the voice data, or may be determined based on a sentence structure in the voice data, and may be determined according to practical situations, and the embodiment of the present application is not limited herein. If a phrase matching with any of the mood words in the preset mood word bank is identified in the certain voice fragment, the mood characteristics of the voice fragment can be determined based on the mood represented by the mood word.

For ease of understanding, the following describes the division of sentence-level speech segments based on speech pauses in the speech data: as shown in fig. 2, when the sentence-level speech segments are divided, the sentences may be divided based on pauses in the audio waveform corresponding to the speech data, and when the decibel of the speech data is lower than the preset decibel threshold for more than a certain time, the speech data between the line a and the line B in the figure may be determined as a pause, the speech data before the line a may be determined as a sentence-level speech segment, and the speech data after the line B may be determined as a sentence-level speech segment.

In a second embodiment, the recognizing the feature information of the voice data includes: and recognizing the voice characteristics of the voice data, and determining the tone characteristics of the voice data based on the voice characteristics of the voice data.

In a specific implementation, the voice features of the voice segments may be identified based on information such as the voice features, the audio spectrum features, or the voiceprint features of the voice data.

In this embodiment, optionally, the recognizing the voice feature of the voice data and determining the mood feature of the voice data based on the voice feature of the voice data includes:

dividing the voice data into a plurality of voice segments;

determining voice feature vectors of N voice frames in the voice segment based on the voice features of the N voice frames in the voice segment, wherein N is a positive integer;

inputting the voice feature vectors of the N voice frames into a pre-trained tone prediction model, and predicting tone levels of the voice segments containing preset tones;

acquiring a target vector output by the tone prediction model, wherein the target vector is used for representing tone levels of the voice segments containing preset tones;

based on the target vector, determining the tone features of the voice segments.

In this implementation form, the voice data may be divided into a plurality of voice segments, and a voice feature vector may be extracted based on N voice frames in the voice segments. In a specific implementation, the N speech frames may be determined based on a frame rate of the speech data, for example, when the frame rate of the speech data is 20 milliseconds, every 20 milliseconds of the speech data is a frame. For each speech frame, feature extraction can be performed from k dimensions to obtain a speech feature vector of 1 × k, k is a positive integer, and the speech feature vectors of N speech frames corresponding to one speech segment can form a speech feature matrix of N × k to serve as an input matrix of the speech prediction model. In this implementation form, k Low-Level descriptors (LLD) are used for describing the k dimensions, which is not limited at all.

The k low level descriptors may be pre-customized. Optionally, the k low level descriptors are determined from at least one of a sound feature and an audio spectral feature. Wherein the sound features may include, but are not limited to, at least one of tonal features, timbre features, loudness features, such as the pitch or pitch of the sound, the brightness of the sound, etc., the audio spectral features may include, but are not limited to, at least one of temporal features, which may include, for example, attack time, spectral centroid, zero-crossing rate, etc., and frequency features, which may include amplitude, fundamental frequency, sinusoidal components, residual components, etc. In one exemplary implementation, N-20, k-65, the processing flow of a speech segment is as follows: original speech segments → 20 original speech frames → extracting speech feature vectors of each speech frame based on the customized 65 LLDs → 20 speech feature vectors are input into the speech prediction model for speech prediction → determining speech features.

Alternatively, the k low-level descriptors may be obtained by training a Neural network, such as a Convolutional Neural Network (CNN). In one exemplary implementation, N-20, k-65, the processing flow of a speech segment is as follows: original speech segments → 20 original speech frames → 65 LLDs obtained by CNN training → speech feature vectors of each speech frame are extracted based on the 65 LLDs obtained by training → 20 speech feature vectors are input into a speech prediction model for speech prediction → speech feature is determined. It is to be understood that the implementation form of the k low-level descriptors is not limited thereto, and may be determined according to practical situations, and the embodiments of the present application are not limited thereto.

In an optional implementation form, the speech feature vector may be further decoded by an Adaptive Automatic Encoder (AAE) to perform feature re-representation on the speech frame. The implicit factor of the AAE includes an emotional state of the speech segment. Since the speech signal of the speech frame may be determined by a plurality of implicit factors, such as emotional state, age, gender, and content of speech, the implicit factors determining the speech signal may be estimated by the AAE to reconstruct the speech frame and re-represent the speech features of the speech frame. Therefore, the reconstructed voice frame can carry emotion labels, and the determined voice characteristics can be more obvious. In one exemplary implementation, N-20, k-65, the processing flow of a speech segment is as follows: original speech segments → 20 original speech frames → speech feature vector of each speech frame is extracted based on 65 LLD → AAE performs feature re-expression on each speech feature vector → 20 speech feature vectors are input into a speech prediction model for speech prediction → determining speech features.

The mood prediction model is trained based on N speech feature vectors and mood levels of the speech segments. Specifically, the speech prediction model may be trained based on a preset mood, and the mood prediction result is used to predict whether the speech segment contains the preset mood and the mood level containing the preset mood, where the mood level may be understood as the strength containing the preset mood. Taking the preset tone as an emphasized tone as an example, the training sample of the speech prediction model includes a large number of speech segments containing the emphasized tone, and a predetermined tone level of each speech segment containing the emphasized tone. And training the tone prediction model based on the training samples, so that the tone prediction result can represent whether the voice segment contains the emphasized tone and the tone level containing the emphasized tone, for example, the tone level is 1-5, wherein 1 represents slight emphasis, and 5 represents strong emphasis. It should be noted that, in other implementation forms, an utterance prediction model for predicting an utterance class of a speech segment may be trained in advance, and an utterance prediction result of such an utterance prediction model is used to determine which preset utterance is included in the speech segment.

Further, optionally, the mood prediction model includes an input layer, at least one convolutional layer, a fully-connected layer, and an output layer, where the fully-connected layer includes a first hidden layer and a second hidden layer;

inputting the voice feature vectors of the N voice frames into a pre-trained tone prediction model, predicting tone levels of the voice segments containing preset tones, and including:

acquiring the N voice feature vectors in the input layer;

inputting the N voice feature vectors into the at least one convolution layer for feature extraction to obtain a target feature matrix;

inputting the target characteristic matrix into the first hidden layer, multiplying a first preset weight matrix by the target characteristic matrix in the first hidden layer, adding a first preset bias matrix, and activating through a first preset activation function to obtain a first matrix;

inputting the first matrix into the second hidden layer, multiplying a second preset weight matrix with the first matrix in the second hidden layer, adding a second preset bias matrix, and activating through a second preset activation function to obtain a second matrix;

inputting the second matrix into the output layer, multiplying a third preset weight matrix by the second matrix in the output layer, adding a third preset offset vector, and activating through a third preset activation function to obtain a third matrix; the third matrix is a column vector, one row in the third matrix corresponds to one tone level, the value of a target row in the third matrix represents the matching probability of the voice fragment matched to the target tone level, and the target row represents any row in the third matrix;

and outputting the target vector through the output layer, wherein the target vector is used for representing the tone level with the highest matching probability in the third matrix.

In specific implementation, taking N-20 and k-65 as an example, the preset mood is an emphasized mood, and an exemplary implementation form of the mood prediction model is as follows, as shown in fig. 3:

the language prediction model comprises an input layer (input layer), at least one convolutional layer (convolutional layer), a fully connected layer (fully connected layer) and an output layer (output layer), wherein the fully connected layer comprises a first hidden layer (hidden layer 1) and a second hidden layer (hidden layer 2), the first hidden layer and the second hidden layer are respectively provided with 256 nodes, the output layer is provided with 5 nodes, the nodes respectively correspond to 5 levels 1-5 containing emphasized language, wherein 1 represents slight emphasis, 5 represents strong emphasis, and the strength containing emphasized language is gradually increased from 1 to 5.

And the input layer of the voice atmosphere prediction model is provided with 20 nodes, each node corresponds to a voice feature vector of a voice frame, the N voice feature vectors are input into the at least one convolution layer, and in the at least one convolution layer, feature extraction is carried out on a voice feature matrix formed by combining the N voice feature vectors to obtain a target feature matrix.

And then inputting the target characteristic matrix into the first hidden layer. Each node in the first hidden layer is connected with each node in the input layer, a weight value on each line is determined in pre-training, and preset weight values on 256 × 20 lines between 256 nodes in the first hidden layer and 20 nodes in the input layer can form the first preset weight matrix. In the first hidden layer, a first preset weight matrix is multiplied by the target feature matrix and added with a first preset bias matrix, and activation is performed through a first preset activation function, so that an output result of the first hidden layer can be obtained, and the output result is marked as the first matrix. Wherein the first preset activation function may be a ReLU function.

Then, the first matrix is input into the second hidden layer. Each node in the second hidden layer is connected with each node in the first hidden layer, a weight value on each connection line is determined in pre-training, and preset weight values on 256 × 256 connection lines between 256 nodes in the second hidden layer and 256 nodes in the first hidden layer can form the second preset weight matrix. In the second hidden layer, a second preset weight matrix is multiplied by the first matrix and added with a second preset bias matrix, and activation is performed through a second preset activation function, so that an output result of the second hidden layer can be obtained, and the output result is marked as the second matrix. Wherein, the second preset activation function may also be a ReLU function.

Thereafter, the second matrix is input to the output layer. Each node in the output layer is connected with each node in the second hidden layer, a weight value on each connection line is determined in pre-training, and preset weight values on 5 × 256 connection lines between 5 nodes in the output layer and 256 nodes in the second hidden layer can form the third preset weight matrix. In the output layer, a third preset weight matrix is multiplied by the second matrix and added with a third preset bias matrix, and activation is performed through a third preset activation function, so that the third matrix can be obtained. Wherein the third preset activation function may be a softmax function.

The third matrix is a column vector of 5 × 1, each row corresponds to a tone level, a value of each row is a real number of (0,1), a value of a certain row represents a matching probability that the speech segment is matched to the tone level corresponding to the row, and a sum of values of 5 rows is 1, for example, if the third matrix is:

the probability of the current speech fragment matching to the mood level 1 is 0.2, the probability of the current speech fragment matching to the mood level 2 is 0.2, the probability of the current speech fragment matching to the mood level 3 is 0.4, the probability of the current speech fragment matching to the mood level 4 is 0.1, and the probability of the current speech fragment matching to the mood level 5 is 0.1.

Then, the target vector may be determined based on the third matrix, for example, the target vector may be a column vector of 5 × 1 dimensions, and a value of each row is 1 or 0, and a row with the highest probability in the third matrix is taken as 1, and other rows are taken as 0, for example, in the following example, the target vector is:

and the mood prediction model outputs the target vector, and the mood level of the voice segment can be determined to be 3 based on the target vector. It should be noted that, if a speech segment input into the mood prediction model contains emphasized mood, the value of a target line of the target vector is 1, and the mood level corresponding to the target line is the mood level of the speech segment. If the speech segment input into the tone prediction model does not contain emphasized tone, the matching probability of the speech segment matched with any tone level is 0, and the value of each row of the target vector is 0. It is to be understood that the implementation form of the target vector is not limited to the following, and may be determined according to practical situations, and the embodiments of the present application are not limited herein.

It should be noted that, in other implementation forms, the 5 nodes of the output layer may respectively correspond to 5 levels 1-5 with emphasized mood, where 1 denotes that the emphasized mood is not included, 5 denotes that the emphasized mood is emphasized, and the intensity of the emphasized mood gradually increases from 1 to 5, which may be determined according to practical situations, and the embodiment of the present application is not limited herein.

In this case, optionally, the extracting the key voice of the voice data based on the feature information of the voice data includes:

determining first voice data containing preset voice mood in the voice data, wherein the first voice data comprises at least one first voice segment;

determining a first keyword based on the occurrence frequency of the word group in the voice data, wherein the first keyword meets at least one of the following conditions: phrases with frequency greater than a second threshold value appear in the voice data; the weighting frequency is the product of the occurrence frequency and a second weight corresponding to the phrase;

determining a first weight corresponding to the first voice fragment based on the occurrence frequency or the weighted frequency of the first keyword in the first voice fragment or the degree of correlation between the first voice fragment and a target word bank, wherein the degree of correlation of the target word bank is determined based on the occurrence frequency or the weighted frequency of the first keyword matched with the target word bank in the first voice fragment;

and extracting the first voice segment with the first weight larger than a first threshold value to obtain second voice data, wherein the key voice comprises the second voice data.

In this embodiment, the preset mood is taken as an example of the emphasized mood, and the first voice data is voice data containing the emphasized mood. However, due to the difference of speaking habits, there may exist some voice data with emphasis on mood but not significant meaning of the voice content in the first voice data, and in order to screen out these voice data with not significant meaning of the voice content, a first voice segment with a higher first weight in the first voice data may be extracted as a key voice, and the first weight is used to represent the degree of correlation between the first voice segment and the main content of the voice data.

It should be noted that, before the voice data is processed, the voice data is divided into a plurality of voice segments, and the first voice segment is a voice segment containing a preset mood in the plurality of voice segments. If the voice data is not divided into a plurality of voice segments before the voice data is processed, after the first voice data is determined, the first voice data may be divided into a plurality of first voice segments based on the preset division rule, which may be determined according to actual situations, and the embodiment of the present application is not limited herein.

In a specific implementation, the first weight may be determined based on the occurrence frequency or weighted frequency of the first keyword in the first speech segment, or based on the degree of correlation between the first speech segment and the target lexicon. The first keyword is a high-frequency keyword in the first voice segment. After determining the plurality of phrases in the voice data, the occurrence frequency of each phrase may be determined, and the plurality of phrases are ordered according to the occurrence frequency from high to low. And under the condition that the occurrence frequency is a natural frequency, dividing the phrases into high-frequency phrases and low-frequency phrases based on the first threshold, or under the condition that the occurrence frequency is a weighted frequency, dividing the phrases into high-frequency phrases and low-frequency phrases based on the second threshold, wherein the high-frequency phrases can be determined as the first keywords. Illustratively, the frequency of occurrence of phrase 1, phrase 2, phrase 3, phrase 4, and phrase 5 is as shown in fig. 4, and based on the first threshold, the above 5 phrases can be divided into two groups, where the frequency of occurrence of phrase 1, phrase 2, and phrase 3 is higher, and the frequency of occurrence of phrase 4 and phrase 5 is lower, and the frequency of occurrence of phrase is lower.

The frequency of occurrence of the first keyword refers to the frequency of occurrence of the first keyword in the first speech segment, for example, if the first keyword includes "S9", we will pay more attention to the "this popularization activity of S9" in the speech segment, and for this popularization activity of S9, we have the following scheme "in which the frequency of occurrence of" S9 "is 2.

The weighted frequency of the first keyword is a product of the occurrence frequency of the first keyword and a second weight corresponding to the first keyword, for example, assuming that the second weight of the first keyword "S9" is 1.5, the popularization activity of the voice segment "this S9" is regarded as important, and for this popularization activity of S9, we have the following scheme "in which the weighted frequency of" S9 "is 3. The second weight of the phrase may be determined by user-defined setting, or may be determined based on the occurrence frequency of the phrase in the historical speech data collected in advance in the user equipment, which may be specifically determined according to an actual situation, and the embodiment of the present application is not limited herein.

The degree of relevance of the first voice segment to the target word stock can be determined based on the occurrence frequency or weighted frequency of the first keywords matched with the target word stock in the first voice segment. Specifically, a plurality of word banks may be preset, and one word bank collects one type of word group, for example, a model word bank collects word groups related to a model of the user equipment, a software function word bank collects word groups related to an application function on the user equipment, and a marketing scheme word bank collects word groups related to a marketing scheme. When a certain phrase in the first speech segment is recognized to belong to the target word stock, the occurrence frequency or the weighting frequency of the phrase can be calculated in the degree of correlation between the first speech segment and the target word stock.

For ease of understanding, the following are exemplified: assuming that the first keywords contained in the speech segment 1 are "S9", "S8" and "promotional activity", respectively, and the first keywords contained in the speech segment 2 are "S9" and "promotional activity", respectively, wherein "S9" and "S8" belong to a model thesaurus and "promotional activity" belongs to an activity thesaurus, and the second weights of "S9", "S8" and "promotional activity" are 1.5, 1.5 and 1, respectively, then, the degree of correlation between the speech segment 1 and the model thesaurus is 3, and the degree of correlation between the speech segment 1 and the activity thesaurus is 1; the degree of correlation between the voice segment 2 and the model lexicon is 1.5, and the degree of correlation between the voice segment 2 and the active lexicon is 1. If the first voice segment with the correlation degree with the model lexicon larger than 2 is extracted as the key voice, the voice segment 1 can be extracted as the key voice, and the voice segment 2 can not be extracted as the key voice; if the first speech segment having a degree of correlation with the active thesaurus greater than 2 is extracted as the key speech, neither the speech segment 1 nor the speech segment 2 may be extracted as the key speech.

It should be noted that, after the second speech data is extracted as the key speech, the first speech segment may be associated with a target lexicon with a higher degree of relevance, and subsequently, when sorting the key speech, the text may be sorted and sorted orderly based on the target lexicon.

It should be noted that, in other embodiments of the present application, optionally, the first weight may also be determined based on the following two implementation forms:

in a first implementation form, the first weight is determined based on a temporal position of the first speech segment in the speech data. In a specific implementation, the corresponding relationship between the time position and the first weight may be determined based on a preset rule, where the preset rule may be a rule set by a user in a self-defined manner or a rule set by an electronic device in a default manner, and for example, because important speech content is usually in a pressure axis position, the first weight of a first speech segment in the speech data at a later time position is higher, and the first weight of a second speech segment in the speech data at a middle or earlier time position is lower.

In a second implementation form, the first weight is determined based on the human character feature of the second voice segment. In specific implementation, the corresponding relationship between the character features and the first weight may be determined based on a preset rule, where the preset rule may be a rule set by a user in a self-defined manner or a rule set by an electronic device in a default manner, and illustratively, when the character features of the first speech segment represent that the corresponding speaker is a professor or a general manager, the first weight of the first speech segment is higher; the character feature of the first voice segment represents that the corresponding speaker character is a host, and the first weight of the first voice segment is lower. After the human features of the voice data are recognized, a first weight of the first voice segment can be correspondingly determined.

For convenience of understanding, as shown in fig. 5, an exemplary implementation flow in the present case is as follows:

and 5-1, dividing the voice data into a plurality of voice fragments, and acquiring text fragments corresponding to the voice fragments.

In this example, the speech segments are sentence-level speech segments. And after the voice data is acquired, acquiring an audio waveform corresponding to the voice data. Then, based on the audio waveform, as shown in fig. 2, if the decibel of the voice data in a certain portion is less than 70dBFS for more than 700 ms, the certain portion is determined as a pause, sentence-level voice segments are divided based on the pause, and the obtained voice segments are converted into text segments. Thereafter, step 5-2 or step 5-3 may be performed.

And 5-2, extracting the voice fragment containing the preset tone, and converting the voice fragment into a text fragment.

Extracting a voice feature vector from each voice segment according to frames to obtain an input matrix of N x 65, inputting a pre-trained tone prediction model, predicting tone levels of emphasized tones of the voice segments, extracting voice segments containing the emphasized tones from the voice segments, and converting the voice segments into text segments. As shown in fig. 6, the specific implementation flow is as follows:

a. and collecting training samples of the tone prediction model. Collecting a large number of sample voice fragments containing the emphasized tone, then extracting voice feature vectors of a plurality of voice frames in each sample voice fragment, and determining the tone level of the emphasized tone of each sample voice fragment to obtain the training sample.

b. And decoding the voice feature vector by using AAE to perform feature re-representation on the voice feature vector.

c. And training a tone prediction model by using the speech feature vector after the feature re-representation and the tone level of the emphasized tone of each sample speech segment determined in advance. And correcting the weight values on a plurality of connecting lines between each layer in the tone prediction model in the training process to determine the optimal weight values so as to obtain the subsequently used tone prediction model.

d. Obtaining a plurality of voice segments in current voice data, extracting voice feature vectors of a plurality of voice frames in each voice segment, and decoding the voice feature vectors by using AAE.

e. After decoding, inputting a plurality of speech feature vectors of a speech segment into the trained tone prediction model to predict tone levels of the emphasized tone, and determining whether the speech segment contains the speech segment of the emphasized tone based on the output result of the tone prediction model.

And 5-3, extracting text segments containing preset tone words.

Performing phrase splitting on the plurality of text segments to identify whether a text segment containing preset emphatic words exists in the plurality of text segments, wherein the preset emphatic words comprise 'important', 'special' and the like.

And 5-4, determining first weights of the speech segment containing the emphasized tone and the text segment containing the emphasized tone word.

And determining a corresponding first weight based on the correlation degree between the speech segment containing the emphasized tone or the text segment containing the emphasized tone word and the target word bank. Taking as an example that the predetermined first keywords include "S9", "S8", and "promotional activity", and the second weights of "S9", "S8", and "promotional activity" are 1.5, and 1, respectively, and the first threshold is 3, the existing speech segment 1 with accent tone hits "S9", "S8", and "promotional activity", the speech segment 2 with accent tone hits "S8" and "promotional activity", and the speech segment 3 with accent tone hits "promotional activity". Then, the degree of correlation between the speech segment 1 and the model lexicon is 3, the degree of correlation between the speech segment 2 and the model lexicon is 1, the degree of correlation between the speech segment 2 and the model lexicon is 1.5, the degree of correlation between the speech segment 3 and the model lexicon is 1, and the degree of correlation between the speech segment 3 and the model lexicon is 0, and the degree of correlation between the speech segment and the scheme lexicon is 1.

And 5-5, extracting the voice segments with the first weight larger than the first threshold value as key voice.

Following the above example, speech segment 1 is extracted as the key speech. It should be noted that the text segment with the first weight greater than the first threshold may be directly determined as the key text corresponding to the key speech.

2) The feature information includes a case of a character feature.

In this case, optionally, three implementation forms are included:

in a first implementation form, the recognizing the feature information of the voice data includes:

recognizing a phrase used for representing the identity of a person in the voice data, and extracting third voice data containing the phrase used for representing the identity of the person, wherein the third voice data comprises at least one second voice segment;

determining a target sentence pattern matched with the second voice fragment in a preset sentence pattern set;

determining a third speech segment in the speech data associated with the second speech segment based on the target sentence pattern;

and determining the character characteristics of the third voice segment based on the phrase used for representing the character identity and contained in the second voice segment.

In this implementation form, in order to facilitate reading, the phrase for characterizing the identity of the person is represented as a second keyword. The second keyword may include a phrase for characterizing a name, such as "zhang san" or "li si", a phrase for characterizing a title or a job, such as "professor", "teacher" or "manager", and a phrase for characterizing a relationship, such as "mom" or "vintage". The character features corresponding to each of the second keywords may be preset by a user, or may be determined by itself based on the word senses of the second keywords, which may be determined specifically according to actual situations, and the embodiment of the present application is not limited herein.

In a specific implementation, the preset sentence pattern set is a sentence pattern set used for determining the character features of the speech segment, and optionally, the following three sentence patterns may be included:

first, the sentence pattern of the character features of the voice data in the preset time period before the second voice segment is determined. Illustratively, the preset sentence pattern set may include a first sentence pattern "xxx just says very well/summarized very well", based on which it may be determined that there may be a speech segment with the speaker character "xxx" in the preset time period before the current second speech segment, and the character characteristic of the speech segment is the character characteristic of "xxx". If the sentence pattern matched with the second voice fragment is the first sentence pattern or the sentence pattern similar to the first sentence pattern, a third voice fragment associated with the second voice fragment is a voice fragment in a preset time period before the second voice fragment, and the character feature of the third voice fragment is the character feature of 'xxx'.

And secondly, determining the sentence pattern of the character features of the voice data in a preset time period after the third voice segment. Illustratively, the set of preset patterns may include a second pattern of "yyy, how do you feel/how do you look? Based on the second sentence pattern, it may be determined that there may be a speech segment whose speaker is "yyy" in a preset time period after the current second speech segment, and the character feature of the speech segment is the character feature of "yyy". If the sentence pattern matched with the second voice fragment is the second sentence pattern or the sentence pattern similar to the second sentence pattern, the third voice fragment associated with the second voice fragment is a voice fragment within a preset time period after the third voice fragment, and the character feature of the third voice feature is the character feature of 'yyy'.

And thirdly, determining the sentence pattern of the character characteristics of the second voice fragment. Illustratively, the preset sentence pattern set may include a third sentence pattern, where the third sentence pattern is "my is zzz", and based on the third sentence pattern, it may be determined that the speaker corresponding to the current second speech segment is "zzz", and then the character feature of the current second speech segment is the character feature of "zzz". And if the sentence pattern matched with the second voice fragment is the third sentence pattern or the sentence pattern similar to the third sentence pattern, the third voice fragment associated with the second voice fragment is the second voice fragment.

In a second implementation form, the recognizing the feature information of the voice data includes:

determining a target voiceprint characteristic corresponding to the voice data in a preset voiceprint set, wherein the preset voiceprint set is preset with a corresponding relation between the voiceprint characteristic and a character characteristic;

and determining the character features of the voice data according to the target voiceprint features based on the corresponding relation between the voiceprint features and the character features.

In this implementation form, the human feature of the voice data may be determined based on a voiceprint feature of the voice data. The preset voiceprint set can store the voiceprint characteristics and the character characteristics of each speaking character in the voice data and the corresponding relation between the voiceprint characteristics and the character characteristics in advance. By identifying the target voiceprint characteristics of the voice data, the person characteristics corresponding to the target voiceprint characteristics can be searched in the preset voiceprint set and used as the person characteristics of the voice data. Illustratively, the preset voiceprint set stores voiceprint feature 1, voiceprint feature 2 and voiceprint feature 3, the voiceprint feature 1 corresponds to "professor yang", the voiceprint feature 2 corresponds to "blogger manager", the voiceprint feature 3 corresponds to "moderator a", and when the voiceprint feature of a certain part of voice data is recognized as the voiceprint feature 1, the character feature of the part of voice data is the character feature of "professor yang".

In a third implementation form, the recognizing the feature information of the plurality of speech segments includes:

receiving a first input of a user;

in response to the first input, a human characteristic of the speech data is determined.

In this implementation, the character features of the speech data are determined based on user-defined input by the user.

In a specific implementation, in an optional implementation form, a plurality of speakers in the voice data may be distinguished based on a voiceprint feature of the voice data to obtain a speaker list corresponding to the voice data, where one entry in the speaker list is used to display one speaker and voice data corresponding to a high-level speaker. By displaying the speaker figure list, the user can listen to the voice data in the speaker figure list one by one and mark the character characteristics by performing input.

In this embodiment, optionally, when the feature information includes a character feature, the extracting a key voice of the voice data based on the feature information of the voice data includes:

and extracting fourth voice data matched with preset character features based on the character features of the voice data, wherein the key voice comprises the fourth voice data.

In this implementation form, optionally, the fourth voice data matched with the preset character features includes: and voice data with the third weight larger than the fourth threshold value corresponding to the human character features. The third weight may be used to characterize the importance of the speaker character represented by different character characteristics, for example, the third weight determined by the character characteristics of "teacher" may be higher than the third weight determined by the character characteristics of "student" and the third weight determined by the character characteristics of "team leader" may be higher than the third weight determined by the character characteristics of "team member".

The third weight may be determined based on the human character feature. For example, if the voice segment 1 and the voice segment 2 are both one voice segment in the voice data of an academic conference, the character feature of the voice segment 1 is the character feature of "teaching", and the character feature of the voice segment 2 is the character feature of "third presenter", then the third weight of the voice segment 1 may be preset to 8, and the third weight of the voice segment 2 may be preset to 2.

The third weight may also be determined based on user input. Different speakers in the voice data are distinguished by recognizing the voiceprint characteristics of the voice data in advance, and a speaker list can be generated to be displayed for a user. A user selection input of at least one speaker character can be received, and the third weight of the character feature of the speaker character with the action of the selection input is higher than the third weight of the character feature of the speaker character without the action of the selection input. For example, as shown in fig. 7, in the case of displaying the voice segment list, if it is determined that the speaker corresponding to "user 1" is a key person by receiving an input from the user acting on "user 1", the third weight of the character feature corresponding to "user 1" is higher than the third weights of the character features of the speakers corresponding to "user 2" and "user 3".

In the case that the third weight of a certain human feature is greater than the fourth threshold, the human feature may be considered to match a preset human feature, and the voice data corresponding to the human feature may be extracted as a key voice.

For convenience of understanding, as shown in fig. 8, an exemplary implementation flow in the present case is as follows:

and 8-1, determining key character characteristics in the voice data.

The method comprises two modes: in the first mode, the character features of the voice data are recognized based on a preset sentence pattern set, the third weight of each character feature is determined, and then the key character features with the third weight larger than a fourth threshold value are determined.

And 8-2, acquiring the voiceprint characteristics of the voice data corresponding to the key character characteristics.

Herein denoted as key voiceprint features.

And 8-3, extracting the voice data associated with the key voiceprint characteristics as key voice.

3) The feature information includes the situation of the mood feature and the character feature

extracting fifth voice data matched with preset character features based on the character features of the voice data, wherein the fifth voice data comprises at least one fourth voice segment;

and extracting a fourth voice fragment containing preset voice, so as to obtain sixth voice data, wherein the key voice comprises the sixth voice data.

In this embodiment, after fifth voice data matched with a preset character feature is extracted, sixth voice data containing a preset mood may be further extracted from the fifth voice data. That is, the implementation forms in the above cases 1) and 2) may be combined, and the specific implementation form may refer to the description in the above embodiments, which is not described herein again.

It should be noted that, in other embodiments of the present application, the feature information may further include a phrase frequency, and optionally, in a case that the feature information includes the phrase frequency, the extracting the key speech of the speech data based on the feature information of the speech data includes:

determining the first keyword based on the occurrence frequency of the word group in the voice data, wherein the first keyword meets at least one of the following conditions: phrases with frequency greater than a second threshold value appear in the voice data; the weighting frequency is the product of the occurrence frequency and a second weight corresponding to the phrase;

and extracting seventh voice data containing the first keyword in the voice data, wherein the key voice comprises the seventh voice data.

In this embodiment, the occurrence frequency of different phrases may represent the importance degree of different phrases in the voice data, and exemplarily, a phrase with a higher occurrence frequency may be a subject or a key point of the voice content. That is, the key speech extracted based on the occurrence frequency of the phrase may be used to determine the subject or key point of the speech data.

For convenience of understanding, as shown in fig. 9, an exemplary implementation flow in this case is as follows:

and 9-1, acquiring text data of the voice data.

And converting the voice data into text data based on voice recognition, and dividing the text data according to a preset sentence division rule to obtain a plurality of sentence-level text segments.

And 9-2, carrying out phrase splitting on the text segments at the sentence levels.

Taking the speech segment "this popularization activity of S9 is to pay more attention, and for this popularization activity of S9, we have the following scheme" as an example, and can split the phrases "S9", "popularization activity" and "scheme". Other sentence-level voice fragments in the voice data can be also split to obtain at least one phrase.

And 9-3, determining the occurrence frequency of each phrase in the voice data.

And 9-4, determining high-frequency keywords in the voice data based on the occurrence frequency of each phrase.

And 9-5, extracting the voice fragment containing the high-frequency keyword as key voice.

As shown in fig. 10, in an exemplary embodiment of the present application, an implementation flow of the text generation method is as follows:

step 10-1, acquiring voice data;

and step 10-2, preprocessing the voice data.

Preprocessing can include phrase division, extracting nouns or verb phrases in the voice data, and determining high-frequency keywords; the method also comprises sentence division, wherein the voice data is divided into a plurality of sentence-level voice fragments according to a preset division rule; the method can also comprise the steps of carrying out voiceprint recognition on the voice data, distinguishing speaker figures corresponding to a plurality of voiceprint characteristics in the voice data and generating a speaker figure list; the method can also comprise receiving input performed by a user on the speaker character, and determining key characters in the voice data.

And step 10-3, extracting first key voice containing high-frequency key words.

And step 10-4, extracting voice data containing preset tone, and extracting second key voice based on the first weight of the voice fragment.

And step 10-5, extracting third key voices with the character characteristics meeting the preset character characteristics.

And 10-6, sorting the first key voice, the second key voice and the third key voice.

And after the first key voice, the second key voice and the third key voice are converted into text segments, sequencing and removing duplication according to the appearance time sequence to obtain a key text segment list arranged according to a time line.

And 10-7, manually editing the key text fragment list by a user, generating and displaying a text corresponding to the voice data.

To sum up, the text generation method provided by the embodiment of the present application obtains voice data; identifying feature information of the plurality of speech segments; extracting key voice of the voice data based on the feature information of the voice data; and displaying a text corresponding to the voice data based on the key voice. Therefore, the key voice of the voice data is extracted based on the feature information of the voice data, so that the text corresponding to the voice data can be displayed based on the key voice.

It should be noted that, in the text generation method provided in the embodiment of the present application, the execution subject may be a text generation apparatus, or a control module in the text generation apparatus for executing the text generation method. In the embodiment of the present application, a method for executing text generation by a text generation device is taken as an example, and a text generation device provided in the embodiment of the present application is described.

Referring to fig. 11, fig. 11 is a structural diagram of a text generating apparatus according to an embodiment of the present application.

As shown in fig. 11, the text generation apparatus 1100 includes:

an obtaining module 1101, configured to obtain voice data;

a recognition module 1102, configured to recognize feature information of the voice data, where the feature information includes at least one of a mood feature and a character feature;

an extracting module 1103, configured to extract a key voice of the voice data based on the feature information of the voice data;

a generating module 1104, configured to display a text corresponding to the voice data based on the key voice.

Optionally, in a case that the feature information includes a mood feature, the identifying module 1102 includes:

the first identification unit is used for identifying phrases used for representing tone in the voice data and determining tone characteristics of the voice data based on the phrases used for representing tone;

or the second recognition unit is used for recognizing the voice characteristics of the voice data and determining the tone characteristics of the voice data based on the voice characteristics of the voice data.

Optionally, the second identification unit includes:

a dividing subunit, configured to divide the voice data into a plurality of voice segments;

a first determining subunit, configured to determine, based on speech features of N speech frames in the speech segment, speech feature vectors of the N speech frames in the speech segment, where N is a positive integer;

the prediction subunit is used for inputting the voice feature vectors of the N voice frames into a pre-trained tone prediction model and predicting tone levels of the voice segments containing preset tones;

the first obtaining subunit is configured to obtain a target vector output by the mood prediction model, where the target vector is used to represent a mood level of the speech segment that contains a preset mood;

and the second determining subunit is used for determining the tone features of the voice segments on the basis of the target vectors.

Optionally, the mood prediction model comprises an input layer, at least one convolutional layer, a fully-connected layer and an output layer, wherein the fully-connected layer comprises a first hidden layer and a second hidden layer;

the predictor unit is specifically configured to:

acquiring the N voice feature vectors in the input layer;

Optionally, in a case that the feature information includes a mood feature, the extracting module 1103 includes:

the first determining unit is used for determining first voice data containing preset voice in the voice data, and the first voice data comprises at least one first voice segment;

a second determining unit, configured to determine a first keyword based on a frequency of occurrence of a word group in the voice data, where the first keyword satisfies at least one of the following: phrases with frequency greater than a second threshold value appear in the voice data; the weighting frequency is the product of the occurrence frequency and a second weight corresponding to the phrase;

a third determining unit, configured to determine a first weight corresponding to the first speech segment based on an occurrence frequency or a weighted frequency of the first keyword in the first speech segment, or a degree of correlation between the first speech segment and a target lexicon, where the degree of correlation of the target lexicon is determined based on the occurrence frequency or the weighted frequency of the first keyword in the first speech segment matching the target lexicon;

a first extracting unit, configured to extract the first voice segment with the first weight being greater than a first threshold value, to obtain second voice data, where the key voice includes the second voice data.

Optionally, in a case where the feature information includes a human feature, the identifying module 1102 includes:

the third recognition unit is used for recognizing a phrase used for representing the identity of a person in the voice data and extracting third voice data containing the phrase used for representing the identity of the person, wherein the third voice data comprises at least one second voice segment;

a fourth determining unit, configured to determine a target sentence pattern matched with the second speech segment in a preset sentence pattern set;

a fifth determining unit configured to determine, based on the target sentence pattern, a third speech piece associated with the second speech piece in the speech data;

a sixth determining unit, configured to determine, based on a phrase included in the second voice segment and used for characterizing a person identity, a person feature of the third voice segment.

Optionally, in a case where the feature information includes a character feature, the extraction module 1103 includes:

and the second extraction unit is used for extracting fourth voice data matched with preset character features based on the character features of the voice data, and the key voice comprises the fourth voice data.

Optionally, in a case that the feature information includes a mood feature and a character feature, the extracting module 1103 includes, based on the feature information of the voice data:

the third extraction unit is used for extracting fifth voice data matched with preset character features based on the character features of the voice data, and the fifth voice data comprises at least one fourth voice segment;

and the fourth extraction unit is used for extracting a fourth voice segment with the characteristic of tone meeting the preset tone to obtain sixth voice data, and the key voice comprises the sixth voice data.

The text generation device provided by the embodiment of the application acquires voice data; identifying feature information of the plurality of voice segments, wherein the feature information comprises at least one of tone features and character features; extracting key voice of the voice data based on the feature information of the voice data; and displaying a text corresponding to the voice data based on the key voice. Therefore, the key voice of the voice data is extracted based on the feature information of the voice data, so that the text corresponding to the voice data can be displayed based on the key voice.

The text generation device in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The text generation device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The text generation device provided in the embodiment of the present application can implement each process implemented by the method embodiments of fig. 1 to fig. 10, and is not described here again to avoid repetition.

Optionally, as shown in fig. 12, an electronic device 1200 is further provided in an embodiment of the present application, and includes a processor 1201, a memory 1202, and a program or an instruction stored in the memory 1202 and executable on the processor 1201, where the program or the instruction is executed by the processor 1201 to implement each process of the text generation method embodiment, and can achieve the same technical effect, and no further description is provided here to avoid repetition.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 13 is a schematic hardware structure diagram of an electronic device implementing an embodiment of the present application.

The electronic device 1300 includes, but is not limited to: a radio frequency unit 1301, a network module 1302, an audio output unit 1303, an input unit 1304, a sensor 1305, a display unit 1306, a user input unit 1307, an interface unit 1308, a memory 1309, a processor 1310, and the like.

Those skilled in the art will appreciate that the electronic device 1300 may further comprise a power source (e.g., a battery) for supplying power to the various components, and the power source may be logically connected to the processor 1310 via a power management system, so as to manage charging, discharging, and power consumption management functions via the power management system. The electronic device structure shown in fig. 13 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

Wherein the processor 1310 is configured to:

acquiring voice data;

and displaying a text corresponding to the voice data based on the key voice.

Optionally, in a case that the feature information includes a mood feature, the processor 1310 is specifically configured to:

Optionally, the processor 1310 is specifically configured to:

dividing the voice data into a plurality of voice segments;

Optionally, the mood prediction model comprises an input layer, at least one convolutional layer, a fully-connected layer and an output layer, wherein the fully-connected layer comprises a first hidden layer and a second hidden layer; the processor 1310 is specifically configured to:

acquiring the N voice feature vectors in the input layer;

Optionally, in a case that the feature information includes a character feature, the processor 1310 is specifically configured to:

Optionally, in a case that the feature information includes a character feature, the processor 1310 is further configured to:

Optionally, in a case that the feature information includes a mood feature and a character feature, the processor 1310 is specifically configured to:

and extracting a fourth voice fragment with the characteristic of tone meeting the preset tone to obtain sixth voice data, wherein the key voice comprises the sixth voice data.

The electronic equipment provided by the embodiment of the application acquires voice data; identifying feature information of the plurality of voice segments, wherein the feature information comprises at least one of tone features and character features; extracting key voice of the voice data based on the feature information of the voice data; and displaying a text corresponding to the voice data based on the key voice. Therefore, the key voice of the voice data is extracted based on the feature information of the voice data, so that the text corresponding to the voice data can be displayed based on the key voice.

It should be understood that in the embodiment of the present application, the input Unit 1304 may include a Graphics Processing Unit (GPU) 13041 and a microphone 13042, and the Graphics processor 13041 processes image data of still pictures or videos obtained by an image capturing apparatus (such as a camera) in a video capturing mode or an image capturing mode. The display unit 1306 may include a display panel 13061, and the display panel 13061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1307 includes a touch panel 13071 and other input devices 13072. A touch panel 13071, also referred to as a touch screen. The touch panel 13071 may include two parts, a touch detection device and a touch controller. Other input devices 13072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. Memory 1309 may be used to store software programs as well as various data, including but not limited to application programs and operating systems. The processor 1310 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1310.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the text generation method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the text generation method embodiment, and can achieve the same technical effect, and the details are not repeated here to avoid repetition.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of text generation, the method comprising:

acquiring voice data;

and displaying a text corresponding to the voice data based on the key voice.

2. The method according to claim 1, wherein in the case that the feature information includes a mood feature, the recognizing feature information of the voice data includes:

3. The method of claim 2, wherein the recognizing the speech features of the speech data and determining the mood features of the speech data based on the speech features of the speech data comprises:

dividing the voice data into a plurality of voice segments;

4. The method of claim 3, wherein the mood prediction model comprises an input layer, at least one convolutional layer, a fully-connected layer, and an output layer, the fully-connected layer comprising a first hidden layer and a second hidden layer;

acquiring the N voice feature vectors in the input layer;

5. The method according to any one of claims 1 to 4, wherein, in a case where the feature information includes a mood feature, the extracting a key voice of the voice data based on the feature information of the voice data includes:

6. The method according to claim 1, wherein in the case where the feature information includes a character feature, the recognizing feature information of the voice data includes:

7. The method according to claim 1 or 6, wherein, in a case where the feature information includes a human character, the extracting a key voice of the voice data based on the feature information of the voice data includes:

8. The method according to claim 1, wherein in a case where the feature information includes a mood feature and a character feature, the extracting a key voice of the voice data based on the feature information of the voice data includes:

9. An apparatus for generating text, the apparatus comprising:

the acquisition module is used for acquiring voice data;

10. The apparatus of claim 9, wherein in the case that the feature information includes a mood feature, the identifying module comprises:

11. The apparatus of claim 10, wherein the second identification unit comprises:

12. The apparatus of claim 11, wherein the mood prediction model comprises an input layer, at least one convolutional layer, a fully-connected layer, and an output layer, the fully-connected layer comprising a first hidden layer and a second hidden layer;

the predictor unit is specifically configured to:

acquiring the N voice feature vectors in the input layer;

13. The apparatus according to any one of claims 9 to 12, wherein in the case where the feature information includes a mood feature, the extraction module includes:

14. The apparatus of claim 9, wherein in the case where the feature information includes a character feature, the identifying module comprises:

15. The apparatus according to claim 9 or 14, wherein in the case where the feature information includes a character feature, the extraction module includes:

16. The apparatus of claim 9, wherein in the case that the feature information includes a mood feature and a character feature, the extraction module comprises:

17. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the text generation method of any of claims 1-8.

18. A readable storage medium, on which a program or instructions are stored, which when executed by a processor, carry out the steps of the text generation method according to any one of claims 1 to 8.