Disclosure of Invention
The invention aims to provide a voice data text conversion method based on multi-party communication so as to solve the problems in the background technology.
In order to achieve the above object, the present invention provides a method for converting voice data text based on multi-party communication, which comprises the following steps:
firstly, a preset password input by a multi-party equipment end is identified, and the password comprises two gestures:
if the gesture I is correct and the preset password is correct, marking the equipment end, outputting the mark of each equipment end, and constructing group chat according to the mark of the equipment end;
secondly, if the preset password is incorrect, continuing to pop up the input window;
performing character conversion on voice data communicated by each equipment end in group chat;
storing the voice data and the converted text data through a memory;
extracting voice data output by a preselected marking device end and character data converted by the preselected marking device end in a memory, then identifying key data information of the preselected marking device end according to the extracted character data to form a key title, and then extracting voice data output by the other marking device ends before the next key title appears after the key title and character data converted by the other marking device ends to form key character data;
and integrating the key character data and the key title, specifically, screening the key character data according to the key title to screen out the value character data, and supplementing the value character data, the voice data and the equipment end mark into a display frame of the group chat in a mutually corresponding manner.
As a further improvement of the technical scheme, the key data information of the preselected marking equipment end comprises key character information, tone word-aid information and keyword extraction information.
As a further improvement of the technical scheme, the key data information extraction adopts a weighted extraction algorithm, and the algorithm steps are as follows:
punctuation mark punctuation sentences are carried out according to the sound intervals and the tone of the sound in the voice data, wherein the punctuation marks comprise periods, question marks and exclamation marks;
quantizing the word frequency, word length, word property, position and dictionary factors of character data at the end of the pre-selection marking equipment by using the weighting factors, and performing weight calculation after quantization to obtain the total weight of each factor;
and sequencing words corresponding to the weight values in a descending order mode to obtain a keyword list, and acquiring key data information through the keyword list.
As a further improvement of the technical solution, the total weight calculation formula of the factors is as follows:
wherein,
as words and phrases
The factor total weight of the text data;
the word frequency factor is the ratio;
is a word frequency factor;
is the word length factor ratio;
is a word length factor;
is the ratio of parts-of-speech factors;
is a part of speech;
is the ratio of the position factors;
is the ratio of the positions;
is the ratio of dictionary factors;
is a dictionary factor, and
。
as a further improvement of the technical scheme, the Chinese character conversion comprises the following specific steps:
firstly, extracting audio data output by an equipment end, and then training the audio data by using a Gaussian mixture learning algorithm;
decomposing a harmonic plus noise model of the audio output voice of the extraction source, correcting the decomposed model by using average fundamental frequency comparison to obtain corresponding corrected harmonic amplitude and phase parameters, extracting the characteristics of the harmonic amplitude and phase parameters to obtain linear spectral rate parameters, mapping the linear spectral rate parameters by using a Gaussian mixture model, and fusing the mapped linear spectral rate parameter characteristics;
and performing mixed output by using the corrected harmonic amplitude and phase parameters, and then extracting text data of the source audio output voice.
As a further improvement of the technical solution, the gaussian mixture learning algorithm includes the following steps:
firstly, training source audio output voice and target audio output voice, and decomposing corresponding harmonic and noise models;
calculating the average fundamental frequency ratio of the fundamental frequency tracks of the two output voices, and simultaneously performing feature extraction on the harmonic amplitude and phase parameters of the two output voices to obtain corresponding linear spectrum rate parameters;
and (4) carrying out dynamic time warping on the obtained linear frequency spectrum rate parameters, and obtaining a Gaussian mixture model by using a variational Bayes estimation algorithm.
As a further improvement of the technical solution, a calculation formula of the variational bayes estimation algorithm is as follows:
wherein:
is the logarithmic edge density;
to observe an audio variable;
outputting a text variable of the voice for the source audio;
for given purpose
About
A posterior probability of (d);
is composed of
A priori probability of.
Compared with the prior art, the invention has the beneficial effects that:
according to the voice data and character conversion method based on multi-party communication, the characters converted from the voice data of multi-party communication are integrated through the key titles and the key character data, and the key titles are integrated in a pre-selection marking mode, so that the problem that the voice data conversion pertinence is not enough in the prior art is solved, and the efficiency of later-stage manual screening is greatly improved after arrangement.
Detailed Description
Example 1
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a technical solution:
the invention provides a voice data text conversion method based on multi-party communication, which comprises the following steps:
performing character conversion on voice data communicated by each equipment end in group chat;
the voice data and the text data converted by the voice data are stored through the memory, then the text data converted by the text data are displayed in the display frame, please refer to fig. 4, the displayed text data are convenient for memorizing in the process of arranging meeting records or learning notes at the later stage, and the problem that the text can not be extracted and memorized after the video meeting or the video learning is solved.
Example 2
In order to improve the security of group chat and prevent non-group chat people from joining the group chat, the embodiment is different from embodiment 1 in that a preset password input by a multi-party device end is firstly identified, and the preset password comprises two gestures:
if the first posture and the preset password are correct, the equipment end is marked, the marks of the equipment ends are output, group chat is built according to the marks of the equipment ends, and therefore people in the group chat are distinguished and divided in a marking mode, and distinguishing is carried out in a mode of adding a specific mark, for example: if the group chats to be an enterprise group, the marking mode comprises a boss mode and an employee mode; if the group chat is a learning group, the marking mode comprises a teacher and a student, and the identification degree of the members in the group chat is further improved;
and secondly, if the preset password is incorrect, the input window is continuously popped up, and the equipment end with the incorrect preset password cannot join the group chat, so that the safety of the group chat is greatly improved, and the problem that non-group chat personnel join the group chat is solved.
Example 3
In order to improve the pertinence of voice data conversion, the embodiment is different from
embodiment 2 in that voice data output by a preselected marking device end and text data after conversion thereof are extracted from a memory, then key data information of the preselected marking device end is identified according to the extracted text data to form a key title, and then voice data output by the other marking device ends before a next key title appears after the key title and the text data after conversion thereof are extracted to form the key text data;
and integrating the key character data and the key title, specifically, screening the key character data according to the key title to screen out the value character data, and supplementing the value character data, the voice data and the equipment end mark into a display frame of the group chat in a mutually corresponding manner.
In addition, the key data information of the preselection marking device end comprises key character information, tone word-aid information and keyword extraction information.
In specific use of the embodiment, the embodiment is exemplified by a conference of an enterprise, an alternating group chat is constructed by means of password input, assuming that an equipment end set S = (a, a1, a2, a 3) in the group chat, and S = (boss a, employee a1, employee a2, employee a 3) after marking, at this time, "boss a" is set as a preselected mark, when the boss a issues "how you still have problems in the above description" in the group chat, so as to obtain a word "how" the language help word is, the "how you still have problems in the above description" is determined as a key title, and then the employee a1, a2, a3 outputs "problem 1: how to improve the normal work efficiency, and "problem 2: unknown "," problem 3: how to realize the mutual supervision of employees in the working process, and other voice data are determined as key word data, and then the word data which are not in conformity with the key title are determined as the problem 2: not knowing "culling, left" problem 1: "
question 3" and "integrate the display through the display box for what question you still have" as described above, see fig. 5, where:
b, boss A: "how do you have problems with the above description";
employee a 1: "problem 1: how to improve the working efficiency at ordinary times ";
employee a 3: "problem 3: how to realize the mutual supervision of the employees in the working process ".
Therefore, the key titles and the key word data are integrated to integrate the words converted from the voice data of multi-party communication, and the key titles are integrated in a pre-selection marking mode, so that the problem of insufficient pertinence of voice data conversion in the prior art is solved, and the efficiency of later-stage manual screening is greatly improved after arrangement.
Example 4
In order to improve the accuracy of extracting the key data information, the embodiment is different from
embodiment 3 in that a weighted extraction algorithm is adopted for extracting the key data information, and the algorithm steps are as follows:
punctuation mark punctuation sentences are carried out according to the sound intervals and the tone of the sound in the voice data, wherein the punctuation marks comprise periods, question marks and exclamation marks;
quantizing the word frequency, word length, word property, position and dictionary factors of character data at the end of the pre-selection marking equipment by using the weighting factors, and performing weight calculation after quantization to obtain the total weight of each factor;
and sequencing words corresponding to the weight values in a descending order mode to obtain a keyword list, and acquiring key data information through the keyword list.
Specifically, the total weight of the factors is calculated as follows:
wherein,
as words and phrases
The factor total weight of the text data;
the word frequency factor is the ratio;
is a word frequency factor;
is the word length factor ratio;
is a word length factor;
is the ratio of parts-of-speech factors;
is a part of speech;
is the ratio of the position factors;
is the ratio of the positions;
is the ratio of dictionary factors;
is a dictionary factor, and
in the present embodiment, the determination
The method of reverse reasoning using large-scale corpus, preferably fuzzy processing, will be used according to various importance of the result
The value assigned was 0.4,
The value assigned was 0.2,
And
the value assigned was 0.15,
The value is assigned to 0.1, and then a candidate keyword table A is obtained through weight calculation, and the generation principle of the primary candidate keywords is as follows:
words of unspecific part of speech (noun, verb, adjective, idiom) or words which do not appear in the title sentence, the head of the paragraph, and the tail of the paragraph and have the word frequency of 1 are filtered out.
If the Total word number of the article is Total, and the extraction number of the keywords is k, k should satisfy:
k = Total35% if Total35% < 20; if Total35% > =20, k keywords extracted through the above two steps are used as primary candidate keywords, and therefore accuracy of key data information obtained through weight calculation is greatly improved.
Example 5
In order to improve the robustness of voice conversion in the rare data environment, the present embodiment is different from embodiment 1, please refer to fig. 2 and fig. 3, wherein:
the Chinese character conversion method comprises the following specific steps:
firstly, extracting audio data output by an equipment end, and then training the audio data by using a Gaussian mixture learning algorithm;
decomposing a harmonic plus noise model of the audio output voice of the extraction source, correcting the decomposed model by using average fundamental frequency comparison to obtain corresponding corrected harmonic amplitude and phase parameters, extracting the characteristics of the harmonic amplitude and phase parameters to obtain linear spectral rate parameters, mapping the linear spectral rate parameters by using a Gaussian mixture model, and fusing the mapped linear spectral rate parameter characteristics;
and performing mixed output by using the corrected harmonic amplitude and phase parameters, and then extracting text data of the source audio output voice.
In addition, the Gaussian mixture learning algorithm comprises the following steps:
firstly, training source audio output voice and target audio output voice, and decomposing corresponding harmonic and noise models;
calculating the average fundamental frequency ratio of the fundamental frequency tracks of the two output voices, and simultaneously performing feature extraction on the harmonic amplitude and phase parameters of the two output voices to obtain corresponding linear spectrum rate parameters;
and (4) carrying out dynamic time warping on the obtained linear frequency spectrum rate parameters, and obtaining a Gaussian mixture model by using a variational Bayes estimation algorithm.
Specifically, when in use, the Gaussian mixture model adopts a VB-GMM algorithm, and firstly outputs source audio to voice
And target audio output speech
Combined into an extended vector
Then is aligned with

Referring to fig. 6, the horizontal axis direction is the number of degrees of mixing, the vertical axis direction is the logarithmic distortion (unit: dB), specifically, the conversion error of point (1) decreases with the increasing data amount, which shows that the more sufficient the data amount, the more sufficient the model training, the better the conversion effect, point (2) shows the better than the standard solid line portion GMM (error distortion is about 0.5dB lower) with the less data amount (training data is less than 500 frames), the difference between the two performances decreases with the increasing data amount, when the data amount is relatively sufficient (training data is more than 5000 frames), the difference between the two tends to be balanced (error distortion is about 0.23dB and the above theory is the same, when the data amount tends to infinity, the result of VB-GMM estimation is approximately equal to the result of the maximum likelihood estimation, point (3) trains data about 3000 frames, the performance of the two kinds of data reaches a local low point phenomenon, and the correlation between 3000 frames of training data and test data is strong, so the conversion effect is very good.
It is worth noting that: the symbols of the points (1) - (6) refer to the optimal mixing degree numbers of the two under a certain training data volume (the standard GMM and the VB-GMM both adopt the same optimal mixing degree), and the optimal value is automatically obtained according to the VB-GMM algorithm (for different numbers of data, we initialize 32 mixing degrees, and the finally obtained mixing degree value is the self-optimization result of the VB-GMM algorithm), so that the problem of 'overfitting' is avoided, and the robustness of voice conversion under the rare data environment is improved.
Further, the calculation formula of the variational bayesian estimation algorithm is as follows:
wherein:
is the logarithmic edge density;
to observe an audio variable;
outputting a text variable of the voice for the source audio;
for given purpose
About
A posterior probability of (d);
is composed of
By a priori probability of all possible pairs, in particular
By way of integration to estimate
。
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.