CN105957517A

CN105957517A - Voice data structured conversion method and system based on open source API

Info

Publication number: CN105957517A
Application number: CN201610286831.9A
Authority: CN
Inventors: 许爱东; 郭晓斌; 黄文琦; 陈华军; 李果; 蒋屹新; 袁小凯; 蒙家晓; 张福铮; 黄建理; 杜金燃
Original assignee: China South Power Grid International Co ltd; Power Grid Technology Research Center of China Southern Power Grid Co Ltd
Current assignee: China South Power Grid International Co ltd; Power Grid Technology Research Center of China Southern Power Grid Co Ltd
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2016-09-21

Abstract

The invention relates to a voice data structuralization conversion method and a system thereof based on an open source API, wherein the method comprises the following steps: extracting voice data in a data source; carrying out segmentation, fragmentation and non-voice filtering processing on the voice data to obtain feature text data of the voice data; performing voice recognition processing on the characteristic text data by utilizing an open-source voice recognition API (application program interface) to obtain the frequency, loudness, emotion information and a voice recognition sequence of voice; and fusing the frequency, loudness, emotion information and voice recognition sequence of the voice and carrying out structured text identification to generate structured voice data. By the technical scheme, the voice data are structurally converted, and the voice data are stored and managed; moreover, the method has higher working efficiency and good emotion analysis function, and improves the accuracy of the voice data structured conversion method and system based on the open source API.

Description

Speech data structuring conversion method based on the API that increases income and system thereof

Technical field

The present invention relates to technical field of data processing, particularly relate to a kind of speech data based on the API that increases income Structuring conversion method and system thereof.

Background technology

In the last few years, popular along with big concept data, with electronic document, mail, form, audio frequency, regard The quantity that frequency and graph image etc. are the unstructured data of key component the most quickly increases, especially Practicality and the development speed of the speech data in unstructured data are especially prominent.Along with unstructured data The quick growth of quantity, the storage and management problem of unstructured data highlights the most day by day, severely impacts The work efficiency that data process.The growth of the structural data in relevant database is the slowest, There is not the similar storage of unstructured data and problem of management, therefore, when reality is applied, especially language Sound technical field of data processing, stores after unstructured data is converted to structural data again and manages Reason becomes the effective method of one solving the problems referred to above.

Speech data is converted into structural data by prior art, mainly has two kinds of methods: h coding divides Analysis and Software Coding analysis；Wherein, h coding analyzes and can carry out semantic point according to the accurate participle of linguistic context Analysis, and the requirement to data mode is relatively low, it is also possible to oralization style of writing is carried out data analysis, but, The speed that h coding analyzes is relatively slow, is difficult to carry out the data analysis work that data volume is bigger；Software Coding divides Analysis, the accuracy for semantic analysis is relatively low, and step is relatively complicated, lacks enough sentiment analysis functions, Treatment effeciency need to improve.

In sum, the processing speed of existing speech data structuring conversion method is relatively slow, and accuracy is relatively Low.

Summary of the invention

Based on this, it is necessary to the processing speed for existing speech data structuring conversion method is relatively slow, and The technical problem that accuracy is relatively low, it is provided that a kind of speech data structuring conversion method based on the API that increases income and Its system.

A kind of speech data structuring conversion method based on the API that increases income, it is characterised in that include walking as follows Rapid:

Extract the speech data in data source；

Described speech data is carried out cutting, fragmentation and non-voice filtration treatment, obtains described speech data Feature text data；

Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing, Frequency, loudness, emotion information and speech recognition sequence to voice；Wherein, described emotion information is to represent The information of emotion classification；

Frequency, loudness, emotion information and the speech recognition sequence of described voice is merged and structuring literary composition This mark, the speech data of generating structure.

Above-mentioned speech data structuring conversion method based on the API that increases income, by the data source extracted Speech data carries out cutting, fragmentation and non-voice filtration treatment, obtains the feature text of described speech data Data；Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing, obtain The frequency of voice, loudness, emotion information and speech recognition sequence；To the frequency of described voice, loudness, feelings Sense information and speech recognition sequence carry out merging and structured text mark, the speech data of generating structure. Pass through technique scheme, it is achieved that the structuring to the speech data in data source is changed, beneficially voice The storage and management of data；And, the speech data structuring conversion method based on the API that increases income of the present invention Work efficiency higher, also there is good sentiment analysis function, further increase based on increasing income API's The accuracy of speech data structuring conversion method.

A kind of speech data structuring converting system based on the API that increases income, including:

Extraction module, for extracting the speech data in data source；

Pretreatment module, for described speech data being carried out cutting, fragmentation and non-voice filtration treatment, Obtain the feature text data of described speech data；

Sound identification module, for utilizing the speech recognition API increased income that described feature text data is carried out language Sound identifying processing, obtains the frequency of voice, loudness, emotion information and speech recognition sequence；

Modular converter, for carrying out frequency, loudness, emotion information and the speech recognition sequence of described voice Merge and structured text identifies, the speech data of generating structure.

Above-mentioned speech data structuring converting system based on the API that increases income, by pretreatment module to extraction Speech data in data source carries out cutting, fragmentation and non-voice filtration treatment, obtains described speech data Feature text data；Utilize the speech recognition API increased income to described feature text by sound identification module Data carry out voice recognition processing, obtain the frequency of voice, loudness, emotion information and speech recognition sequence； By modular converter frequency, loudness, emotion information and the speech recognition sequence of described voice merged and Structured text identifies, the speech data of generating structure.Pass through technique scheme, it is achieved that to data The structuring conversion of the speech data in source, the beneficially storage and management of speech data；And, the present invention The work efficiency of speech data structuring converting system based on the API that increases income higher, also there are good feelings Sense analytic function, further increases the accuracy of speech data structuring converting system based on the API that increases income.

Accompanying drawing explanation

Fig. 1 is the speech data structuring conversion method stream based on the API that increases income of one embodiment of the present of invention Cheng Tu；

Fig. 2 is the speech data structuring conversion method based on the API that increases income of an alternative embodiment of the invention Preprocess method flow chart；

Fig. 3 is the speech data structuring conversion method based on the API that increases income of an alternative embodiment of the invention Audio recognition method flow chart；

Fig. 4 is the speech data structuring conversion method based on the API that increases income of an alternative embodiment of the invention Conversion method flow chart；

Fig. 5 is the speech data structuring converting system based on the API that increases income of one embodiment of the present of invention Structural representation；

Fig. 6 is the speech data structuring converting system based on the API that increases income of an alternative embodiment of the invention The structural representation of pretreatment module；

Fig. 7 is the speech data structuring converting system based on the API that increases income of an alternative embodiment of the invention The structural representation of sound identification module；

Fig. 8 is the speech data structuring converting system based on the API that increases income of an alternative embodiment of the invention The structural representation of structurized module；

Fig. 9 is the speech data structuring converting system based on the API that increases income of an alternative embodiment of the invention The structural representation of modular converter.

Detailed description of the invention

In order to further illustrate the technological means and the effect of acquirement that the present invention taked, below in conjunction with the accompanying drawings And preferred embodiment, to technical scheme, carry out clear and complete description.

As it is shown in figure 1, the speech data structure based on the API that increases income that Fig. 1 is one embodiment of the present of invention Change conversion method flow chart, comprise the steps:

Step S101: extract the speech data in data source；

In this step, by being extracted by the speech data in data source, prevent from changing at speech data During non-speech data speech data is interfered, improve the voice based on the API that increases income of the present invention The accuracy of data structured conversion method.In actual applications, this step S101 is from comprising multiple number According to the data source of type extracts audio-frequency information, such as, from one section of video, extract the part of track.

From data source, extract the API that track can use windows system to provide, provide with winmm.h Method as a example by, it is necessary first to comprise reference object by following statement:

#include<Windows.h>

#include"mmsystem.h"

#pragma comment(lib,"winmm.lib")

Then, call function according to functional requirement, such as:

WaveInOpen: open the audio input device specified, record；

WaveInPrepareHeader: prepare a relief area for audio input device；

WaveInStart: proceed by recording；

WaveInClose: close hull closure.

Owing to using the widest language to be mandarin and Guangdong language at home, therefore, the language comprised in data source Information preferably can comprise mandarin or Guangdong language.When reality is applied, it is also possible to comprise such as Mongol, The language of other ethnic groups such as Hui ethnic group's language, strong language.

Step S102: described speech data is carried out cutting, fragmentation and non-voice filtration treatment, obtains institute State the feature text data of speech data；

In this step, by described speech data being carried out cutting, fragmentation and non-voice filtration treatment, Reduce further the non-speech data impact on speech data in speech data transformation process, improve this The accuracy of the speech data structuring conversion method based on the API that increases income of invention.

Step S103: utilize increase income speech recognition API (Application Programming Interface, Application programming interface) described feature text data is carried out voice recognition processing, obtain voice frequency, Loudness, emotion information and speech recognition sequence；

The emotion information of the voice described in this step is including but not limited to fundamental frequency, duration, energy and frequency spectrum etc. Information.Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing, obtain The frequency of voice, loudness, emotion information and speech recognition sequence, simplify the step of speech recognition, improves The work efficiency of speech data conversion.

The speech recognition API increased income described in this step can be Baidu's speech recognition interface, Google's voice Identify that interface, Microsoft's speech recognition interface or University of Science and Technology news fly speech recognition interface etc..Know with Baidu's voice As a example by other interface, under python environment, pass through statement:

import baidu_oauth

Comprise Baidu's speech recognition library file, then use statement:

Asr_server='http: //vop.baidu.com/server_api'

Baidu_oauth_url='https: //openapi.baidu.com/oauth/2.0/token'

Connect Baidu's voice and carry out authorization identifying, and use statement:

Data_dict={'format':'wav', ' rate':8000, ' channel':1, ' cuid':mac_address, ' token':access _token,'lan':'zh','speech':speech_base64,'len':speech_length}

Specify and set up data structure, wherein can comprise file type, code check, frequency range, MAC Address etc. Information.The most only as a example by using Baidu's speech recognition interface under python environment, other voices are used to know Other interface or other programming languages also have similar step, and here is omitted.

Step S104: frequency, loudness, emotion information and the speech recognition sequence of described voice is merged Identify with structured text, the speech data of generating structure.

Structurized speech data described in this step can be the file of XML format, i.e. according to the need of user Ask and it is returned to user with expandable mark language XML.

As in figure 2 it is shown, speech data based on the API that the increases income knot that Fig. 2 is an alternative embodiment of the invention The preprocess method flow chart of structure conversion method, in the present embodiment, described is carried out described speech data Cutting, fragmentation and non-voice filtration treatment, obtain step S102 of the feature text data of described speech data Can also include:

Step S1021: described speech data carries out cutting and fragmentation processes, generates at least one sentence literary composition Part；Wherein, described sentence file includes speech sentence file and non-voice sentence file；

In this step, cutting can be carried out by finding out the quiet point of speech data and fragmentation processes. Such as, using 50 frames, 200 sampled points of every frame as quiet some threshold value, exceed this threshold value and then think this point It is in mute position, after finding mute position, then with this position, speech data is carried out cutting, then will cut The some fragmentation sentence files divided add the information such as duration, timestamp, then preserve with pcm form, Generate at least one sentence file.

Step S1022: described sentence file is carried out non-voice information filtration, obtains described speech data corresponding Speech sentence file；

Specifically, due to the sentence file after cutting and fragmentation is not done in step S1021 any in Process in appearance, therefore, sentence file may comprise the noise of non-voice, speaker swallow sound etc. Interference information, these unnecessary sound can cause the erroneous judgement of speech recognition, can reduce recognition accuracy, increases Effective workload of system and burden.

In order to carry out effective non-voice information filtration, one can be trained for classify voice and non-voice Disaggregated model, after identifying non-voice information, deletes non-voice information, then obtain effective language Message ceases, and i.e. obtains described speech data corresponding speech sentence file.

Step S1023: described speech sentence file is carried out speech feature extraction, obtains described speech data Feature text data.

In this step, as much as possible voice data is converted into text data, to reduce the data volume of transmission, And then saving bandwidth resources.

In actual applications, this step can use mel cepstrum coefficients extraction method to carry out speech feature extraction, Its principle is sound wave to be had different auditory sensitivities, from the voice signal of 200Hz to 5000Hz based on human ear The definition of voice is affected maximum.The existence of the frequency content that two loudness is higher influences whether to loudness relatively The impression of low frequency so that it is become to be difficult to discover.It is to say, bass easily shelters high pitch, but high pitch Shelter bass relatively difficult.Therefore can from low to high in this section of frequency band according to the size of critical bandwidth By close to dredging arrangement band filter, the voice signal of input is filtered, by each band filter Output signal energy, as the basic feature of signal, can serve as phonetic feature after being further processed this. This is also to need in step S103 to carry out frequency and a major reason of loudness extraction.

As it is shown on figure 3, speech data based on the API that the increases income knot that Fig. 3 is an alternative embodiment of the invention The audio recognition method flow chart of structure conversion method, described carries out speech recognition to described feature text data Processing, the step obtaining the frequency of voice, loudness, emotion information and speech recognition sequence includes:

Step S1031: utilize the speech recognition API increased income to be identified described feature text data processing, Extract the emotion contextual information that described feature text data comprises；Wherein, described emotion contextual information bag Include priori emotion contextual information and spatio-temporal context information；

Priori emotion information described in this step, can use Fuzzy Inference, according to the noise of background Speculate and analyze affective state；Spatio-temporal context information can analyze previous or rear the one of each sentence fragment Speech intonation information in individual sentence fragment, speculates the emotion information of speaker in current sentence fragment.

Step S1032: the affective characteristics that feature text data described in extract real-time comprises；Wherein, described emotion Feature includes morpheme interval information and morpheme duration information；

Morpheme interval described in the present embodiment is speaker and says the time interval between each two Chinese character, language Element duration is speaker and says duration used during single word, and above two information is the most permissible The affective characteristics of reaction speaker.

Step S1033: described emotion contextual information and affective characteristics are carried out structural sparse expression respectively, Obtain the emotion information that described feature text data is corresponding.

Wherein in an embodiment, the speech data structuring conversion method based on the API that increases income of the present invention, Described step S1033 that described affective characteristics carries out structural sparse expression may include steps of:

Nonlinear Classification is differentiated, and the dictionary that model insertion represents to structural sparse optimizes in solution；

Utilize supervised learning method, the dictionary optimization solution of described rarefaction representation is optimized；

According to the dictionary optimization solution of the rarefaction representation after optimizing, described affective characteristics is carried out structural sparse table Show.

Wherein in an embodiment, if X={x₁,x₂,…x_nIt is n affective characteristics set of vectors, Y={y₁,y₂,…y_nIt is n emotion context vector set, set up and based on core non-linear differentiate sparse table Show that criterion is as follows:

\underset{D, θ, a}{m i n} (Σ_{j = 1}^{g} Σ_{i = 1}^{n_{j}} (C (f (α, θ), y_{i}) + λ_{0} | | x_{i} - {Dα}_{i} | |_{2}^{2} + λ_{1} | | α_{i} | |_{1} + λ_{2} | | α_{G j} | |_{2} + λ_{3} | | θ | |_{2}^{2});

Wherein, D is rarefaction representation dictionary, α_i={ α₁,α₂,…α_mIt is the set of m affective characteristics rarefaction representation, G is characterized the number of group, n_jFor the number of affective characteristics in jth group, θ is core discrimination parameter；F (α, θ) is α is mapped to higher dimensional space, utilizes the Nonlinear Classification function about sparse code α that kernel function K is set up, The desirable gaussian kernel of kernel function, nuclear parameter can be obtained by training；(C(f(α,θ),y_i) it is loss function, this damage Lose the balance of function designer's overall situation and consider that generic α within-cluster variance is the least, between different classes of α class from Fisher criterion that divergence is tried one's best big and design；λ₀,λ₁,λ₂,λ₃For penalty factor.

The process of iteration optimization can be: uses existing D, θ, and the based on core non-linear of foundation differentiates Rarefaction representation criterion solves the sparse coding α of label affective characteristics, then can set up rarefaction representation constraint Equation, about the partial differential equation of D, θ, uses gradient descent method solve renewal dictionary D and differentiate classification ginseng Number θ, continues iteration until restraining.Thus obtain the dictionary through solving-optimizing.

Dictionary after being optimized by dictionary can draw affective characteristics vector X and emotion context vector Y, And design negotiation algorithm to utilize aforesaid two kinds of vectors to be estimated.

In step s 103, after obtaining frequency and the loudness of voice, gauss hybrid models can be trained, come The feature of speaker is further portrayed, final according to the model trained, it is also possible to according to one section The characteristics such as the intonation of voice, frequency, loudness judge that whether different voices is from same person.

As shown in Figure 4, Fig. 4 is the conversion method stream of data transfer device of an alternative embodiment of the invention Cheng Tu, the described frequency to described voice, loudness, emotion information and speech recognition sequence merge and tie Structure Text Flag, step S104 of the speech data of generating structure includes:

Step S1041: frequency, loudness, emotion information and the speech recognition sequence of described voice is collected And alignment, and the beginning and ending time comprised according to described speech recognition sequence be ranked up；

Step S1042: the speech recognition sequence after sequence is identified according to structured format, generating structure The speech data changed；Wherein, described mark include sex mark, tone color mark, punctuation mark mark and time Between stamp mark.

In above-described embodiment, ultimately generate structurized speech data, and be sent to the form of XML message Client, wherein can include source files title, sex mark, voice duration mark, language in XML message The contents such as sound/non-voice mark.Especially, in order to ensure the accuracy rate of speech recognition, used in the present invention Speech recognition library of increasing income is periodically to be acquired according to network text and update, and therefore can include popular word Converge, preferably adapt to the speech recognition of various situation, be conducive to improving the accuracy of speech recognition.

As it is shown in figure 5, the speech data structure based on the API that increases income that Fig. 5 is one embodiment of the present of invention Change the structural representation of converting system, including:

Extraction module 101, for extracting the speech data in data source；

In extraction module 101, by the speech data in data source is extracted, prevent at voice number According to non-speech data in transformation process, speech data is interfered, improve the present invention based on the API that increases income The accuracy of speech data structuring converting system.In actual applications, extraction module 101 can be from can Can comprise in the data source of numerous types of data and extract audio-frequency information, such as, from one section of video, extract sound The part of rail.

#include<Windows.h>

#include"mmsystem.h"

#pragma comment(lib,"winmm.lib")

Then, call function according to functional requirement, such as:

WaveInOpen: open the audio input device specified, record；

WaveInPrepareHeader: prepare a relief area for audio input device；

WaveInStart: proceed by recording；

WaveInClose: close hull closure.

Owing to using the widest language to be mandarin and Guangdong language at home, therefore, the language comprised in data source Information preferably can comprise mandarin or Guangdong language.When reality is applied, it is also possible to comprise such as Mongol, The language of other ethnic groups such as Hui ethnic group's language, Wei Er language.

Pretreatment module 102, for described speech data being carried out cutting, fragmentation and non-voice filtration treatment, Obtain the feature text data of described speech data；

In pretreatment module 102, by described speech data being carried out cutting, fragmentation and non-voice mistake Filter processes, and reduce further the non-speech data impact on speech data in speech data transformation process, Improve the accuracy of the speech data structuring converting system based on the API that increases income of the present invention.

Sound identification module 103, for utilizing the speech recognition API increased income to carry out described feature text data Voice recognition processing, obtains the frequency of voice, loudness, emotion information and speech recognition sequence；

The emotion information of the voice described in sound identification module 103 including but not limited to fundamental frequency, duration, The information such as energy and frequency spectrum.Utilize the speech recognition API increased income that described feature text data is carried out voice knowledge Other places are managed, and obtain the frequency of voice, loudness, emotion information and speech recognition sequence, simplify speech recognition Step, improve speech data conversion work efficiency.

The speech recognition API increased income by utilization carries out voice recognition processing to described feature text data, To frequency, loudness, emotion information and the speech recognition sequence of voice, simplify the step of speech recognition, carry The high work efficiency of speech data conversion.

The speech recognition API increased income described in above-mentioned sound identification module 103 can be that Baidu's speech recognition connects Mouth, Google's speech recognition interface, Microsoft's speech recognition interface or University of Science and Technology news fly speech recognition interface etc.. As a example by Baidu's speech recognition interface, under python environment, pass through statement:

import baidu_oauth

Comprise Baidu's speech recognition library file, then use statement:

Asr_server='http: //vop.baidu.com/server_api'

Baidu_oauth_url='https: //openapi.baidu.com/oauth/2.0/token'

Modular converter 104, for entering frequency, loudness, emotion information and the speech recognition sequence of described voice Row merges and structured text identifies, the speech data of generating structure.

Structurized speech data described in above-mentioned modular converter 104 can be the file of XML format, i.e. root According to the demand of user, it is returned to user with expandable mark language XML.

Above-mentioned speech data structuring converting system based on the API that increases income, by pretreatment module to extraction Speech data in data source carries out cutting, fragmentation and non-voice filtration treatment, obtains described speech data Feature text data；Utilize the speech recognition interface increased income to described feature text by sound identification module Data carry out voice recognition processing, obtain the frequency of voice, loudness, emotion information and speech recognition sequence； By modular converter frequency, loudness, emotion information and the speech recognition sequence of described voice merged and Structured text identifies, the speech data of generating structure.Pass through technique scheme, it is achieved that to data The structuring conversion of the speech data in source, the beneficially storage and management of speech data；And, the present invention The work efficiency of speech data structuring converting system based on the API that increases income higher, also there are good feelings Sense analytic function, further increases the accuracy of speech data structuring converting system based on the API that increases income.

As shown in Figure 6, Fig. 6 is speech data based on the API that the increases income knot of an alternative embodiment of the invention The structural representation of the pretreatment module of structure converting system, described pretreatment module 102 includes:

Cutting module 1021, processes for described speech data carries out cutting and fragmentation, generates at least one Individual sentence file；Wherein, described sentence file includes speech sentence file and non-voice sentence file；

In above-mentioned cutting module 1021, cutting and broken can be carried out by finding out the quiet point of speech data Sheetization processes.Such as, using 50 frames, 200 sampled points of every frame as quiet some threshold value, this threshold value is exceeded Then think that this point is in mute position, after finding mute position, then with this position, speech data carried out cutting, Then the some fragmentation sentence files segmented are added the information such as duration, timestamp, then with pcm form Preserve, generate at least one sentence file.

Non-voice filtering module 1022, for described sentence file is carried out non-voice filtration, obtains institute's predicate Sound data corresponding speech sentence file；

Specifically, owing to the sentence file after cutting and fragmentation not being done any in cutting module 1021 Process in content, therefore, sentence file may comprise the noise of non-voice, speaker swallow sound Deng interference information, these unnecessary sound can cause the erroneous judgement of speech recognition, can reduce recognition accuracy, increases Effective workload of adding system and burden.

Characteristic extracting module 1023, for described speech sentence file is carried out speech feature extraction, obtains institute State the feature text data of speech data.

In features described above extraction module 1023, as much as possible voice data is converted into text data, to subtract The data volume of few transmission, and then save bandwidth resources.

In actual applications, it is possible to use mel cepstrum coefficients extraction method carries out speech feature extraction, its principle It is sound wave to be had different auditory sensitivities, from the voice signal of 200Hz to 5000Hz to voice based on human ear Definition impact maximum.The existence of the frequency content that two loudness is higher influences whether the frequency relatively low to loudness The impression of rate so that it is become to be difficult to discover.It is to say, bass easily shelters high pitch, but high pitch is sheltered low Sound is relatively difficult.Therefore can from low to high in this section of frequency band according to critical bandwidth size by close to Dredge and arrange band filter, the voice signal of input is filtered, the output of each band filter is believed Number energy, as the basic feature of signal, can serve as phonetic feature after being further processed this.This is also It is sound identification module 103 to need carry out frequency and a major reason of loudness extraction.

As it is shown in fig. 7, speech data based on the API that the increases income knot that Fig. 7 is an alternative embodiment of the invention The structural representation of the sound identification module of structure converting system, described sound identification module 103 includes:

Emotion information extraction module 1031, for utilizing the speech recognition API increased income to described feature textual data According to being identified process, extract the emotion contextual information that described feature text data comprises；Wherein, described Emotion contextual information includes priori emotion contextual information and spatio-temporal context information；

Priori emotion information described in above-mentioned emotion information extraction module 1031, can use fuzzy reasoning skill Art, speculates according to the noise of background and analyzes affective state；Spatio-temporal context information can analyze each sentence Speech intonation information in the previous or later sentence fragment of fragment, speculates in current sentence fragment and says The emotion information of words people.

Affective feature extraction module 1032, the affective characteristics comprised for feature text data described in extract real-time； Wherein, described affective characteristics includes morpheme interval information and morpheme duration information；

Structurized module 1033, for carrying out structuring respectively to described emotion contextual information and affective characteristics Rarefaction representation, obtains the emotion information that described feature text data is corresponding.

As shown in Figure 8, Fig. 8 is speech data based on the API that the increases income knot of an alternative embodiment of the invention The structural representation of the structurized module of structure converting system, described structurized module 1033 includes:

Embed module 10331, for Nonlinear Classification being differentiated the dictionary that model insertion represents to structural sparse Optimize in solution；

Optimize module 10332, be used for utilizing supervised learning method, the dictionary optimization solution of described rarefaction representation is entered Row optimizes；

Sparse module 10333, for the dictionary optimization solution according to the rarefaction representation after optimizing, special to described emotion Levy and carry out structural sparse expression.

\underset{D, θ, a}{m i n} (Σ_{j = 1}^{g} Σ_{i = 1}^{n_{j}} (C (f (α, θ), y_{i}) + λ_{0} | | x_{i} - {Dα}_{i} | |_{2}^{2} + λ_{1} | | α_{i} | |_{1} + λ_{2} | | α_{G j} | |_{2} + λ_{3} | | θ | |_{2}^{2});

In sound identification module 103, after obtaining frequency and the loudness of voice, Gaussian Mixture can be trained Model, is further portrayed the feature of speaker, final according to the model trained, it is also possible to The characteristics such as intonation according to one section of voice, frequency, loudness judge that whether different voices is from same person.

As it is shown in figure 9, speech data based on the API that the increases income knot that Fig. 9 is an alternative embodiment of the invention The structural representation of the modular converter of structure converting system, described modular converter 104 includes:

Order module 1041, for frequency, loudness, emotion information and speech recognition sequence to described voice Carry out collecting and aliging, and the beginning and ending time comprised according to described speech recognition sequence is ranked up；

Mark module 1042, for the speech recognition sequence after sequence is identified according to structured format, The speech data of generating structure；Wherein, described mark includes sex mark, tone color mark, punctuation mark Mark and timestamp identify.

Each technical characteristic of embodiment described above can combine arbitrarily, for making description succinct, the most right The all possible combination of each technical characteristic in above-described embodiment is all described, but, if these skills There is not contradiction in the combination of art feature, is all considered to be the scope that this specification is recorded.

Embodiment described above only have expressed the several embodiments of the present invention, and it describes more concrete and detailed, But can not therefore be construed as limiting the scope of the patent.It should be pointed out that, for this area For those of ordinary skill, without departing from the inventive concept of the premise, it is also possible to make some deformation and change Entering, these broadly fall into protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be with appended power Profit requires to be as the criterion.

Claims

1. a speech data structuring conversion method based on the API that increases income, it is characterised in that include as follows Step:

Extract the speech data in data source；

Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing, obtain language The frequency of sound, loudness, emotion information and speech recognition sequence；

Speech data structuring conversion method based on the API that increases income the most according to claim 1, it is special Levy and be, described described speech data is carried out cutting, fragmentation and non-voice filtration treatment, obtain described The step of the feature text data of speech data includes:

Described speech data is carried out cutting and fragmentation processes, generate at least one sentence file；Wherein, Described sentence file includes speech sentence file and non-voice sentence file；

Described sentence file is carried out non-voice information filtration, obtains the corresponding speech sentence of described speech data File；

Described speech sentence file is carried out speech feature extraction, obtains the feature textual data of described speech data According to.

Speech data structuring conversion method based on the API that increases income the most according to claim 1, it is special Levying and be, the speech recognition API that described utilization is increased income carries out voice recognition processing to described feature text data Step include:

Utilize the speech recognition API increased income to be identified described feature text data processing, extract described spy The emotion contextual information that notebook data of soliciting articles comprises；Wherein, described emotion contextual information includes priori emotion Contextual information and spatio-temporal context information；

The affective characteristics that feature text data described in extract real-time comprises；Wherein, described affective characteristics includes language Element interval information and morpheme duration information；

Described emotion contextual information and affective characteristics are carried out structural sparse expression respectively, obtains described spy The emotion information that notebook data of soliciting articles is corresponding.

Speech data structuring conversion method based on the API that increases income the most according to claim 3, it is special Levying and be, the described step that described affective characteristics carries out structural sparse expression includes:

Speech data structuring conversion method based on the API that increases income the most according to claim 1, it is special Levy and be, the described frequency to described voice, loudness, emotion information and speech recognition sequence merge and Structured text identifies, and the step of the speech data of generating structure includes:

Frequency, loudness, emotion information and the speech recognition sequence of described voice is collected and align, and The beginning and ending time comprised according to described speech recognition sequence is ranked up；

Speech recognition sequence after sequence is identified according to structured format, the voice number of generating structure According to；Wherein, described mark includes sex mark, tone color mark, punctuation mark mark and timestamp mark.

6. a speech data structuring converting system based on the API that increases income, it is characterised in that including:

Extraction module, for extracting the speech data in data source；

Speech data structuring converting system based on the API that increases income the most according to claim 6, it is special Levying and be, described pretreatment module includes:

Cutting module, processes for described speech data carries out cutting and fragmentation, generates at least one sentence Subfile；Wherein, described sentence file includes speech sentence file and non-voice sentence file；

Non-voice filtering module, for described sentence file is carried out non-voice filtration, obtains described voice number According to corresponding speech sentence file；

Characteristic extracting module, for described speech sentence file is carried out speech feature extraction, obtains institute's predicate The feature text data of sound data.

Speech data structuring converting system based on the API that increases income the most according to claim 6, it is special Levying and be, described sound identification module includes:

Emotion information extraction module, for utilizing the speech recognition API increased income to enter described feature text data Row identifying processing, extracts the emotion contextual information that described feature text data comprises；Wherein, described emotion Contextual information includes priori emotion contextual information and spatio-temporal context information；

Affective feature extraction module, the affective characteristics comprised for feature text data described in extract real-time；Its In, described affective characteristics includes morpheme interval information and morpheme duration information；

Structurized module, for carrying out structural sparse respectively to described emotion contextual information and affective characteristics Represent, obtain the emotion information that described feature text data is corresponding.

Speech data structuring converting system based on the API that increases income the most according to claim 7, it is special Levying and be, described structurized module includes:

Embed module, for Nonlinear Classification being differentiated the dictionary optimization that model insertion represents to structural sparse Xie Zhong；

Optimize module, be used for utilizing supervised learning method, the dictionary optimization solution of described rarefaction representation is carried out excellent Change；

Sparse module, for the dictionary optimization solution according to the rarefaction representation after optimizing, enters described affective characteristics Row structural sparse represents.

Speech data structuring converting system based on the API that increases income the most according to claim 6, its Being characterised by, described modular converter includes:

Order module, for carrying out frequency, loudness, emotion information and the speech recognition sequence of described voice Collect and align, and the beginning and ending time comprised according to described speech recognition sequence is ranked up；

Mark module, for being identified the speech recognition sequence after sequence according to structured format, generates Structurized speech data；Wherein, described mark includes sex mark, tone color mark, punctuation mark mark Identify with timestamp.