CN105957517A - Voice data structured conversion method and system based on open source API - Google Patents

Voice data structured conversion method and system based on open source API Download PDF

Info

Publication number
CN105957517A
CN105957517A CN201610286831.9A CN201610286831A CN105957517A CN 105957517 A CN105957517 A CN 105957517A CN 201610286831 A CN201610286831 A CN 201610286831A CN 105957517 A CN105957517 A CN 105957517A
Authority
CN
China
Prior art keywords
data
speech
voice
api
speech data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610286831.9A
Other languages
Chinese (zh)
Inventor
许爱东
郭晓斌
黄文琦
陈华军
李果
蒋屹新
袁小凯
蒙家晓
张福铮
黄建理
杜金燃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China South Power Grid International Co ltd
Power Grid Technology Research Center of China Southern Power Grid Co Ltd
Original Assignee
China South Power Grid International Co ltd
Power Grid Technology Research Center of China Southern Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China South Power Grid International Co ltd, Power Grid Technology Research Center of China Southern Power Grid Co Ltd filed Critical China South Power Grid International Co ltd
Priority to CN201610286831.9A priority Critical patent/CN105957517A/en
Publication of CN105957517A publication Critical patent/CN105957517A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a voice data structuralization conversion method and a system thereof based on an open source API, wherein the method comprises the following steps: extracting voice data in a data source; carrying out segmentation, fragmentation and non-voice filtering processing on the voice data to obtain feature text data of the voice data; performing voice recognition processing on the characteristic text data by utilizing an open-source voice recognition API (application program interface) to obtain the frequency, loudness, emotion information and a voice recognition sequence of voice; and fusing the frequency, loudness, emotion information and voice recognition sequence of the voice and carrying out structured text identification to generate structured voice data. By the technical scheme, the voice data are structurally converted, and the voice data are stored and managed; moreover, the method has higher working efficiency and good emotion analysis function, and improves the accuracy of the voice data structured conversion method and system based on the open source API.

Description

Speech data structuring conversion method based on the API that increases income and system thereof
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of speech data based on the API that increases income Structuring conversion method and system thereof.
Background technology
In the last few years, popular along with big concept data, with electronic document, mail, form, audio frequency, regard The quantity that frequency and graph image etc. are the unstructured data of key component the most quickly increases, especially Practicality and the development speed of the speech data in unstructured data are especially prominent.Along with unstructured data The quick growth of quantity, the storage and management problem of unstructured data highlights the most day by day, severely impacts The work efficiency that data process.The growth of the structural data in relevant database is the slowest, There is not the similar storage of unstructured data and problem of management, therefore, when reality is applied, especially language Sound technical field of data processing, stores after unstructured data is converted to structural data again and manages Reason becomes the effective method of one solving the problems referred to above.
Speech data is converted into structural data by prior art, mainly has two kinds of methods: h coding divides Analysis and Software Coding analysis;Wherein, h coding analyzes and can carry out semantic point according to the accurate participle of linguistic context Analysis, and the requirement to data mode is relatively low, it is also possible to oralization style of writing is carried out data analysis, but, The speed that h coding analyzes is relatively slow, is difficult to carry out the data analysis work that data volume is bigger;Software Coding divides Analysis, the accuracy for semantic analysis is relatively low, and step is relatively complicated, lacks enough sentiment analysis functions, Treatment effeciency need to improve.
In sum, the processing speed of existing speech data structuring conversion method is relatively slow, and accuracy is relatively Low.
Summary of the invention
Based on this, it is necessary to the processing speed for existing speech data structuring conversion method is relatively slow, and The technical problem that accuracy is relatively low, it is provided that a kind of speech data structuring conversion method based on the API that increases income and Its system.
A kind of speech data structuring conversion method based on the API that increases income, it is characterised in that include walking as follows Rapid:
Extract the speech data in data source;
Described speech data is carried out cutting, fragmentation and non-voice filtration treatment, obtains described speech data Feature text data;
Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing, Frequency, loudness, emotion information and speech recognition sequence to voice;Wherein, described emotion information is to represent The information of emotion classification;
Frequency, loudness, emotion information and the speech recognition sequence of described voice is merged and structuring literary composition This mark, the speech data of generating structure.
Above-mentioned speech data structuring conversion method based on the API that increases income, by the data source extracted Speech data carries out cutting, fragmentation and non-voice filtration treatment, obtains the feature text of described speech data Data;Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing, obtain The frequency of voice, loudness, emotion information and speech recognition sequence;To the frequency of described voice, loudness, feelings Sense information and speech recognition sequence carry out merging and structured text mark, the speech data of generating structure. Pass through technique scheme, it is achieved that the structuring to the speech data in data source is changed, beneficially voice The storage and management of data;And, the speech data structuring conversion method based on the API that increases income of the present invention Work efficiency higher, also there is good sentiment analysis function, further increase based on increasing income API's The accuracy of speech data structuring conversion method.
A kind of speech data structuring converting system based on the API that increases income, including:
Extraction module, for extracting the speech data in data source;
Pretreatment module, for described speech data being carried out cutting, fragmentation and non-voice filtration treatment, Obtain the feature text data of described speech data;
Sound identification module, for utilizing the speech recognition API increased income that described feature text data is carried out language Sound identifying processing, obtains the frequency of voice, loudness, emotion information and speech recognition sequence;
Modular converter, for carrying out frequency, loudness, emotion information and the speech recognition sequence of described voice Merge and structured text identifies, the speech data of generating structure.
Above-mentioned speech data structuring converting system based on the API that increases income, by pretreatment module to extraction Speech data in data source carries out cutting, fragmentation and non-voice filtration treatment, obtains described speech data Feature text data;Utilize the speech recognition API increased income to described feature text by sound identification module Data carry out voice recognition processing, obtain the frequency of voice, loudness, emotion information and speech recognition sequence; By modular converter frequency, loudness, emotion information and the speech recognition sequence of described voice merged and Structured text identifies, the speech data of generating structure.Pass through technique scheme, it is achieved that to data The structuring conversion of the speech data in source, the beneficially storage and management of speech data;And, the present invention The work efficiency of speech data structuring converting system based on the API that increases income higher, also there are good feelings Sense analytic function, further increases the accuracy of speech data structuring converting system based on the API that increases income.
Accompanying drawing explanation
Fig. 1 is the speech data structuring conversion method stream based on the API that increases income of one embodiment of the present of invention Cheng Tu;
Fig. 2 is the speech data structuring conversion method based on the API that increases income of an alternative embodiment of the invention Preprocess method flow chart;
Fig. 3 is the speech data structuring conversion method based on the API that increases income of an alternative embodiment of the invention Audio recognition method flow chart;
Fig. 4 is the speech data structuring conversion method based on the API that increases income of an alternative embodiment of the invention Conversion method flow chart;
Fig. 5 is the speech data structuring converting system based on the API that increases income of one embodiment of the present of invention Structural representation;
Fig. 6 is the speech data structuring converting system based on the API that increases income of an alternative embodiment of the invention The structural representation of pretreatment module;
Fig. 7 is the speech data structuring converting system based on the API that increases income of an alternative embodiment of the invention The structural representation of sound identification module;
Fig. 8 is the speech data structuring converting system based on the API that increases income of an alternative embodiment of the invention The structural representation of structurized module;
Fig. 9 is the speech data structuring converting system based on the API that increases income of an alternative embodiment of the invention The structural representation of modular converter.
Detailed description of the invention
In order to further illustrate the technological means and the effect of acquirement that the present invention taked, below in conjunction with the accompanying drawings And preferred embodiment, to technical scheme, carry out clear and complete description.
As it is shown in figure 1, the speech data structure based on the API that increases income that Fig. 1 is one embodiment of the present of invention Change conversion method flow chart, comprise the steps:
Step S101: extract the speech data in data source;
In this step, by being extracted by the speech data in data source, prevent from changing at speech data During non-speech data speech data is interfered, improve the voice based on the API that increases income of the present invention The accuracy of data structured conversion method.In actual applications, this step S101 is from comprising multiple number According to the data source of type extracts audio-frequency information, such as, from one section of video, extract the part of track.
From data source, extract the API that track can use windows system to provide, provide with winmm.h Method as a example by, it is necessary first to comprise reference object by following statement:
#include<Windows.h>
#include"mmsystem.h"
#pragma comment(lib,"winmm.lib")
Then, call function according to functional requirement, such as:
WaveInOpen: open the audio input device specified, record;
WaveInPrepareHeader: prepare a relief area for audio input device;
WaveInStart: proceed by recording;
WaveInClose: close hull closure.
Owing to using the widest language to be mandarin and Guangdong language at home, therefore, the language comprised in data source Information preferably can comprise mandarin or Guangdong language.When reality is applied, it is also possible to comprise such as Mongol, The language of other ethnic groups such as Hui ethnic group's language, strong language.
Step S102: described speech data is carried out cutting, fragmentation and non-voice filtration treatment, obtains institute State the feature text data of speech data;
In this step, by described speech data being carried out cutting, fragmentation and non-voice filtration treatment, Reduce further the non-speech data impact on speech data in speech data transformation process, improve this The accuracy of the speech data structuring conversion method based on the API that increases income of invention.
Step S103: utilize increase income speech recognition API (Application Programming Interface, Application programming interface) described feature text data is carried out voice recognition processing, obtain voice frequency, Loudness, emotion information and speech recognition sequence;
The emotion information of the voice described in this step is including but not limited to fundamental frequency, duration, energy and frequency spectrum etc. Information.Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing, obtain The frequency of voice, loudness, emotion information and speech recognition sequence, simplify the step of speech recognition, improves The work efficiency of speech data conversion.
The speech recognition API increased income described in this step can be Baidu's speech recognition interface, Google's voice Identify that interface, Microsoft's speech recognition interface or University of Science and Technology news fly speech recognition interface etc..Know with Baidu's voice As a example by other interface, under python environment, pass through statement:
import baidu_oauth
Comprise Baidu's speech recognition library file, then use statement:
Asr_server='http: //vop.baidu.com/server_api'
Baidu_oauth_url='https: //openapi.baidu.com/oauth/2.0/token'
Connect Baidu's voice and carry out authorization identifying, and use statement:
Data_dict={'format':'wav', ' rate':8000, ' channel':1, ' cuid':mac_address, ' token':access _token,'lan':'zh','speech':speech_base64,'len':speech_length}
Specify and set up data structure, wherein can comprise file type, code check, frequency range, MAC Address etc. Information.The most only as a example by using Baidu's speech recognition interface under python environment, other voices are used to know Other interface or other programming languages also have similar step, and here is omitted.
Step S104: frequency, loudness, emotion information and the speech recognition sequence of described voice is merged Identify with structured text, the speech data of generating structure.
Structurized speech data described in this step can be the file of XML format, i.e. according to the need of user Ask and it is returned to user with expandable mark language XML.
Above-mentioned speech data structuring conversion method based on the API that increases income, by the data source extracted Speech data carries out cutting, fragmentation and non-voice filtration treatment, obtains the feature text of described speech data Data;Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing, obtain The frequency of voice, loudness, emotion information and speech recognition sequence;To the frequency of described voice, loudness, feelings Sense information and speech recognition sequence carry out merging and structured text mark, the speech data of generating structure. Pass through technique scheme, it is achieved that the structuring to the speech data in data source is changed, beneficially voice The storage and management of data;And, the speech data structuring conversion method based on the API that increases income of the present invention Work efficiency higher, also there is good sentiment analysis function, further increase based on increasing income API's The accuracy of speech data structuring conversion method.
As in figure 2 it is shown, speech data based on the API that the increases income knot that Fig. 2 is an alternative embodiment of the invention The preprocess method flow chart of structure conversion method, in the present embodiment, described is carried out described speech data Cutting, fragmentation and non-voice filtration treatment, obtain step S102 of the feature text data of described speech data Can also include:
Step S1021: described speech data carries out cutting and fragmentation processes, generates at least one sentence literary composition Part;Wherein, described sentence file includes speech sentence file and non-voice sentence file;
In this step, cutting can be carried out by finding out the quiet point of speech data and fragmentation processes. Such as, using 50 frames, 200 sampled points of every frame as quiet some threshold value, exceed this threshold value and then think this point It is in mute position, after finding mute position, then with this position, speech data is carried out cutting, then will cut The some fragmentation sentence files divided add the information such as duration, timestamp, then preserve with pcm form, Generate at least one sentence file.
Step S1022: described sentence file is carried out non-voice information filtration, obtains described speech data corresponding Speech sentence file;
Specifically, due to the sentence file after cutting and fragmentation is not done in step S1021 any in Process in appearance, therefore, sentence file may comprise the noise of non-voice, speaker swallow sound etc. Interference information, these unnecessary sound can cause the erroneous judgement of speech recognition, can reduce recognition accuracy, increases Effective workload of system and burden.
In order to carry out effective non-voice information filtration, one can be trained for classify voice and non-voice Disaggregated model, after identifying non-voice information, deletes non-voice information, then obtain effective language Message ceases, and i.e. obtains described speech data corresponding speech sentence file.
Step S1023: described speech sentence file is carried out speech feature extraction, obtains described speech data Feature text data.
In this step, as much as possible voice data is converted into text data, to reduce the data volume of transmission, And then saving bandwidth resources.
In actual applications, this step can use mel cepstrum coefficients extraction method to carry out speech feature extraction, Its principle is sound wave to be had different auditory sensitivities, from the voice signal of 200Hz to 5000Hz based on human ear The definition of voice is affected maximum.The existence of the frequency content that two loudness is higher influences whether to loudness relatively The impression of low frequency so that it is become to be difficult to discover.It is to say, bass easily shelters high pitch, but high pitch Shelter bass relatively difficult.Therefore can from low to high in this section of frequency band according to the size of critical bandwidth By close to dredging arrangement band filter, the voice signal of input is filtered, by each band filter Output signal energy, as the basic feature of signal, can serve as phonetic feature after being further processed this. This is also to need in step S103 to carry out frequency and a major reason of loudness extraction.
As it is shown on figure 3, speech data based on the API that the increases income knot that Fig. 3 is an alternative embodiment of the invention The audio recognition method flow chart of structure conversion method, described carries out speech recognition to described feature text data Processing, the step obtaining the frequency of voice, loudness, emotion information and speech recognition sequence includes:
Step S1031: utilize the speech recognition API increased income to be identified described feature text data processing, Extract the emotion contextual information that described feature text data comprises;Wherein, described emotion contextual information bag Include priori emotion contextual information and spatio-temporal context information;
Priori emotion information described in this step, can use Fuzzy Inference, according to the noise of background Speculate and analyze affective state;Spatio-temporal context information can analyze previous or rear the one of each sentence fragment Speech intonation information in individual sentence fragment, speculates the emotion information of speaker in current sentence fragment.
Step S1032: the affective characteristics that feature text data described in extract real-time comprises;Wherein, described emotion Feature includes morpheme interval information and morpheme duration information;
Morpheme interval described in the present embodiment is speaker and says the time interval between each two Chinese character, language Element duration is speaker and says duration used during single word, and above two information is the most permissible The affective characteristics of reaction speaker.
Step S1033: described emotion contextual information and affective characteristics are carried out structural sparse expression respectively, Obtain the emotion information that described feature text data is corresponding.
Wherein in an embodiment, the speech data structuring conversion method based on the API that increases income of the present invention, Described step S1033 that described affective characteristics carries out structural sparse expression may include steps of:
Nonlinear Classification is differentiated, and the dictionary that model insertion represents to structural sparse optimizes in solution;
Utilize supervised learning method, the dictionary optimization solution of described rarefaction representation is optimized;
According to the dictionary optimization solution of the rarefaction representation after optimizing, described affective characteristics is carried out structural sparse table Show.
Wherein in an embodiment, if X={x1,x2,…xnIt is n affective characteristics set of vectors, Y={y1,y2,…ynIt is n emotion context vector set, set up and based on core non-linear differentiate sparse table Show that criterion is as follows:
m i n D , &theta; , a ( &Sigma; j = 1 g &Sigma; i = 1 n j ( C ( f ( &alpha; , &theta; ) , y i ) + &lambda; 0 | | x i - D&alpha; i | | 2 2 + &lambda; 1 | | &alpha; i | | 1 + &lambda; 2 | | &alpha; G j | | 2 + &lambda; 3 | | &theta; | | 2 2 ) ;
Wherein, D is rarefaction representation dictionary, αi={ α12,…αmIt is the set of m affective characteristics rarefaction representation, G is characterized the number of group, njFor the number of affective characteristics in jth group, θ is core discrimination parameter;F (α, θ) is α is mapped to higher dimensional space, utilizes the Nonlinear Classification function about sparse code α that kernel function K is set up, The desirable gaussian kernel of kernel function, nuclear parameter can be obtained by training;(C(f(α,θ),yi) it is loss function, this damage Lose the balance of function designer's overall situation and consider that generic α within-cluster variance is the least, between different classes of α class from Fisher criterion that divergence is tried one's best big and design;λ0123For penalty factor.
The process of iteration optimization can be: uses existing D, θ, and the based on core non-linear of foundation differentiates Rarefaction representation criterion solves the sparse coding α of label affective characteristics, then can set up rarefaction representation constraint Equation, about the partial differential equation of D, θ, uses gradient descent method solve renewal dictionary D and differentiate classification ginseng Number θ, continues iteration until restraining.Thus obtain the dictionary through solving-optimizing.
Dictionary after being optimized by dictionary can draw affective characteristics vector X and emotion context vector Y, And design negotiation algorithm to utilize aforesaid two kinds of vectors to be estimated.
In step s 103, after obtaining frequency and the loudness of voice, gauss hybrid models can be trained, come The feature of speaker is further portrayed, final according to the model trained, it is also possible to according to one section The characteristics such as the intonation of voice, frequency, loudness judge that whether different voices is from same person.
As shown in Figure 4, Fig. 4 is the conversion method stream of data transfer device of an alternative embodiment of the invention Cheng Tu, the described frequency to described voice, loudness, emotion information and speech recognition sequence merge and tie Structure Text Flag, step S104 of the speech data of generating structure includes:
Step S1041: frequency, loudness, emotion information and the speech recognition sequence of described voice is collected And alignment, and the beginning and ending time comprised according to described speech recognition sequence be ranked up;
Step S1042: the speech recognition sequence after sequence is identified according to structured format, generating structure The speech data changed;Wherein, described mark include sex mark, tone color mark, punctuation mark mark and time Between stamp mark.
In above-described embodiment, ultimately generate structurized speech data, and be sent to the form of XML message Client, wherein can include source files title, sex mark, voice duration mark, language in XML message The contents such as sound/non-voice mark.Especially, in order to ensure the accuracy rate of speech recognition, used in the present invention Speech recognition library of increasing income is periodically to be acquired according to network text and update, and therefore can include popular word Converge, preferably adapt to the speech recognition of various situation, be conducive to improving the accuracy of speech recognition.
As it is shown in figure 5, the speech data structure based on the API that increases income that Fig. 5 is one embodiment of the present of invention Change the structural representation of converting system, including:
Extraction module 101, for extracting the speech data in data source;
In extraction module 101, by the speech data in data source is extracted, prevent at voice number According to non-speech data in transformation process, speech data is interfered, improve the present invention based on the API that increases income The accuracy of speech data structuring converting system.In actual applications, extraction module 101 can be from can Can comprise in the data source of numerous types of data and extract audio-frequency information, such as, from one section of video, extract sound The part of rail.
From data source, extract the API that track can use windows system to provide, provide with winmm.h Method as a example by, it is necessary first to comprise reference object by following statement:
#include<Windows.h>
#include"mmsystem.h"
#pragma comment(lib,"winmm.lib")
Then, call function according to functional requirement, such as:
WaveInOpen: open the audio input device specified, record;
WaveInPrepareHeader: prepare a relief area for audio input device;
WaveInStart: proceed by recording;
WaveInClose: close hull closure.
Owing to using the widest language to be mandarin and Guangdong language at home, therefore, the language comprised in data source Information preferably can comprise mandarin or Guangdong language.When reality is applied, it is also possible to comprise such as Mongol, The language of other ethnic groups such as Hui ethnic group's language, Wei Er language.
Pretreatment module 102, for described speech data being carried out cutting, fragmentation and non-voice filtration treatment, Obtain the feature text data of described speech data;
In pretreatment module 102, by described speech data being carried out cutting, fragmentation and non-voice mistake Filter processes, and reduce further the non-speech data impact on speech data in speech data transformation process, Improve the accuracy of the speech data structuring converting system based on the API that increases income of the present invention.
Sound identification module 103, for utilizing the speech recognition API increased income to carry out described feature text data Voice recognition processing, obtains the frequency of voice, loudness, emotion information and speech recognition sequence;
The emotion information of the voice described in sound identification module 103 including but not limited to fundamental frequency, duration, The information such as energy and frequency spectrum.Utilize the speech recognition API increased income that described feature text data is carried out voice knowledge Other places are managed, and obtain the frequency of voice, loudness, emotion information and speech recognition sequence, simplify speech recognition Step, improve speech data conversion work efficiency.
The speech recognition API increased income by utilization carries out voice recognition processing to described feature text data, To frequency, loudness, emotion information and the speech recognition sequence of voice, simplify the step of speech recognition, carry The high work efficiency of speech data conversion.
The speech recognition API increased income described in above-mentioned sound identification module 103 can be that Baidu's speech recognition connects Mouth, Google's speech recognition interface, Microsoft's speech recognition interface or University of Science and Technology news fly speech recognition interface etc.. As a example by Baidu's speech recognition interface, under python environment, pass through statement:
import baidu_oauth
Comprise Baidu's speech recognition library file, then use statement:
Asr_server='http: //vop.baidu.com/server_api'
Baidu_oauth_url='https: //openapi.baidu.com/oauth/2.0/token'
Connect Baidu's voice and carry out authorization identifying, and use statement:
Data_dict={'format':'wav', ' rate':8000, ' channel':1, ' cuid':mac_address, ' token':access _token,'lan':'zh','speech':speech_base64,'len':speech_length}
Specify and set up data structure, wherein can comprise file type, code check, frequency range, MAC Address etc. Information.The most only as a example by using Baidu's speech recognition interface under python environment, other voices are used to know Other interface or other programming languages also have similar step, and here is omitted.
Modular converter 104, for entering frequency, loudness, emotion information and the speech recognition sequence of described voice Row merges and structured text identifies, the speech data of generating structure.
Structurized speech data described in above-mentioned modular converter 104 can be the file of XML format, i.e. root According to the demand of user, it is returned to user with expandable mark language XML.
Above-mentioned speech data structuring converting system based on the API that increases income, by pretreatment module to extraction Speech data in data source carries out cutting, fragmentation and non-voice filtration treatment, obtains described speech data Feature text data;Utilize the speech recognition interface increased income to described feature text by sound identification module Data carry out voice recognition processing, obtain the frequency of voice, loudness, emotion information and speech recognition sequence; By modular converter frequency, loudness, emotion information and the speech recognition sequence of described voice merged and Structured text identifies, the speech data of generating structure.Pass through technique scheme, it is achieved that to data The structuring conversion of the speech data in source, the beneficially storage and management of speech data;And, the present invention The work efficiency of speech data structuring converting system based on the API that increases income higher, also there are good feelings Sense analytic function, further increases the accuracy of speech data structuring converting system based on the API that increases income.
As shown in Figure 6, Fig. 6 is speech data based on the API that the increases income knot of an alternative embodiment of the invention The structural representation of the pretreatment module of structure converting system, described pretreatment module 102 includes:
Cutting module 1021, processes for described speech data carries out cutting and fragmentation, generates at least one Individual sentence file;Wherein, described sentence file includes speech sentence file and non-voice sentence file;
In above-mentioned cutting module 1021, cutting and broken can be carried out by finding out the quiet point of speech data Sheetization processes.Such as, using 50 frames, 200 sampled points of every frame as quiet some threshold value, this threshold value is exceeded Then think that this point is in mute position, after finding mute position, then with this position, speech data carried out cutting, Then the some fragmentation sentence files segmented are added the information such as duration, timestamp, then with pcm form Preserve, generate at least one sentence file.
Non-voice filtering module 1022, for described sentence file is carried out non-voice filtration, obtains institute's predicate Sound data corresponding speech sentence file;
Specifically, owing to the sentence file after cutting and fragmentation not being done any in cutting module 1021 Process in content, therefore, sentence file may comprise the noise of non-voice, speaker swallow sound Deng interference information, these unnecessary sound can cause the erroneous judgement of speech recognition, can reduce recognition accuracy, increases Effective workload of adding system and burden.
In order to carry out effective non-voice information filtration, one can be trained for classify voice and non-voice Disaggregated model, after identifying non-voice information, deletes non-voice information, then obtain effective language Message ceases, and i.e. obtains described speech data corresponding speech sentence file.
Characteristic extracting module 1023, for described speech sentence file is carried out speech feature extraction, obtains institute State the feature text data of speech data.
In features described above extraction module 1023, as much as possible voice data is converted into text data, to subtract The data volume of few transmission, and then save bandwidth resources.
In actual applications, it is possible to use mel cepstrum coefficients extraction method carries out speech feature extraction, its principle It is sound wave to be had different auditory sensitivities, from the voice signal of 200Hz to 5000Hz to voice based on human ear Definition impact maximum.The existence of the frequency content that two loudness is higher influences whether the frequency relatively low to loudness The impression of rate so that it is become to be difficult to discover.It is to say, bass easily shelters high pitch, but high pitch is sheltered low Sound is relatively difficult.Therefore can from low to high in this section of frequency band according to critical bandwidth size by close to Dredge and arrange band filter, the voice signal of input is filtered, the output of each band filter is believed Number energy, as the basic feature of signal, can serve as phonetic feature after being further processed this.This is also It is sound identification module 103 to need carry out frequency and a major reason of loudness extraction.
As it is shown in fig. 7, speech data based on the API that the increases income knot that Fig. 7 is an alternative embodiment of the invention The structural representation of the sound identification module of structure converting system, described sound identification module 103 includes:
Emotion information extraction module 1031, for utilizing the speech recognition API increased income to described feature textual data According to being identified process, extract the emotion contextual information that described feature text data comprises;Wherein, described Emotion contextual information includes priori emotion contextual information and spatio-temporal context information;
Priori emotion information described in above-mentioned emotion information extraction module 1031, can use fuzzy reasoning skill Art, speculates according to the noise of background and analyzes affective state;Spatio-temporal context information can analyze each sentence Speech intonation information in the previous or later sentence fragment of fragment, speculates in current sentence fragment and says The emotion information of words people.
Affective feature extraction module 1032, the affective characteristics comprised for feature text data described in extract real-time; Wherein, described affective characteristics includes morpheme interval information and morpheme duration information;
Morpheme interval described in the present embodiment is speaker and says the time interval between each two Chinese character, language Element duration is speaker and says duration used during single word, and above two information is the most permissible The affective characteristics of reaction speaker.
Structurized module 1033, for carrying out structuring respectively to described emotion contextual information and affective characteristics Rarefaction representation, obtains the emotion information that described feature text data is corresponding.
As shown in Figure 8, Fig. 8 is speech data based on the API that the increases income knot of an alternative embodiment of the invention The structural representation of the structurized module of structure converting system, described structurized module 1033 includes:
Embed module 10331, for Nonlinear Classification being differentiated the dictionary that model insertion represents to structural sparse Optimize in solution;
Optimize module 10332, be used for utilizing supervised learning method, the dictionary optimization solution of described rarefaction representation is entered Row optimizes;
Sparse module 10333, for the dictionary optimization solution according to the rarefaction representation after optimizing, special to described emotion Levy and carry out structural sparse expression.
Wherein in an embodiment, if X={x1,x2,…xnIt is n affective characteristics set of vectors, Y={y1,y2,…ynIt is n emotion context vector set, set up and based on core non-linear differentiate sparse table Show that criterion is as follows:
m i n D , &theta; , a ( &Sigma; j = 1 g &Sigma; i = 1 n j ( C ( f ( &alpha; , &theta; ) , y i ) + &lambda; 0 | | x i - D&alpha; i | | 2 2 + &lambda; 1 | | &alpha; i | | 1 + &lambda; 2 | | &alpha; G j | | 2 + &lambda; 3 | | &theta; | | 2 2 ) ;
Wherein, D is rarefaction representation dictionary, αi={ α12,…αmIt is the set of m affective characteristics rarefaction representation, G is characterized the number of group, njFor the number of affective characteristics in jth group, θ is core discrimination parameter;F (α, θ) is α is mapped to higher dimensional space, utilizes the Nonlinear Classification function about sparse code α that kernel function K is set up, The desirable gaussian kernel of kernel function, nuclear parameter can be obtained by training;(C(f(α,θ),yi) it is loss function, this damage Lose the balance of function designer's overall situation and consider that generic α within-cluster variance is the least, between different classes of α class from Fisher criterion that divergence is tried one's best big and design;λ0123For penalty factor.
The process of iteration optimization can be: uses existing D, θ, and the based on core non-linear of foundation differentiates Rarefaction representation criterion solves the sparse coding α of label affective characteristics, then can set up rarefaction representation constraint Equation, about the partial differential equation of D, θ, uses gradient descent method solve renewal dictionary D and differentiate classification ginseng Number θ, continues iteration until restraining.Thus obtain the dictionary through solving-optimizing.
Dictionary after being optimized by dictionary can draw affective characteristics vector X and emotion context vector Y, And design negotiation algorithm to utilize aforesaid two kinds of vectors to be estimated.
In sound identification module 103, after obtaining frequency and the loudness of voice, Gaussian Mixture can be trained Model, is further portrayed the feature of speaker, final according to the model trained, it is also possible to The characteristics such as intonation according to one section of voice, frequency, loudness judge that whether different voices is from same person.
As it is shown in figure 9, speech data based on the API that the increases income knot that Fig. 9 is an alternative embodiment of the invention The structural representation of the modular converter of structure converting system, described modular converter 104 includes:
Order module 1041, for frequency, loudness, emotion information and speech recognition sequence to described voice Carry out collecting and aliging, and the beginning and ending time comprised according to described speech recognition sequence is ranked up;
Mark module 1042, for the speech recognition sequence after sequence is identified according to structured format, The speech data of generating structure;Wherein, described mark includes sex mark, tone color mark, punctuation mark Mark and timestamp identify.
In above-described embodiment, ultimately generate structurized speech data, and be sent to the form of XML message Client, wherein can include source files title, sex mark, voice duration mark, language in XML message The contents such as sound/non-voice mark.Especially, in order to ensure the accuracy rate of speech recognition, used in the present invention Speech recognition library of increasing income is periodically to be acquired according to network text and update, and therefore can include popular word Converge, preferably adapt to the speech recognition of various situation, be conducive to improving the accuracy of speech recognition.
Each technical characteristic of embodiment described above can combine arbitrarily, for making description succinct, the most right The all possible combination of each technical characteristic in above-described embodiment is all described, but, if these skills There is not contradiction in the combination of art feature, is all considered to be the scope that this specification is recorded.
Embodiment described above only have expressed the several embodiments of the present invention, and it describes more concrete and detailed, But can not therefore be construed as limiting the scope of the patent.It should be pointed out that, for this area For those of ordinary skill, without departing from the inventive concept of the premise, it is also possible to make some deformation and change Entering, these broadly fall into protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be with appended power Profit requires to be as the criterion.

Claims (10)

1. a speech data structuring conversion method based on the API that increases income, it is characterised in that include as follows Step:
Extract the speech data in data source;
Described speech data is carried out cutting, fragmentation and non-voice filtration treatment, obtains described speech data Feature text data;
Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing, obtain language The frequency of sound, loudness, emotion information and speech recognition sequence;
Frequency, loudness, emotion information and the speech recognition sequence of described voice is merged and structuring literary composition This mark, the speech data of generating structure.
Speech data structuring conversion method based on the API that increases income the most according to claim 1, it is special Levy and be, described described speech data is carried out cutting, fragmentation and non-voice filtration treatment, obtain described The step of the feature text data of speech data includes:
Described speech data is carried out cutting and fragmentation processes, generate at least one sentence file;Wherein, Described sentence file includes speech sentence file and non-voice sentence file;
Described sentence file is carried out non-voice information filtration, obtains the corresponding speech sentence of described speech data File;
Described speech sentence file is carried out speech feature extraction, obtains the feature textual data of described speech data According to.
Speech data structuring conversion method based on the API that increases income the most according to claim 1, it is special Levying and be, the speech recognition API that described utilization is increased income carries out voice recognition processing to described feature text data Step include:
Utilize the speech recognition API increased income to be identified described feature text data processing, extract described spy The emotion contextual information that notebook data of soliciting articles comprises;Wherein, described emotion contextual information includes priori emotion Contextual information and spatio-temporal context information;
The affective characteristics that feature text data described in extract real-time comprises;Wherein, described affective characteristics includes language Element interval information and morpheme duration information;
Described emotion contextual information and affective characteristics are carried out structural sparse expression respectively, obtains described spy The emotion information that notebook data of soliciting articles is corresponding.
Speech data structuring conversion method based on the API that increases income the most according to claim 3, it is special Levying and be, the described step that described affective characteristics carries out structural sparse expression includes:
Nonlinear Classification is differentiated, and the dictionary that model insertion represents to structural sparse optimizes in solution;
Utilize supervised learning method, the dictionary optimization solution of described rarefaction representation is optimized;
According to the dictionary optimization solution of the rarefaction representation after optimizing, described affective characteristics is carried out structural sparse table Show.
Speech data structuring conversion method based on the API that increases income the most according to claim 1, it is special Levy and be, the described frequency to described voice, loudness, emotion information and speech recognition sequence merge and Structured text identifies, and the step of the speech data of generating structure includes:
Frequency, loudness, emotion information and the speech recognition sequence of described voice is collected and align, and The beginning and ending time comprised according to described speech recognition sequence is ranked up;
Speech recognition sequence after sequence is identified according to structured format, the voice number of generating structure According to;Wherein, described mark includes sex mark, tone color mark, punctuation mark mark and timestamp mark.
6. a speech data structuring converting system based on the API that increases income, it is characterised in that including:
Extraction module, for extracting the speech data in data source;
Pretreatment module, for described speech data being carried out cutting, fragmentation and non-voice filtration treatment, Obtain the feature text data of described speech data;
Sound identification module, for utilizing the speech recognition API increased income that described feature text data is carried out language Sound identifying processing, obtains the frequency of voice, loudness, emotion information and speech recognition sequence;
Modular converter, for carrying out frequency, loudness, emotion information and the speech recognition sequence of described voice Merge and structured text identifies, the speech data of generating structure.
Speech data structuring converting system based on the API that increases income the most according to claim 6, it is special Levying and be, described pretreatment module includes:
Cutting module, processes for described speech data carries out cutting and fragmentation, generates at least one sentence Subfile;Wherein, described sentence file includes speech sentence file and non-voice sentence file;
Non-voice filtering module, for described sentence file is carried out non-voice filtration, obtains described voice number According to corresponding speech sentence file;
Characteristic extracting module, for described speech sentence file is carried out speech feature extraction, obtains institute's predicate The feature text data of sound data.
Speech data structuring converting system based on the API that increases income the most according to claim 6, it is special Levying and be, described sound identification module includes:
Emotion information extraction module, for utilizing the speech recognition API increased income to enter described feature text data Row identifying processing, extracts the emotion contextual information that described feature text data comprises;Wherein, described emotion Contextual information includes priori emotion contextual information and spatio-temporal context information;
Affective feature extraction module, the affective characteristics comprised for feature text data described in extract real-time;Its In, described affective characteristics includes morpheme interval information and morpheme duration information;
Structurized module, for carrying out structural sparse respectively to described emotion contextual information and affective characteristics Represent, obtain the emotion information that described feature text data is corresponding.
Speech data structuring converting system based on the API that increases income the most according to claim 7, it is special Levying and be, described structurized module includes:
Embed module, for Nonlinear Classification being differentiated the dictionary optimization that model insertion represents to structural sparse Xie Zhong;
Optimize module, be used for utilizing supervised learning method, the dictionary optimization solution of described rarefaction representation is carried out excellent Change;
Sparse module, for the dictionary optimization solution according to the rarefaction representation after optimizing, enters described affective characteristics Row structural sparse represents.
Speech data structuring converting system based on the API that increases income the most according to claim 6, its Being characterised by, described modular converter includes:
Order module, for carrying out frequency, loudness, emotion information and the speech recognition sequence of described voice Collect and align, and the beginning and ending time comprised according to described speech recognition sequence is ranked up;
Mark module, for being identified the speech recognition sequence after sequence according to structured format, generates Structurized speech data;Wherein, described mark includes sex mark, tone color mark, punctuation mark mark Identify with timestamp.
CN201610286831.9A 2016-04-29 2016-04-29 Voice data structured conversion method and system based on open source API Pending CN105957517A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610286831.9A CN105957517A (en) 2016-04-29 2016-04-29 Voice data structured conversion method and system based on open source API

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610286831.9A CN105957517A (en) 2016-04-29 2016-04-29 Voice data structured conversion method and system based on open source API

Publications (1)

Publication Number Publication Date
CN105957517A true CN105957517A (en) 2016-09-21

Family

ID=56913436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610286831.9A Pending CN105957517A (en) 2016-04-29 2016-04-29 Voice data structured conversion method and system based on open source API

Country Status (1)

Country Link
CN (1) CN105957517A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106791913A (en) * 2016-12-30 2017-05-31 深圳市九洲电器有限公司 Digital television program simultaneous interpretation output intent and system
CN108319888A (en) * 2017-01-17 2018-07-24 阿里巴巴集团控股有限公司 The recognition methods of video type and device, terminal
WO2018171257A1 (en) * 2017-03-21 2018-09-27 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for speech information processing
CN108899031A (en) * 2018-07-17 2018-11-27 广西师范学院 Strong language audio recognition method based on cloud computing
WO2021259073A1 (en) * 2020-06-26 2021-12-30 International Business Machines Corporation System for voice-to-text tagging for rich transcription of human speech

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101685634A (en) * 2008-09-27 2010-03-31 上海盛淘智能科技有限公司 Children speech emotion recognition method
CN101847406A (en) * 2010-05-18 2010-09-29 中国农业大学 Speech recognition query method and system
CN103123619A (en) * 2012-12-04 2013-05-29 江苏大学 Visual speech multi-mode collaborative analysis method based on emotion context and system
US20130297297A1 (en) * 2012-05-07 2013-11-07 Erhan Guven System and method for classification of emotion in human speech
CN103700370A (en) * 2013-12-04 2014-04-02 北京中科模识科技有限公司 Broadcast television voice recognition method and system
CN104050963A (en) * 2014-06-23 2014-09-17 东南大学 Continuous speech emotion prediction algorithm based on emotion data field

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101685634A (en) * 2008-09-27 2010-03-31 上海盛淘智能科技有限公司 Children speech emotion recognition method
CN101847406A (en) * 2010-05-18 2010-09-29 中国农业大学 Speech recognition query method and system
US20130297297A1 (en) * 2012-05-07 2013-11-07 Erhan Guven System and method for classification of emotion in human speech
CN103123619A (en) * 2012-12-04 2013-05-29 江苏大学 Visual speech multi-mode collaborative analysis method based on emotion context and system
CN103700370A (en) * 2013-12-04 2014-04-02 北京中科模识科技有限公司 Broadcast television voice recognition method and system
CN104050963A (en) * 2014-06-23 2014-09-17 东南大学 Continuous speech emotion prediction algorithm based on emotion data field

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
STRONGLEG: "使用百度API实现语音识别——in python", 《新浪博客》 *
中国人工智能学会: "《中国人工智能进展》", 31 December 2007, 北京邮电大学出版社 *
赵力等: "语音信号中的情感特征分析和识别的研究", 《电子学报》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106791913A (en) * 2016-12-30 2017-05-31 深圳市九洲电器有限公司 Digital television program simultaneous interpretation output intent and system
CN108319888A (en) * 2017-01-17 2018-07-24 阿里巴巴集团控股有限公司 The recognition methods of video type and device, terminal
CN108319888B (en) * 2017-01-17 2023-04-07 阿里巴巴集团控股有限公司 Video type identification method and device and computer terminal
WO2018171257A1 (en) * 2017-03-21 2018-09-27 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for speech information processing
CN109074803A (en) * 2017-03-21 2018-12-21 北京嘀嘀无限科技发展有限公司 Speech information processing system and method
CN109074803B (en) * 2017-03-21 2022-10-18 北京嘀嘀无限科技发展有限公司 Voice information processing system and method
CN108899031A (en) * 2018-07-17 2018-11-27 广西师范学院 Strong language audio recognition method based on cloud computing
WO2021259073A1 (en) * 2020-06-26 2021-12-30 International Business Machines Corporation System for voice-to-text tagging for rich transcription of human speech
GB2611684A (en) * 2020-06-26 2023-04-12 Ibm System for voice-to-text tagging for rich transcription of human speech
US11817100B2 (en) 2020-06-26 2023-11-14 International Business Machines Corporation System for voice-to-text tagging for rich transcription of human speech

Similar Documents

Publication Publication Date Title
CN103700370B (en) A kind of radio and television speech recognition system method and system
WO2021073116A1 (en) Method and apparatus for generating legal document, device and storage medium
Gupta et al. The AT&T spoken language understanding system
CN101930735B (en) Speech emotion recognition equipment and speech emotion recognition method
CN105957517A (en) Voice data structured conversion method and system based on open source API
CN112735383A (en) Voice signal processing method, device, equipment and storage medium
CN111105785B (en) Text prosody boundary recognition method and device
CN109256150A (en) Speech emotion recognition system and method based on machine learning
CN101685634A (en) Children speech emotion recognition method
CN108877769B (en) Method and device for identifying dialect type
CN111177350A (en) Method, device and system for forming dialect of intelligent voice robot
Chittaragi et al. Automatic text-independent Kannada dialect identification system
KR20200119410A (en) System and Method for Recognizing Emotions from Korean Dialogues based on Global and Local Contextual Information
CN110209812A (en) File classification method and device
CN112151015A (en) Keyword detection method and device, electronic equipment and storage medium
CN111081219A (en) End-to-end voice intention recognition method
CN117349427A (en) Artificial intelligence multi-mode content generation system for public opinion event coping
Koolagudi et al. Dravidian language classification from speech signal using spectral and prosodic features
Ling An acoustic model for English speech recognition based on deep learning
Harsha et al. Lexical ambiguity in natural language processing applications
WO2024077906A1 (en) Speech text generation method and apparatus, and training method and apparatus for speech text generation model
Yue English spoken stress recognition based on natural language processing and endpoint detection algorithm
Zahariev et al. An approach to speech ambiguities eliminating using semantically-acoustical analysis
CN112150103B (en) Schedule setting method, schedule setting device and storage medium
CN114707515A (en) Method and device for judging dialect, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160921