CN105957517A - Voice data structured conversion method and system based on open source API - Google Patents
Voice data structured conversion method and system based on open source API Download PDFInfo
- Publication number
- CN105957517A CN105957517A CN201610286831.9A CN201610286831A CN105957517A CN 105957517 A CN105957517 A CN 105957517A CN 201610286831 A CN201610286831 A CN 201610286831A CN 105957517 A CN105957517 A CN 105957517A
- Authority
- CN
- China
- Prior art keywords
- data
- speech
- voice
- api
- speech data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 39
- 230000008451 emotion Effects 0.000 claims abstract description 68
- 238000013467 fragmentation Methods 0.000 claims abstract description 24
- 238000006062 fragmentation reaction Methods 0.000 claims abstract description 24
- 238000012545 processing Methods 0.000 claims abstract description 23
- 238000001914 filtration Methods 0.000 claims abstract description 22
- 238000000605 extraction Methods 0.000 claims description 23
- 239000000284 extract Substances 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 16
- 238000005457 optimization Methods 0.000 claims description 11
- 230000008859 change Effects 0.000 claims description 4
- 238000003780 insertion Methods 0.000 claims description 4
- 230000037431 insertion Effects 0.000 claims description 4
- 239000000203 mixture Substances 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 abstract description 9
- 230000011218 segmentation Effects 0.000 abstract 1
- 230000006870 function Effects 0.000 description 17
- 239000013598 vector Substances 0.000 description 10
- 239000012634 fragment Substances 0.000 description 6
- 238000007726 management method Methods 0.000 description 6
- 241001672694 Citrus reticulata Species 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 241000208340 Araliaceae Species 0.000 description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 description 2
- 238000013475 authorization Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000013501 data transformation Methods 0.000 description 2
- 235000008434 ginseng Nutrition 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 230000008676 import Effects 0.000 description 2
- 230000000452 restraining effect Effects 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a voice data structuralization conversion method and a system thereof based on an open source API, wherein the method comprises the following steps: extracting voice data in a data source; carrying out segmentation, fragmentation and non-voice filtering processing on the voice data to obtain feature text data of the voice data; performing voice recognition processing on the characteristic text data by utilizing an open-source voice recognition API (application program interface) to obtain the frequency, loudness, emotion information and a voice recognition sequence of voice; and fusing the frequency, loudness, emotion information and voice recognition sequence of the voice and carrying out structured text identification to generate structured voice data. By the technical scheme, the voice data are structurally converted, and the voice data are stored and managed; moreover, the method has higher working efficiency and good emotion analysis function, and improves the accuracy of the voice data structured conversion method and system based on the open source API.
Description
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of speech data based on the API that increases income
Structuring conversion method and system thereof.
Background technology
In the last few years, popular along with big concept data, with electronic document, mail, form, audio frequency, regard
The quantity that frequency and graph image etc. are the unstructured data of key component the most quickly increases, especially
Practicality and the development speed of the speech data in unstructured data are especially prominent.Along with unstructured data
The quick growth of quantity, the storage and management problem of unstructured data highlights the most day by day, severely impacts
The work efficiency that data process.The growth of the structural data in relevant database is the slowest,
There is not the similar storage of unstructured data and problem of management, therefore, when reality is applied, especially language
Sound technical field of data processing, stores after unstructured data is converted to structural data again and manages
Reason becomes the effective method of one solving the problems referred to above.
Speech data is converted into structural data by prior art, mainly has two kinds of methods: h coding divides
Analysis and Software Coding analysis;Wherein, h coding analyzes and can carry out semantic point according to the accurate participle of linguistic context
Analysis, and the requirement to data mode is relatively low, it is also possible to oralization style of writing is carried out data analysis, but,
The speed that h coding analyzes is relatively slow, is difficult to carry out the data analysis work that data volume is bigger;Software Coding divides
Analysis, the accuracy for semantic analysis is relatively low, and step is relatively complicated, lacks enough sentiment analysis functions,
Treatment effeciency need to improve.
In sum, the processing speed of existing speech data structuring conversion method is relatively slow, and accuracy is relatively
Low.
Summary of the invention
Based on this, it is necessary to the processing speed for existing speech data structuring conversion method is relatively slow, and
The technical problem that accuracy is relatively low, it is provided that a kind of speech data structuring conversion method based on the API that increases income and
Its system.
A kind of speech data structuring conversion method based on the API that increases income, it is characterised in that include walking as follows
Rapid:
Extract the speech data in data source;
Described speech data is carried out cutting, fragmentation and non-voice filtration treatment, obtains described speech data
Feature text data;
Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing,
Frequency, loudness, emotion information and speech recognition sequence to voice;Wherein, described emotion information is to represent
The information of emotion classification;
Frequency, loudness, emotion information and the speech recognition sequence of described voice is merged and structuring literary composition
This mark, the speech data of generating structure.
Above-mentioned speech data structuring conversion method based on the API that increases income, by the data source extracted
Speech data carries out cutting, fragmentation and non-voice filtration treatment, obtains the feature text of described speech data
Data;Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing, obtain
The frequency of voice, loudness, emotion information and speech recognition sequence;To the frequency of described voice, loudness, feelings
Sense information and speech recognition sequence carry out merging and structured text mark, the speech data of generating structure.
Pass through technique scheme, it is achieved that the structuring to the speech data in data source is changed, beneficially voice
The storage and management of data;And, the speech data structuring conversion method based on the API that increases income of the present invention
Work efficiency higher, also there is good sentiment analysis function, further increase based on increasing income API's
The accuracy of speech data structuring conversion method.
A kind of speech data structuring converting system based on the API that increases income, including:
Extraction module, for extracting the speech data in data source;
Pretreatment module, for described speech data being carried out cutting, fragmentation and non-voice filtration treatment,
Obtain the feature text data of described speech data;
Sound identification module, for utilizing the speech recognition API increased income that described feature text data is carried out language
Sound identifying processing, obtains the frequency of voice, loudness, emotion information and speech recognition sequence;
Modular converter, for carrying out frequency, loudness, emotion information and the speech recognition sequence of described voice
Merge and structured text identifies, the speech data of generating structure.
Above-mentioned speech data structuring converting system based on the API that increases income, by pretreatment module to extraction
Speech data in data source carries out cutting, fragmentation and non-voice filtration treatment, obtains described speech data
Feature text data;Utilize the speech recognition API increased income to described feature text by sound identification module
Data carry out voice recognition processing, obtain the frequency of voice, loudness, emotion information and speech recognition sequence;
By modular converter frequency, loudness, emotion information and the speech recognition sequence of described voice merged and
Structured text identifies, the speech data of generating structure.Pass through technique scheme, it is achieved that to data
The structuring conversion of the speech data in source, the beneficially storage and management of speech data;And, the present invention
The work efficiency of speech data structuring converting system based on the API that increases income higher, also there are good feelings
Sense analytic function, further increases the accuracy of speech data structuring converting system based on the API that increases income.
Accompanying drawing explanation
Fig. 1 is the speech data structuring conversion method stream based on the API that increases income of one embodiment of the present of invention
Cheng Tu;
Fig. 2 is the speech data structuring conversion method based on the API that increases income of an alternative embodiment of the invention
Preprocess method flow chart;
Fig. 3 is the speech data structuring conversion method based on the API that increases income of an alternative embodiment of the invention
Audio recognition method flow chart;
Fig. 4 is the speech data structuring conversion method based on the API that increases income of an alternative embodiment of the invention
Conversion method flow chart;
Fig. 5 is the speech data structuring converting system based on the API that increases income of one embodiment of the present of invention
Structural representation;
Fig. 6 is the speech data structuring converting system based on the API that increases income of an alternative embodiment of the invention
The structural representation of pretreatment module;
Fig. 7 is the speech data structuring converting system based on the API that increases income of an alternative embodiment of the invention
The structural representation of sound identification module;
Fig. 8 is the speech data structuring converting system based on the API that increases income of an alternative embodiment of the invention
The structural representation of structurized module;
Fig. 9 is the speech data structuring converting system based on the API that increases income of an alternative embodiment of the invention
The structural representation of modular converter.
Detailed description of the invention
In order to further illustrate the technological means and the effect of acquirement that the present invention taked, below in conjunction with the accompanying drawings
And preferred embodiment, to technical scheme, carry out clear and complete description.
As it is shown in figure 1, the speech data structure based on the API that increases income that Fig. 1 is one embodiment of the present of invention
Change conversion method flow chart, comprise the steps:
Step S101: extract the speech data in data source;
In this step, by being extracted by the speech data in data source, prevent from changing at speech data
During non-speech data speech data is interfered, improve the voice based on the API that increases income of the present invention
The accuracy of data structured conversion method.In actual applications, this step S101 is from comprising multiple number
According to the data source of type extracts audio-frequency information, such as, from one section of video, extract the part of track.
From data source, extract the API that track can use windows system to provide, provide with winmm.h
Method as a example by, it is necessary first to comprise reference object by following statement:
#include<Windows.h>
#include"mmsystem.h"
#pragma comment(lib,"winmm.lib")
Then, call function according to functional requirement, such as:
WaveInOpen: open the audio input device specified, record;
WaveInPrepareHeader: prepare a relief area for audio input device;
WaveInStart: proceed by recording;
WaveInClose: close hull closure.
Owing to using the widest language to be mandarin and Guangdong language at home, therefore, the language comprised in data source
Information preferably can comprise mandarin or Guangdong language.When reality is applied, it is also possible to comprise such as Mongol,
The language of other ethnic groups such as Hui ethnic group's language, strong language.
Step S102: described speech data is carried out cutting, fragmentation and non-voice filtration treatment, obtains institute
State the feature text data of speech data;
In this step, by described speech data being carried out cutting, fragmentation and non-voice filtration treatment,
Reduce further the non-speech data impact on speech data in speech data transformation process, improve this
The accuracy of the speech data structuring conversion method based on the API that increases income of invention.
Step S103: utilize increase income speech recognition API (Application Programming Interface,
Application programming interface) described feature text data is carried out voice recognition processing, obtain voice frequency,
Loudness, emotion information and speech recognition sequence;
The emotion information of the voice described in this step is including but not limited to fundamental frequency, duration, energy and frequency spectrum etc.
Information.Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing, obtain
The frequency of voice, loudness, emotion information and speech recognition sequence, simplify the step of speech recognition, improves
The work efficiency of speech data conversion.
The speech recognition API increased income described in this step can be Baidu's speech recognition interface, Google's voice
Identify that interface, Microsoft's speech recognition interface or University of Science and Technology news fly speech recognition interface etc..Know with Baidu's voice
As a example by other interface, under python environment, pass through statement:
import baidu_oauth
Comprise Baidu's speech recognition library file, then use statement:
Asr_server='http: //vop.baidu.com/server_api'
Baidu_oauth_url='https: //openapi.baidu.com/oauth/2.0/token'
Connect Baidu's voice and carry out authorization identifying, and use statement:
Data_dict={'format':'wav', ' rate':8000, ' channel':1, ' cuid':mac_address, ' token':access
_token,'lan':'zh','speech':speech_base64,'len':speech_length}
Specify and set up data structure, wherein can comprise file type, code check, frequency range, MAC Address etc.
Information.The most only as a example by using Baidu's speech recognition interface under python environment, other voices are used to know
Other interface or other programming languages also have similar step, and here is omitted.
Step S104: frequency, loudness, emotion information and the speech recognition sequence of described voice is merged
Identify with structured text, the speech data of generating structure.
Structurized speech data described in this step can be the file of XML format, i.e. according to the need of user
Ask and it is returned to user with expandable mark language XML.
Above-mentioned speech data structuring conversion method based on the API that increases income, by the data source extracted
Speech data carries out cutting, fragmentation and non-voice filtration treatment, obtains the feature text of described speech data
Data;Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing, obtain
The frequency of voice, loudness, emotion information and speech recognition sequence;To the frequency of described voice, loudness, feelings
Sense information and speech recognition sequence carry out merging and structured text mark, the speech data of generating structure.
Pass through technique scheme, it is achieved that the structuring to the speech data in data source is changed, beneficially voice
The storage and management of data;And, the speech data structuring conversion method based on the API that increases income of the present invention
Work efficiency higher, also there is good sentiment analysis function, further increase based on increasing income API's
The accuracy of speech data structuring conversion method.
As in figure 2 it is shown, speech data based on the API that the increases income knot that Fig. 2 is an alternative embodiment of the invention
The preprocess method flow chart of structure conversion method, in the present embodiment, described is carried out described speech data
Cutting, fragmentation and non-voice filtration treatment, obtain step S102 of the feature text data of described speech data
Can also include:
Step S1021: described speech data carries out cutting and fragmentation processes, generates at least one sentence literary composition
Part;Wherein, described sentence file includes speech sentence file and non-voice sentence file;
In this step, cutting can be carried out by finding out the quiet point of speech data and fragmentation processes.
Such as, using 50 frames, 200 sampled points of every frame as quiet some threshold value, exceed this threshold value and then think this point
It is in mute position, after finding mute position, then with this position, speech data is carried out cutting, then will cut
The some fragmentation sentence files divided add the information such as duration, timestamp, then preserve with pcm form,
Generate at least one sentence file.
Step S1022: described sentence file is carried out non-voice information filtration, obtains described speech data corresponding
Speech sentence file;
Specifically, due to the sentence file after cutting and fragmentation is not done in step S1021 any in
Process in appearance, therefore, sentence file may comprise the noise of non-voice, speaker swallow sound etc.
Interference information, these unnecessary sound can cause the erroneous judgement of speech recognition, can reduce recognition accuracy, increases
Effective workload of system and burden.
In order to carry out effective non-voice information filtration, one can be trained for classify voice and non-voice
Disaggregated model, after identifying non-voice information, deletes non-voice information, then obtain effective language
Message ceases, and i.e. obtains described speech data corresponding speech sentence file.
Step S1023: described speech sentence file is carried out speech feature extraction, obtains described speech data
Feature text data.
In this step, as much as possible voice data is converted into text data, to reduce the data volume of transmission,
And then saving bandwidth resources.
In actual applications, this step can use mel cepstrum coefficients extraction method to carry out speech feature extraction,
Its principle is sound wave to be had different auditory sensitivities, from the voice signal of 200Hz to 5000Hz based on human ear
The definition of voice is affected maximum.The existence of the frequency content that two loudness is higher influences whether to loudness relatively
The impression of low frequency so that it is become to be difficult to discover.It is to say, bass easily shelters high pitch, but high pitch
Shelter bass relatively difficult.Therefore can from low to high in this section of frequency band according to the size of critical bandwidth
By close to dredging arrangement band filter, the voice signal of input is filtered, by each band filter
Output signal energy, as the basic feature of signal, can serve as phonetic feature after being further processed this.
This is also to need in step S103 to carry out frequency and a major reason of loudness extraction.
As it is shown on figure 3, speech data based on the API that the increases income knot that Fig. 3 is an alternative embodiment of the invention
The audio recognition method flow chart of structure conversion method, described carries out speech recognition to described feature text data
Processing, the step obtaining the frequency of voice, loudness, emotion information and speech recognition sequence includes:
Step S1031: utilize the speech recognition API increased income to be identified described feature text data processing,
Extract the emotion contextual information that described feature text data comprises;Wherein, described emotion contextual information bag
Include priori emotion contextual information and spatio-temporal context information;
Priori emotion information described in this step, can use Fuzzy Inference, according to the noise of background
Speculate and analyze affective state;Spatio-temporal context information can analyze previous or rear the one of each sentence fragment
Speech intonation information in individual sentence fragment, speculates the emotion information of speaker in current sentence fragment.
Step S1032: the affective characteristics that feature text data described in extract real-time comprises;Wherein, described emotion
Feature includes morpheme interval information and morpheme duration information;
Morpheme interval described in the present embodiment is speaker and says the time interval between each two Chinese character, language
Element duration is speaker and says duration used during single word, and above two information is the most permissible
The affective characteristics of reaction speaker.
Step S1033: described emotion contextual information and affective characteristics are carried out structural sparse expression respectively,
Obtain the emotion information that described feature text data is corresponding.
Wherein in an embodiment, the speech data structuring conversion method based on the API that increases income of the present invention,
Described step S1033 that described affective characteristics carries out structural sparse expression may include steps of:
Nonlinear Classification is differentiated, and the dictionary that model insertion represents to structural sparse optimizes in solution;
Utilize supervised learning method, the dictionary optimization solution of described rarefaction representation is optimized;
According to the dictionary optimization solution of the rarefaction representation after optimizing, described affective characteristics is carried out structural sparse table
Show.
Wherein in an embodiment, if X={x1,x2,…xnIt is n affective characteristics set of vectors,
Y={y1,y2,…ynIt is n emotion context vector set, set up and based on core non-linear differentiate sparse table
Show that criterion is as follows:
Wherein, D is rarefaction representation dictionary, αi={ α1,α2,…αmIt is the set of m affective characteristics rarefaction representation,
G is characterized the number of group, njFor the number of affective characteristics in jth group, θ is core discrimination parameter;F (α, θ) is
α is mapped to higher dimensional space, utilizes the Nonlinear Classification function about sparse code α that kernel function K is set up,
The desirable gaussian kernel of kernel function, nuclear parameter can be obtained by training;(C(f(α,θ),yi) it is loss function, this damage
Lose the balance of function designer's overall situation and consider that generic α within-cluster variance is the least, between different classes of α class from
Fisher criterion that divergence is tried one's best big and design;λ0,λ1,λ2,λ3For penalty factor.
The process of iteration optimization can be: uses existing D, θ, and the based on core non-linear of foundation differentiates
Rarefaction representation criterion solves the sparse coding α of label affective characteristics, then can set up rarefaction representation constraint
Equation, about the partial differential equation of D, θ, uses gradient descent method solve renewal dictionary D and differentiate classification ginseng
Number θ, continues iteration until restraining.Thus obtain the dictionary through solving-optimizing.
Dictionary after being optimized by dictionary can draw affective characteristics vector X and emotion context vector Y,
And design negotiation algorithm to utilize aforesaid two kinds of vectors to be estimated.
In step s 103, after obtaining frequency and the loudness of voice, gauss hybrid models can be trained, come
The feature of speaker is further portrayed, final according to the model trained, it is also possible to according to one section
The characteristics such as the intonation of voice, frequency, loudness judge that whether different voices is from same person.
As shown in Figure 4, Fig. 4 is the conversion method stream of data transfer device of an alternative embodiment of the invention
Cheng Tu, the described frequency to described voice, loudness, emotion information and speech recognition sequence merge and tie
Structure Text Flag, step S104 of the speech data of generating structure includes:
Step S1041: frequency, loudness, emotion information and the speech recognition sequence of described voice is collected
And alignment, and the beginning and ending time comprised according to described speech recognition sequence be ranked up;
Step S1042: the speech recognition sequence after sequence is identified according to structured format, generating structure
The speech data changed;Wherein, described mark include sex mark, tone color mark, punctuation mark mark and time
Between stamp mark.
In above-described embodiment, ultimately generate structurized speech data, and be sent to the form of XML message
Client, wherein can include source files title, sex mark, voice duration mark, language in XML message
The contents such as sound/non-voice mark.Especially, in order to ensure the accuracy rate of speech recognition, used in the present invention
Speech recognition library of increasing income is periodically to be acquired according to network text and update, and therefore can include popular word
Converge, preferably adapt to the speech recognition of various situation, be conducive to improving the accuracy of speech recognition.
As it is shown in figure 5, the speech data structure based on the API that increases income that Fig. 5 is one embodiment of the present of invention
Change the structural representation of converting system, including:
Extraction module 101, for extracting the speech data in data source;
In extraction module 101, by the speech data in data source is extracted, prevent at voice number
According to non-speech data in transformation process, speech data is interfered, improve the present invention based on the API that increases income
The accuracy of speech data structuring converting system.In actual applications, extraction module 101 can be from can
Can comprise in the data source of numerous types of data and extract audio-frequency information, such as, from one section of video, extract sound
The part of rail.
From data source, extract the API that track can use windows system to provide, provide with winmm.h
Method as a example by, it is necessary first to comprise reference object by following statement:
#include<Windows.h>
#include"mmsystem.h"
#pragma comment(lib,"winmm.lib")
Then, call function according to functional requirement, such as:
WaveInOpen: open the audio input device specified, record;
WaveInPrepareHeader: prepare a relief area for audio input device;
WaveInStart: proceed by recording;
WaveInClose: close hull closure.
Owing to using the widest language to be mandarin and Guangdong language at home, therefore, the language comprised in data source
Information preferably can comprise mandarin or Guangdong language.When reality is applied, it is also possible to comprise such as Mongol,
The language of other ethnic groups such as Hui ethnic group's language, Wei Er language.
Pretreatment module 102, for described speech data being carried out cutting, fragmentation and non-voice filtration treatment,
Obtain the feature text data of described speech data;
In pretreatment module 102, by described speech data being carried out cutting, fragmentation and non-voice mistake
Filter processes, and reduce further the non-speech data impact on speech data in speech data transformation process,
Improve the accuracy of the speech data structuring converting system based on the API that increases income of the present invention.
Sound identification module 103, for utilizing the speech recognition API increased income to carry out described feature text data
Voice recognition processing, obtains the frequency of voice, loudness, emotion information and speech recognition sequence;
The emotion information of the voice described in sound identification module 103 including but not limited to fundamental frequency, duration,
The information such as energy and frequency spectrum.Utilize the speech recognition API increased income that described feature text data is carried out voice knowledge
Other places are managed, and obtain the frequency of voice, loudness, emotion information and speech recognition sequence, simplify speech recognition
Step, improve speech data conversion work efficiency.
The speech recognition API increased income by utilization carries out voice recognition processing to described feature text data,
To frequency, loudness, emotion information and the speech recognition sequence of voice, simplify the step of speech recognition, carry
The high work efficiency of speech data conversion.
The speech recognition API increased income described in above-mentioned sound identification module 103 can be that Baidu's speech recognition connects
Mouth, Google's speech recognition interface, Microsoft's speech recognition interface or University of Science and Technology news fly speech recognition interface etc..
As a example by Baidu's speech recognition interface, under python environment, pass through statement:
import baidu_oauth
Comprise Baidu's speech recognition library file, then use statement:
Asr_server='http: //vop.baidu.com/server_api'
Baidu_oauth_url='https: //openapi.baidu.com/oauth/2.0/token'
Connect Baidu's voice and carry out authorization identifying, and use statement:
Data_dict={'format':'wav', ' rate':8000, ' channel':1, ' cuid':mac_address, ' token':access
_token,'lan':'zh','speech':speech_base64,'len':speech_length}
Specify and set up data structure, wherein can comprise file type, code check, frequency range, MAC Address etc.
Information.The most only as a example by using Baidu's speech recognition interface under python environment, other voices are used to know
Other interface or other programming languages also have similar step, and here is omitted.
Modular converter 104, for entering frequency, loudness, emotion information and the speech recognition sequence of described voice
Row merges and structured text identifies, the speech data of generating structure.
Structurized speech data described in above-mentioned modular converter 104 can be the file of XML format, i.e. root
According to the demand of user, it is returned to user with expandable mark language XML.
Above-mentioned speech data structuring converting system based on the API that increases income, by pretreatment module to extraction
Speech data in data source carries out cutting, fragmentation and non-voice filtration treatment, obtains described speech data
Feature text data;Utilize the speech recognition interface increased income to described feature text by sound identification module
Data carry out voice recognition processing, obtain the frequency of voice, loudness, emotion information and speech recognition sequence;
By modular converter frequency, loudness, emotion information and the speech recognition sequence of described voice merged and
Structured text identifies, the speech data of generating structure.Pass through technique scheme, it is achieved that to data
The structuring conversion of the speech data in source, the beneficially storage and management of speech data;And, the present invention
The work efficiency of speech data structuring converting system based on the API that increases income higher, also there are good feelings
Sense analytic function, further increases the accuracy of speech data structuring converting system based on the API that increases income.
As shown in Figure 6, Fig. 6 is speech data based on the API that the increases income knot of an alternative embodiment of the invention
The structural representation of the pretreatment module of structure converting system, described pretreatment module 102 includes:
Cutting module 1021, processes for described speech data carries out cutting and fragmentation, generates at least one
Individual sentence file;Wherein, described sentence file includes speech sentence file and non-voice sentence file;
In above-mentioned cutting module 1021, cutting and broken can be carried out by finding out the quiet point of speech data
Sheetization processes.Such as, using 50 frames, 200 sampled points of every frame as quiet some threshold value, this threshold value is exceeded
Then think that this point is in mute position, after finding mute position, then with this position, speech data carried out cutting,
Then the some fragmentation sentence files segmented are added the information such as duration, timestamp, then with pcm form
Preserve, generate at least one sentence file.
Non-voice filtering module 1022, for described sentence file is carried out non-voice filtration, obtains institute's predicate
Sound data corresponding speech sentence file;
Specifically, owing to the sentence file after cutting and fragmentation not being done any in cutting module 1021
Process in content, therefore, sentence file may comprise the noise of non-voice, speaker swallow sound
Deng interference information, these unnecessary sound can cause the erroneous judgement of speech recognition, can reduce recognition accuracy, increases
Effective workload of adding system and burden.
In order to carry out effective non-voice information filtration, one can be trained for classify voice and non-voice
Disaggregated model, after identifying non-voice information, deletes non-voice information, then obtain effective language
Message ceases, and i.e. obtains described speech data corresponding speech sentence file.
Characteristic extracting module 1023, for described speech sentence file is carried out speech feature extraction, obtains institute
State the feature text data of speech data.
In features described above extraction module 1023, as much as possible voice data is converted into text data, to subtract
The data volume of few transmission, and then save bandwidth resources.
In actual applications, it is possible to use mel cepstrum coefficients extraction method carries out speech feature extraction, its principle
It is sound wave to be had different auditory sensitivities, from the voice signal of 200Hz to 5000Hz to voice based on human ear
Definition impact maximum.The existence of the frequency content that two loudness is higher influences whether the frequency relatively low to loudness
The impression of rate so that it is become to be difficult to discover.It is to say, bass easily shelters high pitch, but high pitch is sheltered low
Sound is relatively difficult.Therefore can from low to high in this section of frequency band according to critical bandwidth size by close to
Dredge and arrange band filter, the voice signal of input is filtered, the output of each band filter is believed
Number energy, as the basic feature of signal, can serve as phonetic feature after being further processed this.This is also
It is sound identification module 103 to need carry out frequency and a major reason of loudness extraction.
As it is shown in fig. 7, speech data based on the API that the increases income knot that Fig. 7 is an alternative embodiment of the invention
The structural representation of the sound identification module of structure converting system, described sound identification module 103 includes:
Emotion information extraction module 1031, for utilizing the speech recognition API increased income to described feature textual data
According to being identified process, extract the emotion contextual information that described feature text data comprises;Wherein, described
Emotion contextual information includes priori emotion contextual information and spatio-temporal context information;
Priori emotion information described in above-mentioned emotion information extraction module 1031, can use fuzzy reasoning skill
Art, speculates according to the noise of background and analyzes affective state;Spatio-temporal context information can analyze each sentence
Speech intonation information in the previous or later sentence fragment of fragment, speculates in current sentence fragment and says
The emotion information of words people.
Affective feature extraction module 1032, the affective characteristics comprised for feature text data described in extract real-time;
Wherein, described affective characteristics includes morpheme interval information and morpheme duration information;
Morpheme interval described in the present embodiment is speaker and says the time interval between each two Chinese character, language
Element duration is speaker and says duration used during single word, and above two information is the most permissible
The affective characteristics of reaction speaker.
Structurized module 1033, for carrying out structuring respectively to described emotion contextual information and affective characteristics
Rarefaction representation, obtains the emotion information that described feature text data is corresponding.
As shown in Figure 8, Fig. 8 is speech data based on the API that the increases income knot of an alternative embodiment of the invention
The structural representation of the structurized module of structure converting system, described structurized module 1033 includes:
Embed module 10331, for Nonlinear Classification being differentiated the dictionary that model insertion represents to structural sparse
Optimize in solution;
Optimize module 10332, be used for utilizing supervised learning method, the dictionary optimization solution of described rarefaction representation is entered
Row optimizes;
Sparse module 10333, for the dictionary optimization solution according to the rarefaction representation after optimizing, special to described emotion
Levy and carry out structural sparse expression.
Wherein in an embodiment, if X={x1,x2,…xnIt is n affective characteristics set of vectors,
Y={y1,y2,…ynIt is n emotion context vector set, set up and based on core non-linear differentiate sparse table
Show that criterion is as follows:
Wherein, D is rarefaction representation dictionary, αi={ α1,α2,…αmIt is the set of m affective characteristics rarefaction representation,
G is characterized the number of group, njFor the number of affective characteristics in jth group, θ is core discrimination parameter;F (α, θ) is
α is mapped to higher dimensional space, utilizes the Nonlinear Classification function about sparse code α that kernel function K is set up,
The desirable gaussian kernel of kernel function, nuclear parameter can be obtained by training;(C(f(α,θ),yi) it is loss function, this damage
Lose the balance of function designer's overall situation and consider that generic α within-cluster variance is the least, between different classes of α class from
Fisher criterion that divergence is tried one's best big and design;λ0,λ1,λ2,λ3For penalty factor.
The process of iteration optimization can be: uses existing D, θ, and the based on core non-linear of foundation differentiates
Rarefaction representation criterion solves the sparse coding α of label affective characteristics, then can set up rarefaction representation constraint
Equation, about the partial differential equation of D, θ, uses gradient descent method solve renewal dictionary D and differentiate classification ginseng
Number θ, continues iteration until restraining.Thus obtain the dictionary through solving-optimizing.
Dictionary after being optimized by dictionary can draw affective characteristics vector X and emotion context vector Y,
And design negotiation algorithm to utilize aforesaid two kinds of vectors to be estimated.
In sound identification module 103, after obtaining frequency and the loudness of voice, Gaussian Mixture can be trained
Model, is further portrayed the feature of speaker, final according to the model trained, it is also possible to
The characteristics such as intonation according to one section of voice, frequency, loudness judge that whether different voices is from same person.
As it is shown in figure 9, speech data based on the API that the increases income knot that Fig. 9 is an alternative embodiment of the invention
The structural representation of the modular converter of structure converting system, described modular converter 104 includes:
Order module 1041, for frequency, loudness, emotion information and speech recognition sequence to described voice
Carry out collecting and aliging, and the beginning and ending time comprised according to described speech recognition sequence is ranked up;
Mark module 1042, for the speech recognition sequence after sequence is identified according to structured format,
The speech data of generating structure;Wherein, described mark includes sex mark, tone color mark, punctuation mark
Mark and timestamp identify.
In above-described embodiment, ultimately generate structurized speech data, and be sent to the form of XML message
Client, wherein can include source files title, sex mark, voice duration mark, language in XML message
The contents such as sound/non-voice mark.Especially, in order to ensure the accuracy rate of speech recognition, used in the present invention
Speech recognition library of increasing income is periodically to be acquired according to network text and update, and therefore can include popular word
Converge, preferably adapt to the speech recognition of various situation, be conducive to improving the accuracy of speech recognition.
Each technical characteristic of embodiment described above can combine arbitrarily, for making description succinct, the most right
The all possible combination of each technical characteristic in above-described embodiment is all described, but, if these skills
There is not contradiction in the combination of art feature, is all considered to be the scope that this specification is recorded.
Embodiment described above only have expressed the several embodiments of the present invention, and it describes more concrete and detailed,
But can not therefore be construed as limiting the scope of the patent.It should be pointed out that, for this area
For those of ordinary skill, without departing from the inventive concept of the premise, it is also possible to make some deformation and change
Entering, these broadly fall into protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be with appended power
Profit requires to be as the criterion.
Claims (10)
1. a speech data structuring conversion method based on the API that increases income, it is characterised in that include as follows
Step:
Extract the speech data in data source;
Described speech data is carried out cutting, fragmentation and non-voice filtration treatment, obtains described speech data
Feature text data;
Utilize the speech recognition API increased income that described feature text data is carried out voice recognition processing, obtain language
The frequency of sound, loudness, emotion information and speech recognition sequence;
Frequency, loudness, emotion information and the speech recognition sequence of described voice is merged and structuring literary composition
This mark, the speech data of generating structure.
Speech data structuring conversion method based on the API that increases income the most according to claim 1, it is special
Levy and be, described described speech data is carried out cutting, fragmentation and non-voice filtration treatment, obtain described
The step of the feature text data of speech data includes:
Described speech data is carried out cutting and fragmentation processes, generate at least one sentence file;Wherein,
Described sentence file includes speech sentence file and non-voice sentence file;
Described sentence file is carried out non-voice information filtration, obtains the corresponding speech sentence of described speech data
File;
Described speech sentence file is carried out speech feature extraction, obtains the feature textual data of described speech data
According to.
Speech data structuring conversion method based on the API that increases income the most according to claim 1, it is special
Levying and be, the speech recognition API that described utilization is increased income carries out voice recognition processing to described feature text data
Step include:
Utilize the speech recognition API increased income to be identified described feature text data processing, extract described spy
The emotion contextual information that notebook data of soliciting articles comprises;Wherein, described emotion contextual information includes priori emotion
Contextual information and spatio-temporal context information;
The affective characteristics that feature text data described in extract real-time comprises;Wherein, described affective characteristics includes language
Element interval information and morpheme duration information;
Described emotion contextual information and affective characteristics are carried out structural sparse expression respectively, obtains described spy
The emotion information that notebook data of soliciting articles is corresponding.
Speech data structuring conversion method based on the API that increases income the most according to claim 3, it is special
Levying and be, the described step that described affective characteristics carries out structural sparse expression includes:
Nonlinear Classification is differentiated, and the dictionary that model insertion represents to structural sparse optimizes in solution;
Utilize supervised learning method, the dictionary optimization solution of described rarefaction representation is optimized;
According to the dictionary optimization solution of the rarefaction representation after optimizing, described affective characteristics is carried out structural sparse table
Show.
Speech data structuring conversion method based on the API that increases income the most according to claim 1, it is special
Levy and be, the described frequency to described voice, loudness, emotion information and speech recognition sequence merge and
Structured text identifies, and the step of the speech data of generating structure includes:
Frequency, loudness, emotion information and the speech recognition sequence of described voice is collected and align, and
The beginning and ending time comprised according to described speech recognition sequence is ranked up;
Speech recognition sequence after sequence is identified according to structured format, the voice number of generating structure
According to;Wherein, described mark includes sex mark, tone color mark, punctuation mark mark and timestamp mark.
6. a speech data structuring converting system based on the API that increases income, it is characterised in that including:
Extraction module, for extracting the speech data in data source;
Pretreatment module, for described speech data being carried out cutting, fragmentation and non-voice filtration treatment,
Obtain the feature text data of described speech data;
Sound identification module, for utilizing the speech recognition API increased income that described feature text data is carried out language
Sound identifying processing, obtains the frequency of voice, loudness, emotion information and speech recognition sequence;
Modular converter, for carrying out frequency, loudness, emotion information and the speech recognition sequence of described voice
Merge and structured text identifies, the speech data of generating structure.
Speech data structuring converting system based on the API that increases income the most according to claim 6, it is special
Levying and be, described pretreatment module includes:
Cutting module, processes for described speech data carries out cutting and fragmentation, generates at least one sentence
Subfile;Wherein, described sentence file includes speech sentence file and non-voice sentence file;
Non-voice filtering module, for described sentence file is carried out non-voice filtration, obtains described voice number
According to corresponding speech sentence file;
Characteristic extracting module, for described speech sentence file is carried out speech feature extraction, obtains institute's predicate
The feature text data of sound data.
Speech data structuring converting system based on the API that increases income the most according to claim 6, it is special
Levying and be, described sound identification module includes:
Emotion information extraction module, for utilizing the speech recognition API increased income to enter described feature text data
Row identifying processing, extracts the emotion contextual information that described feature text data comprises;Wherein, described emotion
Contextual information includes priori emotion contextual information and spatio-temporal context information;
Affective feature extraction module, the affective characteristics comprised for feature text data described in extract real-time;Its
In, described affective characteristics includes morpheme interval information and morpheme duration information;
Structurized module, for carrying out structural sparse respectively to described emotion contextual information and affective characteristics
Represent, obtain the emotion information that described feature text data is corresponding.
Speech data structuring converting system based on the API that increases income the most according to claim 7, it is special
Levying and be, described structurized module includes:
Embed module, for Nonlinear Classification being differentiated the dictionary optimization that model insertion represents to structural sparse
Xie Zhong;
Optimize module, be used for utilizing supervised learning method, the dictionary optimization solution of described rarefaction representation is carried out excellent
Change;
Sparse module, for the dictionary optimization solution according to the rarefaction representation after optimizing, enters described affective characteristics
Row structural sparse represents.
Speech data structuring converting system based on the API that increases income the most according to claim 6, its
Being characterised by, described modular converter includes:
Order module, for carrying out frequency, loudness, emotion information and the speech recognition sequence of described voice
Collect and align, and the beginning and ending time comprised according to described speech recognition sequence is ranked up;
Mark module, for being identified the speech recognition sequence after sequence according to structured format, generates
Structurized speech data;Wherein, described mark includes sex mark, tone color mark, punctuation mark mark
Identify with timestamp.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610286831.9A CN105957517A (en) | 2016-04-29 | 2016-04-29 | Voice data structured conversion method and system based on open source API |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610286831.9A CN105957517A (en) | 2016-04-29 | 2016-04-29 | Voice data structured conversion method and system based on open source API |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105957517A true CN105957517A (en) | 2016-09-21 |
Family
ID=56913436
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610286831.9A Pending CN105957517A (en) | 2016-04-29 | 2016-04-29 | Voice data structured conversion method and system based on open source API |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105957517A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106791913A (en) * | 2016-12-30 | 2017-05-31 | 深圳市九洲电器有限公司 | Digital television program simultaneous interpretation output intent and system |
CN108319888A (en) * | 2017-01-17 | 2018-07-24 | 阿里巴巴集团控股有限公司 | The recognition methods of video type and device, terminal |
WO2018171257A1 (en) * | 2017-03-21 | 2018-09-27 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for speech information processing |
CN108899031A (en) * | 2018-07-17 | 2018-11-27 | 广西师范学院 | Strong language audio recognition method based on cloud computing |
WO2021259073A1 (en) * | 2020-06-26 | 2021-12-30 | International Business Machines Corporation | System for voice-to-text tagging for rich transcription of human speech |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101685634A (en) * | 2008-09-27 | 2010-03-31 | 上海盛淘智能科技有限公司 | Children speech emotion recognition method |
CN101847406A (en) * | 2010-05-18 | 2010-09-29 | 中国农业大学 | Speech recognition query method and system |
CN103123619A (en) * | 2012-12-04 | 2013-05-29 | 江苏大学 | Visual speech multi-mode collaborative analysis method based on emotion context and system |
US20130297297A1 (en) * | 2012-05-07 | 2013-11-07 | Erhan Guven | System and method for classification of emotion in human speech |
CN103700370A (en) * | 2013-12-04 | 2014-04-02 | 北京中科模识科技有限公司 | Broadcast television voice recognition method and system |
CN104050963A (en) * | 2014-06-23 | 2014-09-17 | 东南大学 | Continuous speech emotion prediction algorithm based on emotion data field |
-
2016
- 2016-04-29 CN CN201610286831.9A patent/CN105957517A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101685634A (en) * | 2008-09-27 | 2010-03-31 | 上海盛淘智能科技有限公司 | Children speech emotion recognition method |
CN101847406A (en) * | 2010-05-18 | 2010-09-29 | 中国农业大学 | Speech recognition query method and system |
US20130297297A1 (en) * | 2012-05-07 | 2013-11-07 | Erhan Guven | System and method for classification of emotion in human speech |
CN103123619A (en) * | 2012-12-04 | 2013-05-29 | 江苏大学 | Visual speech multi-mode collaborative analysis method based on emotion context and system |
CN103700370A (en) * | 2013-12-04 | 2014-04-02 | 北京中科模识科技有限公司 | Broadcast television voice recognition method and system |
CN104050963A (en) * | 2014-06-23 | 2014-09-17 | 东南大学 | Continuous speech emotion prediction algorithm based on emotion data field |
Non-Patent Citations (3)
Title |
---|
STRONGLEG: "使用百度API实现语音识别——in python", 《新浪博客》 * |
中国人工智能学会: "《中国人工智能进展》", 31 December 2007, 北京邮电大学出版社 * |
赵力等: "语音信号中的情感特征分析和识别的研究", 《电子学报》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106791913A (en) * | 2016-12-30 | 2017-05-31 | 深圳市九洲电器有限公司 | Digital television program simultaneous interpretation output intent and system |
CN108319888A (en) * | 2017-01-17 | 2018-07-24 | 阿里巴巴集团控股有限公司 | The recognition methods of video type and device, terminal |
CN108319888B (en) * | 2017-01-17 | 2023-04-07 | 阿里巴巴集团控股有限公司 | Video type identification method and device and computer terminal |
WO2018171257A1 (en) * | 2017-03-21 | 2018-09-27 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for speech information processing |
CN109074803A (en) * | 2017-03-21 | 2018-12-21 | 北京嘀嘀无限科技发展有限公司 | Speech information processing system and method |
CN109074803B (en) * | 2017-03-21 | 2022-10-18 | 北京嘀嘀无限科技发展有限公司 | Voice information processing system and method |
CN108899031A (en) * | 2018-07-17 | 2018-11-27 | 广西师范学院 | Strong language audio recognition method based on cloud computing |
WO2021259073A1 (en) * | 2020-06-26 | 2021-12-30 | International Business Machines Corporation | System for voice-to-text tagging for rich transcription of human speech |
GB2611684A (en) * | 2020-06-26 | 2023-04-12 | Ibm | System for voice-to-text tagging for rich transcription of human speech |
US11817100B2 (en) | 2020-06-26 | 2023-11-14 | International Business Machines Corporation | System for voice-to-text tagging for rich transcription of human speech |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103700370B (en) | A kind of radio and television speech recognition system method and system | |
WO2021073116A1 (en) | Method and apparatus for generating legal document, device and storage medium | |
Gupta et al. | The AT&T spoken language understanding system | |
CN101930735B (en) | Speech emotion recognition equipment and speech emotion recognition method | |
CN105957517A (en) | Voice data structured conversion method and system based on open source API | |
CN112735383A (en) | Voice signal processing method, device, equipment and storage medium | |
CN111105785B (en) | Text prosody boundary recognition method and device | |
CN109256150A (en) | Speech emotion recognition system and method based on machine learning | |
CN101685634A (en) | Children speech emotion recognition method | |
CN108877769B (en) | Method and device for identifying dialect type | |
CN111177350A (en) | Method, device and system for forming dialect of intelligent voice robot | |
Chittaragi et al. | Automatic text-independent Kannada dialect identification system | |
KR20200119410A (en) | System and Method for Recognizing Emotions from Korean Dialogues based on Global and Local Contextual Information | |
CN110209812A (en) | File classification method and device | |
CN112151015A (en) | Keyword detection method and device, electronic equipment and storage medium | |
CN111081219A (en) | End-to-end voice intention recognition method | |
CN117349427A (en) | Artificial intelligence multi-mode content generation system for public opinion event coping | |
Koolagudi et al. | Dravidian language classification from speech signal using spectral and prosodic features | |
Ling | An acoustic model for English speech recognition based on deep learning | |
Harsha et al. | Lexical ambiguity in natural language processing applications | |
WO2024077906A1 (en) | Speech text generation method and apparatus, and training method and apparatus for speech text generation model | |
Yue | English spoken stress recognition based on natural language processing and endpoint detection algorithm | |
Zahariev et al. | An approach to speech ambiguities eliminating using semantically-acoustical analysis | |
CN112150103B (en) | Schedule setting method, schedule setting device and storage medium | |
CN114707515A (en) | Method and device for judging dialect, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160921 |