CN108320734A - Audio signal processing method and device, storage medium, electronic equipment - Google Patents
Audio signal processing method and device, storage medium, electronic equipment Download PDFInfo
- Publication number
- CN108320734A CN108320734A CN201711479955.XA CN201711479955A CN108320734A CN 108320734 A CN108320734 A CN 108320734A CN 201711479955 A CN201711479955 A CN 201711479955A CN 108320734 A CN108320734 A CN 108320734A
- Authority
- CN
- China
- Prior art keywords
- text
- voice data
- feature
- user
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 11
- 230000005236 sound signal Effects 0.000 title claims abstract description 10
- 238000000034 method Methods 0.000 claims abstract description 33
- 239000000284 extract Substances 0.000 claims abstract description 16
- 239000006185 dispersion Substances 0.000 claims description 33
- 238000012545 processing Methods 0.000 claims description 22
- 230000002159 abnormal effect Effects 0.000 claims description 16
- 238000012549 training Methods 0.000 claims description 15
- 238000000605 extraction Methods 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000013480 data collection Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000001514 detection method Methods 0.000 description 4
- 210000001260 vocal cord Anatomy 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000000739 chaotic effect Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 208000024827 Alzheimer disease Diseases 0.000 description 1
- 206010039966 Senile dementia Diseases 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 210000004704 glottis Anatomy 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 206010027175 memory impairment Diseases 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000001766 physiological effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 210000002105 tongue Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/027—Syllables being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
A kind of audio signal processing method of disclosure offer and device, storage medium, electronic equipment.This method includes:The voice data of user to be measured is acquired, the voice data includes the first voice data that the user to be measured reads aloud specified text, and/or, the user to be measured recites the second speech data of the specified text;Extract the acoustic feature and/or text feature of the voice data, the acoustic feature is used to indicate the pronunciation character of the user to be measured, the text feature be used to indicate the user to be measured semantic statement level feature;Using the acoustic feature and/or the text feature as input, after the Classification of Speech model treatment through building in advance, the language feature of the user to be measured is determined.Such scheme can determine the language feature of user to be measured by voice process technology, realize that process is simple and convenient.
Description
Technical field
This disclosure relates to speech processes field, and in particular, to a kind of audio signal processing method and device, storage are situated between
Matter, electronic equipment.
Background technology
Voice is as a kind of analog signal for carrying specific information, it has also become information and biography are obtained in people's social life
Broadcast the important means of information.In general, including abnormal abundant information in voice signal, for example, content of text or semanteme, sound
Line feature, languages or dialect, mood etc., Speech processing are exactly that effective voice letter is extracted in complicated voice environment
Breath.
In actual application, the customized information of user can be extracted by Speech processing, carry out identity knowledge
Not, for example, identifying different speakers from one section of dialogue;Alternatively, by Speech processing can to different user into
The difference normalized processing of row, extracts common information, and Classification and Identification is carried out to speaker, for example, can it is gender-disaggregated, press languages
Classification etc..
Invention content
It is a general object of the present disclosure to provide a kind of audio signal processing method and device, storage medium, electronic equipments, can
To determine the language feature of user to be measured by voice process technology.
To achieve the goals above, the disclosure provides a kind of audio signal processing method, the method includes:
The voice data of user to be measured is acquired, the voice data includes that the user to be measured reads aloud the first of specified text
Voice data, and/or, the user to be measured recites the second speech data of the specified text;
The acoustic feature and/or text feature of the voice data are extracted, the acoustic feature is for indicating described to be measured
The pronunciation character of user, the text feature be used for indicate the user to be measured semantic statement level feature;
Using the acoustic feature and/or the text feature as input, the Classification of Speech model treatment through building in advance
Afterwards, the language feature of the user to be measured is determined.
Optionally, the acoustic feature includes pause feature and/or fundamental frequency feature;
The pause feature includes at least one in following characteristics:The total pause duration of the voice data and institute's predicate
Pause duration is less than the first preset duration T in ratio, the voice data between the duration of sound data1Pause number, institute
It states pause duration in voice data and is more than the second preset duration T2Pause number, the total pause number of the voice data, T1<
T2;
The fundamental frequency feature includes at least one in following characteristics:The fundamental frequency mean value of the voice data, the voice
The minimum fundamental frequency of the fundamental frequency variance of data, the maximizing fundamental frequency of the voice data, the voice data.
Optionally, the text feature includes text similarity, then the text feature of the extraction voice data, packet
It includes:
Speech recognition is carried out to the voice data, converting text is obtained, calculates the converting text and the specified text
Text similarity between this.
Optionally, the method further includes:
Judge whether the text identification rate of the voice data is more than predetermined threshold value;
If the text identification rate of the voice data is more than the predetermined threshold value, then executes the extraction voice data
The step of text similarity.
Optionally, the text feature further includes:
The sentence dispersion of specified text and the sentence dispersion of converting text, the sentence dispersion of the specified text
For indicating the distance between the sentence vector of the specified text and the chapter vector of the specified text variance;The conversion
The sentence dispersion of text is used to indicate between the sentence vector of the converting text and the chapter vector of the converting text
Distance variance;
And/or
Degree of aliasing PPL differences, PPL values for indicating the specified text and between the PPL values of the converting text
Difference.
Optionally, the mode for building the Classification of Speech model is:
The sample voice data of collecting sample user, the sample voice data include that the sample of users reads aloud the finger
Determine the first sample voice data of text, and/or, the sample of users recites the second sample voice number of the specified text
According to the sample of users includes normal speech feature user and abnormal language feature user;
Extract the acoustic feature and/or text feature of the sample voice data;
Determine the topological structure of the Classification of Speech model;
Using the topological structure and the acoustic feature and/or text feature of the sample voice data, described in training
Classification of Speech model, until the language feature phase that the language feature of Classification of Speech model output has with the sample of users
Symbol.
The disclosure provides a kind of speech signal processing device, and described device includes:
Speech data collection module, the voice data for acquiring user to be measured, the voice data include described to be measured
User reads aloud the first voice data of specified text, and/or, the user to be measured recites the second voice number of the specified text
According to;
Characteristic extracting module, acoustic feature and/or text feature for extracting the voice data, the acoustic feature
Pronunciation character for indicating the user to be measured, the text feature is for indicating the user to be measured in semantic statement level
Feature;
Language feature determining module is used for using the acoustic feature and/or the text feature as input, through advance structure
After the Classification of Speech model treatment built, the language feature of the user to be measured is determined.
Optionally, the acoustic feature includes pause feature and/or fundamental frequency feature;
The pause feature includes at least one in following characteristics:The total pause duration of the voice data and institute's predicate
Pause duration is less than the first preset duration T in ratio, the voice data between the duration of sound data1Pause number, institute
It states pause duration in voice data and is more than the second preset duration T2Pause number, the total pause number of the voice data, T1<
T2;
The fundamental frequency feature includes at least one in following characteristics:The fundamental frequency mean value of the voice data, the voice
The minimum fundamental frequency of the fundamental frequency variance of data, the maximizing fundamental frequency of the voice data, the voice data.
Optionally, the text feature includes text similarity,
The characteristic extracting module obtains converting text, described in calculating for carrying out speech recognition to the voice data
Text similarity between converting text and the specified text.
Optionally, described device further includes:
Discrimination judgment module, for judging whether the text identification rate of the voice data is more than predetermined threshold value;
The characteristic extracting module, for when the text identification rate of the voice data is more than the predetermined threshold value, carrying
Take the text similarity of the voice data.
Optionally, the text feature further includes:
The sentence dispersion of specified text and the sentence dispersion of converting text, the sentence dispersion of the specified text
For indicating the distance between the sentence vector of the specified text and the chapter vector of the specified text variance;The conversion
The sentence dispersion of text is used to indicate between the sentence vector of the converting text and the chapter vector of the converting text
Distance variance;
And/or
Degree of aliasing PPL differences, PPL values for indicating the specified text and between the PPL values of the converting text
Difference.
Optionally, described device further includes:
Sample voice data acquisition module is used for the sample voice data of collecting sample user, the sample voice data
The first sample voice data of the specified text is read aloud including the sample of users, and/or, the sample of users is recited described
Second sample voice data of specified text, the sample of users include that normal speech feature user and abnormal language feature are used
Family;
Sample characteristics extraction module, acoustic feature and/or text feature for extracting the sample voice data;
Topological structure determining module, the topological structure for determining the Classification of Speech model;
Model training module, for using the topological structure and the sample voice data acoustic feature and/or
Text feature, the training Classification of Speech model, until the language feature of Classification of Speech model output is used with the sample
The language feature that family has is consistent.
The disclosure provides a kind of storage device, wherein being stored with a plurality of instruction, described instruction is loaded by processor, in execution
The step of predicate signal processing method.
The disclosure provides a kind of electronic equipment, and the electronic equipment includes;
Above-mentioned storage device;And
Processor, for executing the instruction in the storage device.
Disclosure scheme can acquire the first voice data and/or the user to be measured back of the body that user to be measured reads aloud specified text
The second speech data for reading aloud specified text can extract the acoustic feature for indicating user pronunciation feature to be measured based on this, and/or
The text feature for indicating user semantic expressive faculty to be measured, using acoustic feature and/or text feature as mode input, through model
The language feature of user to be measured can be determined after processing.Such scheme realizes that process is simple and convenient, the time saving province of processing procedure
Power, and there is no professional skill requirement to personnel.
Other feature and advantage of the disclosure will be described in detail in subsequent specific embodiment part.
Description of the drawings
Attached drawing is for providing further understanding of the disclosure, and a part for constitution instruction, with following tool
Body embodiment is used to explain the disclosure together, but does not constitute the limitation to the disclosure.In the accompanying drawings:
Fig. 1 is the flow diagram of disclosure scheme audio signal processing method;
Fig. 2 is the flow diagram that Classification of Speech model is built in disclosure scheme;
Fig. 3 is the composition schematic diagram of disclosure scheme speech signal processing device;
Fig. 4 is structural schematic diagram of the disclosure scheme for the electronic equipment of Speech processing.
Specific implementation mode
The specific implementation mode of the disclosure is described in detail below in conjunction with attached drawing.It should be understood that this place is retouched
The specific implementation mode stated is only used for describing and explaining the disclosure, is not limited to the disclosure.
Referring to Fig. 1, the flow diagram of disclosure audio signal processing method is shown.It may comprise steps of:
S101, acquires the voice data of user to be measured, and the voice data includes that the user to be measured reads aloud specified text
The first voice data, and/or, the user to be measured recites the second speech data of the specified text.
When disclosure scheme carries out Speech processing, the voice data of user to be measured can be first acquired.Show as one kind
Example, can acquire the voice data of user to be measured, for example, intelligent terminal can be hand by the microphone of intelligent terminal
The day electronic devices such as machine, PC, tablet computer, intelligent sound box;Alternatively, intelligent terminal can also be special equipment, this
Open scheme can be not specifically limited this.
As an example, the specified text in disclosure scheme can be one section of straightaway word;Alternatively, can be with
More parts of texts are preserved in the database, and passage of therefrom choosing random when needed is as specified text.It is to be appreciated that
Database may belong to intelligent terminal, can also belong to server-side, and read specify from server-side by intelligent terminal when needed
Text, disclosure scheme is to specifying quantity, preserving type, the acquisition modes etc. of text that can be not specifically limited.
In disclosure scheme, the language feature of user to be measured can be predicted by Speech processing, for example, language
Feature can be presented as whether memory fails, whether orderliness is clear etc..
For needing the scene of prediction user language feature, for example, some post or a certain need of work user have
Good memory, clearly logic can be predicted based on disclosure scheme, and then carry out post according to prediction result
Matching;For another example daily carry out language feature horizontal forecast to kinsfolk, for example, the language feature of prediction child is horizontal, and
Special training is carried out according to prediction result;Alternatively, the language feature of the prediction middle-aged and the old is horizontal, and carried out according to prediction result
Senile dementia prevents training etc., and disclosure scheme can be not specifically limited application scenarios.
As an example, user can be divided into two types:One is normal speech feature user, this types
User usually have preferable memory, orderliness clear;One is abnormal language feature user, such user is usual
There are hypomnesia, orderliness is chaotic situations such as.Corresponding to this, disclosure scheme can acquire user to be measured and read aloud specified text
First voice data identifies situations such as user to be measured is with the presence or absence of orderliness confusion, dirty read draft with this;And/or it can acquire and wait for
The second speech data that user recites specified text is surveyed, identifies user to be measured with the presence or absence of memory decay, the easily feelings such as forgetting with this
Condition.
For acquiring two kinds of voice data, specified text under can first allowing user to be measured to be familiar with, then acquire the first voice
Data;In addition, after collecting the first voice data, it can be with separated in time, such as 30 seconds, then user to be measured is allowed to complete again
Specified text is stated, second speech data is collected.Disclosure scheme can be not specifically limited the process for acquiring voice data.
S102 extracts the acoustic feature and/or text feature of the voice data, and the acoustic feature is for indicating described
The pronunciation character of user to be measured, the text feature be used for indicate the user to be measured semantic statement level feature.
As an example, disclosure scheme can at least extract the feature of voice data according to following three kinds of modes, under
Face is explained respectively.
1. extracting the acoustic feature of voice data
In general, the user of different language feature has different pronunciation characters, disclosure scheme can be from voice data
Acoustic feature is extracted, and carries out language feature analysis accordingly.As an example, acoustic feature can be presented as pause feature
And/or fundamental frequency feature, it is explained separately below.
(1) pause feature
As an example, speech terminals detection tool can be utilized, mute in voice data is detected, is led to
Often, it is exactly that user the place paused occurs there are mute place, the endpoint value obtained according to detection, it may be determined that go out voice number
According to the middle position for occurring pausing, the duration to pause every time, the pause feature of voice data can be obtained accordingly.
As an example, pause feature can be presented as at least one in following characteristics:
(a) ratio between the total pause duration of voice data and the duration of voice data
Specifically, pause duration total in voice data can be first counted, then calculates total pause duration and voice number
According to duration between ratio.
(b) pause duration is less than the first preset duration T in voice data1Pause number
In actual application, different people's tongues is had nothing in common with each other, and somebody's word speed is very fast, somebody's word speed compared with
Slowly, disclosure scheme can extract pause duration less than T1Pause, determined in voice data with the presence or absence of because word speed is slow with this
Caused pause.
(c) pause duration is more than the second preset duration T in voice data2Pause number
As an example, disclosure scheme can also extract pause duration more than T2Pause, voice number is determined with this
It pauses with the presence or absence of long in, it is this long to pause that may be user's custom of speaking cause, it is also possible to which that orderliness is chaotic, thinking is unclear
It is clear to cause, in general, the probability of latter reason is some larger.
It is to be appreciated that T1<T2, for example, T1Can be 0.5s, T2Can be 2s, it is specific in combination with practical application request and
Fixed, disclosure scheme can not limit this.
(d) the total pause number of voice data
That is, all pause numbers occurred in voice data, may include that pause duration is less than T1Pause number,
Pause duration is in [T1, T2] the pause number in section, pause duration be more than T2Pause number.
(2) fundamental frequency feature
In general, voice signal can be divided into two kinds of voiceless sound, voiced sound by people when pronunciation according to whether vocal cords shake.
Voiced sound is also known as sound language, carries most energy in language, and voiced sound shows apparent periodicity in the time domain;Voiceless sound
Similar to white noise, without apparent periodical.When sending out voiced sound, air-flow makes vocal cords generate the vibration of relaxation vibrating type by glottis,
Quasi-periodic driving pulse string is generated, the frequency of this vocal cord vibration is properly termed as fundamental frequency, abbreviation fundamental frequency.
Fundamental frequency generally has relationship with personal vocal cords, custom etc. of pronouncing, and can react personal feature to a certain extent.
In actual application, the user of different language feature may have different fundamental frequency features, for example, abnormal language feature user
Fundamental frequency when speaking is usually relatively low, and it is more single to pronounce, and the opposite variation of fundamental frequency is smaller, therefore can extract the base of voice data
Frequency feature.
As an example, sub-frame processing can be carried out to voice data, obtains multiple speech data frames, then extraction is every
The corresponding fundamental frequency of frame, and then based on per the corresponding fundamental frequency of frame, obtain at least one in following fundamental frequency feature:The base of voice data
The minimum fundamental frequency of frequency mean value, the fundamental frequency variance of voice data, the maximizing fundamental frequency of voice data, voice data.It for example, can be with
Fundamental frequency mean value, fundamental frequency variance are calculated using auto-relativity function method, disclosure scheme can be not specifically limited this.
It is to be appreciated that the above acoustic feature can specifically be presented as the acoustic feature extracted from the first voice data;With/
Or, the acoustic feature extracted from second speech data.
2. extracting the text feature of voice data
In general, the user of different language feature has different semantic expressive faculties, can specifically be reflected in from voice data
In the text feature that extracts, be explained below explanation.
As an example, can to voice data carry out speech recognition, obtain converting text, then calculate converting text with
Text similarity between specified text.For example, can be calculated based on the first converting text that the first voice data obtains
The first text similarity between one converting text and specified text;And/or can be obtained based on second speech data second
Converting text calculates the second text similarity between the second converting text and specified text.
For abnormal language feature user, may occur dirty read draft when reading aloud specified text, insert
Situations such as entering uncorrelated word or sentence, disclosure scheme can extract the first text similarity, and identify user to be measured with this
With the presence or absence of the above situation.In addition, for abnormal language feature user, may occur easy when reciting specified text
Situations such as forgetting, disclosure scheme can extract the second text similarity, and identify that user to be measured whether there is above-mentioned feelings with this
Condition.
For example, the text similarity in disclosure scheme can at least be presented as at least one in following three kinds of situations
Kind:
(1) text similarity can be presented as the Topic Similarity of text
For example, can first pass through topic model (topic model) excavates the theme of converting text, specified text, then count
Calculate Topic Similarity.As an example, LSA (English can be passed through:Latent Semantic Analysis, Chinese:Enigmatic language
Justice analysis) text subject is excavated, similarity is calculated by COS distance method.Disclosure scheme is to excavating text subject, calculating
The method of similarity can be not specifically limited.
(2) text similarity can be presented as the content similarity of text
For example, can converting text, specified text be carried out word segmentation processing, and be added using the term vector of each word
Power and calculating obtain the chapter vector of each text;Then the chapter vector of converting text and the chapter vector of specified text are calculated
Between content similarity.As an example, similarity can be calculated by COS distance, disclosure scheme can not do this
It is specific to limit.
(3) text similarity can be presented as the ROUGE (English of text:Recall-Oriented Understudy
For Gisting Evaluation, Chinese:Towards the purport candidate appraisal procedure recalled) value
ROUGE values are the common indexs in one, natural language understanding field, can be used for indicating specified text in disclosure scheme
The goodness of fit between converting text.By way of matching the character string occurred jointly between converting text, given text, table
Show content that user to be measured can correctly say how many.As an example, ROUGE values can be presented as following index:
Accuracy rate, the correct proportions for indicating the n members word (n-gram) in converting text, reflection is to read aloud or carry on the back
The correctness readed aloud;
Recall rate, for indicating there is the n members word (n-gram) of much ratios to be correctly recited or recite in specified text,
How many is correctly recalled i.e. specified text;
F values (F-Measure), the overall target for indicating accuracy rate and recall rate can be calculated by the following formula F
Value:
Wherein, P indicates that accuracy rate, R indicate recall rate.
If converting text is identical with given text, it is 1 respectively to refer to target value;If converting text and given text
Part is identical, respectively refers to target value between (0,1).The process for calculating ROUGE values can refer to the relevant technologies realization, not do herein
It is described in detail.
As an example, before extracting text similarity, can first judge voice data text identification rate whether
More than predetermined threshold value;If it exceeds predetermined threshold value, then illustrate that the accuracy of speech-to-text is higher, the text phase based on this calculating
Accuracy like degree is also higher.For example, predetermined threshold value could be provided as 95%, it is specific in combination with practical application request and
Fixed, disclosure scheme can not limit this.It is to be appreciated that the text identification rate of disclosure scheme judges, can judge
The text identification rate of first voice data, and/or, judge the text identification rate of second speech data;In addition, two kinds of voice data
Identical predetermined threshold value can be used, or different predetermined threshold values can also be used, disclosure scheme can not do this specifically
It limits.
As an example, the text feature in disclosure scheme can also include at least one of following characteristics:
(1) the sentence dispersion of text and the sentence dispersion of converting text are specified
Such as introduction made above, for abnormal language feature user, it is possible that dirty read draft, be inserted into it is uncorrelated
Situations such as word or sentence, disclosure scheme can extract sentence dispersion, and user to be measured is identified by the way that whether sentence dissipates
With the presence or absence of the above situation.
As an example, sentence dispersion can be presented as the sentence dispersion and converting text of specified text
Sentence dispersion.Wherein, the sentence dispersion of text is specified to be used to indicate a piece for the sentence vector and specified text of specified text
The distance between Zhang Xiangliang variances;The sentence dispersion of converting text is used to indicate the sentence vector and converting text of converting text
The distance between chapter vector variance.
For example, word segmentation processing can be carried out to specified text, and be weighted using the term vector of each word and
Calculate, obtain the sentence vector of each sentence in specified text, the chapter vector of specified text, then calculate each sentence vector with
The distance between chapter vector variance, the sentence dispersion as specified text.Similarly, it also can refer to the above method and calculate conversion
The sentence dispersion of text, and will not be described here in detail, and specifically, the sentence dispersion of converting text can be presented as the first conversion text
This sentence dispersion and/or the sentence dispersion of the second converting text.
(2) PPL (English:Perplexity, Chinese:Degree of aliasing) difference
PPL is an important indicator in language model field, is mainly used for reflecting whether sequences of text is reasonable.For exception
May there is a situation where unclear and coherent for language feature user, in converting text, disclosure scheme can calculate specified text
PPL values and the PPL values of converting text between PPL differences, and identify accordingly user to be measured whether there is the above situation.
The process of the specified text of extraction, converting text PPL values, can refer to the relevant technologies realization, is not detailed herein.Specifically
Ground, PPL differences can be the PPL differences between the PPL values and the PPL values of the first converting text of specified text;And/or it can be with
It is the PPL differences between the PPL values and the PPL values of the second converting text for specifying text.
3. extracting the acoustic feature and text feature of voice data
That is, said acoustic feature and text feature above can also be integrated, the language feature of user to be measured is identified,
Introduction made above is specifically can refer to, details are not described herein again.
S103, using the acoustic feature and/or the text feature as input, the Classification of Speech model through building in advance
After processing, the language feature of the user to be measured is determined.
After extracting acoustic feature and/or text feature in the voice data of user to be measured, structure in advance can be utilized
Classification of Speech model carry out model treatment, export the language feature of user to be measured.
As an example, a model prediction can be carried out for a user to be measured;It can also carry out multiple model
It predicts, and determines the language feature of user to be measured according to the mean value of multiple prediction result, or will occur in multiple prediction result
Language feature of the most language feature of number as user to be measured, specifically depending on combinable practical application request, disclosure side
Case can be not specifically limited the mode etc. predicted number, determine language feature.
By being described above it is found that disclosure scheme realizes that process is simple and convenient, processing procedure is time saving and energy saving, and does not have to personnel
There is professional skill requirement.As an example, when disclosure program prediction middle-aged and the old language feature level, what model was determined
Language feature is not used in the conventional detection for substituting hospital, conventional detection can be assisted to be judged;And it is only needed in model predictive process
User's typing voice data to be measured, specific processing procedure is wanted not to directly act on user to be measured, and will not be to be measured
The physical function of user generates any influence.
The process that Classification of Speech model is built in disclosure scheme is explained below.For details, reference can be made to Fig. 2 institutes
Show flow chart, may comprise steps of:
S201, the sample voice data of collecting sample user, the sample voice data include that the sample of users is read aloud
The first sample voice data of the specified text, and/or, the sample of users recites the second sample language of the specified text
Sound data, the sample of users include normal speech feature user and abnormal language feature user.
When carrying out model training, the sample voice data of great amount of samples user can be acquired.Wherein, sample voice data
It can be presented as the first sample voice data for reading aloud specified text, and/or recite specified text as introduced at S101 above
This second sample voice data, and will not be described here in detail.
As an example, sample of users may include normal speech feature user, abnormal language feature user.Citing comes
It says, the age bracket of sample of users can be made similar as possible, contribute to physiological property caused by reducing age difference accurate to classifying
The influence of degree.
S202 extracts the acoustic feature and/or text feature of the sample voice data.
Specific implementation process can refer to and be introduced at S102 above, is not detailed herein.
S203 determines the topological structure of the Classification of Speech model.
As an example, the topological structure in disclosure scheme can be presented as:CNN (English:Convolutional
Neural Network, Chinese:Convolutional neural networks), RNN (English:Recurrent neural Network, Chinese:Cycle
Neural network), DNN (English:Deep Neural Network, Chinese:Deep neural network) etc., disclosure scheme can to this
It is not specifically limited.
As an example, neural network can include input layer, hidden layer and output layer.Wherein, input layer can be
The characteristic vector obtained after acoustic feature and/or text feature splicing;Hidden layer can be one layer, or multilayer, every layer
Interstitial content can be set as between 16~32, and sigmoid may be used as activation primitive;Output layer can include 2 outputs
Node respectively represents normal speech feature user, abnormal language feature user, for example, " 0 " can be used to indicate normal speech feature
User indicates abnormal language feature user with " 1 ";Alternatively, output layer can include 1 output node, user's quilt to be measured is indicated
It is identified as the probability of normal speech feature user.Disclosure scheme can not limit the specific manifestation form of each layer of neural network
It is fixed.
S204 utilizes the topological structure and the acoustic feature and/or text feature of the sample voice data, instruction
Practice the Classification of Speech model, until the language that the language feature of Classification of Speech model output has with the sample of users
Feature is consistent.
The topological structure for determining model, after the acoustic feature and/or text feature that extract sample voice data,
Carry out model training.As an example, cross entropy criterion may be used in training process, uses common stochastic gradient descent method
Update Optimized model parameter, it is ensured that when model training is completed, the prediction language feature of model output really has with sample of users
Language feature be consistent.Wherein, the language feature of Classification of Speech model output is consistent with the language feature that sample of users has, can
Language feature to be model prediction is identical with the language feature that sample of users has;Alternatively, can be model prediction language
The rate of accuracy reached of feature is sayed to preset value, such as 90%, disclosure scheme can be not specifically limited this.
It is to be appreciated that the Classification of Speech model of disclosure scheme is mainly based upon normal speech feature user, abnormal language
Feature of the speech feature user in pronunciation and/or semantic statement obtains different languages by way of statistical analysis, model training
It says the classification rule of feature, and then general user, i.e., the language feature of user to be measured is determined according to the classification rule.
Referring to Fig. 3, the composition schematic diagram of disclosure speech signal processing device is shown.Described device may include:
Speech data collection module 301, the voice data for acquiring user to be measured, the voice data include described wait for
The first voice data that user reads aloud specified text is surveyed, and/or, the user to be measured recites the second voice of the specified text
Data;
Characteristic extracting module 302, acoustic feature and/or text feature for extracting the voice data, the acoustics
Feature is used to indicate the pronunciation character of the user to be measured, and the text feature is for indicating that the user to be measured states in semanteme
The feature of level;
Language feature determining module 303 is used for using the acoustic feature and/or the text feature as input, through pre-
After the Classification of Speech model treatment first built, the language feature of the user to be measured is determined.
Optionally, the acoustic feature includes pause feature and/or fundamental frequency feature;
The pause feature includes at least one in following characteristics:The total pause duration of the voice data and institute's predicate
Pause duration is less than the first preset duration T in ratio, the voice data between the duration of sound data1Pause number, institute
It states pause duration in voice data and is more than the second preset duration T2Pause number, the total pause number of the voice data, T1<
T2;
The fundamental frequency feature includes at least one in following characteristics:The fundamental frequency mean value of the voice data, the voice
The minimum fundamental frequency of the fundamental frequency variance of data, the maximizing fundamental frequency of the voice data, the voice data.
Optionally, the text feature includes text similarity,
The characteristic extracting module obtains converting text, described in calculating for carrying out speech recognition to the voice data
Text similarity between converting text and the specified text.
Optionally, described device further includes:
Discrimination judgment module, for judging whether the text identification rate of the voice data is more than predetermined threshold value;
The characteristic extracting module, for when the text identification rate of the voice data is more than the predetermined threshold value, carrying
Take the text similarity of the voice data.
Optionally, the text feature further includes:
The sentence dispersion of specified text and the sentence dispersion of converting text, the sentence dispersion of the specified text
For indicating the distance between the sentence vector of the specified text and the chapter vector of the specified text variance;The conversion
The sentence dispersion of text is used to indicate between the sentence vector of the converting text and the chapter vector of the converting text
Distance variance;
And/or
Degree of aliasing PPL differences, PPL values for indicating the specified text and between the PPL values of the converting text
Difference.
Optionally, described device further includes:
Sample voice data acquisition module is used for the sample voice data of collecting sample user, the sample voice data
The first sample voice data of the specified text is read aloud including the sample of users, and/or, the sample of users is recited described
Second sample voice data of specified text, the sample of users include that normal speech feature user and abnormal language feature are used
Family;
Sample characteristics extraction module, acoustic feature and/or text feature for extracting the sample voice data;
Topological structure determining module, the topological structure for determining the Classification of Speech model;
Model training module, for using the topological structure and the sample voice data acoustic feature and/or
Text feature, the training Classification of Speech model, until the language feature of Classification of Speech model output is used with the sample
The language feature that family has is consistent.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method
Embodiment in be described in detail, explanation will be not set forth in detail herein.
Referring to Fig. 4, structural schematic diagram of the disclosure for the electronic equipment 400 of Speech processing is shown.With reference to figure
4, electronic equipment 400 includes processing component 401, further comprises one or more processors, and by 402 institute of storage medium
The storage device resource of representative, can be by the instruction of the execution of processing component 401, such as application program for storing.Storage medium
The application program stored in 402 may include it is one or more each correspond to one group of instruction module.In addition, place
Reason component 401 is configured as executing instruction, to execute above-mentioned audio signal processing method.
Electronic equipment 400 can also include a power supply module 403, be configured as executing the power supply pipe of electronic equipment 400
Reason;One wired or wireless network interface 404 is configured as electronic equipment 400 being connected to network;With an input and output
(I/O) interface 405.Electronic equipment 400 can be operated based on the operating system for being stored in storage medium 402, such as Windows
ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
The preferred embodiment of the disclosure is described in detail above in association with attached drawing, still, the disclosure is not limited to above-mentioned reality
The detail in mode is applied, in the range of the technology design of the disclosure, a variety of letters can be carried out to the technical solution of the disclosure
Monotropic type, these simple variants belong to the protection domain of the disclosure.
It is further to note that specific technical features described in the above specific embodiments, in not lance
In the case of shield, can be combined by any suitable means, in order to avoid unnecessary repetition, the disclosure to it is various can
The combination of energy no longer separately illustrates.
In addition, arbitrary combination can also be carried out between a variety of different embodiments of the disclosure, as long as it is without prejudice to originally
Disclosed thought equally should be considered as disclosure disclosure of that.
Claims (14)
1. a kind of audio signal processing method, which is characterized in that the method includes:
The voice data of user to be measured is acquired, the voice data includes the first voice that the user to be measured reads aloud specified text
Data, and/or, the user to be measured recites the second speech data of the specified text;
The acoustic feature and/or text feature of the voice data are extracted, the acoustic feature is for indicating the user to be measured
Pronunciation character, the text feature be used for indicate the user to be measured semantic statement level feature;
Using the acoustic feature and/or the text feature as input, after the Classification of Speech model treatment through building in advance, really
Make the language feature of the user to be measured.
2. according to the method described in claim 1, it is characterized in that, the acoustic feature includes that pause feature and/or fundamental frequency are special
Sign;
The pause feature includes at least one in following characteristics:The total pause duration of the voice data and the voice number
According to duration between ratio, pause duration is less than the first preset duration T in the voice data1Pause number, institute's predicate
Pause duration is more than the second preset duration T in sound data2Pause number, the total pause number of the voice data, T1<T2;
The fundamental frequency feature includes at least one in following characteristics:The fundamental frequency mean value of the voice data, the voice data
Fundamental frequency variance, the maximizing fundamental frequency of the voice data, the voice data minimum fundamental frequency.
3. according to the method described in claim 1, it is characterized in that, the text feature includes text similarity, then described to carry
The text feature of the voice data is taken, including:
Speech recognition is carried out to the voice data, obtains converting text, calculate the converting text and the specified text it
Between text similarity.
4. according to the method described in claim 3, it is characterized in that, the method further includes:
Judge whether the text identification rate of the voice data is more than predetermined threshold value;
If the text identification rate of the voice data is more than the predetermined threshold value, then executes the text for extracting the voice data
The step of similarity.
5. method according to claim 3 or 4, which is characterized in that the text feature further includes:
The sentence dispersion of specified text and the sentence dispersion of converting text, the sentence dispersion of the specified text are used for
Indicate the distance between the sentence vector of the specified text and the chapter vector of the specified text variance;The converting text
Sentence dispersion be used for indicate the converting text sentence vector and the converting text the distance between chapter vector
Variance;
And/or
Degree of aliasing PPL differences, the PPL values for indicating the specified text and the difference between the PPL values of the converting text.
6. according to the method described in claim 1, it is characterized in that, the mode for building the Classification of Speech model is:
The sample voice data of collecting sample user, the sample voice data include that the sample of users reads aloud the specified text
This first sample voice data, and/or, the sample of users recites the second sample voice data of the specified text, institute
It includes normal speech feature user and abnormal language feature user to state sample of users;
Extract the acoustic feature and/or text feature of the sample voice data;
Determine the topological structure of the Classification of Speech model;
Utilize the topological structure and the acoustic feature and/or text feature of the sample voice data, the training voice
Disaggregated model, until the language feature of Classification of Speech model output is consistent with the language feature that the sample of users has.
7. a kind of speech signal processing device, which is characterized in that described device includes:
Speech data collection module, the voice data for acquiring user to be measured, the voice data include the user to be measured
The first voice data of specified text is read aloud, and/or, the user to be measured recites the second speech data of the specified text;
Characteristic extracting module, acoustic feature and/or text feature for extracting the voice data, the acoustic feature are used for
Indicate the pronunciation character of the user to be measured, the text feature be used to indicate the user to be measured semantic statement level spy
Sign;
Language feature determining module is used for using the acoustic feature and/or the text feature as input, through what is built in advance
After Classification of Speech model treatment, the language feature of the user to be measured is determined.
8. device according to claim 7, which is characterized in that the acoustic feature includes that pause feature and/or fundamental frequency are special
Sign;
The pause feature includes at least one in following characteristics:The total pause duration of the voice data and the voice number
According to duration between ratio, pause duration is less than the first preset duration T in the voice data1Pause number, institute's predicate
Pause duration is more than the second preset duration T in sound data2Pause number, the total pause number of the voice data, T1<T2;
The fundamental frequency feature includes at least one in following characteristics:The fundamental frequency mean value of the voice data, the voice data
Fundamental frequency variance, the maximizing fundamental frequency of the voice data, the voice data minimum fundamental frequency.
9. device according to claim 7, which is characterized in that the text feature includes text similarity,
The characteristic extracting module obtains converting text, calculates the conversion for carrying out speech recognition to the voice data
Text similarity between text and the specified text.
10. device according to claim 9, which is characterized in that described device further includes:
Discrimination judgment module, for judging whether the text identification rate of the voice data is more than predetermined threshold value;
The characteristic extracting module, for when the text identification rate of the voice data is more than the predetermined threshold value, extracting institute
State the text similarity of voice data.
11. device according to claim 9 or 10, which is characterized in that the text feature further includes:
The sentence dispersion of specified text and the sentence dispersion of converting text, the sentence dispersion of the specified text are used for
Indicate the distance between the sentence vector of the specified text and the chapter vector of the specified text variance;The converting text
Sentence dispersion be used for indicate the converting text sentence vector and the converting text the distance between chapter vector
Variance;
And/or
Degree of aliasing PPL differences, the PPL values for indicating the specified text and the difference between the PPL values of the converting text.
12. device according to claim 7, which is characterized in that described device further includes:
Sample voice data acquisition module, is used for the sample voice data of collecting sample user, and the sample voice data include
The sample of users reads aloud the first sample voice data of the specified text, and/or, the sample of users is recited described specified
Second sample voice data of text, the sample of users include normal speech feature user and abnormal language feature user;
Sample characteristics extraction module, acoustic feature and/or text feature for extracting the sample voice data;
Topological structure determining module, the topological structure for determining the Classification of Speech model;
Model training module, for the acoustic feature and/or text using the topological structure and the sample voice data
Feature, the training Classification of Speech model, until the language feature of Classification of Speech model output has with the sample of users
Some language features are consistent.
13. a kind of storage device, wherein being stored with a plurality of instruction, which is characterized in that described instruction is loaded by processor, right of execution
Profit requires the step of any one of 1 to 6 the method.
14. a kind of electronic equipment, which is characterized in that the electronic equipment includes;
Storage device described in claim 13;And
Processor, for executing the instruction in the storage device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711479955.XA CN108320734A (en) | 2017-12-29 | 2017-12-29 | Audio signal processing method and device, storage medium, electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711479955.XA CN108320734A (en) | 2017-12-29 | 2017-12-29 | Audio signal processing method and device, storage medium, electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108320734A true CN108320734A (en) | 2018-07-24 |
Family
ID=62893510
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711479955.XA Pending CN108320734A (en) | 2017-12-29 | 2017-12-29 | Audio signal processing method and device, storage medium, electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108320734A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109147780A (en) * | 2018-08-15 | 2019-01-04 | 重庆柚瓣家科技有限公司 | Audio recognition method and system under free chat scenario |
CN109657186A (en) * | 2018-12-27 | 2019-04-19 | 广州势必可赢网络科技有限公司 | A kind of demographic method, system and relevant apparatus |
CN109741732A (en) * | 2018-08-30 | 2019-05-10 | 京东方科技集团股份有限公司 | Name entity recognition method, name entity recognition device, equipment and medium |
CN109754822A (en) * | 2019-01-22 | 2019-05-14 | 平安科技(深圳)有限公司 | The method and apparatus for establishing Alzheimer's disease detection model |
CN110600015A (en) * | 2019-09-18 | 2019-12-20 | 北京声智科技有限公司 | Voice dense classification method and related device |
CN110797016A (en) * | 2019-02-26 | 2020-02-14 | 北京嘀嘀无限科技发展有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN111354366A (en) * | 2018-12-20 | 2020-06-30 | 沈阳新松机器人自动化股份有限公司 | Abnormal sound detection method and abnormal sound detection device |
CN111583907A (en) * | 2020-04-15 | 2020-08-25 | 北京小米松果电子有限公司 | Information processing method, device and storage medium |
CN112420028A (en) * | 2020-12-03 | 2021-02-26 | 上海欣方智能系统有限公司 | System and method for performing semantic recognition on voice signal |
CN112825248A (en) * | 2019-11-19 | 2021-05-21 | 阿里巴巴集团控股有限公司 | Voice processing method, model training method, interface display method and equipment |
CN114596960A (en) * | 2022-03-01 | 2022-06-07 | 中山大学 | Alzheimer's disease risk estimation method based on neural network and natural conversation |
CN115116431A (en) * | 2022-08-29 | 2022-09-27 | 深圳市星范儿文化科技有限公司 | Audio generation method, device and equipment based on intelligent reading kiosk and storage medium |
CN115188369A (en) * | 2022-09-09 | 2022-10-14 | 北京探境科技有限公司 | Voice recognition rate testing method, system, chip, electronic device and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101739868A (en) * | 2008-11-19 | 2010-06-16 | 中国科学院自动化研究所 | Automatic evaluation and diagnosis method of text reading level for oral test |
US20100174533A1 (en) * | 2009-01-06 | 2010-07-08 | Regents Of The University Of Minnesota | Automatic measurement of speech fluency |
US20100298649A1 (en) * | 2007-11-02 | 2010-11-25 | Siegbert Warkentin | System and methods for assessment of the aging brain and its brain disease induced brain dysfunctions by speech analysis |
JP2011255106A (en) * | 2010-06-11 | 2011-12-22 | Nagoya Institute Of Technology | Cognitive dysfunction danger computing device, cognitive dysfunction danger computing system, and program |
US20120065977A1 (en) * | 2010-09-09 | 2012-03-15 | Rosetta Stone, Ltd. | System and Method for Teaching Non-Lexical Speech Effects |
CN103251386A (en) * | 2011-12-20 | 2013-08-21 | 台达电子工业股份有限公司 | Apparatus and method for voice assisted medical diagnosis |
CN105845134A (en) * | 2016-06-14 | 2016-08-10 | 科大讯飞股份有限公司 | Spoken language evaluation method through freely read topics and spoken language evaluation system thereof |
CN107145739A (en) * | 2017-05-07 | 2017-09-08 | 黄石德龙自动化科技有限公司 | Method of discrimination based on Alzheimer disease under a variety of data |
CN107316638A (en) * | 2017-06-28 | 2017-11-03 | 北京粉笔未来科技有限公司 | A kind of poem recites evaluating method and system, a kind of terminal and storage medium |
-
2017
- 2017-12-29 CN CN201711479955.XA patent/CN108320734A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100298649A1 (en) * | 2007-11-02 | 2010-11-25 | Siegbert Warkentin | System and methods for assessment of the aging brain and its brain disease induced brain dysfunctions by speech analysis |
CN101739868A (en) * | 2008-11-19 | 2010-06-16 | 中国科学院自动化研究所 | Automatic evaluation and diagnosis method of text reading level for oral test |
US20100174533A1 (en) * | 2009-01-06 | 2010-07-08 | Regents Of The University Of Minnesota | Automatic measurement of speech fluency |
JP2011255106A (en) * | 2010-06-11 | 2011-12-22 | Nagoya Institute Of Technology | Cognitive dysfunction danger computing device, cognitive dysfunction danger computing system, and program |
US20120065977A1 (en) * | 2010-09-09 | 2012-03-15 | Rosetta Stone, Ltd. | System and Method for Teaching Non-Lexical Speech Effects |
CN103251386A (en) * | 2011-12-20 | 2013-08-21 | 台达电子工业股份有限公司 | Apparatus and method for voice assisted medical diagnosis |
CN105845134A (en) * | 2016-06-14 | 2016-08-10 | 科大讯飞股份有限公司 | Spoken language evaluation method through freely read topics and spoken language evaluation system thereof |
CN107145739A (en) * | 2017-05-07 | 2017-09-08 | 黄石德龙自动化科技有限公司 | Method of discrimination based on Alzheimer disease under a variety of data |
CN107316638A (en) * | 2017-06-28 | 2017-11-03 | 北京粉笔未来科技有限公司 | A kind of poem recites evaluating method and system, a kind of terminal and storage medium |
Non-Patent Citations (1)
Title |
---|
邓杰娟: "基于语音识别技术的失语症辅助诊断及康复治疗系统的研究", 《中国优秀硕士学位论文全文数据库工程科技II辑》 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109147780B (en) * | 2018-08-15 | 2023-03-03 | 重庆柚瓣家科技有限公司 | Voice recognition method and system under free chat scene |
CN109147780A (en) * | 2018-08-15 | 2019-01-04 | 重庆柚瓣家科技有限公司 | Audio recognition method and system under free chat scenario |
CN109741732B (en) * | 2018-08-30 | 2022-06-21 | 京东方科技集团股份有限公司 | Named entity recognition method, named entity recognition device, equipment and medium |
CN109741732A (en) * | 2018-08-30 | 2019-05-10 | 京东方科技集团股份有限公司 | Name entity recognition method, name entity recognition device, equipment and medium |
US11514891B2 (en) | 2018-08-30 | 2022-11-29 | Beijing Boe Technology Development Co., Ltd. | Named entity recognition method, named entity recognition equipment and medium |
WO2020043123A1 (en) * | 2018-08-30 | 2020-03-05 | 京东方科技集团股份有限公司 | Named-entity recognition method, named-entity recognition apparatus and device, and medium |
CN111354366A (en) * | 2018-12-20 | 2020-06-30 | 沈阳新松机器人自动化股份有限公司 | Abnormal sound detection method and abnormal sound detection device |
CN111354366B (en) * | 2018-12-20 | 2023-06-16 | 沈阳新松机器人自动化股份有限公司 | Abnormal sound detection method and abnormal sound detection device |
CN109657186A (en) * | 2018-12-27 | 2019-04-19 | 广州势必可赢网络科技有限公司 | A kind of demographic method, system and relevant apparatus |
CN109754822A (en) * | 2019-01-22 | 2019-05-14 | 平安科技(深圳)有限公司 | The method and apparatus for establishing Alzheimer's disease detection model |
WO2020151155A1 (en) * | 2019-01-22 | 2020-07-30 | 平安科技(深圳)有限公司 | Method and device for building alzheimer's disease detection model |
CN110797016A (en) * | 2019-02-26 | 2020-02-14 | 北京嘀嘀无限科技发展有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN110600015A (en) * | 2019-09-18 | 2019-12-20 | 北京声智科技有限公司 | Voice dense classification method and related device |
CN112825248A (en) * | 2019-11-19 | 2021-05-21 | 阿里巴巴集团控股有限公司 | Voice processing method, model training method, interface display method and equipment |
CN111583907B (en) * | 2020-04-15 | 2023-08-15 | 北京小米松果电子有限公司 | Information processing method, device and storage medium |
CN111583907A (en) * | 2020-04-15 | 2020-08-25 | 北京小米松果电子有限公司 | Information processing method, device and storage medium |
CN112420028A (en) * | 2020-12-03 | 2021-02-26 | 上海欣方智能系统有限公司 | System and method for performing semantic recognition on voice signal |
CN112420028B (en) * | 2020-12-03 | 2024-03-19 | 上海欣方智能系统有限公司 | System and method for carrying out semantic recognition on voice signals |
CN114596960B (en) * | 2022-03-01 | 2023-08-08 | 中山大学 | Alzheimer's disease risk prediction method based on neural network and natural dialogue |
CN114596960A (en) * | 2022-03-01 | 2022-06-07 | 中山大学 | Alzheimer's disease risk estimation method based on neural network and natural conversation |
CN115116431B (en) * | 2022-08-29 | 2022-11-18 | 深圳市星范儿文化科技有限公司 | Audio generation method, device, equipment and storage medium based on intelligent reading kiosk |
CN115116431A (en) * | 2022-08-29 | 2022-09-27 | 深圳市星范儿文化科技有限公司 | Audio generation method, device and equipment based on intelligent reading kiosk and storage medium |
CN115188369A (en) * | 2022-09-09 | 2022-10-14 | 北京探境科技有限公司 | Voice recognition rate testing method, system, chip, electronic device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108320734A (en) | Audio signal processing method and device, storage medium, electronic equipment | |
CN111179975B (en) | Voice endpoint detection method for emotion recognition, electronic device and storage medium | |
US11037553B2 (en) | Learning-type interactive device | |
CN102194454B (en) | Equipment and method for detecting key word in continuous speech | |
CN109686383B (en) | Voice analysis method, device and storage medium | |
Alon et al. | Contextual speech recognition with difficult negative training examples | |
CN111243569B (en) | Emotional voice automatic generation method and device based on generation type confrontation network | |
US11810471B2 (en) | Computer implemented method and apparatus for recognition of speech patterns and feedback | |
CN108899033B (en) | Method and device for determining speaker characteristics | |
Levitan et al. | Combining Acoustic-Prosodic, Lexical, and Phonotactic Features for Automatic Deception Detection. | |
Yağanoğlu | Real time wearable speech recognition system for deaf persons | |
Devi et al. | Speaker emotion recognition based on speech features and classification techniques | |
KR101988165B1 (en) | Method and system for improving the accuracy of speech recognition technology based on text data analysis for deaf students | |
CN109872714A (en) | A kind of method, electronic equipment and storage medium improving accuracy of speech recognition | |
CN110853669B (en) | Audio identification method, device and equipment | |
CN110689895A (en) | Voice verification method and device, electronic equipment and readable storage medium | |
CN114783464A (en) | Cognitive detection method and related device, electronic equipment and storage medium | |
CN108269574A (en) | Voice signal processing method and device, storage medium and electronic equipment | |
KR20210071713A (en) | Speech Skill Feedback System | |
Gupta et al. | Implicit language identification system based on random forest and support vector machine for speech | |
CN115512692B (en) | Voice recognition method, device, equipment and storage medium | |
CN116052655A (en) | Audio processing method, device, electronic equipment and readable storage medium | |
US11961510B2 (en) | Information processing apparatus, keyword detecting apparatus, and information processing method | |
CN113593523A (en) | Speech detection method and device based on artificial intelligence and electronic equipment | |
CN113808577A (en) | Intelligent extraction method and device of voice abstract, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180724 |