CN108320734A

CN108320734A - Audio signal processing method and device, storage medium, electronic equipment

Info

Publication number: CN108320734A
Application number: CN201711479955.XA
Authority: CN
Inventors: 孔常青; 乔玉平; 高建清; 鹿晓亮
Original assignee: Iflytek Anhui Medical Information Technology Co Ltd
Current assignee: Iflytek Anhui Medical Information Technology Co Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2018-07-24

Abstract

A kind of audio signal processing method of disclosure offer and device, storage medium, electronic equipment.This method includes：The voice data of user to be measured is acquired, the voice data includes the first voice data that the user to be measured reads aloud specified text, and/or, the user to be measured recites the second speech data of the specified text；Extract the acoustic feature and/or text feature of the voice data, the acoustic feature is used to indicate the pronunciation character of the user to be measured, the text feature be used to indicate the user to be measured semantic statement level feature；Using the acoustic feature and/or the text feature as input, after the Classification of Speech model treatment through building in advance, the language feature of the user to be measured is determined.Such scheme can determine the language feature of user to be measured by voice process technology, realize that process is simple and convenient.

Description

Audio signal processing method and device, storage medium, electronic equipment

Technical field

This disclosure relates to speech processes field, and in particular, to a kind of audio signal processing method and device, storage are situated between Matter, electronic equipment.

Background technology

Voice is as a kind of analog signal for carrying specific information, it has also become information and biography are obtained in people's social life Broadcast the important means of information.In general, including abnormal abundant information in voice signal, for example, content of text or semanteme, sound Line feature, languages or dialect, mood etc., Speech processing are exactly that effective voice letter is extracted in complicated voice environment Breath.

In actual application, the customized information of user can be extracted by Speech processing, carry out identity knowledge Not, for example, identifying different speakers from one section of dialogue；Alternatively, by Speech processing can to different user into The difference normalized processing of row, extracts common information, and Classification and Identification is carried out to speaker, for example, can it is gender-disaggregated, press languages Classification etc..

Invention content

It is a general object of the present disclosure to provide a kind of audio signal processing method and device, storage medium, electronic equipments, can To determine the language feature of user to be measured by voice process technology.

To achieve the goals above, the disclosure provides a kind of audio signal processing method, the method includes：

The voice data of user to be measured is acquired, the voice data includes that the user to be measured reads aloud the first of specified text Voice data, and/or, the user to be measured recites the second speech data of the specified text；

The acoustic feature and/or text feature of the voice data are extracted, the acoustic feature is for indicating described to be measured The pronunciation character of user, the text feature be used for indicate the user to be measured semantic statement level feature；

Using the acoustic feature and/or the text feature as input, the Classification of Speech model treatment through building in advance Afterwards, the language feature of the user to be measured is determined.

Optionally, the acoustic feature includes pause feature and/or fundamental frequency feature；

The pause feature includes at least one in following characteristics：The total pause duration of the voice data and institute's predicate Pause duration is less than the first preset duration T in ratio, the voice data between the duration of sound data₁Pause number, institute It states pause duration in voice data and is more than the second preset duration T₂Pause number, the total pause number of the voice data, T₁< T₂；

The fundamental frequency feature includes at least one in following characteristics：The fundamental frequency mean value of the voice data, the voice The minimum fundamental frequency of the fundamental frequency variance of data, the maximizing fundamental frequency of the voice data, the voice data.

Optionally, the text feature includes text similarity, then the text feature of the extraction voice data, packet It includes：

Speech recognition is carried out to the voice data, converting text is obtained, calculates the converting text and the specified text Text similarity between this.

Optionally, the method further includes：

Judge whether the text identification rate of the voice data is more than predetermined threshold value；

If the text identification rate of the voice data is more than the predetermined threshold value, then executes the extraction voice data The step of text similarity.

Optionally, the text feature further includes：

The sentence dispersion of specified text and the sentence dispersion of converting text, the sentence dispersion of the specified text For indicating the distance between the sentence vector of the specified text and the chapter vector of the specified text variance；The conversion The sentence dispersion of text is used to indicate between the sentence vector of the converting text and the chapter vector of the converting text Distance variance；

And/or

Degree of aliasing PPL differences, PPL values for indicating the specified text and between the PPL values of the converting text Difference.

Optionally, the mode for building the Classification of Speech model is：

The sample voice data of collecting sample user, the sample voice data include that the sample of users reads aloud the finger Determine the first sample voice data of text, and/or, the sample of users recites the second sample voice number of the specified text According to the sample of users includes normal speech feature user and abnormal language feature user；

Extract the acoustic feature and/or text feature of the sample voice data；

Determine the topological structure of the Classification of Speech model；

Using the topological structure and the acoustic feature and/or text feature of the sample voice data, described in training Classification of Speech model, until the language feature phase that the language feature of Classification of Speech model output has with the sample of users Symbol.

The disclosure provides a kind of speech signal processing device, and described device includes：

Speech data collection module, the voice data for acquiring user to be measured, the voice data include described to be measured User reads aloud the first voice data of specified text, and/or, the user to be measured recites the second voice number of the specified text According to；

Characteristic extracting module, acoustic feature and/or text feature for extracting the voice data, the acoustic feature Pronunciation character for indicating the user to be measured, the text feature is for indicating the user to be measured in semantic statement level Feature；

Language feature determining module is used for using the acoustic feature and/or the text feature as input, through advance structure After the Classification of Speech model treatment built, the language feature of the user to be measured is determined.

Optionally, the text feature includes text similarity,

The characteristic extracting module obtains converting text, described in calculating for carrying out speech recognition to the voice data Text similarity between converting text and the specified text.

Optionally, described device further includes：

Discrimination judgment module, for judging whether the text identification rate of the voice data is more than predetermined threshold value；

The characteristic extracting module, for when the text identification rate of the voice data is more than the predetermined threshold value, carrying Take the text similarity of the voice data.

Optionally, the text feature further includes：

And/or

Optionally, described device further includes：

Sample voice data acquisition module is used for the sample voice data of collecting sample user, the sample voice data The first sample voice data of the specified text is read aloud including the sample of users, and/or, the sample of users is recited described Second sample voice data of specified text, the sample of users include that normal speech feature user and abnormal language feature are used Family；

Sample characteristics extraction module, acoustic feature and/or text feature for extracting the sample voice data；

Topological structure determining module, the topological structure for determining the Classification of Speech model；

Model training module, for using the topological structure and the sample voice data acoustic feature and/or Text feature, the training Classification of Speech model, until the language feature of Classification of Speech model output is used with the sample The language feature that family has is consistent.

The disclosure provides a kind of storage device, wherein being stored with a plurality of instruction, described instruction is loaded by processor, in execution The step of predicate signal processing method.

The disclosure provides a kind of electronic equipment, and the electronic equipment includes；

Above-mentioned storage device；And

Processor, for executing the instruction in the storage device.

Disclosure scheme can acquire the first voice data and/or the user to be measured back of the body that user to be measured reads aloud specified text The second speech data for reading aloud specified text can extract the acoustic feature for indicating user pronunciation feature to be measured based on this, and/or The text feature for indicating user semantic expressive faculty to be measured, using acoustic feature and/or text feature as mode input, through model The language feature of user to be measured can be determined after processing.Such scheme realizes that process is simple and convenient, the time saving province of processing procedure Power, and there is no professional skill requirement to personnel.

Other feature and advantage of the disclosure will be described in detail in subsequent specific embodiment part.

Description of the drawings

Attached drawing is for providing further understanding of the disclosure, and a part for constitution instruction, with following tool Body embodiment is used to explain the disclosure together, but does not constitute the limitation to the disclosure.In the accompanying drawings：

Fig. 1 is the flow diagram of disclosure scheme audio signal processing method；

Fig. 2 is the flow diagram that Classification of Speech model is built in disclosure scheme；

Fig. 3 is the composition schematic diagram of disclosure scheme speech signal processing device；

Fig. 4 is structural schematic diagram of the disclosure scheme for the electronic equipment of Speech processing.

Specific implementation mode

The specific implementation mode of the disclosure is described in detail below in conjunction with attached drawing.It should be understood that this place is retouched The specific implementation mode stated is only used for describing and explaining the disclosure, is not limited to the disclosure.

Referring to Fig. 1, the flow diagram of disclosure audio signal processing method is shown.It may comprise steps of：

S101, acquires the voice data of user to be measured, and the voice data includes that the user to be measured reads aloud specified text The first voice data, and/or, the user to be measured recites the second speech data of the specified text.

When disclosure scheme carries out Speech processing, the voice data of user to be measured can be first acquired.Show as one kind Example, can acquire the voice data of user to be measured, for example, intelligent terminal can be hand by the microphone of intelligent terminal The day electronic devices such as machine, PC, tablet computer, intelligent sound box；Alternatively, intelligent terminal can also be special equipment, this Open scheme can be not specifically limited this.

As an example, the specified text in disclosure scheme can be one section of straightaway word；Alternatively, can be with More parts of texts are preserved in the database, and passage of therefrom choosing random when needed is as specified text.It is to be appreciated that Database may belong to intelligent terminal, can also belong to server-side, and read specify from server-side by intelligent terminal when needed Text, disclosure scheme is to specifying quantity, preserving type, the acquisition modes etc. of text that can be not specifically limited.

In disclosure scheme, the language feature of user to be measured can be predicted by Speech processing, for example, language Feature can be presented as whether memory fails, whether orderliness is clear etc..

For needing the scene of prediction user language feature, for example, some post or a certain need of work user have Good memory, clearly logic can be predicted based on disclosure scheme, and then carry out post according to prediction result Matching；For another example daily carry out language feature horizontal forecast to kinsfolk, for example, the language feature of prediction child is horizontal, and Special training is carried out according to prediction result；Alternatively, the language feature of the prediction middle-aged and the old is horizontal, and carried out according to prediction result Senile dementia prevents training etc., and disclosure scheme can be not specifically limited application scenarios.

As an example, user can be divided into two types：One is normal speech feature user, this types User usually have preferable memory, orderliness clear；One is abnormal language feature user, such user is usual There are hypomnesia, orderliness is chaotic situations such as.Corresponding to this, disclosure scheme can acquire user to be measured and read aloud specified text First voice data identifies situations such as user to be measured is with the presence or absence of orderliness confusion, dirty read draft with this；And/or it can acquire and wait for The second speech data that user recites specified text is surveyed, identifies user to be measured with the presence or absence of memory decay, the easily feelings such as forgetting with this Condition.

For acquiring two kinds of voice data, specified text under can first allowing user to be measured to be familiar with, then acquire the first voice Data；In addition, after collecting the first voice data, it can be with separated in time, such as 30 seconds, then user to be measured is allowed to complete again Specified text is stated, second speech data is collected.Disclosure scheme can be not specifically limited the process for acquiring voice data.

S102 extracts the acoustic feature and/or text feature of the voice data, and the acoustic feature is for indicating described The pronunciation character of user to be measured, the text feature be used for indicate the user to be measured semantic statement level feature.

As an example, disclosure scheme can at least extract the feature of voice data according to following three kinds of modes, under Face is explained respectively.

1. extracting the acoustic feature of voice data

In general, the user of different language feature has different pronunciation characters, disclosure scheme can be from voice data Acoustic feature is extracted, and carries out language feature analysis accordingly.As an example, acoustic feature can be presented as pause feature And/or fundamental frequency feature, it is explained separately below.

(1) pause feature

As an example, speech terminals detection tool can be utilized, mute in voice data is detected, is led to Often, it is exactly that user the place paused occurs there are mute place, the endpoint value obtained according to detection, it may be determined that go out voice number According to the middle position for occurring pausing, the duration to pause every time, the pause feature of voice data can be obtained accordingly.

As an example, pause feature can be presented as at least one in following characteristics：

(a) ratio between the total pause duration of voice data and the duration of voice data

Specifically, pause duration total in voice data can be first counted, then calculates total pause duration and voice number According to duration between ratio.

(b) pause duration is less than the first preset duration T in voice data₁Pause number

In actual application, different people's tongues is had nothing in common with each other, and somebody's word speed is very fast, somebody's word speed compared with Slowly, disclosure scheme can extract pause duration less than T₁Pause, determined in voice data with the presence or absence of because word speed is slow with this Caused pause.

(c) pause duration is more than the second preset duration T in voice data₂Pause number

As an example, disclosure scheme can also extract pause duration more than T₂Pause, voice number is determined with this It pauses with the presence or absence of long in, it is this long to pause that may be user's custom of speaking cause, it is also possible to which that orderliness is chaotic, thinking is unclear It is clear to cause, in general, the probability of latter reason is some larger.

It is to be appreciated that T₁<T₂, for example, T₁Can be 0.5s, T₂Can be 2s, it is specific in combination with practical application request and Fixed, disclosure scheme can not limit this.

(d) the total pause number of voice data

That is, all pause numbers occurred in voice data, may include that pause duration is less than T₁Pause number, Pause duration is in [T₁, T₂] the pause number in section, pause duration be more than T₂Pause number.

(2) fundamental frequency feature

In general, voice signal can be divided into two kinds of voiceless sound, voiced sound by people when pronunciation according to whether vocal cords shake. Voiced sound is also known as sound language, carries most energy in language, and voiced sound shows apparent periodicity in the time domain；Voiceless sound Similar to white noise, without apparent periodical.When sending out voiced sound, air-flow makes vocal cords generate the vibration of relaxation vibrating type by glottis, Quasi-periodic driving pulse string is generated, the frequency of this vocal cord vibration is properly termed as fundamental frequency, abbreviation fundamental frequency.

Fundamental frequency generally has relationship with personal vocal cords, custom etc. of pronouncing, and can react personal feature to a certain extent. In actual application, the user of different language feature may have different fundamental frequency features, for example, abnormal language feature user Fundamental frequency when speaking is usually relatively low, and it is more single to pronounce, and the opposite variation of fundamental frequency is smaller, therefore can extract the base of voice data Frequency feature.

As an example, sub-frame processing can be carried out to voice data, obtains multiple speech data frames, then extraction is every The corresponding fundamental frequency of frame, and then based on per the corresponding fundamental frequency of frame, obtain at least one in following fundamental frequency feature：The base of voice data The minimum fundamental frequency of frequency mean value, the fundamental frequency variance of voice data, the maximizing fundamental frequency of voice data, voice data.It for example, can be with Fundamental frequency mean value, fundamental frequency variance are calculated using auto-relativity function method, disclosure scheme can be not specifically limited this.

It is to be appreciated that the above acoustic feature can specifically be presented as the acoustic feature extracted from the first voice data；With/ Or, the acoustic feature extracted from second speech data.

2. extracting the text feature of voice data

In general, the user of different language feature has different semantic expressive faculties, can specifically be reflected in from voice data In the text feature that extracts, be explained below explanation.

As an example, can to voice data carry out speech recognition, obtain converting text, then calculate converting text with Text similarity between specified text.For example, can be calculated based on the first converting text that the first voice data obtains The first text similarity between one converting text and specified text；And/or can be obtained based on second speech data second Converting text calculates the second text similarity between the second converting text and specified text.

For abnormal language feature user, may occur dirty read draft when reading aloud specified text, insert Situations such as entering uncorrelated word or sentence, disclosure scheme can extract the first text similarity, and identify user to be measured with this With the presence or absence of the above situation.In addition, for abnormal language feature user, may occur easy when reciting specified text Situations such as forgetting, disclosure scheme can extract the second text similarity, and identify that user to be measured whether there is above-mentioned feelings with this Condition.

For example, the text similarity in disclosure scheme can at least be presented as at least one in following three kinds of situations Kind：

(1) text similarity can be presented as the Topic Similarity of text

For example, can first pass through topic model (topic model) excavates the theme of converting text, specified text, then count Calculate Topic Similarity.As an example, LSA (English can be passed through：Latent Semantic Analysis, Chinese：Enigmatic language Justice analysis) text subject is excavated, similarity is calculated by COS distance method.Disclosure scheme is to excavating text subject, calculating The method of similarity can be not specifically limited.

(2) text similarity can be presented as the content similarity of text

For example, can converting text, specified text be carried out word segmentation processing, and be added using the term vector of each word Power and calculating obtain the chapter vector of each text；Then the chapter vector of converting text and the chapter vector of specified text are calculated Between content similarity.As an example, similarity can be calculated by COS distance, disclosure scheme can not do this It is specific to limit.

(3) text similarity can be presented as the ROUGE (English of text：Recall-Oriented Understudy For Gisting Evaluation, Chinese：Towards the purport candidate appraisal procedure recalled) value

ROUGE values are the common indexs in one, natural language understanding field, can be used for indicating specified text in disclosure scheme The goodness of fit between converting text.By way of matching the character string occurred jointly between converting text, given text, table Show content that user to be measured can correctly say how many.As an example, ROUGE values can be presented as following index：

Accuracy rate, the correct proportions for indicating the n members word (n-gram) in converting text, reflection is to read aloud or carry on the back The correctness readed aloud；

Recall rate, for indicating there is the n members word (n-gram) of much ratios to be correctly recited or recite in specified text, How many is correctly recalled i.e. specified text；

F values (F-Measure), the overall target for indicating accuracy rate and recall rate can be calculated by the following formula F Value：

Wherein, P indicates that accuracy rate, R indicate recall rate.

If converting text is identical with given text, it is 1 respectively to refer to target value；If converting text and given text Part is identical, respectively refers to target value between (0,1).The process for calculating ROUGE values can refer to the relevant technologies realization, not do herein It is described in detail.

As an example, before extracting text similarity, can first judge voice data text identification rate whether More than predetermined threshold value；If it exceeds predetermined threshold value, then illustrate that the accuracy of speech-to-text is higher, the text phase based on this calculating Accuracy like degree is also higher.For example, predetermined threshold value could be provided as 95%, it is specific in combination with practical application request and Fixed, disclosure scheme can not limit this.It is to be appreciated that the text identification rate of disclosure scheme judges, can judge The text identification rate of first voice data, and/or, judge the text identification rate of second speech data；In addition, two kinds of voice data Identical predetermined threshold value can be used, or different predetermined threshold values can also be used, disclosure scheme can not do this specifically It limits.

As an example, the text feature in disclosure scheme can also include at least one of following characteristics：

(1) the sentence dispersion of text and the sentence dispersion of converting text are specified

Such as introduction made above, for abnormal language feature user, it is possible that dirty read draft, be inserted into it is uncorrelated Situations such as word or sentence, disclosure scheme can extract sentence dispersion, and user to be measured is identified by the way that whether sentence dissipates With the presence or absence of the above situation.

As an example, sentence dispersion can be presented as the sentence dispersion and converting text of specified text Sentence dispersion.Wherein, the sentence dispersion of text is specified to be used to indicate a piece for the sentence vector and specified text of specified text The distance between Zhang Xiangliang variances；The sentence dispersion of converting text is used to indicate the sentence vector and converting text of converting text The distance between chapter vector variance.

For example, word segmentation processing can be carried out to specified text, and be weighted using the term vector of each word and Calculate, obtain the sentence vector of each sentence in specified text, the chapter vector of specified text, then calculate each sentence vector with The distance between chapter vector variance, the sentence dispersion as specified text.Similarly, it also can refer to the above method and calculate conversion The sentence dispersion of text, and will not be described here in detail, and specifically, the sentence dispersion of converting text can be presented as the first conversion text This sentence dispersion and/or the sentence dispersion of the second converting text.

(2) PPL (English：Perplexity, Chinese：Degree of aliasing) difference

PPL is an important indicator in language model field, is mainly used for reflecting whether sequences of text is reasonable.For exception May there is a situation where unclear and coherent for language feature user, in converting text, disclosure scheme can calculate specified text PPL values and the PPL values of converting text between PPL differences, and identify accordingly user to be measured whether there is the above situation.

The process of the specified text of extraction, converting text PPL values, can refer to the relevant technologies realization, is not detailed herein.Specifically Ground, PPL differences can be the PPL differences between the PPL values and the PPL values of the first converting text of specified text；And/or it can be with It is the PPL differences between the PPL values and the PPL values of the second converting text for specifying text.

3. extracting the acoustic feature and text feature of voice data

That is, said acoustic feature and text feature above can also be integrated, the language feature of user to be measured is identified, Introduction made above is specifically can refer to, details are not described herein again.

S103, using the acoustic feature and/or the text feature as input, the Classification of Speech model through building in advance After processing, the language feature of the user to be measured is determined.

After extracting acoustic feature and/or text feature in the voice data of user to be measured, structure in advance can be utilized Classification of Speech model carry out model treatment, export the language feature of user to be measured.

As an example, a model prediction can be carried out for a user to be measured；It can also carry out multiple model It predicts, and determines the language feature of user to be measured according to the mean value of multiple prediction result, or will occur in multiple prediction result Language feature of the most language feature of number as user to be measured, specifically depending on combinable practical application request, disclosure side Case can be not specifically limited the mode etc. predicted number, determine language feature.

By being described above it is found that disclosure scheme realizes that process is simple and convenient, processing procedure is time saving and energy saving, and does not have to personnel There is professional skill requirement.As an example, when disclosure program prediction middle-aged and the old language feature level, what model was determined Language feature is not used in the conventional detection for substituting hospital, conventional detection can be assisted to be judged；And it is only needed in model predictive process User's typing voice data to be measured, specific processing procedure is wanted not to directly act on user to be measured, and will not be to be measured The physical function of user generates any influence.

The process that Classification of Speech model is built in disclosure scheme is explained below.For details, reference can be made to Fig. 2 institutes Show flow chart, may comprise steps of：

S201, the sample voice data of collecting sample user, the sample voice data include that the sample of users is read aloud The first sample voice data of the specified text, and/or, the sample of users recites the second sample language of the specified text Sound data, the sample of users include normal speech feature user and abnormal language feature user.

When carrying out model training, the sample voice data of great amount of samples user can be acquired.Wherein, sample voice data It can be presented as the first sample voice data for reading aloud specified text, and/or recite specified text as introduced at S101 above This second sample voice data, and will not be described here in detail.

As an example, sample of users may include normal speech feature user, abnormal language feature user.Citing comes It says, the age bracket of sample of users can be made similar as possible, contribute to physiological property caused by reducing age difference accurate to classifying The influence of degree.

S202 extracts the acoustic feature and/or text feature of the sample voice data.

Specific implementation process can refer to and be introduced at S102 above, is not detailed herein.

S203 determines the topological structure of the Classification of Speech model.

As an example, the topological structure in disclosure scheme can be presented as：CNN (English：Convolutional Neural Network, Chinese：Convolutional neural networks), RNN (English：Recurrent neural Network, Chinese：Cycle Neural network), DNN (English：Deep Neural Network, Chinese：Deep neural network) etc., disclosure scheme can to this It is not specifically limited.

As an example, neural network can include input layer, hidden layer and output layer.Wherein, input layer can be The characteristic vector obtained after acoustic feature and/or text feature splicing；Hidden layer can be one layer, or multilayer, every layer Interstitial content can be set as between 16~32, and sigmoid may be used as activation primitive；Output layer can include 2 outputs Node respectively represents normal speech feature user, abnormal language feature user, for example, " 0 " can be used to indicate normal speech feature User indicates abnormal language feature user with " 1 "；Alternatively, output layer can include 1 output node, user's quilt to be measured is indicated It is identified as the probability of normal speech feature user.Disclosure scheme can not limit the specific manifestation form of each layer of neural network It is fixed.

S204 utilizes the topological structure and the acoustic feature and/or text feature of the sample voice data, instruction Practice the Classification of Speech model, until the language that the language feature of Classification of Speech model output has with the sample of users Feature is consistent.

The topological structure for determining model, after the acoustic feature and/or text feature that extract sample voice data, Carry out model training.As an example, cross entropy criterion may be used in training process, uses common stochastic gradient descent method Update Optimized model parameter, it is ensured that when model training is completed, the prediction language feature of model output really has with sample of users Language feature be consistent.Wherein, the language feature of Classification of Speech model output is consistent with the language feature that sample of users has, can Language feature to be model prediction is identical with the language feature that sample of users has；Alternatively, can be model prediction language The rate of accuracy reached of feature is sayed to preset value, such as 90%, disclosure scheme can be not specifically limited this.

It is to be appreciated that the Classification of Speech model of disclosure scheme is mainly based upon normal speech feature user, abnormal language Feature of the speech feature user in pronunciation and/or semantic statement obtains different languages by way of statistical analysis, model training It says the classification rule of feature, and then general user, i.e., the language feature of user to be measured is determined according to the classification rule.

Referring to Fig. 3, the composition schematic diagram of disclosure speech signal processing device is shown.Described device may include：

Speech data collection module 301, the voice data for acquiring user to be measured, the voice data include described wait for The first voice data that user reads aloud specified text is surveyed, and/or, the user to be measured recites the second voice of the specified text Data；

Characteristic extracting module 302, acoustic feature and/or text feature for extracting the voice data, the acoustics Feature is used to indicate the pronunciation character of the user to be measured, and the text feature is for indicating that the user to be measured states in semanteme The feature of level；

Language feature determining module 303 is used for using the acoustic feature and/or the text feature as input, through pre- After the Classification of Speech model treatment first built, the language feature of the user to be measured is determined.

Optionally, the text feature includes text similarity,

Optionally, described device further includes：

Optionally, the text feature further includes：

And/or

Optionally, described device further includes：

About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, explanation will be not set forth in detail herein.

Referring to Fig. 4, structural schematic diagram of the disclosure for the electronic equipment 400 of Speech processing is shown.With reference to figure 4, electronic equipment 400 includes processing component 401, further comprises one or more processors, and by 402 institute of storage medium The storage device resource of representative, can be by the instruction of the execution of processing component 401, such as application program for storing.Storage medium The application program stored in 402 may include it is one or more each correspond to one group of instruction module.In addition, place Reason component 401 is configured as executing instruction, to execute above-mentioned audio signal processing method.

Electronic equipment 400 can also include a power supply module 403, be configured as executing the power supply pipe of electronic equipment 400 Reason；One wired or wireless network interface 404 is configured as electronic equipment 400 being connected to network；With an input and output (I/O) interface 405.Electronic equipment 400 can be operated based on the operating system for being stored in storage medium 402, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.

The preferred embodiment of the disclosure is described in detail above in association with attached drawing, still, the disclosure is not limited to above-mentioned reality The detail in mode is applied, in the range of the technology design of the disclosure, a variety of letters can be carried out to the technical solution of the disclosure Monotropic type, these simple variants belong to the protection domain of the disclosure.

It is further to note that specific technical features described in the above specific embodiments, in not lance In the case of shield, can be combined by any suitable means, in order to avoid unnecessary repetition, the disclosure to it is various can The combination of energy no longer separately illustrates.

In addition, arbitrary combination can also be carried out between a variety of different embodiments of the disclosure, as long as it is without prejudice to originally Disclosed thought equally should be considered as disclosure disclosure of that.

Claims

1. a kind of audio signal processing method, which is characterized in that the method includes：

The voice data of user to be measured is acquired, the voice data includes the first voice that the user to be measured reads aloud specified text Data, and/or, the user to be measured recites the second speech data of the specified text；

The acoustic feature and/or text feature of the voice data are extracted, the acoustic feature is for indicating the user to be measured Pronunciation character, the text feature be used for indicate the user to be measured semantic statement level feature；

Using the acoustic feature and/or the text feature as input, after the Classification of Speech model treatment through building in advance, really Make the language feature of the user to be measured.

2. according to the method described in claim 1, it is characterized in that, the acoustic feature includes that pause feature and/or fundamental frequency are special Sign；

The pause feature includes at least one in following characteristics：The total pause duration of the voice data and the voice number According to duration between ratio, pause duration is less than the first preset duration T in the voice data₁Pause number, institute's predicate Pause duration is more than the second preset duration T in sound data₂Pause number, the total pause number of the voice data, T₁<T₂；

The fundamental frequency feature includes at least one in following characteristics：The fundamental frequency mean value of the voice data, the voice data Fundamental frequency variance, the maximizing fundamental frequency of the voice data, the voice data minimum fundamental frequency.

3. according to the method described in claim 1, it is characterized in that, the text feature includes text similarity, then described to carry The text feature of the voice data is taken, including：

Speech recognition is carried out to the voice data, obtains converting text, calculate the converting text and the specified text it Between text similarity.

4. according to the method described in claim 3, it is characterized in that, the method further includes：

If the text identification rate of the voice data is more than the predetermined threshold value, then executes the text for extracting the voice data The step of similarity.

5. method according to claim 3 or 4, which is characterized in that the text feature further includes：

The sentence dispersion of specified text and the sentence dispersion of converting text, the sentence dispersion of the specified text are used for Indicate the distance between the sentence vector of the specified text and the chapter vector of the specified text variance；The converting text Sentence dispersion be used for indicate the converting text sentence vector and the converting text the distance between chapter vector Variance；

And/or

Degree of aliasing PPL differences, the PPL values for indicating the specified text and the difference between the PPL values of the converting text.

6. according to the method described in claim 1, it is characterized in that, the mode for building the Classification of Speech model is：

The sample voice data of collecting sample user, the sample voice data include that the sample of users reads aloud the specified text This first sample voice data, and/or, the sample of users recites the second sample voice data of the specified text, institute It includes normal speech feature user and abnormal language feature user to state sample of users；

Extract the acoustic feature and/or text feature of the sample voice data；

Determine the topological structure of the Classification of Speech model；

Utilize the topological structure and the acoustic feature and/or text feature of the sample voice data, the training voice Disaggregated model, until the language feature of Classification of Speech model output is consistent with the language feature that the sample of users has.

7. a kind of speech signal processing device, which is characterized in that described device includes：

Speech data collection module, the voice data for acquiring user to be measured, the voice data include the user to be measured The first voice data of specified text is read aloud, and/or, the user to be measured recites the second speech data of the specified text；

Characteristic extracting module, acoustic feature and/or text feature for extracting the voice data, the acoustic feature are used for Indicate the pronunciation character of the user to be measured, the text feature be used to indicate the user to be measured semantic statement level spy Sign；

Language feature determining module is used for using the acoustic feature and/or the text feature as input, through what is built in advance After Classification of Speech model treatment, the language feature of the user to be measured is determined.

8. device according to claim 7, which is characterized in that the acoustic feature includes that pause feature and/or fundamental frequency are special Sign；

9. device according to claim 7, which is characterized in that the text feature includes text similarity,

The characteristic extracting module obtains converting text, calculates the conversion for carrying out speech recognition to the voice data Text similarity between text and the specified text.

10. device according to claim 9, which is characterized in that described device further includes：

The characteristic extracting module, for when the text identification rate of the voice data is more than the predetermined threshold value, extracting institute State the text similarity of voice data.

11. device according to claim 9 or 10, which is characterized in that the text feature further includes：

And/or

12. device according to claim 7, which is characterized in that described device further includes：

Sample voice data acquisition module, is used for the sample voice data of collecting sample user, and the sample voice data include The sample of users reads aloud the first sample voice data of the specified text, and/or, the sample of users is recited described specified Second sample voice data of text, the sample of users include normal speech feature user and abnormal language feature user；

Model training module, for the acoustic feature and/or text using the topological structure and the sample voice data Feature, the training Classification of Speech model, until the language feature of Classification of Speech model output has with the sample of users Some language features are consistent.

13. a kind of storage device, wherein being stored with a plurality of instruction, which is characterized in that described instruction is loaded by processor, right of execution Profit requires the step of any one of 1 to 6 the method.

14. a kind of electronic equipment, which is characterized in that the electronic equipment includes；

Storage device described in claim 13；And

Processor, for executing the instruction in the storage device.