CN112116909A

CN112116909A - Voice recognition method, device and system

Info

Publication number: CN112116909A
Application number: CN201910538919.9A
Authority: CN
Inventors: 董勤波; 周洪伟; 陈展
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2020-12-22

Abstract

The embodiment of the application provides a voice recognition method, a voice recognition device and a voice recognition system. According to the method and the device, the target voice recognition engine corresponding to the target area identifier is determined according to the target area identifier corresponding to the target voice signal to be recognized, the target voice recognition engine is used for recognizing the target voice signal to obtain the recognition result, the corresponding voice recognition engine can be accurately determined according to the area identifier, the accurate voice recognition result is obtained based on the determined voice recognition engine, and the accuracy of voice recognition is improved.

Description

Voice recognition method, device and system

Technical Field

The embodiment of the application relates to the technical field of audio processing, in particular to a voice recognition method, device and system.

Background

Speech Recognition technology, also known as Automatic Speech Recognition (ASR), aims at converting the vocabulary content in human Speech into computer-readable input, such as keystrokes, binary codes or character sequences.

In the related art, the same speech recognition engine is used for all users to recognize speech contents. However, since the difference in accent between users in different regions is large, the accuracy of speech recognition is low in this manner.

Disclosure of Invention

In order to overcome the problems in the related art, embodiments of the present application provide a speech recognition method, apparatus, and system, so as to improve accuracy of speech recognition.

According to a first aspect of embodiments of the present application, there is provided a speech recognition method, the method including:

determining a target voice recognition engine corresponding to a target area identifier according to the target area identifier corresponding to a target voice signal to be recognized;

and identifying the target voice signal by using a target voice identification engine to obtain an identification result. According to a second aspect of embodiments of the present application, there is provided a speech recognition apparatus, the apparatus comprising:

the region determining module is used for determining a target region to which a target user belongs according to a region identifier corresponding to a target voice signal to be recognized, wherein the target user is a user speaking the target voice signal;

and the content identification module is used for identifying the content of the target voice signal by utilizing a target voice identification engine corresponding to the target area.

According to a third aspect of embodiments of the present application, there is provided a speech recognition system comprising a microphone, a display, and a processor:

the microphone is used for collecting voice, converting the voice into a voice signal and sending the voice signal to the processor;

the processor is used for determining a target voice recognition engine corresponding to a target area identifier according to the target area identifier corresponding to a target voice signal to be recognized, and recognizing the target voice signal by using the target voice recognition engine to obtain a recognition result;

the display is used for displaying the identification result.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

according to the method and the device, the target voice recognition engine corresponding to the target area identifier is determined according to the target area identifier corresponding to the target voice signal to be recognized, the target voice recognition engine is used for recognizing the target voice signal to obtain the recognition result, the corresponding voice recognition engine can be accurately determined according to the area identifier, the accurate voice recognition result is obtained based on the determined voice recognition engine, and the accuracy of voice recognition is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.

Fig. 1 is a flowchart illustrating a speech recognition method according to an embodiment of the present application.

Fig. 2 is a functional block diagram of a speech recognition apparatus according to an embodiment of the present application.

Fig. 3 is a hardware configuration diagram of a speech recognition device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the examples of the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the embodiments of the application, as detailed in the appended claims.

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments of the present application only and is not intended to be limiting of embodiments of the present application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the embodiments of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In some application scenarios, it is often desirable to convert human speech into computer-readable input information, such as converting speech into text. At this time, the converted information, such as text, can be obtained by using the speech recognition method provided in the embodiment of the present application.

For example, in an exemplary application scenario, a user inputs a voice into a mobile phone, the mobile phone transmits the voice to a server having a voice recognition function, and the server converts the voice into text information by using the voice recognition method provided in the embodiment of the present application, and sends the text information to the mobile phone.

In other embodiments, the terminal that receives the voice and transmits the voice to the server may also be an in-vehicle device, a smart speaker, or the like.

Aiming at the situation that the accuracy of voice recognition is poor due to the fact that the same voice recognition engine is used for recognizing voice content for all users in the related technology, the corresponding voice recognition engine is determined based on the region identification of the voice signal so as to recognize voice. Since the region identifier is related to the accent or dialect of the user's speech, the corresponding speech recognition engine can be accurately determined, and the user's speech can be accurately recognized based on the speech recognition engine.

For example, the accent of the first speech of the user is the accent of Zhejiang, and the accent of the second speech of the user is the accent of Sichuan. The related art recognizes the voice signals of the user a and the user b with the same voice recognition engine. According to the voice recognition method provided by the embodiment of the application, the voice of the user A is recognized by the voice recognition engine corresponding to Zhejiang according to the area identification corresponding to the voice signal of the user A, and the voice of the user B is recognized by the voice recognition engine corresponding to Sichuan according to the area identification corresponding to the voice signal of the user B. Compared with the prior art, the voice recognition method provided by the embodiment of the application has the advantage that the obtained voice recognition result is more accurate.

The following describes a speech recognition method provided in an embodiment of the present application with reference to embodiments.

Fig. 1 is a flowchart illustrating a speech recognition method according to an embodiment of the present application. As shown in fig. 1, the method may include:

s101, determining a target voice recognition engine corresponding to a target area identifier according to the target area identifier corresponding to the target voice signal to be recognized.

And S102, identifying the target voice signal by using the target voice identification engine to obtain an identification result.

In step S101, the area identification is used to indicate an area to which the user belongs. Here, the "area to which the user belongs" refers to an area to which an accent or dialect spoken by the user belongs, and does not refer to which area or the user's domicile the location of the user belongs. That is, by the region identification, the region to which the accent or dialect that the user speaks belongs can be known regardless of where the user is currently or where the user is.

For example. Suppose that the current location of user C belongs to Yunnan, the household registration of user C is Shanghai, but the accent of user C speaking is Cantonese accent. The zone id corresponding to the spoken voice of the user is the guangdong.

In the embodiment of the present application, the area identifier may be represented by using any information capable of uniquely identifying the area.

In one example, the zone identity may be represented by a zone name. Such as "Shanghai", "Zhejiang", "Sichuan", etc.

In one example, the region identifier may be represented by a region abbreviation. Such as Shanghai, Zhe and Chuan.

In the embodiment of the application, each area identifier corresponds to one speech recognition engine, and different area identifiers correspond to different speech recognition engines.

For example, each province in China is taken as a region, and each province corresponds to a region identifier and a speech recognition engine. Each speech recognition engine is adapted to the accent or dialect of the region.

For example, Zhejiang corresponds to speech recognition engine 1, Sichuan corresponds to speech recognition engine 2, … …, and thus, each province corresponds to an area ID and to a speech recognition engine.

In another example, all regions within a language usage range may be treated as a region, and the regions share a region identifier. For example, a cantonese-speaking region corresponds to the region identifier a and the speech recognition engine 1, and a southern-Min-speaking region corresponds to the region identifier b and the speech recognition engine 2, … … so that all regions within the range of use of each language correspond to one speech recognition engine.

Through the step S101, the target speech recognition engine for recognizing the target speech signal can be accurately determined according to the target region identifier, so that the content of the target speech signal can be accurately recognized, which is the recognition result of the target speech signal.

In step S102, since the recognition result is obtained by using the target speech recognition engine corresponding to the target area identifier, the accuracy of the recognition result is higher than that of the speech recognition result in the related art.

In an exemplary implementation process, the obtaining manner of the target area identifier may include:

and determining the target area identification according to the voice characteristic information of the target voice signal.

In this embodiment, the speech feature information is extracted from the target speech signal. The voice characteristic information can accurately reflect the voice characteristics of people, such as regional accents, dialects and the like, so that the target area identification corresponding to the target voice signal can be accurately determined according to the voice characteristics. The voice characteristics of the person do not change due to the change of the geographical position of the user, so that the target area identification can be accurately determined through the voice characteristic information, and a foundation is laid for accurately selecting the voice recognition engine.

Speech feature information is typically represented by Mel-scale Frequency Cepstral coeffients (MFCC). The MFCC may reflect the speech characteristics of the speaker, which include the accent information of the speaker.

A person produces sound through a vocal tract whose shape determines what sound the person produces. The shape of the vocal tract includes tongue, teeth, etc. If the shape of the vocal tract is accurately known, the phoneme generated can be accurately described. The shape of the vocal tract is shown in the envelope of the short-time power spectrum of speech. MFCC is a feature that accurately describes this envelope.

In one example, the target region identification may be recognized from speech feature information of the target speech signal using a trained region information recognition model. The region information identification model may be a deep learning network model, such as a convolutional neural network model.

When the region information recognition model is trained, the voice audio data can be divided into a plurality of data sets according to regions, each data set corresponds to one region, and each region corresponds to one output node of the region information recognition model. For example, one province is taken as one region (represented by region identification), 34 regions are shared by 34 provinces, at this time, 35 output nodes are shared by the region information recognition model, the first 34 output nodes respectively correspond to one region, and the 35 th output node is used for mapping that the voice audio data to be classified does not find a suitable region.

and acquiring the target area identification from the input information of the target user.

For example, in one example, after the user inputs a target voice signal to be recognized through the mobile phone, the interface of the mobile phone may display a text input box, the user inputs a target area identifier, such as a word "shanghai" or "shanghai", in the text input box, and then the mobile phone sends the text information input by the user to the server for recognizing the voice.

In another example, after a user inputs a target voice signal to be recognized through a mobile phone, an interface of the mobile phone may display a region identifier list, region identifiers of all regions are displayed in the region identifier list, the user may select a target region identifier in the region identifier list by clicking or the like, and then the mobile phone sends the target region identifier selected by the user to a server for recognizing voice.

and acquiring a target area identifier from local storage information of a target terminal receiving the target voice signal.

For example, in one example, a speech recognition APP (application) may be installed in a mobile phone, and the speech recognition APP can send a speech signal spoken by a user to a remote server for speech recognition, and then receive a recognition result returned by the server. In the speech recognition APP, an operation option "save area id" may be set, and a user may input and save the area id through the operation option. Thereafter, each time the user speaks a voice signal through the voice recognition APP, the voice recognition APP sends the voice signal to the server together with the stored region identification.

and acquiring a target area identifier according to the positioning information of the target terminal for receiving the target voice signal.

Generally, the user will be active in the area to which it belongs. At this time, according to the positioning information of the target terminal of the target voice signal sent by the user, the area where the user is located can be determined, and the area identifier of the area is used as the target area identifier.

and acquiring a target area identifier according to the number attribution of the target terminal for receiving the target voice signal.

Generally, a user transacts a mobile phone number in the belonging area, so that the belonging area of the user can be judged according to the number belonging area of the target terminal, and the area can be used as the target area identifier.

It should be noted that, in an application scenario, any one or more of the above-mentioned target area identifiers may be set.

In one exemplary implementation, determining the target area identifier according to the speech feature information of the target speech signal includes: inputting the voice characteristic information of the target voice signal into a trained regional information recognition model, and recognizing a target region identifier by the regional information recognition model according to the input voice characteristic information, wherein the target region identifier is used for indicating a region to which a target user speaking the target voice signal belongs;

determining a target speech recognition engine corresponding to the target area identification, comprising: selecting a target speech recognition engine corresponding to the target area identification from the trained speech recognition engines of all the areas;

utilizing a target speech recognition engine to recognize a target speech signal to obtain a recognition result, comprising: and inputting the target voice audio to a target voice recognition engine, and performing voice recognition on the input target voice signal by the target voice recognition engine to obtain a recognition result.

It should be noted that the region identifier corresponding to the speech recognition engine is the same as the region identifier corresponding to the corresponding output node of the region information recognition model. For example, when one province is defined as one region, the region identifier corresponding to the speech recognition engine indicates the province, and the region identifier corresponding to the output node of the region information recognition model also indicates the province. For example, beijing corresponds to the output node 1 of the area information recognition model, and the area identifier corresponding to the voice recognition engine a1, the area identifier corresponding to the voice recognition engine a1 and the area identifier corresponding to the output node 1 of the area information recognition model are all beijing; the Shanghai corresponds to the output node 2 of the area information recognition model, and the Shanghai corresponds to the speech recognition engine A2, the area identifier corresponding to the speech recognition engine A2 and the area identifier corresponding to the output node 2 of the area information recognition model are Shanghai; the guangdong corresponds to the output node 3 of the region information recognition model, and the guangdong corresponds to the speech recognition engine A3, and the region identifier corresponding to the speech recognition engine A3 and the region identifier corresponding to the output node 3 of the region information recognition model are all the guangdong … …. Here, beijing, shanghai, and guangdong are area identifiers.

In this example, since the target speech recognition engine is a speech recognition engine corresponding to the target area identifier, and the target area identifier is recognized based on the speech feature information, the recognition accuracy of the target area identifier is high, so that the determination accuracy of the target speech recognition engine determined based on the target area identifier is high, and further, the recognition result of the target speech signal obtained based on the target speech recognition engine is high in accuracy.

The speech recognition engine may be a deep learning network model.

When the speech recognition engine is trained, the training data is classified according to the regions to which the training data belongs, a training data set is established for each region, and for each region, the speech recognition engine corresponding to the region is identified by the region of the training data set corresponding to the region.

In one exemplary implementation, the speech characteristic information is determined by:

framing a target voice signal according to a set frame length m and a set frame shift n to obtain at least one first voice frequency frame;

windowing each first audio frame according to a preset window function to obtain a second audio frame;

performing pre-emphasis operation on each second audio frame to obtain a third audio frame, wherein the pre-emphasis operation is used for increasing the weight of the high-frequency characteristics of the second audio frames;

and extracting the voice characteristic information of the target voice signal from each third audio frame.

Here, the frame length refers to the duration of each frame of the voice signal. The frame shift refers to the time difference between the start positions of two adjacent frames.

The present example converts an entire target speech signal that is not stationary into a plurality of stationary audio frames by dividing the target speech signal into the plurality of audio frames in order to extract speech features from the stationary audio frames.

The speech signal is not stable macroscopically, is stable microscopically, and has short-time stationarity (the speech signal can be considered to be approximately constant within the time length of 10ms to 30 ms). Because the process of extracting the speech features needs fourier transform, which requires that the input signal is a stationary signal, the whole speech signal is subjected to framing processing.

Framing is to cut the entire speech signal into at least one speech segment. In the framing, the duration of the audio frame obtained after the framing is generally not less than 20ms, and the frame is usually frame-shifted by about 1/2. The overlapping area between two adjacent audio frames can avoid the overlarge change of two adjacent frames.

After framing, discontinuities occur at the beginning and end of each audio frame, and the more frames are framed, the greater the error from the original speech signal. The purpose of windowing is to make the framed speech signal continuous so that each audio frame exhibits the characteristics of a periodic function.

The windowing operation is a multiplication of the signal of each audio frame obtained after the framing with a window function.

The pre-emphasis is to multiply the signal of each audio frame by a coefficient on the frequency domain, the coefficient is positively correlated with the frequency, the higher the frequency is, the larger the coefficient is, the lower the frequency is, the smaller the coefficient is, so the amplitude of the high-frequency signal is improved.

By pre-emphasis, the effects caused by vocal cords and lips during the vocalization process can be eliminated to compensate for the high frequency portions of the speech signal that are suppressed by the vocalization system. Pre-emphasis can also highlight formants at high frequencies.

The purpose of extracting the voice feature is to obtain valid data in a target voice signal to be recognized, and the extracted feature is an MFCC.

In an exemplary implementation, extracting speech feature information of the target speech signal from each third audio frame includes:

performing Fast Fourier Transform (FFT) on each third audio frame to obtain a first frequency spectrum;

performing triangular filtering on each first frequency spectrum to obtain a second frequency spectrum;

determining MFCCs corresponding to the second frequency spectrums according to the corresponding relation between the preset Mel cepstrum coefficients MFCCs and the frequency spectrums; each determined MFCC is determined as speech characteristic information.

The signal is usually difficult to see by the transformation in the time domain, so it is usually observed by transforming it into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different voices. The present example converts a time domain signal into a frequency domain signal by FFT transformation in order to extract a voice feature.

Triangular filtering refers to filtering the spectrum using a Mel-filter bank.

The Mel filter bank is an analog filter with a frequency selective characteristic for the human ear. One reason that the human ear can hear speech signals from noisy background noise is because the human intima-basement membrane modulates extraneous signals. For different frequencies, signals within the corresponding critical bandwidth will cause vibrations at different locations on the base film. A bank of band pass filters can be used to mimic human ear hearing and thereby reduce the effect of noise on speech. The critical bandwidth changes along with the change of the frequency and is consistent with the increase of the frequency Mel of the sensing frequency, and the critical bandwidth is approximately linearly distributed and is about 100Hz below 1000 Hz; the critical bandwidth grows logarithmically above 1000 Hz. The frequency relationship is shown in the following equation (1):

Mel(f)＝1127ln(1+f/700) (1)

according to the division of the critical bands, the frequency domain can be divided into a series of triangular filter banks, called Mel-frequency filter banks, each triangular filter in the filter bank having a span equal to that on the Mel scale. The bandwidth of the filter covers the bandwidth of 0-1/2 sampling rate. Wherein, the ith filter frequency response is shown in the following formulas (2), (3), (4) and (5):

H_i(k)＝0,k＜f[i-1] (2)

H_i(k)＝0,f[i+1]≤k (5)

wherein f [ i ] is the center frequency of the triangular filter, and satisfies the following formula (6):

Mel(f[i+1])-Mel(f[i-1])＝Mel(f[i])-Mel(f[i-1]) (6)

the MFCC parameters fully utilize the human auditory principle and the decorrelation characteristics of cepstrum, and the mel frequency cepstrum has the capability of compensating the convolution channel distortion.

Then cosine transform (DCT) is carried out, and the MFCC characteristics are obtained by taking the former N-dimensional characteristics

The correspondence between the mel-frequency cepstral coefficient MFCC and the frequency spectrum is as the following formula (7):

y_t＝DCT(log(Mel(fft(x_t)))) (7)

wherein x is_tIs a frequency spectrum, y_tIs the value of MFCC.

In an exemplary implementation, the recognizing the target region according to the input speech feature information by the region information recognition model may include:

the region information identification model identifies a reference region to which a target user belongs according to each MFCC input;

selecting one reference area meeting the preset requirement from all the reference areas, wherein the one reference area meeting the preset requirement is the same as at least one reference area in the rest other reference areas;

and determining the area identification of the selected reference area as the target area identification.

In one example, one reference region satisfying the preset requirement may be the most numerous reference regions among all the reference regions.

For example. Assuming that 20 audio frames are obtained after the target speech signal is framed, wherein the region identification result corresponding to the MFCCs of 15 audio frames is region identification A, and the region identification result corresponding to the MFCCs of 5 audio frames is region identification B, the region identification A with the largest number is taken as the target region identification.

Each audio frame after framing corresponds to one MFCC, and the region information identification model identifies the region identifier of one region according to one MFCC, so that a plurality of audio frames have a plurality of region identifier identification results. The smoothing processing is performed after the region identification results corresponding to the audio frames are obtained, so that the accuracy of region identification judgment is improved.

In one exemplary implementation, the target speech signal is obtained by:

inputting an initial Voice signal to be recognized into a trained Voice endpoint detection (VAD) model, so as to locate a Voice starting point and a Voice ending point from the input initial Voice signal by the VAD model, and removing a non-Voice signal from the initial Voice signal to obtain a target Voice signal.

The main task of voice endpoint detection is to accurately locate the start point and the end point of voice from voice, that is, to remove the non-human voice in voice, such as silence and noise, etc., to save communication bandwidth and reduce the calculation amount of the back-end model, especially on the embedded device, to reduce power consumption. Under the noise environment, the VAD with better performance can remove the noise in the voice, reduce the processing of the back-end model to the noise voice and improve the accuracy of voice recognition.

In one example, the speech endpoint detection model may employ a deep learning based neural network model. The voice endpoint detection model has two output nodes, one output node is used for outputting voice signals, and the other output node is used for outputting non-voice signals.

In other examples, voice endpoint detection may also employ conventional VAD schemes, such as VAD based on energy, zero-crossing rate, or gaussian models.

In an exemplary procedure, before step S101, the method may further include:

receiving a target voice signal sent by a target terminal;

after step S102, the method may further include:

and returning the identification result of the target voice signal to the target terminal so that the target terminal can inform the user of the identification result.

For example, in one example, the target terminal may notify the user of the recognition result by displaying a text corresponding to the target voice signal. In another example, the target terminal may also notify the user of the recognition result by playing a standard voice corresponding to the target voice signal. The embodiment is not limited to the form of the recognition result and the way in which the target terminal notifies the user of the recognition result.

In one exemplary process, the speech recognition method may further include:

receiving input information of the target user, which is sent by a target terminal, wherein the input information comprises the area identification; or the like, or, alternatively,

receiving local storage information of the target terminal, which is sent by the target terminal, wherein the storage information comprises the area identification; or the like, or, alternatively,

receiving positioning information of a target terminal sent by the target terminal; or the like, or, alternatively,

and receiving the number of the target terminal sent by the target terminal.

The server may acquire the target area identification through the above information received from the target terminal. The server may obtain the target area identifier by using any one of the aforementioned target area identifier obtaining manners.

In the embodiment shown in fig. 1, the target speech recognition engine corresponding to the target area identifier is determined according to the target area identifier corresponding to the target speech signal to be recognized, the target speech recognition engine is used for recognizing the target speech signal to obtain the recognition result, the corresponding speech recognition engine can be accurately determined according to the area identifier, the accurate speech recognition result is obtained based on the determined speech recognition engine, and the accuracy of speech recognition is improved.

Based on the method embodiment, the embodiment of the application also provides corresponding device, equipment and storage medium embodiments.

Fig. 2 is a functional block diagram of a speech recognition apparatus according to an embodiment of the present application. As shown in fig. 2, in this embodiment, the apparatus may include:

the region determining module 210 is configured to determine, according to a target region identifier corresponding to a target speech signal to be recognized, a target speech recognition engine corresponding to the target region identifier;

and the content identification module 220 is configured to identify the target speech signal by using a target speech recognition engine to obtain a recognition result.

determining the target area identification according to the voice characteristic information of the target voice signal; or the like, or, alternatively,

acquiring the target area identification from the input information of the target user; or the like, or, alternatively,

acquiring the target area identifier from local storage information of a target terminal receiving the target voice signal; or the like, or, alternatively,

acquiring the target area identification according to the positioning information of the target terminal receiving the target voice signal; or the like, or, alternatively,

and acquiring the target area identifier according to the number attribution of the target terminal receiving the target voice signal.

In an exemplary implementation, the determining the target area identifier according to the speech feature information of the target speech signal includes:

inputting the voice characteristic information of the target voice signal into a trained regional information recognition model, and recognizing a target region identifier by the regional information recognition model according to the input voice characteristic information, wherein the target region identifier is used for indicating a region to which a target user speaking the target voice signal belongs;

the region determining module 210 is specifically configured to:

selecting a target speech recognition engine corresponding to the target area identification from the trained speech recognition engines of all the areas;

the content identification module 220 is specifically configured to:

and inputting the target voice audio to the target voice recognition engine, and performing voice recognition on the input target voice signal by the target voice recognition engine to obtain a recognition result.

framing the target voice signal according to a set frame length m and a set frame shift n to obtain at least one first voice frequency frame;

performing fast Fourier transform on each third audio frame to obtain a first frequency spectrum;

determining MFCCs corresponding to the second frequency spectrums according to the corresponding relation between the preset Mel cepstrum coefficients MFCCs and the frequency spectrums; and determining each determined MFCC as the voice characteristic information.

In one exemplary implementation, the target speech signal is obtained by:

inputting an initial voice signal to be recognized into a trained voice endpoint detection model, so that a voice starting point and a voice ending point are located from the input initial voice signal by the voice endpoint detection model, and removing a non-voice signal from the initial voice signal to obtain the target voice signal.

In an exemplary implementation process, the region information recognition model recognizes a target region to which a target user belongs according to input speech feature information, including:

the region information identification model identifies a reference region to which the target user belongs according to each MFCC input;

and determining the selected reference area as the target area.

In one exemplary implementation, the speech recognition method may further include:

the signal receiving module is used for receiving the target voice signal sent by a target terminal;

and the result sending module is used for returning the identification result of the target voice signal to the target terminal so that the target terminal can inform a user of the identification result.

the input information receiving module is used for receiving the input information of the target user, which is sent by a target terminal, wherein the input information comprises the area identifier; or the like, or, alternatively,

a storage information receiving module, configured to receive local storage information of a target terminal sent by the target terminal, where the storage information includes the area identifier; or the like, or, alternatively,

the positioning information receiving module is used for receiving the positioning information of the target terminal sent by the target terminal; or the like, or, alternatively,

and the number receiving module is used for receiving the number of the target terminal sent by the target terminal.

The embodiment of the application also provides voice recognition equipment. Fig. 3 is a hardware configuration diagram of a speech recognition device according to an embodiment of the present application. As shown in fig. 3, the voice recognition apparatus includes: an internal bus 301, and a memory 302, a processor 303, and an external interface 304, which are connected through the internal bus, wherein,

the processor 303 is configured to read the machine-readable instructions in the memory 302 and execute the instructions to implement the following operations:

and identifying the target voice signal by using a target voice identification engine to obtain an identification result.

In an exemplary implementation process, the obtaining manner of the target area identifier includes:

the determining a target speech recognition engine corresponding to the target area identification includes:

the recognizing the target voice signal by using the target voice recognition engine to obtain a recognition result comprises the following steps:

In an exemplary implementation, the extracting the speech feature information of the target speech signal from each third audio frame includes:

In one exemplary implementation, the target speech signal is obtained by:

In an exemplary implementation, the processor 303 may further execute the instructions to perform the following operations:

receiving the target voice signal sent by a target terminal;

and returning the identification result of the target voice signal to the target terminal so that the target terminal can inform a user of the identification result.

and receiving the number of the target terminal sent by the target terminal.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the program, when executed by a processor, implements the following operations:

and determining the selected reference area as the target area.

In one exemplary implementation, the target speech signal is obtained by:

In one exemplary implementation, the program when executed by the processor further performs the following:

receiving the target voice signal sent by a target terminal;

and receiving the number of the target terminal sent by the target terminal.

An embodiment of the present application further provides a speech recognition system, which includes a microphone, a display, and a processor, wherein:

the display is used for displaying the identification result.

In one exemplary implementation of the process of the present invention,

the display is further used for displaying the area identifications so that a user can select a target area identification from the displayed area identifications;

the processor is specifically configured to determine, according to the target area identifier selected by the user, a target speech recognition engine corresponding to the target area identifier.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.

It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.

The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method of speech recognition, the method comprising:

2. The method of claim 1, wherein the target area identifier is obtained in a manner comprising:

3. The method of claim 2, wherein determining the target region identifier according to the speech feature information of the target speech signal comprises:

4. The method of claim 2, wherein the speech characteristic information is determined by:

5. The method according to claim 4, wherein said extracting the speech feature information of the target speech signal from each third audio frame comprises:

determining MFCCs corresponding to the second frequency spectrums according to the corresponding relation between the preset Mel cepstrum coefficients MFCCs and the frequency spectrums;

and determining each determined MFCC as the voice characteristic information.

6. The method of claim 5, wherein the region information recognition model recognizes a target region to which a target user belongs according to the input speech feature information, and comprises:

and determining the selected reference area as the target area.

7. The method according to any one of claims 1 to 6, wherein the target speech signal is obtained by:

8. The method of claim 1, further comprising:

receiving the target voice signal sent by a target terminal;

9. The method of claim 1, further comprising:

and receiving the number of the target terminal sent by the target terminal.

10. A speech recognition apparatus, characterized in that the apparatus comprises:

the system comprises a region determining module, a target speech recognition engine and a target speech recognition engine, wherein the region determining module is used for determining a target speech recognition engine corresponding to a target speech signal to be recognized according to a target region identifier corresponding to the target speech signal to be recognized;

and the content identification module is used for identifying the target voice signal by utilizing a target voice identification engine to obtain an identification result.

11. A speech recognition system comprising a microphone, a display, and a processor:

the microphone is used for collecting voice, converting the voice into a voice signal and sending the voice signal to the processor; the processor is used for determining a target voice recognition engine corresponding to a target area identifier according to the target area identifier corresponding to a target voice signal to be recognized, and recognizing the target voice signal by using the target voice recognition engine to obtain a recognition result;

the display is used for displaying the identification result.

12. The method of claim 1,

the display is further used for displaying the area identifications so that a user can select a target area identification from the displayed area identifications; the processor is specifically configured to determine, according to the target area identifier selected by the user, a target speech recognition engine corresponding to the target area identifier.