CN113963694A

CN113963694A - Voice recognition method, voice recognition device, electronic equipment and storage medium

Info

Publication number: CN113963694A
Application number: CN202010700307.8A
Authority: CN
Inventors: 姜雪婷
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2022-01-21
Anticipated expiration: 2040-07-20

Abstract

The application discloses a voice recognition method, a voice recognition device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a plurality of voice data under a current voice recognition scene; the voice data comprises voice data of a plurality of objects under the current voice recognition scene, which are collected by a plurality of voice collectors; the voice collectors are positioned at different positions in the current voice recognition scene; generating target voice data associated with each object of the plurality of objects based on the plurality of voice data; the target voice data associated with each object is derived from at least two voice data of the plurality of voice data; generating a voice recognition result based on the plurality of target voice data, and outputting the voice recognition result; therefore, the voice data does not need to be analyzed artificially, the calculated amount of the voice data is reduced, and the accuracy of the voice analysis result is ensured.

Description

Voice recognition method, voice recognition device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech recognition method, a speech recognition apparatus, an electronic device, and a storage medium.

Background

With the rapid development of speech recognition technology, speech recognition has become an important way for man-machine interaction. Common speech recognition methods in the related art include an artificial analysis method and a multi-speech engine recognition method.

When an artificial analysis mode is adopted, in some occasions such as a conference, after voice information of conference personnel is collected, voice information corresponding to each personnel is artificially analyzed, and the method has the problems of large data analysis calculation amount or inaccurate analysis result. When a multi-speech-engine recognition mode is adopted, in some occasions such as a conference, after speech information of conference personnel is collected, the speech information is input into a plurality of speech recognition engines, the confidence degree of a recognition result corresponding to each speech recognition engine is obtained, and finally the recognition result with the highest confidence degree is determined as a final speech recognition result. Therefore, the manual analysis mode in the related technology has the problems of large data analysis calculation amount or inaccurate analysis result, and the multi-voice engine identification mode has lower identification performance.

Disclosure of Invention

The application expects to provide a voice recognition method, a voice recognition device, electronic equipment and a storage medium, and solves the problems that the manual analysis mode in the related art has large data analysis calculation amount or inaccurate analysis result, and the multi-voice engine recognition mode has lower recognition performance.

The technical scheme of the application is realized as follows:

the application provides a voice recognition method, which comprises the following steps:

acquiring a plurality of voice data under a current voice recognition scene; the voice data comprise voice data of a plurality of objects under the current voice recognition scene, which are collected by a plurality of voice collectors; the plurality of voice collectors are positioned at different positions in the current voice recognition scene;

generating target speech data associated with each object of a plurality of objects based on the plurality of speech data; the target voice data associated with each object is derived from at least two voice data of the plurality of voice data;

and generating a voice recognition result based on the target voice data, and outputting the voice recognition result.

Optionally, the generating target voice data associated with each object in the plurality of objects based on the plurality of voice data includes:

dividing each voice data in the plurality of voice data to obtain each sub-voice data set after each voice data is divided; each sub voice data set comprises a plurality of sections of voice data;

acquiring a plurality of voiceprint characteristics associated with each sub-voice data set;

and generating target voice data associated with each object based on the each sub voice data set and a plurality of voice print characteristics associated with the each sub voice data set.

Optionally, the generating target voice data associated with each object based on each sub-voice data set and a plurality of voiceprint features associated with each sub-voice data set includes:

determining a plurality of sub-voice data with the same voiceprint characteristic and the same time stamp in a plurality of sub-voice data sets;

determining target sub-voice data from the plurality of sub-voice data to obtain a plurality of target sub-voice data associated with the same voiceprint characteristic;

and generating target voice data associated with each object based on the target sub-voice data and the time stamps corresponding to the target sub-voice data.

Optionally, the target sub-speech data is speech data having a maximum amplitude among the plurality of sub-speech data.

Optionally, before generating a speech recognition result based on the target speech data, the method further includes:

acquiring first position information of the plurality of voice collectors in the current voice recognition scene;

correspondingly, the generating a speech recognition result based on a plurality of the target speech data further includes:

determining second position information of each object in the plurality of objects in the current voice recognition scene based on the first position information and a plurality of the target voice data;

and generating a voice recognition result based on the plurality of second position information and the plurality of target voice data.

Optionally, the generating a speech recognition result based on the plurality of second location information and the plurality of target speech data includes:

performing voice emotion recognition on the target voice data to obtain a plurality of first recognition results;

acquiring a plurality of target text data associated with a plurality of target voice data;

performing semantic recognition on the target text data to obtain a plurality of second recognition results;

generating the voice recognition result based on the plurality of first recognition results, the plurality of second recognition results, the plurality of target text data, and the plurality of second position information.

Optionally, the generating the speech recognition result based on the plurality of first recognition results, the plurality of second recognition results, the plurality of target text data, and the plurality of second location information includes:

extracting feature information of the target text data associated with each object based on the first recognition result associated with each object and the second recognition result associated with each object;

generating an azimuth voice graph based on the characteristic information associated with each object and the second position information associated with each object; the speech recognition result comprises the azimuthal speech map.

The application provides a speech recognition device, the speech recognition device includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of voice data in a current voice recognition scene; the voice data comprise voice data of a plurality of objects under the current voice recognition scene, which are collected by a plurality of voice collectors; the plurality of voice collectors are positioned at different positions in the current voice recognition scene;

a first processing unit configured to generate target voice data associated with each of a plurality of objects based on the plurality of voice data; the target voice data associated with each object is derived from at least two voice data of the plurality of voice data;

and the second processing unit is used for generating a voice recognition result based on the target voice data and outputting the voice recognition result.

The application provides an electronic device, the electronic device includes:

a memory for storing executable instructions;

a processor for executing the executable instructions stored in the memory to implement the speech recognition method as described above.

The present application provides a computer storage medium storing one or more programs executable by one or more processors to implement a speech recognition method as described above.

The application provides a voice recognition method, a voice recognition device, an electronic device and a storage medium, which are used for acquiring a plurality of voice data under a current voice recognition scene; the voice data comprises voice data of a plurality of objects under the current voice recognition scene, which are collected by a plurality of voice collectors; the voice collectors are positioned at different positions in the current voice recognition scene; generating target voice data associated with each object of the plurality of objects based on the plurality of voice data; the target voice data associated with each object is derived from at least two voice data of the plurality of voice data; generating a voice recognition result based on the plurality of target voice data, and outputting the voice recognition result; that is to say, the present application generates target speech data associated with each of a plurality of objects based on a plurality of speech data including the plurality of objects acquired by a plurality of speech acquirers located at different positions in an acquired current speech recognition scene, and then generates and outputs a speech recognition result based on the target speech data; therefore, the voice data does not need to be analyzed artificially, the intelligent analysis of the voice data is realized, the calculated amount of the voice data is reduced, the recognition performance is improved, and the accuracy of the voice analysis result is ensured.

Drawings

Fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of another speech recognition method according to an embodiment of the present application;

fig. 3 (a) to (c) are schematic diagrams illustrating a process of segmenting a plurality of voice data according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another speech recognition method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech recognition processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

It should be appreciated that reference throughout this specification to "an embodiment of the present application" or "an embodiment described previously" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in the embodiments of the present application" or "in the embodiments" in various places throughout this specification are not necessarily all referring to the same embodiments. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

An embodiment of the present application provides a speech recognition method, which is applied to an electronic device, and as shown in fig. 1, the method includes the following steps:

step 101, obtaining a plurality of voice data in a current voice recognition scene.

The voice data comprises voice data of a plurality of objects under the current voice recognition scene, which are collected by a plurality of voice collectors; the plurality of speech collectors are located at different positions in the current speech recognition scene.

In the embodiment of the application, the current voice recognition scene can be understood as a scene capable of acquiring voice data; for example, the current speech recognition scene may be a scene of a real-time caption on-screen at the time of an input method, a conference, and/or a court trial; the current speech recognition scenario may also be a customer service speech quality control and User originated (UGC) speech Content review scenario with low requirements for recorded audio/video subtitle configuration and/or real-time performance, which is not specifically limited in the present application.

In the embodiment of the application, the electronic device can acquire the voice data of a plurality of objects in the current voice recognition scene through a plurality of set voice collectors. It should be noted that the multiple voice collectors are located at different positions in the current voice recognition scene, and the same voice collector can collect voice data of multiple objects in the current voice recognition scene, and different voice collectors can collect voice data of the same object in the current voice recognition scene.

In practical application, the position of the voice collector can be understood as a position set in advance by a user for collecting voice data with higher quality in a current voice recognition scene. Meanwhile, the distance between every two voice collectors in the plurality of voice collectors meets the preset distance. The preset distance can be a distance corresponding to the model of the voice collector before the electronic equipment acquires the plurality of voice data; the preset distance may also be a distance set in advance by a user before the electronic device acquires the plurality of voice data. The preset distance is the distance corresponding to the model of the voice collector or the distance set in advance by the user, and the voice collector collects a better voice signal. For example, if a user sets four voice collectors with the same model, one voice collector may be set in each of four directions, namely, the south, the west, and the north directions, and the distance between every two voice collectors in the four voice collectors is two meters, and further, the position of each voice collector is marked and stored in the electronic device.

Illustratively, four voice collectors with the same model number, namely A, B, C, D, and persons A, B, C and D participating in a conference are installed on a conference table of a certain conference room; wherein, voice collector A is located the true east direction of conference table, and voice collector B is located the true south direction of conference table, and voice collector C is located the true west direction of conference table, and voice collector D is located the true north direction of conference table, and the distance between per two voice collectors in voice collector A, B, C, D is two meters. When meeting personnel meet in the meeting room, the voice data of the object A can be simultaneously acquired by the voice acquisition unit A and the voice acquisition unit B, the voice data of the object A and the voice data of the object B can also be simultaneously acquired by the voice acquisition unit A, and the voice data acquired by each voice acquisition unit can be stored in the electronic equipment.

Step 102, generating target voice data associated with each object of the plurality of objects based on the plurality of voice data.

Wherein the target voice data associated with each object is derived from at least two voice data of the plurality of voice data.

In the embodiment of the present application, the target voice data may be understood as voice data associated with the same object of a plurality of objects.

In the embodiment of the application, after acquiring a plurality of voice data in a current voice recognition scene, the electronic device acquires voice data associated with the same object in a plurality of objects based on at least two voice data in the plurality of voice data, and further generates target voice data associated with each object in the plurality of objects.

And 103, generating a voice recognition result based on the target voice data, and outputting the voice recognition result.

In the embodiment of the present application, the voice recognition result may be understood as a result obtained by performing a voice recognition method on the target voice data.

In practical application, after the electronic device generates target voice data associated with each of a plurality of objects, the electronic device generates a voice recognition result from the plurality of target voice data by a voice recognition method, and outputs the voice recognition result. The voice recognition method comprises a linguistic and acoustic based method, a random model method, probability grammar analysis and the like; the speech recognition method further includes a stochastic model method. In the embodiment of the application, the voice recognition method preferably adopts a random model method to generate the voice recognition result from the plurality of target voice data.

The voice recognition method provided by the application comprises the steps of obtaining a plurality of voice data under a current voice recognition scene; the voice data comprises voice data of a plurality of objects under the current voice recognition scene, which are collected by a plurality of voice collectors; the voice collectors are positioned at different positions in the current voice recognition scene; generating target voice data associated with each object of the plurality of objects based on the plurality of voice data; the target voice data associated with each object is derived from at least two voice data of the plurality of voice data; generating a voice recognition result based on the plurality of target voice data, and outputting the voice recognition result; that is to say, in the embodiment of the present application, target voice data associated with each object in the multiple objects is generated based on the multiple voice data including the multiple objects, which are acquired by the multiple voice acquirers located at different positions in the current voice recognition scene, and then a voice recognition result is generated and output based on the target voice data; therefore, the voice data does not need to be analyzed artificially, the calculated amount of the voice data is reduced, and the accuracy of the voice analysis result is ensured.

Based on the foregoing embodiments, an embodiment of the present application provides a speech recognition method applied to an electronic device, and as shown in fig. 2, the method includes the following steps:

step 201, acquiring a plurality of voice data in a current voice recognition scene.

Step 202, each voice data in the plurality of voice data is segmented, and each sub-voice data set obtained after each voice data is segmented is obtained.

Wherein each sub-voice data set comprises a plurality of segments of voice data.

In the embodiment of the application, the electronic device divides each piece of voice data in the plurality of pieces of voice data to obtain each sub-voice data set including a plurality of pieces of voice data after each piece of voice data is divided.

In practical application, the step 202 is to divide each voice data in the plurality of voice data to obtain each sub-voice data set after each voice data is divided, and the following steps are performed:

step S1, obtaining a break time between sentences in each of the plurality of voice data, and if the break time is greater than a preset break time, segmenting each of the plurality of voice data to obtain a first sub-voice data set after each voice data is segmented.

Wherein the first sub-speech data set comprises a plurality of segments of speech data.

In the embodiment of the present application, the first sub-speech data set may be understood as a speech data set obtained after each speech data is segmented.

In this embodiment of the application, the preset interruption time may be an interruption time that is pre-stored in the electronic device and corresponds to a voice data segmentation mode before each voice data in the plurality of voice data is segmented; the preset interruption time may also be the interruption time set by the user in real time after the electronic device determines to enter the voice data segmentation state. It can be understood that whether the break time is pre-stored or set in real time, the better voice data is obtained.

In the embodiment of the application, after acquiring a plurality of voice data in a current voice recognition scene, the electronic device acquires inter-sentence break time in each voice data in the plurality of voice data, compares the inter-sentence break time in each voice data with preset break time, and if the inter-sentence break time is greater than the preset break time, segments the inter-sentence break position in each voice data in the plurality of voice data to obtain a first sub-voice data set comprising a plurality of pieces of voice data after each voice data is segmented.

In practical application, as shown in (a) and (b) in fig. 3, if the electronic device acquires a plurality of pieces of voice data in a current voice recognition scene, as shown in (a) in fig. 3, an inter-sentence break time in each piece of voice data is acquired, and the inter-sentence break time in each piece of voice data is compared with a preset break time, for example, the preset break time may be set to 2 seconds, and if the inter-sentence break time exceeds 2 seconds, each piece of voice data in the plurality of pieces of voice data is divided at the inter-sentence break, as shown in (b) in fig. 3, and a first sub-voice data set including a plurality of pieces of voice data after each piece of voice data is divided is obtained.

Step S2, subdividing each segment of voice data in the first sub-voice data set obtained by dividing each piece of voice data according to a preset time interval, to obtain each sub-voice data set obtained by dividing each piece of voice data.

Wherein the first sub-speech data set comprises each sub-speech data set.

In this embodiment of the application, the preset time interval may be a time interval pre-stored in the electronic device and corresponding to a mode for segmenting the first sub-speech data set before segmenting the first sub-speech data set; the preset time interval may also be a time interval set by the user in real time after the electronic device determines that the state of segmenting the first sub-speech data set is entered. It is understood that the pre-stored time interval or the real-time set time interval is subject to better voice data.

In the embodiment of the application, after the electronic device obtains the first sub-voice data set, each piece of voice data in the first sub-voice data set after each voice data division is divided again according to a preset time interval, so as to obtain each sub-voice data set after each voice data division.

In practical applications, as shown in (b) and (c) of fig. 3, the electronic device obtains the first sub-voice data set, as shown in (b) of fig. 3, and re-divides each piece of voice data in the first sub-voice data set after each voice data division into a preset time interval, for example, the preset time interval may be set to 10-30ms, and divides each piece of voice data in the first sub-voice data set into voice data of a smaller time period, as shown in (c) of fig. 3, and obtains each sub-voice data set after each voice data division including a plurality of pieces of voice data. It should be noted that each piece of speech data in the first sub-speech data set is segmented again to obtain speech data in a smaller time period, so that a speech recognition result of each object can be obtained more accurately.

Step 203, obtaining a plurality of voiceprint features associated with each sub voice data set.

In the embodiment of the present application, the voiceprint feature can be understood as having features of different objects, and the voiceprint can also be understood as having features of identity recognition. The voiceprint characteristics are specific and have stability.

In the embodiment of the application, the electronic device obtains a plurality of voiceprint features associated with each sub-voice data set, and determines the number of objects in the current voice recognition scene based on the plurality of voiceprint features associated with each sub-voice data set. It should be noted that the same object has the same voiceprint features, and the electronic device may determine the number of objects in the current speech recognition scene based on the number of voiceprint features.

And step 204, generating target voice data associated with each object based on each sub voice data set and a plurality of voice print characteristics associated with each sub voice data set.

In this embodiment of the application, the step 204 of generating the target voice data associated with each object based on each sub-voice data set and the multiple voiceprint features associated with each sub-voice data set may be implemented by the following steps:

step A1, determining a plurality of sub-voice data with the same voiceprint feature and the same time stamp in a plurality of sub-voice data sets.

In the embodiment of the present application, the same voiceprint feature can be understood as that the same object has the same voiceprint feature; the plurality of sub voice data may be understood as selecting voice data having the same voiceprint feature and the same time stamp from a plurality of sub voice data sets; that is, included in the plurality of pieces of sub voice data is voice data obtained by the same object at the same time point or within the same time period.

In the embodiment of the application, after acquiring a plurality of voiceprint features associated with each sub-voice data set, the electronic device determines a plurality of sub-voice data sets having the same voiceprint feature and the same timestamp.

Step A2, determining target sub-voice data from the plurality of sub-voice data to obtain a plurality of target sub-voice data associated with the same voiceprint feature.

Wherein the target sub-voice data is voice data having the largest amplitude among the plurality of sub-voice data.

In the embodiment of the present application, the amplitude is the maximum value of the difference between the sound pressure and the rest pressure. Wherein the sound pressure is the pressure increment of the sound wave which forms compression and sparse alternation when the sound wave propagates in the air. In practical application, the principle of collecting voice data by the voice collector is to convert pressure fluctuation waves in the air into fluctuation of electric signals.

In the embodiment of the present application, the electronic device determines the target sub-voice data with the maximum amplitude from the plurality of sub-voice data to obtain the plurality of target sub-voice data associated with the same voiceprint feature, and may implement the following steps:

the electronic equipment acquires each amplitude of a plurality of pieces of sub-voice data with the same voiceprint feature and the same timestamp, compares each amplitude of the plurality of pieces of sub-voice data, obtains sub-voice data with the maximum amplitude from the plurality of pieces of sub-voice data as target sub-voice data, and further obtains a plurality of pieces of target sub-voice data associated with the same voiceprint feature based on the target sub-voice data. If at least two pieces of sub-speech data having the same amplitude exist in the plurality of pieces of sub-speech data, the sub-speech data having any amplitude is acquired from the at least two pieces of sub-speech data having the same amplitude as the target sub-speech data, and the plurality of pieces of target sub-speech data associated with the same voiceprint feature are obtained based on the target sub-speech data.

For example, the electronic device obtains sub-voice data T1, sub-voice data T2, and sub-voice data T3 having the same voiceprint feature and the same timestamp, obtains an amplitude a1 of the sub-voice data T1, an amplitude a2 of the sub-voice data T2, and an amplitude A3 of the sub-voice data T3, and compares the amplitude a1 of the sub-voice data T1, the amplitude a2 of the sub-voice data T2, and the amplitude A3 of the sub-voice data T3, to obtain a comparison result of the following two cases, that is, a case: amplitude A3> amplitude a2> amplitude a1, indicating that the amplitude A3 of the sub voice data T3 is the maximum amplitude, and the sub voice data T3 is the target sub voice data; case two: when the amplitude A3 is equal to the amplitude a2> the amplitude a1, it indicates that the amplitude A3 of the sub voice data T3 and the amplitude a2 of the sub voice data T2 are both maximum amplitudes, and sub voice data of any amplitude is obtained as target sub voice data from the amplitude A3 of the sub voice data T3 and the amplitude a2 of the sub voice data T2, for example, the sub voice data T2 of the amplitude a2 is obtained as target sub voice data. It should be noted that the embodiments of the present application only provide convenience for users to understand the present solution, and do not completely represent the specific implementation manner claimed in the present application.

Step a3, generating target voice data associated with each object based on the plurality of target sub voice data and the time stamps corresponding to the plurality of target sub voice data.

In the embodiment of the application, the electronic device obtains a plurality of target sub-voice data associated with the same voiceprint feature and a plurality of timestamps corresponding to the target sub-voice data, and generates target voice data associated with each object.

Step 205, generating a voice recognition result based on the plurality of target voice data, and outputting the voice recognition result.

In this embodiment of the application, step 205 generates a speech recognition result based on a plurality of target speech data, and outputs the speech recognition result, which may be implemented by the following steps:

and step B1, acquiring first position information of the plurality of voice collectors in the current voice recognition scene.

In the embodiment of the present application, the first position information may be understood as information of a position where the voice collector is set in a current voice recognition scene. The first position information may be position information pre-stored in the electronic device by a plurality of voice collectors in a current voice recognition scene; the first position information may also be position information of a plurality of voice collectors acquired by the electronic device in real time in a current voice recognition scenario. It should be noted that, regardless of the location information pre-stored in the electronic device or the location information acquired in real time, the first location information is subject to acquiring the location information of the plurality of voice acquirers.

And step B2, determining second position information of each object in the plurality of objects in the current voice recognition scene based on the first position information and the plurality of target voice data.

The second position information may be understood as first position information in the current speech recognition scenario with respect to the plurality of speech collectors.

In the embodiment of the application, after the electronic device obtains the first position information of the plurality of voice collectors in the current voice recognition scene, the electronic device determines the second position information of each object in the plurality of objects in the current voice recognition scene relative to the first position information based on the first position information and the plurality of target voice data.

And step B3, performing voice emotion recognition on the target voice data to obtain a plurality of first recognition results.

In this embodiment, the first recognition result may be understood as a result obtained by performing speech emotion recognition on the target speech data.

In the embodiment of the application, the electronic equipment extracts the emotional voice feature of each voice data in the target voice data, and performs emotional recognition on each voice data based on the emotional voice feature to obtain a first recognition result.

In practical applications, each voice data needs to be preprocessed before performing emotion voice feature extraction on each voice data. The pre-processing of the speech data includes pre-emphasis, short-time analysis, framing, windowing, and endpoint detection. Wherein, pre-emphasis: the higher the frequency in the speech data, the smaller the corresponding component, and the frequency spectrum of the high frequency part can be improved by means of pre-emphasis, so that the frequency spectrum of the signal becomes flat for the purpose of spectral analysis or vocal tract parameter analysis. Short-time analysis: the speech data is time-varying as a whole, is a non-stationary process, and has a time-varying characteristic, but its characteristic remains substantially relatively stable over a short time range, e.g., 10-30ms, i.e., speech has short-term stationarity. Framing: in order to perform short-time analysis, the voice data is segmented according to a preset time interval, such as 10-30ms, and each obtained segment is called a frame; in order to make the transition from frame to frame smooth and maintain continuity, the method of overlapping segments may also be used. Windowing: multiplying w (n) by s (n) with a window function to form windowed speech data s_wAnd (n) s (n) w (n), wherein the window length in the window function, namely the number of sample points, corresponds to one frame. And (3) end point detection: the starting point and the ending point of the voice data are accurately found from a section of voice data, and effective voice data and useless noise signals are guaranteed to be separated.

In the embodiment of the application, the emotional voice features are extracted by converting voice waveforms into parameter representation forms at a relatively minimum data rate. The emotion speech feature extraction algorithm includes Mel Frequency Cepstrum Coefficient (MFCC), Linear Prediction Cepstrum Coefficient (LPCC), line spectrum Frequency (LSF, line spectrum Frequency), Discrete Wavelet Transform (DWT, Discrete Wavelet Transform), Perceptual Linear Prediction (PLP, Perceptual Linear Prediction), and the like, and of course, the emotion speech feature extraction algorithm also includes other algorithms such as Linear Prediction Coefficient (LPC, Linear Prediction Coding), and the like, and the speech feature extraction algorithm is not particularly limited in the present application.

The speech emotion recognition essentially classifies and pattern-recognizes speech characteristic parameters of speech emotion,

in the embodiment of the application, the electronic device extracts emotion voice features in each voice data in the target voice data based on an LPC algorithm, and performs emotion classification on each voice data in the target voice data by using an emotion classification algorithm based on a training library to obtain a plurality of first recognition results. The emotion classification algorithm includes an Artificial Neural Network (ANN), a Hidden Markov Model (HMM), a Support Vector Machine (SVM), and the like, and certainly, the emotion recognition algorithm also includes other components such as a Decision Tree (DT), and the emotion classification algorithm is not specifically limited in the present application.

The speech emotion data set is an important basis for studying speech emotion recognition. The voice emotion data set comprises a Belfast English emotion database, a Berlin Emo-DB emotion database, a CASIA Chinese emotion database and an ACCOPus series Chinese emotion database, and certainly, the voice emotion data set also comprises other German emotion databases such as FAU AIBO children, and the voice emotion data set is not specifically limited.

In the embodiment of the application, the electronic device extracts emotion voice features of each voice data in the target voice data by using an LPC algorithm, classifies voice emotion by using an HMM algorithm, and uses a CASIA Chinese emotion database in a training library to obtain a plurality of first recognition results.

And step B4, acquiring a plurality of target text data related to the plurality of target voice data.

In the embodiment of the application, after the electronic equipment acquires the target voice data, the target voice data are converted into the target text data.

And step B5, performing semantic recognition on the target text data to obtain a plurality of second recognition results.

In the embodiment of the application, semantic recognition can be understood as automatically segmenting target text data, further sorting the structure of the target text data and even understanding the meaning of the target text data; the second recognition result may be understood as a result obtained by performing semantic recognition on the target text data.

In the embodiment of the application, the electronic equipment performs semantic recognition on the target text data to obtain a plurality of second recognition results. The semantic recognition method comprises a word segmentation method and a maximum matching method based on character string matching, and the semantic recognition method is not particularly limited in the application.

Step B6, generating a speech recognition result based on the plurality of first recognition results, the plurality of second recognition results, the plurality of target text data, and the plurality of second position information.

In this embodiment, the step B6 of generating a speech recognition result based on the plurality of first recognition results, the plurality of second recognition results, the plurality of target text data, and the plurality of second position information may be implemented by:

and step C1, extracting the characteristic information of the target text data associated with each object based on the first identification result associated with each object and the second identification result associated with each object.

In the embodiment of the present application, the feature information may be understood as information of target text data associated with each object in a current speech recognition scenario. For example, the feature information may be information that a certain word appears frequently in the target text data associated with each object; the feature information may also be information with high topic relevance in the current speech recognition scenario, and the present application is not particularly limited.

In an actual application scenario, for example, in a conference, the electronic device calculates the occurrence frequency of words contained in the target text data associated with each object, and compares the occurrence frequency of each word to obtain a word with a higher frequency as feature information.

In the embodiment of the application, the electronic device extracts feature information of the target text data associated with each object based on the first recognition result associated with each object and the second recognition result associated with each object.

Step C2, generating an orientation voice graph based on the characteristic information associated with each object and the second position information associated with each object; the speech recognition result comprises an azimuthal speech map.

In an embodiment of the present application, the azimuthal voice map includes a plurality of objects, feature information associated with each object, and second location information associated with each object.

In the embodiment of the application, the electronic device extracts feature information of the target text data associated with each object, and generates a voice recognition result including an azimuth voice map based on the feature information associated with each object and the second position information associated with each object.

In an actual application scenario, for example, in a conference, the electronic device extracts feature information of target text data associated with each object, and integrates the feature information of the target text data associated with each object based on a plurality of objects, as feature information of the conference, that is, a subject of the conference. And generating an azimuth voice map of the conference based on the second position information associated with each object.

In other embodiments of the present application, the electronic device may output an azimuth display map and display an azimuth voice map through the display module.

It should be noted that, for the descriptions of the same steps and the same contents in this embodiment as those in other embodiments, reference may be made to the descriptions in other embodiments, which are not described herein again.

Based on the foregoing embodiments, the present application provides a speech recognition method applied to an electronic device, and as shown in fig. 4, the method includes the following steps:

step 301, four voice data under the current voice recognition scene are obtained.

The four voice data comprise voice data of a plurality of objects under the current voice recognition scene, which are collected by four voice collectors; the four speech collectors are located at different positions in the current speech recognition scene.

In the embodiment of the application, including four voice collectors in the electronic equipment, four voice collectors are located four directions of east south, west and north in the current speech recognition scene, and every two voice collectors in four voice collectors are half meters apart from each other, and each voice collector all gathers the speech data of a plurality of objects of four directions under the current speech recognition scene, obtains four speech data.

Step 302, obtaining a break time between sentences in each of the four voice data, and if the break time is greater than a preset break time, segmenting each of the four voice data to obtain a first sub-voice data set after each voice data is segmented.

Step 303, segmenting each voice data in the first sub voice data set after each voice data is segmented according to a preset time interval, and extracting a plurality of voiceprint features in each sub voice data to obtain voice data of each object in a plurality of objects.

In the embodiment of the application, the electronic device clusters the extracted voiceprint features by using the GMM, judges the number of objects in each sub-voice data according to the difference of the voiceprint features, and finally performs voice concatenation to obtain the voice data of each object in the plurality of objects, namely the voice data of the plurality of objects existing at the same time in the same voice data is divided into the plurality of sub-voice data of each object.

And 304, determining a plurality of sub-voice data with the same voiceprint characteristic and the same timestamp from the plurality of sub-voice data sets, acquiring the voice data with the maximum amplitude from the plurality of sub-voice data, and determining the voice data as the target sub-voice data.

Step 305, obtaining a plurality of target sub-voice data associated with the same voiceprint feature based on the plurality of sub-voice data.

Step 306, determining second position information of each object in the plurality of objects in the current voice recognition scene based on the first position information of the voice collector of the collected plurality of target sub-voice data.

And 307, performing voice emotion recognition on the plurality of sub-target voice data to obtain a plurality of first recognition results.

In the embodiment of the application, the electronic equipment carries out voice emotion recognition on a plurality of sub-target voice data, the voice data is preprocessed through pre-emphasis, framing, windowing and end point detection, voice features are extracted through an LPC algorithm, emotion classification is carried out on the plurality of sub-target voice data through an HMM algorithm, in addition, a CASIA Chinese emotion database is used in a training library, and a plurality of first recognition results are finally obtained.

And 308, acquiring a plurality of target text data associated with the plurality of target voice data, and performing semantic recognition on the plurality of target text data to obtain a plurality of second recognition results.

Step 309, extracting the keyword information of the target text data associated with each object based on the first recognition result associated with each object and the second recognition result associated with each object.

And 310, generating a position voice graph based on the keyword information associated with each object and the second position information associated with each object.

Wherein the voice recognition result comprises an azimuth voice map.

And 311, outputting and deriving a direction voice graph.

As can be seen from the above, in the embodiment of the present application, the position information of each object in the plurality of objects, the target speech data of each object, and the keyword extracted based on the target speech data of each object are associated with each other, so that no artificial analysis is required, the amount of calculation for speech data analysis is reduced, and the accuracy of the speech result is ensured.

Based on the foregoing embodiments, an embodiment of the present application provides a speech recognition apparatus, which can be applied to a speech recognition method provided in the embodiments corresponding to fig. 1-2, and as shown in fig. 5, the speech recognition apparatus 5 includes:

an obtaining unit 51, configured to obtain a plurality of voice data in a current voice recognition scene; the voice data comprises voice data of a plurality of objects under the current voice recognition scene, which are collected by a plurality of voice collectors; the voice collectors are positioned at different positions in the current voice recognition scene;

a first processing unit 52 for generating target voice data associated with each of the plurality of objects based on the plurality of voice data; the target voice data associated with each object is derived from at least two voice data of the plurality of voice data;

a second processing unit 53 for generating a voice recognition result based on the plurality of target voice data and outputting the voice recognition result.

In other embodiments of the present application, the first processing unit 52 is further configured to segment each voice data in the multiple voice data to obtain each sub-voice data set after each voice data is segmented; each sub voice data set comprises a plurality of sections of voice data; acquiring a plurality of voiceprint characteristics associated with each sub-voice data set; and generating target voice data associated with each object based on each sub voice data set and a plurality of voice print characteristics associated with each sub voice data set.

In other embodiments of the present application, the first processing unit 52 is further configured to determine a plurality of sub-voice data sets having the same voiceprint feature and the same timestamp; determining target sub-voice data from the plurality of sub-voice data to obtain a plurality of target sub-voice data associated with the same voiceprint feature; and generating target voice data associated with each object based on the target sub-voice data and the time stamps corresponding to the target sub-voice data.

In other embodiments of the present application, the first processing unit 52 is further configured to determine the target sub-speech data as the speech data with the largest amplitude in the plurality of sub-speech data.

In other embodiments of the present application, the second processing unit 53 is further configured to obtain first position information of a plurality of voice collectors in a current voice recognition scene; determining second position information of each object in the plurality of objects in the current voice recognition scene based on the first position information and the plurality of target voice data; and generating a voice recognition result based on the plurality of second position information and the plurality of target voice data.

In other embodiments of the present application, the second processing unit 53 is further configured to perform speech emotion recognition on a plurality of target speech data to obtain a plurality of first recognition results; acquiring a plurality of target text data associated with a plurality of target voice data; performing semantic recognition on the target text data to obtain a plurality of second recognition results; based on the plurality of first recognition results, the plurality of second recognition results, the plurality of target text data, and the plurality of second position information, a voice recognition result is generated.

In other embodiments of the present application, the second processing unit 53 is further configured to extract feature information of the target text data associated with each object based on the first recognition result associated with each object and the second recognition result associated with each object; generating a position voice graph based on the characteristic information associated with each object and the second position information associated with each object; the speech recognition result comprises an azimuthal speech map.

Based on the foregoing embodiments, the present application provides another speech recognition apparatus, which can be applied to the speech recognition method provided by the embodiment corresponding to fig. 3, and as shown in fig. 6, the speech recognition processing apparatus 6 in fig. 6 corresponds to the speech recognition apparatus 5 in fig. 5, wherein the obtaining unit 51 in the speech recognition apparatus 5 includes a speech acquiring unit 61 in the speech recognition processing apparatus 6, and the first processing unit 52 in the speech recognition apparatus 5 includes a speech data dividing unit 62, a human voice separating unit 63, and a voiceprint recognition engine unit 64 in the speech recognition processing apparatus 6; the second processing unit 53 in the speech recognition apparatus 5 includes a speech data selection and object position determination unit 65, a speech emotion recognition unit 66, a speech recognition unit 67, a character conversion unit 68, and a generation output unit 69 in the speech recognition processing apparatus 6.

Based on the foregoing embodiments, an embodiment of the present application provides an electronic device, which may be applied to a speech recognition method provided in the embodiments corresponding to fig. 1-2, and as shown in fig. 7, the electronic device 7 (the electronic device 7 in fig. 7 corresponds to the speech recognition apparatus 5 in fig. 5) includes: a memory 71 and a processor 72; wherein, the processor 72 is configured to execute the speech recognition program stored in the memory 71, and the electronic device 7 implements the following steps through the processor 72:

acquiring a plurality of voice data under a current voice recognition scene; the voice data comprises voice data of a plurality of objects under the current voice recognition scene, which are collected by a plurality of voice collectors; the voice collectors are positioned at different positions in the current voice recognition scene;

generating target voice data associated with each object of the plurality of objects based on the plurality of voice data; the target voice data associated with each object is derived from at least two voice data of the plurality of voice data;

a speech recognition result is generated based on the plurality of target speech data, and the speech recognition result is output.

In other embodiments of the present application, the processor 72 is configured to execute the speech recognition program stored in the memory 71 to implement the following steps:

dividing each voice data in the voice data to obtain each sub voice data set after each voice data is divided; each sub voice data set comprises a plurality of sections of voice data;

and generating target voice data associated with each object based on each sub voice data set and a plurality of voice print characteristics associated with each sub voice data set.

determining a plurality of sub-voice data which have the same voiceprint characteristics and the same time stamp in a plurality of sub-voice data sets;

determining target sub-voice data from the plurality of sub-voice data to obtain a plurality of target sub-voice data associated with the same voiceprint feature;

the target sub voice data is voice data having the largest amplitude among the plurality of sub voice data.

acquiring first position information of a plurality of voice collectors in a current voice recognition scene;

accordingly, generating a speech recognition result based on the plurality of target speech data further comprises:

determining second position information of each object in the plurality of objects in the current voice recognition scene based on the first position information and the plurality of target voice data;

based on the plurality of first recognition results, the plurality of second recognition results, the plurality of target text data, and the plurality of second position information, a voice recognition result is generated.

extracting feature information of target text data associated with each object based on a first identification result associated with each object and a second identification result associated with each object;

generating a position voice graph based on the characteristic information associated with each object and the second position information associated with each object; the speech recognition result comprises an azimuthal speech map.

Based on the foregoing embodiments, embodiments of the invention provide a computer storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of:

In other embodiments of the invention, the one or more programs are executable by the one or more processors to perform the steps of:

The computer storage medium/Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); but may also be various terminals such as mobile phones, computers, tablet devices, personal digital assistants, etc., that include one or any combination of the above-mentioned memories.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing module, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit. Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech recognition, the method comprising:

2. The method of claim 1, wherein generating target speech data associated with each of a plurality of objects based on the plurality of speech data comprises:

3. The method according to claim 2, wherein generating the target speech data associated with each object based on the each sub-speech data set and a plurality of voiceprint features associated with the each sub-speech data set comprises:

4. The method according to claim 3, wherein the target sub voice data is voice data having a maximum amplitude among the plurality of sub voice data.

5. The method of any of claims 1-3, wherein prior to generating a speech recognition result based on the target speech data, the method further comprises:

6. The method according to claim 5, wherein generating a speech recognition result based on the plurality of second location information and the plurality of target speech data comprises:

7. The method according to claim 6, wherein the generating the speech recognition result based on the plurality of first recognition results, the plurality of second recognition results, the plurality of target text data, and the plurality of second position information comprises:

8. A speech recognition apparatus, characterized in that the apparatus comprises:

9. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for executing executable instructions stored in the memory to implement the speech recognition method of any of claims 1 to 7.

10. A computer storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement a speech recognition method as claimed in any one of claims 1 to 7.