CN111128131B

CN111128131B - Voice recognition method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN111128131B
Application number: CN201911297403.6A
Authority: CN
Inventors: 王超; 冯大航; 陈孝良; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2022-07-01
Anticipated expiration: 2039-12-17
Also published as: CN111128131A

Abstract

The embodiment of the disclosure discloses a voice recognition method, a voice recognition device, electronic equipment and a computer-readable storage medium. The voice recognition method comprises the following steps: acquiring an input voice file; dividing the voice file into a plurality of voice frames; extracting voice feature points in the voice frames to generate a voice feature map; extracting a plurality of candidate regions from the voice feature map; one or more candidate regions including the target speech are identified from the plurality of candidate regions. The method extracts a plurality of candidate areas from the feature map of the voice frame and identifies the area including the target voice from the preferred area, thereby solving the technical problem that the position of the target voice in the voice file cannot be determined in the prior art.

Description

Voice recognition method and device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of speech recognition, and in particular, to a speech recognition method, apparatus, electronic device, and computer-readable storage medium.

Background

As a man-machine interaction means, the voice recognition technology is significant in the aspect of liberating both hands of human beings. With the advent of various intelligent sound boxes, voice interaction becomes a new value of internet access, more and more intelligent devices add a voice wake-up trend to become a bridge for communication between people and devices, and therefore a voice wake-up (KWS) technology becomes more and more important.

The voice wake-up technical route goes through many generations of iterations, roughly classified into three generations: the first generation uses a template matching method, which compares the characteristics of the input voice and the template voice and judges whether to wake up based on the comparison result; the second generation is a method combining ASR, an HMM-GMM model is adopted, and the recognition result is a keyword and a non-keyword; the third generation adopts a neural network scheme, and can directly identify whether the voice contains the awakening words from end to end.

However, the above technical solutions can only detect whether a segment of speech contains a specific wake-up word, and cannot detect at which position in the segment of speech the wake-up word is located.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, an embodiment of the present disclosure provides a speech recognition method, including:

acquiring an input voice file;

dividing the voice file into a plurality of voice frames;

extracting voice feature points in the voice frames to generate a voice feature map;

extracting a plurality of candidate regions from the voice feature map;

one or more candidate regions including the target speech are identified from the plurality of candidate regions.

Further, the dividing the voice file into a plurality of voice frames includes:

acquiring the length M of a voice frame and the moving interval N of the voice frame;

and extracting a plurality of voice frames by taking the head of the voice file as a starting point, wherein the length of each voice frame is M, and the starting points of two adjacent voice frames are separated by N.

Further, the extracting of the speech feature points in the plurality of speech frames to generate a speech feature map includes:

carrying out short-time Fourier transform on each of the plurality of voice frames to obtain a plurality of frequency characteristic points;

and forming a voice characteristic diagram by the characteristic frequency characteristic points according to the sequence of the voice frames.

Further, the extracting a plurality of candidate regions from the speech feature map includes:

performing initial segmentation on the voice feature map so as to divide the voice feature map into a plurality of initial regions;

combining the plurality of initial regions according to the similarity to generate a plurality of intermediate regions;

and taking the plurality of intermediate regions as initial regions to continue to be merged until the intermediate regions cannot be merged, so as to obtain a plurality of candidate regions.

Further, the identifying one or more candidate regions including the target speech from the plurality of candidate regions includes:

inputting the voice feature map into a convolutional neural network to generate a voice feature vector;

mapping the candidate frames of the candidate regions to the voice feature vector according to the positions of the candidate frames in the voice feature map to obtain the feature vector of one or more candidate regions;

identifying one or more candidate regions comprising target speech according to the feature vectors of the one or more candidate regions;

and determining the position of one or more candidate regions comprising target voice on the voice feature map according to the feature vectors of the one or more candidate regions.

Further, the identifying one or more candidate regions including the target speech according to the feature vectors of the one or more candidate regions includes:

inputting the feature vectors of the one or more candidate regions into a fully-connected layer to generate one or more fixed-length feature vectors;

inputting the one or more fixed-length feature vectors into a first function to calculate the probability that the one or more fixed-length feature vectors are feature vectors of target speech;

and determining one or more candidate regions corresponding to one or more fixed-length feature vectors with the probability greater than a first threshold value as candidate regions comprising the target voice.

Further, the determining, according to the feature vectors of the one or more candidate regions, the position of the one or more candidate regions including the target speech on the speech feature map includes:

inputting the one or more fixed-length feature vectors into a second function to calculate adjustment parameters of the one or more candidate regions;

and calculating the positions of the one or more candidate regions comprising the target voice according to the adjusting parameters.

Further, the one or more candidate regions are represented on the speech feature map by start point coordinates and end point coordinates.

In a second aspect, an embodiment of the present disclosure provides a speech recognition apparatus, including:

the voice acquisition module is used for acquiring an input voice file;

the framing module is used for dividing the voice file into a plurality of voice frames;

the characteristic diagram extraction module is used for extracting voice characteristic points in the voice frames to generate a voice characteristic diagram;

a candidate region extraction module, configured to extract a plurality of candidate regions from the speech feature map;

and the target speech recognition module is used for recognizing one or more candidate areas comprising the target speech from the plurality of candidate areas.

Further, the framing module further includes:

a framing parameter obtaining module, configured to obtain a speech frame length M and a speech frame moving interval N;

and the voice frame extraction module is used for extracting a plurality of voice frames by taking the head of the voice file as a starting point, wherein the length of each voice frame is M, and the starting points of two adjacent voice frames are separated by N.

Further, the feature map extraction module further includes:

a frequency characteristic point obtaining module, configured to perform short-time fourier transform on each of the multiple speech frames to obtain multiple frequency characteristic points;

and the voice characteristic map combining module is used for combining the characteristic frequency characteristic points into a voice characteristic map according to the sequence of the voice frames.

Further, the candidate region extraction module further includes:

the initial region segmentation module is used for initially segmenting the voice feature map into a plurality of initial regions;

the region merging module is used for merging the plurality of initial regions according to the similarity so as to generate a plurality of intermediate regions; and taking the plurality of intermediate regions as initial regions to continue to be merged until the intermediate regions cannot be merged, so as to obtain a plurality of candidate regions.

Further, the target speech recognition module includes:

the voice feature vector generation module is used for inputting the voice feature map into a convolutional neural network to generate a voice feature vector;

a candidate region feature vector obtaining module, configured to map the candidate frames of the multiple candidate regions to the speech feature vector according to their positions in the speech feature map to obtain feature vectors of one or more candidate regions;

the target candidate area identification module is used for identifying one or more candidate areas comprising target voice according to the feature vectors of the one or more candidate areas;

and the candidate area position determining module is used for determining the positions of one or more candidate areas comprising the target voice on the voice feature map according to the feature vectors of the one or more candidate areas.

Further, the target candidate area identification module is further configured to:

Further, the candidate area location determining module is further configured to:

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the speech recognition method of any of the preceding first aspects.

In a fourth aspect, the disclosed embodiments provide a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions for causing a computer to execute the speech recognition method of any one of the foregoing first aspects.

The foregoing is a summary of the present disclosure, and for the purposes of promoting a clear understanding of the technical means of the present disclosure, the present disclosure may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a schematic view of an application scenario of an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of a speech recognition method according to an embodiment of the present disclosure;

FIG. 3 is a diagram illustrating a plurality of divided speech frames in a speech file according to an embodiment of the present disclosure;

fig. 4 is a flowchart illustrating a specific example of step S203 in the speech recognition method according to the embodiment of the disclosure;

5a-5b are schematic diagrams illustrating the generation of a speech feature map of a speech recognition method according to an embodiment of the present disclosure;

fig. 6 is a flowchart illustrating a specific example of step S204 in the speech recognition method according to the embodiment of the disclosure;

FIG. 7 is a schematic diagram of a process for recognizing a target speech provided by an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an embodiment of a speech recognition apparatus provided in an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device provided according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a schematic view of an application scenario of the embodiment of the present disclosure. As shown in fig. 1, a user 101 inputs voice to a terminal device 102, the terminal device 102 may be any terminal device capable of receiving the natural language input, such as a smart phone, a smart speaker, a smart home appliance, and the like, and the terminal device 102 is connected to a voice recognition device 103 through a network, where the voice recognition device 103 may be a computer or a smart terminal, and the like; the network on which the terminal device 102 communicates with the voice recognition device 103 may be a wireless network, such as a 5G network and a wifi network, or a wired network, such as an optical fiber network. In the application scenario, the user 101 speaks a voice, the terminal device 102 collects the voice and sends the voice to the voice recognition device 103, and if the voice recognition device 103 recognizes a target voice (i.e., a wakeup word), the terminal device 102 executes a function corresponding to the target voice.

It will be appreciated that the speech recognition device 103 and the terminal device 102 may be arranged together, i.e. the terminal device 102 may incorporate speech recognition functionality, such that a user's speech input may be recognized directly in the terminal device 102. After the voice is recognized, the terminal device 102 may perform a function related to the voice according to the voice.

Fig. 2 is a flowchart of an embodiment of a speech recognition method provided in this disclosure, where the speech recognition method provided in this embodiment may be executed by a speech recognition apparatus, the speech recognition apparatus may be implemented as software, or implemented as a combination of software and hardware, and the speech recognition apparatus may be integrated in some device in a speech recognition system, such as a speech recognition server or a speech recognition terminal device. As shown in fig. 2, the method comprises the steps of:

step S201, acquiring an input voice file;

in the present disclosure, the input voice file is acquired by an audio source. Optionally, the audio sources in this step are various audio acquisition devices, such as microphones of various forms, which acquire the voice from the environment and convert it into a voice file. In which the converted audio file is obtained from the audio capture device. Typically, as shown in fig. 1, the terminal device 102 includes an audio collecting device, such as a microphone, through which the voice in the environment where the terminal device is located can be collected.

Optionally, the audio source in this step is a storage space for storing the voice file. The storage space may be a local storage space or a remote storage space, and in this optional embodiment, acquiring the input voice file requires first acquiring an address of the storage space, and then acquiring the voice file from the storage space.

Step S202, dividing the voice file into a plurality of voice frames;

in this step, the plurality of speech frames may be a plurality of speech frames with non-overlapping first-bit connection or a plurality of speech frames with partial overlap.

Optionally, the dividing the voice file into a plurality of voice frames includes:

Fig. 3 shows an example of a plurality of speech frames. As shown in FIG. 3, AF is the two endpoints of the voice file, A represents the beginning of the voice file, and F represents the end of the voice file, and in this example, the voice file is divided into 3 voice frames, AB, CD and EF respectively, by a framing operation, where AB and CD have an overlapping area CB and CD and EF has an overlapping area ED. Since the length of a speech file is generally expressed in milliseconds (ms), and an input speech signal is generally considered to be a signal having a short and smooth time within 10ms to 30ms, in one example, a speech frame length is 25ms, and a speech frame movement interval is 10ms, in accordance with the example in fig. 3, a speech file having a length of AB ═ CD ═ EF ═ 25ms, AC ═ CE ═ 10ms, AF ═ 45ms is divided into 3 speech frames having a length of 25 and a movement interval of 10 ms.

Step S203, extracting voice feature points in the voice frames to generate a voice feature map;

in this step, the features of each of the divided speech frames are extracted to form feature points, which may include information such as frequency and amplitude of speech in the speech frame.

Optionally, the extracting the speech feature points in the plurality of speech frames to generate a speech feature map includes:

step S401, carrying out short-time Fourier transform on each of the plurality of voice frames to obtain a plurality of frequency characteristic points;

step S402, the characteristic frequency characteristic points form a voice characteristic diagram according to the sequence of the voice frame.

Each of the plurality of voice frames represents a voice signal, the voice signal may be represented as a set of pixel points on the XY two-dimensional image, the audio signal is a one-dimensional array, the length of the audio signal is determined by the audio length and the sampling rate, for example, the sampling rate is 16KHz, which represents that 16000 points are sampled within one second, and at this time, if the length of the voice signal is 10s, there are 160000 values in the voice file, and the value is the amplitude of the voice. The speech signal is converted from a time domain signal to a frequency domain signal by using a short-time Fourier transform, when the short-time Fourier transform is used, a parameter H is needed to determine how many points to perform the short-time Fourier transform, if H is 512, 512 points are performed in each speech frame, and because the Fourier transform has symmetry, H/2+1 points are taken when H is an even number, and (H +1)/2 points are taken when H is an odd number, for example, 257 values are finally obtained when H is 512. Forming the obtained points into a one-dimensional column vector as a column vector in a voice feature map, performing the short-time Fourier transform processing on each voice frame to obtain a plurality of column vectors, and arranging the plurality of column vectors according to the sequence of the voice frames to obtain the voice feature map, wherein the horizontal coordinate of the feature map on the XY two-dimensional image represents the sequence, namely the time, of the voice frames; the ordinate of the characteristic diagram on the XY two-dimensional image represents the frequency of the characteristic point; the gray value of the feature point represents the amplitude of the feature point. The above process is shown in fig. 5a-5b, wherein fig. 5a is a schematic diagram of performing short-time fourier transform on a speech frame, which converts a signal image in time domain into a signal image in frequency domain, and then the conversion is performed into a gray scale graph representing frequency in height, each small square represents a feature point, and the gray scale value represents the amplitude of the point; as shown in fig. 5b, the voice feature map generated after all voice frames are subjected to short-time fourier transform, the abscissa represents time (the sequence of voice frames in the voice file), the ordinate represents the frequency of the feature point, and the grayscale of the feature point represents the amplitude of the feature point.

Step S204, extracting a plurality of candidate areas from the voice feature map;

a column of feature points in the voice feature map represents sampling points in a voice frame, so that when a candidate region is extracted, the height of the region is not considered, and the height of the region is the height of the voice feature map, so that two coordinate points in the time direction can represent one candidate region, namely the start point coordinate and the end point coordinate of the candidate region. Illustratively, the coordinate point may be represented by a sequence number of a speech frame, such as (3,6) may be used to represent a candidate region including feature points of all speech frames between speech frame 3 and speech frame 6.

Optionally, the extracting a plurality of candidate regions from the speech feature map includes:

step S601, the voice feature map is initially divided into a plurality of initial areas;

step S602, merging the plurality of initial regions according to the similarity to generate a plurality of intermediate regions;

step S603, continuing to merge the plurality of intermediate regions as initial regions until the intermediate regions cannot be merged, so as to obtain a plurality of candidate regions.

Illustratively, the voice feature map is segmented into a plurality of initial regions in step S601 using a map-based image segmentation algorithm; in step S602, merging the initial regions according to the similarity of the initial regions to generate intermediate regions, where the similarity may be determined according to the similarity of the gray-scale values of the feature points at the same height or the difference between the weighted averages of the feature points in the two regions, and the like, and merging the similar and adjacent initial regions to form intermediate regions; then, in step S603, the intermediate region is used as an initial region to continue merging until the similarity is not consistent, so as to obtain a plurality of candidate regions, where the similarity between the candidate regions is small.

In step S205, one or more candidate regions including the target speech are identified from the plurality of candidate regions.

In this step, it is identified whether each of the plurality of candidate regions includes the target speech and a position of each candidate region on the speech feature map, respectively.

Optionally, the identifying one or more candidate regions including the target speech from the plurality of candidate regions includes:

Fig. 7 is a diagram illustrating recognition of a target voice. As shown in fig. 7, a voice feature map 701 is input into a convolutional neural network 703, which is a neural network trained in advance for extracting voice features, and after multilayer convolution calculation of the convolutional neural network 703, a feature vector map 704 is obtained, where feature points in the feature vector map carry position information of the feature points in the voice feature map; at this time, the candidate frames of the candidate regions in 702 are mapped into the feature map, and since the feature points in the feature vector map carry position information and the candidate frames themselves also carry position information, it is possible to determine which candidate region or candidate regions the feature points in the feature vector map are located in using the same position information, and thus, the feature vectors of one or more candidate regions can be obtained. Finally, the feature vectors of the one or more candidate regions are used to determine which of the candidate regions include the target speech and the position of the target speech in the speech file.

Optionally, the identifying one or more candidate regions including the target speech according to the feature vectors of the one or more candidate regions includes:

In the above alternative embodiment, a fully-connected layer plus a decision function is used to determine whether the candidate region represented by the feature vector includes the target speech. Converting the feature vector of the candidate region into a fixed-length feature vector through the fully-connected layer, and then inputting the fixed-length feature vector into a first function, illustratively, a softmax function, which converts a feature vector into a set of numbers, the sum of all numbers in the set being 1, each number in the set representing a classification, if there are 9 target voices, and a total of 10 types including no target voice, the feature vector of the candidate region is converted into one feature vector of length 10 through the full-connected layer, the corresponding value of each feature vector is calculated by softmax, 10 values are obtained, and the sum of the 10 values is 1, the type corresponding to the maximum value of the 10 values is the target speech type corresponding to the feature vector of the candidate region (one or none of the 9 target speech).

Optionally, the determining, according to the feature vectors of the one or more candidate regions, the position of the one or more candidate regions including the target speech on the speech feature map includes:

In the above alternative embodiment, the fully-connected layer plus a regression function is used to determine the location of the candidate region on the speech feature map. Converting the feature vector of the candidate area into a feature vector with a fixed length through the full connection layer, inputting the feature vector with the fixed length into a second function, calculating an adjustment parameter of the candidate area through the second function, specifically, the adjustment parameter may include an offset value of a coordinate point, and adding the offset value to the coordinate of the candidate area, so as to obtain a correct position of the candidate area on the voice feature map.

The regression process for judging whether the candidate region contains the target voice and the position of the candidate region can use the same full connection layer, so that parameters can be shared to reduce the training difficulty of the network. When the candidate region is classified and position regressed, the classification and position regression can be trained simultaneously by using the weighting function of the two loss functions, so that the classification result and the position regression result of the target speech can be obtained by using one set of parameters.

The disclosed embodiments disclose a speech recognition method, apparatus, electronic device and computer readable storage medium. The voice recognition method comprises the following steps: acquiring an input voice file; dividing the voice file into a plurality of voice frames; extracting voice feature points in the voice frames to generate a voice feature map; extracting a plurality of candidate regions from the voice feature map; one or more candidate regions including the target speech are identified from the plurality of candidate regions. The method extracts a plurality of candidate areas from the feature map of the voice frame and identifies the area including the target voice from the preferred area, thereby solving the technical problem that the position of the target voice in the voice file cannot be determined in the prior art.

In the above, although the steps in the above method embodiments are described in the above sequence, it should be clear to those skilled in the art that the steps in the embodiments of the present disclosure are not necessarily performed in the above sequence, and may also be performed in other sequences such as reverse, parallel, and cross, and further, on the basis of the above steps, other steps may also be added by those skilled in the art, and these obvious modifications or equivalents should also be included in the protection scope of the present disclosure, and are not described herein again.

Fig. 8 is a schematic structural diagram of an embodiment of a speech recognition apparatus according to an embodiment of the present disclosure, and as shown in fig. 8, the apparatus 800 includes: a voice acquisition module 801, a framing module 802, a feature map extraction module 803, a candidate region extraction module 804 and a target voice recognition module 805. Wherein the content of the first and second substances,

a voice acquiring module 801, configured to acquire an input voice file;

a framing module 802, configured to divide the voice file into a plurality of voice frames;

a feature map extracting module 803, configured to extract voice feature points in the multiple voice frames to generate a voice feature map;

a candidate region extraction module 804, configured to extract a plurality of candidate regions from the speech feature map;

a target speech recognition module 805 configured to identify one or more candidate regions including the target speech from the plurality of candidate regions.

Further, the framing module 802 further includes:

Further, the feature map extraction module 803 further includes:

Further, the candidate region extraction module 804 further includes:

Further, the target speech recognition module 805 includes:

a candidate region feature vector obtaining module, configured to map the candidate frames of the multiple candidate regions to the speech feature vector according to positions of the candidate frames in the speech feature map, so as to obtain feature vectors of one or more candidate regions;

Further, the candidate region location determining module is further configured to:

The apparatus shown in fig. 8 can perform the method of the embodiment shown in fig. 1-7, and the detailed description of the embodiment not described in detail can refer to the related description of the embodiment shown in fig. 1-7. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to 7, and are not described herein again.

Referring now to FIG. 9, shown is a block diagram of an electronic device 900 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 9, the electronic device 900 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 901 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage means 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are also stored. The processing apparatus 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

Generally, the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 907 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 908 including, for example, magnetic tape, hard disk, etc.; and a communication device 909. The communication device 909 may allow the electronic apparatus 900 to perform wireless or wired communication with other apparatuses to exchange data. While fig. 9 illustrates an electronic device 900 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication device 909, or installed from the storage device 908, or installed from the ROM 902. The computer program, when executed by the processing device 901, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may be separate and not incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an input voice file; dividing the voice file into a plurality of voice frames; extracting voice feature points in the voice frames to generate a voice feature map; extracting a plurality of candidate regions from the voice feature map; one or more candidate regions including the target speech are identified from the plurality of candidate regions.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and the technical features disclosed in the present disclosure (but not limited to) having similar functions are replaced with each other to form the technical solution.

Claims

1. A speech recognition method comprising:

acquiring an input voice file;

dividing the voice file into a plurality of voice frames;

extracting a plurality of candidate regions from the voice feature map;

identifying one or more candidate regions including a target speech from the plurality of candidate regions;

wherein identifying one or more candidate regions including the target speech from the plurality of candidate regions comprises:

and determining the position of one or more candidate regions comprising the target voice on the voice feature map according to the feature vectors of the one or more candidate regions.

2. The speech recognition method of claim 1, wherein the dividing the speech file into a plurality of speech frames comprises:

3. The speech recognition method of claim 1, wherein the extracting speech feature points in the plurality of speech frames to generate a speech feature map comprises:

performing short-time Fourier transform on each of the plurality of voice frames to obtain a plurality of frequency characteristic points;

and forming a voice characteristic diagram by the plurality of frequency characteristic points according to the sequence of the voice frames.

4. The speech recognition method of claim 1, wherein said extracting a plurality of candidate regions from the speech feature map comprises:

performing initial segmentation on the voice feature map to segment the voice feature map into a plurality of initial areas;

5. The speech recognition method of claim 1, wherein the identifying one or more candidate regions comprising target speech based on the feature vectors of the one or more candidate regions comprises:

6. The speech recognition method of claim 1 wherein determining the location of one or more candidate regions comprising target speech on the speech feature map based on the feature vectors of the one or more candidate regions comprises:

7. The speech recognition method of claim 1, wherein:

the one or more candidate regions are represented on the speech feature map by start point coordinates and end point coordinates.

8. A speech recognition apparatus comprising:

the voice acquisition module is used for acquiring an input voice file;

a target speech recognition module for identifying one or more candidate regions from the plurality of candidate regions that include a target speech;

wherein the target speech recognition module comprises:

9. An electronic device, comprising:

a memory for storing computer readable instructions; and

a processor for executing the computer readable instructions such that the processor when executing implements the speech recognition method according to any of claims 1-7.

10. A non-transitory computer-readable storage medium storing computer-readable instructions that, when executed by a computer, cause the computer to perform the speech recognition method of any one of claims 1-7.