CN115129923A

CN115129923A - Voice search method, device and storage medium

Info

Publication number: CN115129923A
Application number: CN202210536526.6A
Authority: CN
Inventors: 王星; 玄建永; 刘镇亿; 高海宽
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2022-09-30
Anticipated expiration: 2042-05-17
Also published as: CN115129923B

Abstract

The application provides a voice search method, voice search equipment and a storage medium. The method is applied to the electronic equipment, the audio features to be searched are searched based on the audio features of the keywords to be searched, namely, the voice is searched, and the whole searching process does not need to convert the voice into the text, so that the searching time is greatly shortened, and the searching efficiency is improved. In addition, in the process of searching voices through voices, the searched voice segments meeting the preset conditions are marked with time points, so that the playing positions of the audio data which needs the user to listen to and distinguish can be quickly located according to the time points, and the working efficiency of workers is improved.

Description

Voice search method, device and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech search method, device, and storage medium.

Background

With the emergence of applications such as smart speakers, voice assistants, and the like, ordinary people can also use voices to communicate with machines like science fiction scenes. The voice keyword detection is an important technology for realizing human-computer voice interaction, and is widely applied to various intelligent devices and voice retrieval systems. The voice keyword detection can be divided into two types, one type is used for equipment wakeup and equipment control keyword spotting; one is spoken termdication applied to voice document retrieval. For a voice document retrieval scenario, such as retrieval of key information in a conference recording, the current retrieval method for retrieving key information in a voice dialog is to convert a continuous voice stream into a text in a certain form, and then search the text for the required key information.

Although this approach can achieve the search for key information in a continuous voice stream, it is time-consuming and labor-consuming for very large voice files. Moreover, due to the limitation of the voice recognition technology, the converted text has a large error, which results in poor accuracy of the voice document retrieval result.

Disclosure of Invention

In order to solve the above technical problem, the present application provides a voice search method, a device and a storage medium, which aim to directly search a continuous voice stream based on a voice keyword picked up by a microphone, thereby improving the retrieval efficiency.

In a first aspect, the present application provides a voice search method applied to an electronic device. The method comprises the following steps: acquiring first audio data and second audio data, wherein the first audio data are continuous voice stream data to be searched, and the second audio data are audio data of keywords to be searched; extracting first audio features of the first audio data, and constructing an index table at the time points of the first audio data according to the first audio features and the first audio features; searching the first audio data for a matched first audio characteristic according to a second audio characteristic of the second audio data; when the matched first audio features are searched, determining a voice segment containing the matched first audio features and the time point of the voice segment according to the matched first audio features and the time point of the matched first audio features recorded in the index table; and taking the voice segment with the determined time point as a final voice search result.

Therefore, based on the audio features of the keywords to be searched, the audio features of the continuous voice stream data to be searched are searched, namely, the voice is searched, and the whole searching process does not need to convert the voice into the text, so that the searching time is greatly shortened, and the searching efficiency is further improved.

In addition, in the process of searching voices through voices, the searched voice segments meeting the preset conditions are marked with time points, so that the playing positions of the audio data which needs the user to listen to and distinguish can be quickly located according to the time points, and the working efficiency of workers is improved.

According to a first aspect, extracting a first audio feature of first audio data, and constructing an index table at a time point of the first audio data according to the first audio feature and the first audio feature, includes: extracting a first audio feature of each frame of audio data in the first audio data; for the first audio feature of each frame of audio data, constructing a mapping relation between the first audio feature and the first audio feature at the time point of the first audio data; and constructing an index table according to the mapping relation of all the first audio features extracted from the first audio data. Therefore, the playing position of the finally searched voice searching result in the first audio data can be quickly and accurately positioned according to the index table by constructing the mapping relation between the first audio characteristic of each frame of audio data and the time point of the audio data in the whole first audio data and then constructing the index table corresponding to the first audio data according to the mapping relation between the first audio characteristic of each frame of audio data and the time point.

According to the first aspect, or any implementation manner of the first aspect above, extracting a first audio feature of each frame of audio data in the first audio data includes: pre-emphasis processing is carried out on the first audio data; performing framing processing on the pre-emphasized first audio data; windowing the first audio data subjected to framing processing; performing time-frequency conversion processing on the windowed first audio data; and determining the Mel frequency cepstrum coefficient audio characteristics of each frame of audio data in the first audio data after the time-frequency conversion processing to obtain the first audio characteristics of each frame of audio data in the first audio data.

Illustratively, the pre-emphasis processing is performed on the first audio data, so that the high frequency part in the first audio data can be boosted, and the frequency of the signal corresponding to the first audio data can be flattened, so that the same signal-to-noise ratio spectrum can be used in the whole frequency band from low frequency to high frequency.

Illustratively, by performing frame division processing on the pre-emphasized first audio data, a plurality of sampling points can be grouped into an observation unit by taking a frame as a unit, and a small batch of rapid processing is realized under the condition of ensuring the accuracy.

Illustratively, the continuity of the left and right ends of each frame is increased by windowing the first audio data after the frame division processing.

Illustratively, the time-frequency conversion processing is performed on the windowed first audio data, so that the first audio data in a time domain state can be converted into a frequency domain state, and then a magnitude spectrum and a power spectrum of the first audio data can be obtained, so that voice search can be realized from a frequency spectrum.

Illustratively, by converting the frequency metric into a mel metric closer to the auditory mechanism of human ears, the search result can be ensured to be more accurate.

According to the first aspect, or any implementation manner of the first aspect, the determining an audio feature of a mel-frequency cepstrum coefficient of each frame of audio data in the first audio data after the time-frequency conversion processing includes: determining an energy spectrum corresponding to an amplitude spectrum by using a formula for converting frequency into Mel scale, wherein the amplitude spectrum is obtained by performing time-frequency conversion on first audio data; passing the energy spectrum through a Mel filter bank, and calculating logarithmic energy output by the Mel filter bank, wherein the Mel filter bank comprises M triangular filters, and M is an integer greater than 0; discrete cosine transform is carried out on the logarithmic energy to obtain the Mel frequency cepstrum coefficient audio features of each frame of audio data in the first audio data after time-frequency conversion processing.

According to the first aspect or any one of the above implementation manners of the first aspect, the second audio feature of the second audio data is a mel-frequency cepstrum coefficient audio feature; searching for a matching first audio feature from the first audio data based on a second audio feature of the second audio data, comprising: determining an accumulated distance between the Mel frequency cepstrum coefficient audio features of each frame of sound data of the second audio data and the Mel frequency cepstrum coefficient audio features of each frame of audio data of the first audio data by using a dynamic time warping algorithm; and screening the accumulated distance according to a preset condition, and taking the first audio features corresponding to the screened accumulated distance as audio features matched with the second audio features.

According to a first aspect, or any implementation manner of the first aspect above, the audio feature of the first audio data and the audio feature of the second audio data are pronunciation features. Therefore, the pronunciation characteristic is mainly considered for the audio characteristic used for voice search, so that the voice search is not limited to the use scene of searching the first audio data by using the second audio data of the voice search, and the search scene of the first audio data comprising the voice information of different crowds can be realized.

According to the first aspect, or any implementation manner of the first aspect, a first control for importing first audio data and a second control for recording second audio data are displayed in a display interface of an electronic device; acquiring first audio data and second audio data, including: responding to the clicking operation of the first control, displaying a first audio data selection list, wherein selection controls corresponding to the selectable audio data are displayed in the first audio data selection list; responding to the click operation of any selection control, and acquiring audio data corresponding to the selection control to obtain first audio data; responding to the click operation of the second control, and recording audio data through audio acquisition equipment; and in the process of recording the audio data by the audio equipment, stopping recording the audio data after the click operation on the second control is received, and taking the audio data recorded between the two click operations as second audio data. Therefore, by providing the controls operated by the user, such as the first control, the second control, the selection control and the like, the user can conveniently search the second audio data from different first audio data according to the business requirements.

According to the first aspect, or any implementation manner of the first aspect, after taking the speech segment with the determined time point as a final speech search result, the method further includes: displaying a voice search list in a display interface of the electronic equipment; and displaying the voice search result in the voice search list. In this way, all searched voice search results are displayed in the voice search list, so that a user can conveniently know how many voice fragments possibly matched with the second audio data exist in the first audio data.

According to the first aspect, or any implementation manner of the first aspect, a listening trial control corresponding to each voice search result is displayed in a voice search list; after displaying the voice search results in the voice search list, the method further comprises: and responding to the clicking operation of any one audition control, determining the playing position of the first audio data according to the time point of the voice search result corresponding to the audition control, and playing the first audio data from the playing position. Therefore, the audition control corresponding to each voice search result is arranged in the voice search list, so that a user can conveniently listen to the searched voice search result by clicking the audition control, and whether the voice search result comprises the keyword required to be searched by the user is further determined.

In a second aspect, the present application provides an electronic device. The electronic device includes: a memory and a processor, the memory and the processor coupled; the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the instructions of the first aspect or any possible implementation of the first aspect.

In a third aspect, the present application provides a computer readable medium for storing a computer program comprising instructions for performing the method of the first aspect or any possible implementation manner of the first aspect.

In a fourth aspect, the present application provides a computer program comprising instructions for carrying out the method of the first aspect or any possible implementation manner of the first aspect.

In a fifth aspect, the present application provides a chip comprising a processing circuit, a transceiver pin. Wherein the transceiver pin and the processing circuit are in communication with each other via an internal connection path, and the processing circuit is configured to perform the method of the first aspect or any one of the possible implementations of the first aspect to control the receiving pin to receive signals and to control the sending pin to send signals.

Drawings

Fig. 1 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a voice search method according to an embodiment of the present application;

FIG. 3 is a diagram illustrating an energy spectrum and an energy value output by a Mel filter according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a display page including an application implementing a voice search function according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a page launched by a voice search application according to an embodiment of the present application;

fig. 6 is a schematic diagram of a page after a voice search application provided in an embodiment of the present application is started;

FIG. 7 is a schematic diagram of a page for selecting first audio data according to an embodiment of the present application;

fig. 8 is a schematic diagram of a page after first audio data and second audio data are acquired according to an embodiment of the present application;

fig. 9 is a schematic diagram of another page after the first audio data and the second audio data are acquired according to an embodiment of the present application;

fig. 10 is a schematic diagram of a page displaying a voice search result according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, of the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone.

The terms "first" and "second," and the like in the description and in the claims of the embodiments of the present application, are used for distinguishing between different objects and not for describing a particular order of the objects. For example, the first target object and the second target object, etc. are specific sequences for distinguishing different target objects, rather than describing target objects.

In the embodiments of the present application, the words "exemplary" or "such as" are used herein to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

In the description of the embodiments of the present application, the meaning of "a plurality" means two or more unless otherwise specified. For example, a plurality of processing units refers to two or more processing units; the plurality of systems refers to two or more systems.

The embodiment of the present application provides an electronic device, and the electronic device provided in the embodiment of the present application may be a mobile phone, a tablet computer, a personal digital assistant (PDA for short), an intelligent wearable device, an intelligent home device, and the like, which are not listed here one by one.

In addition, the embodiment of the present application does not limit the specific form of the electronic device. For convenience of description, an electronic device is taken as an example of a mobile phone, and a hardware structure of the mobile phone is described with reference to fig. 1.

Referring to fig. 1, a cellular phone 100 may include: the mobile terminal includes a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like.

In addition, it should be noted that, in practical applications, the audio module 170 may include, for example, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, and the like.

Specifically, in the voice search scheme provided by the present application, the audio data of the keyword to be searched may be picked up by the microphone 170C, or may be picked up by other external voice input devices, for example, an earphone accessed through the earphone interface 170D, or an external microphone accessed through the USB interface 130, which is not illustrated here, and the present application is not limited to this.

It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not intended to limit the present embodiment. In particular to practical applications, the voice input device includes but is not limited to the above examples, and in other alternative implementations, may be any device capable of acquiring audio data of keywords to be searched.

In addition, it should be noted that, in practical applications, the sensor module 180 may include, for example, a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.

In addition, in practical applications, the keys 190 may include, for example, a power key (power on key), a home key (home key), a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The cellular phone 100 may receive a key input, and generate a key signal input related to user setting and function control of the cellular phone 100.

In addition, it should be noted that, in practical applications, the processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processor (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), among others.

It is to be appreciated that in particular implementations, the various processing units may be stand-alone devices or may be integrated into one or more processors.

Further, in some implementations, the controller can be a neural hub and a command center of the cell phone 100. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

In addition, memory in the processor 110 is used primarily for storing instructions and data. In some implementations, the memory in the processor 110 is a cache memory.

Furthermore, it will be appreciated that in a practical application scenario, executable program code, including instructions, that trigger the handset 100 to implement various functional applications and data processing is stored in the internal memory 121.

In addition, it should be noted that, in the voice search scheme provided by the present application, the continuous voice stream data to be searched, that is, the audio data that may include the voice keyword to be searched, may be stored in the internal memory 121 in advance, for example, so that when the voice search is required, the continuous voice stream data to be searched may be directly acquired from the internal memory 121.

For example, in another implementation manner, the continuous voice stream data to be searched may also be stored in an external memory, and for such a scenario, it is necessary to first implement communication between the external memory and the mobile phone 100 through the external memory interface 120, so that when voice search is needed, the accessed external memory can be accessed through the external memory interface 120, and then the continuous voice stream data to be searched is acquired.

It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not to be taken as the only limitation of the present embodiment. In particular, in practical applications, the manner of acquiring the continuous voice stream data to be searched includes, but is not limited to, the above example, and in other alternative implementations, the continuous voice stream data may be acquired in any available manner, for example, acquired from other devices through a bluetooth transmission manner, or acquired from a network or other devices through a network transmission manner.

The hardware architecture of the handset 100 is described herein, it being understood that the handset 100 shown in fig. 1 is merely an example, and in particular implementations, the handset 100 may have more or fewer components than shown, may combine two or more components, or may have a different configuration of components. The various components shown in fig. 1 may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.

In order to better understand the technical solution of implementing voice search by the electronic device with the above structure, the following describes in detail a voice search method applied to the electronic device provided in the present application with reference to the accompanying drawings.

It should be understood that the following are implementation details provided only for ease of understanding and are not necessary to practice the present solution.

Illustratively, referring to fig. 2, the specific implementation steps of the voice search method provided in the embodiment of the present application include:

step S101, first audio data and second audio data are acquired.

Specifically, the first audio data is continuous voice stream data to be searched, and the second audio data is keywords or keywords to be searched, or audio data corresponding to key sentences.

For example, in some implementations, the first audio data may be, for example, a sound recording file imported from a voice library, and the second audio data may be, for example, audio data of a user picked up by a voice input device.

It is understood that in some implementations, the voice library may be located in the internal memory described in the above embodiments, or may be located in the external memory, or in the network device, for example; the voice input device may be, for example, a microphone provided electronically, or an external device, such as an earphone connected through an earphone interface, a microphone connected through a USB interface, etc., which are not listed here, but the present application is not limited thereto.

Step S102, extracting a first audio feature of the first audio data, and constructing an index table at a time point of the first audio data according to the first audio feature and the first audio feature.

It should be noted that, in this embodiment, the first audio data is usually voice data with a long time, so that in order to ensure the accuracy of the search result, when the first audio feature of the first audio data is extracted, the first audio feature of each frame of audio data in the audio data needs to be extracted, so that the subsequent matching search can be performed by taking the frame as a unit, thereby effectively ensuring the accuracy of the search result.

Further, after the first audio feature of each frame of audio data is extracted from the first audio data, a complete index table corresponding to the first audio data can be obtained by constructing the mapping relationship between the first audio feature of each frame of audio data and the time point of the first audio feature in the first audio data and further according to the mapping relationship between all the first audio features extracted from the first audio data. Therefore, the playing position of the voice segment matched with the second audio data in the first audio data can be quickly and accurately positioned according to the index table recording the mapping relation between the first audio feature and the time point corresponding to each frame of audio data in the first audio data.

In addition, it should be noted that, in the present embodiment, both the first audio feature extracted from the first audio data and the second audio feature extracted from the second audio data are pronunciation features. That is, the voice search scheme provided by this embodiment does not consider the voiceprint feature information, so that no matter whether the first audio data and the second audio data come from the same user, since the pronunciation of the same content is substantially the same, the matching is performed through the pronunciation feature, and the required content can be searched from the recorded second audio data of different users according to the second audio data provided by different users.

In addition, the manner of extracting the first audio feature from the first audio data may be any voice recognition method capable of extracting pronunciation features, and the present application is not limited thereto.

For convenience of understanding, the present embodiment specifically describes a process of extracting a first audio feature from first audio data by taking the first audio feature proposed from the first audio data as an example of a mel-frequency cepstrum coefficient audio feature.

For example, in some implementations, when extracting a mel-frequency cepstrum coefficient audio feature (a first audio feature) from first audio data, a series of preprocessing operations may be performed on the first audio data, for example, pre-emphasis processing, framing processing, windowing processing, and time-frequency conversion processing are sequentially performed on the first audio data, and then the mel-frequency cepstrum coefficient audio feature of each frame of audio data is extracted from the preprocessed first audio data, so that the first audio feature of each frame of audio data in the first audio data may be obtained.

For example, in some implementations, the pre-emphasis processing of the first audio data may be, for example, passing all speech signals of the first audio data through a high-pass filter, and the corresponding function formula is, for example, the following formula (1):

Q”(n)＝Q'(n)-μQ'(n-1) (1)

where N is 1,2,3.. N, Q' (N) is a speech signal at the nth sampling point, Q "(N) is a speech signal at the nth sampling point subjected to pre-emphasis processing, and μ has a value of 0.9 to 1, typically 0.97.

It should be noted that, by performing pre-emphasis processing on the first audio data, the high frequency part in the first audio data can be boosted, and the frequency of the signal corresponding to the first audio data can be flattened, so that the same signal-to-noise ratio spectrum can be used in the entire frequency band from low frequency to high frequency. Meanwhile, the effect of vocal cords and lips in the sounding process can be eliminated through pre-emphasis processing, the high-frequency part of the voice signal of each sampling point, which is restrained by a sounding system, is compensated, and therefore high-frequency resonance is highlighted.

For example, in some implementations, the framing processing performed on the pre-emphasized first audio data may be, for example, a frame unit, and a plurality of, for example, N sampling points are grouped into an observation unit, so that the accuracy can be ensured, and a small batch of fast processing can be realized.

For example, in the framing processing of the pre-emphasized first audio data, 256 samples or 512 samples may be selected to be grouped into an observation unit, which covers about 20ms to 30 ms.

Further, in order to avoid that the size of two adjacent frames is too large, when the pre-emphasized first audio data is subjected to framing processing, each frame can retain a framing result of audio breakpoint detection based on a double-threshold method, that is, an overlapping region exists between two adjacent frames.

Illustratively, the overlapping region may contain, for example, M sample points. Typically, M is about 1/2, or 1/3, of N.

In addition, it should be noted that the sampling frequency of the voice signal used for voice recognition is typically 8KHz or 16 KHz. For 8KHz, if the frame length is 256 samples, the corresponding time length is 258/8000 × 1000 — 32 ms.

For example, in some implementations, the framing of the pre-emphasized first audio data may be, for example, multiplying each frame by a hamming window to increase the continuity of the left and right ends of each frame, thereby reducing the side lobe size and the risk of spectral leakage after time-frequency conversion.

For example, assuming that the speech signal after the frame division is q "(n), the first audio data after the window addition is multiplied by a hamming window w (n), and g (n) is q" (n) x w (n).

Illustratively, the form for the hamming window w (n) may be as follows in equation (2):

wherein N is more than or equal to 0 and less than or equal to N-1, a is a parameter, different a can generate different Hamming windows, and the value of a is generally 0.46.

In addition, it should be noted that, because the speech signal in the first audio data obtained in the normal case is in the time domain state, and different frequency components in the speech signal, such as the amplitude spectrum and the power spectrum in the frequency domain state, can be better analyzed in the frequency domain state, the first audio data after the windowing process also needs to be subjected to a time-frequency conversion process, that is, the first audio data is converted from the time-spectrum domain to the frequency spectrum domain, so that not only can interference, noise and jitter in the speech signal be better discovered, but also an amplitude spectrum for determining the first audio feature of each frame of audio data can be obtained.

For example, the time-frequency transform processing performed on the windowed first audio data may perform a discrete fourier transform on each frame of audio data, so as to convert each frame of audio data from a time-frequency domain to a frequency-frequency domain.

For example, after the preprocessing of the first audio data is completed, the mel-frequency cepstrum coefficient audio features of each frame of audio data in the first audio data after the time-frequency conversion processing can be determined, so as to obtain the first audio features of each frame of audio data in the first audio data.

Furthermore, it should be noted that Mel-frequency Cepstral Coefficients (MFCCs) are Cepstral parameters extracted in the Mel-scale spectral domain, and the Mel-scale describes the non-linear characteristic of human ear frequency. According to the research of human auditory mechanism, human ears have different auditory sensitivities to sound waves with different frequencies. Speech signals from 200Hz to 5000Hz have a large influence on the intelligibility of speech. When two sounds with different loudness act on human ears, the presence of the frequency components with higher loudness affects the perception of the frequency components with lower loudness, making them less noticeable, which is called a masking effect. Since lower frequency sounds travel a greater distance up the cochlear inner basilar membrane than higher frequency sounds, generally bass sounds tend to mask treble sounds, while treble sounds mask bass sounds more difficult. The critical bandwidth of sound masking at low frequencies is smaller than at higher frequencies. Therefore, in order to ensure the accuracy of the extracted first audio feature, in this embodiment, after determining an energy spectrum corresponding to the amplitude spectrum by using a formula of converting frequency into mel scale, the energy spectrum is filtered by a mel filter bank, and then the signal energy output by the mel filter bank is used as a basic feature of the signal, and the basic feature is further processed to be used as a mel frequency cepstrum coefficient audio feature of each frame of audio data.

Illustratively, in some implementations, the mel filter bank is embodied as a filter bank including M filters, and the filter employed is typically a triangular filter.

In addition, it should be further noted that, in some implementations, the value of M is greater than 0, and may generally be 26 or 32, which is not limited in this application.

Illustratively, regarding the formula for converting from frequency to mel scale, for example, the following formula (3) may be mentioned:

where f is the amplitude required to be converted from the amplitude spectrum to the energy spectrum, and f (f) is the energy value corresponding to the amplitude f.

Exemplarily, after the energy spectrum corresponding to the magnitude spectrum is determined by the formula (3), the filtering processing of the energy spectrum by the mel filter bank means that the energy spectrum passes through the mel filter bank and the logarithmic energy output by the mel filter bank is calculated.

Illustratively, regarding the processing of the energy spectrum by the triangular filter in the mel-filter bank, the following formula (4) is specifically used:

wherein, the relationship between f (m) in the formula (4) and f (f) in the formula (3) is f (f) ═ f (m), and k is a capacity value in the energy spectrum.

In addition, it should be noted that the correspondence relationship between each triangular filter in the mel filter bank and the energy value in the energy spectrum may be as shown in fig. 3.

Referring to fig. 3, for example, the interval between f (m) decreases with decreasing m value, and increases with increasing m value, that is, the amplitude increases, and the frequency domain shows a low frequency, and the part of the signal that goes to a high frequency is attenuated more. Therefore, by processing the energy spectrum with the mel filter bank, the frequency spectrum can be smoothed, and the effect of higher harmonics (higher harmonics, that is, high frequency components) can be eliminated, thereby highlighting the formants of the original speech. Therefore, the tone or pitch of a certain voice segment in the first audio data can not be presented in the final MFCC parameters, so that the voice searching accuracy can be ensured without being influenced by different tones.

Illustratively, in some implementations, the manner in which the log energy of the mel filter bank output is calculated may be, for example, by the following equation (5):

where s (m) is the final output logarithmic energy, | X _a (k)| ² Is the square of the amplitude, H _m (k) Is the power spectrum processed through the mel filter bank.

In addition, it should be noted that, after the signal energy, that is, the logarithmic energy, output by the mel filter bank is used as the basic feature of the signal, the further processing on the basic feature is, for example, performing discrete cosine transform on the logarithmic energy, so that the mel-frequency cepstrum coefficient audio feature of each frame of audio data in the first audio data after the time-frequency conversion processing can be obtained.

Therefore, after the operations on the series of the first audio data are performed, the first audio feature of each frame of audio data can be extracted from the first audio data.

It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not to be taken as the only limitation of the present embodiment.

Further, after the first audio feature of each frame of audio data in the first audio data is obtained, the mapping relationship between the first audio feature of each frame of audio data and the time point of the first audio feature in the first audio data is constructed, and then a complete index table corresponding to the first audio data can be obtained according to the mapping relationships of all the first audio features extracted from the first audio data.

The form of the index table may be, for example, as shown in table 1.

Table 1 index table 1

As can be seen from table 1, the time point of each first audio feature in the first audio data is recorded in the index table.

It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not intended to limit the present embodiment. In other implementations, the information recorded in the index table also includes frames, the first audio features, and time points, as long as the time points of the first audio features corresponding to each frame of audio data can be clearly recorded.

In addition, it should be noted that, in practical applications, a user may perform multiple searches for different keywords on the same first audio data, that is, search for a matching first audio feature from the same first audio data according to a second audio feature of different second audio data. Therefore, the index table generated for the first time can be stored, so that the index table does not need to be reconstructed in subsequent searching.

Step S103, searching the first audio data for the matched first audio characteristic according to the second audio characteristic of the second audio data.

It should be noted that the second audio data may be recorded in real time during the voice search, or may be recorded in advance, and in consideration of practical application, a user may need to search for the same second audio data from different first audio data, so in the voice search process, the second audio feature of the second audio data may also be stored after being extracted, so that after the first audio data is replaced, if the second audio data needs to be searched, recording of the second audio data and extraction of the second audio feature are not needed.

Furthermore, it should be noted that, in order to ensure that the second audio features extracted from the second audio data have the same characteristics as the first audio features extracted from the first audio data, i.e. they have comparability, the manner of extracting the second audio features from the second audio data needs to be the same as the manner of extracting the first audio features from the first audio data.

Still taking the example that the first audio feature extracted from the first audio data is a mel-frequency cepstrum coefficient audio feature, the second audio feature extracted from the second audio data also needs to be a mel-frequency cepstrum system audio feature. For the way of extracting the mel frequency cepstrum coefficient audio features from the second audio data, reference may be made to the specific description given in step S102 for extracting the mel frequency cepstrum coefficient audio features from the first audio data, and details are not repeated here.

Regarding the manner adopted in the step S103 when searching for the first audio feature according to the second audio feature, that is, searching for the speech by using the speech, specifically, in this embodiment, for example, the manner may be implemented by using a Dynamic Time Warping (DTW) algorithm.

Understandably, since the DTW algorithm calculates the cumulative distance between two features, the first audio feature corresponding to the shortest cumulative distance is the searched audio feature matching the second audio feature.

It can be understood that, in order to reduce the misjudgment, the preset conditions may be set to 10 or 20 with the shortest cumulative distance, that is, the first audio feature that is finally determined to be a match may be determined according to the user requirement.

For example, when the preset condition is that the cumulative distance is the shortest 10, the cumulative distance may be filtered according to the preset condition, and the first audio feature corresponding to the filtered cumulative distance may be used as the audio feature matching the second audio feature.

For better understanding of the feature matching search implemented by the DTW algorithm, the following description is made with reference to an example.

For example, it is assumed that N frames of first audio features are included in the first audio data, M frames of second audio features are included in the second audio data, and the first audio features of each frame of audio data in the first audio data and the second audio features of each frame of audio data in the second audio data can be converted into a vector value.

Accordingly, the first audio data comprises N frame vectors { T (1), T (2), T (N), …, T (N) }, wherein T (N) is the first audio feature vector of the nth frame; the second audio data includes M frame vectors of { R (1), R (2), R (M), …, R (M) }, where R (M) is a second audio feature vector of the M-th frame.

Next, a cumulative distance between the mel-frequency cepstrum coefficient audio feature of each frame of sound data of the second audio data and the mel-frequency cepstrum coefficient audio feature of each frame of sound data of the first audio data, i.e., a cumulative distance between each of r (m) and t (n), is calculated based on the following formula (6).

D(N,M)＝d(T(N)，R(M))+min{D(N-1，M),D(N-1，M-1),D(N，M-1)} (5)

Wherein D (N, M) is a cumulative distance between the determined nth frame first audio feature vector and the mth frame second audio feature vector, D (t (N), r (M) | | t (N) -r (M) | computationally ₂ 。

For example, in the initial case, let D (1,1) be 0, and then search is performed according to equation (6) to repeat recursion until D (N, M) can obtain the optimal path. Understandably, the optimal path is D _min (N,M)。

Step S104, when the matched first audio features are searched, determining the voice segments containing the matched first audio features and the time points of the voice segments according to the matched first audio features and the time points of the matched first audio features recorded in the index table.

Illustratively, still taking the mapping relationship between the first audio feature recorded in table 1 and the time point of the first audio feature as an example, assuming that the searched matching first audio feature is the first audio feature b in table 1, then 5 time points of the first audio feature b are 00:45:00, 00:47:00, 00:50:10, 03:15:10 and 03:20:00 respectively, as known from the time point of the first audio feature b recorded in table 1 in the first audio data, considering the continuity of time and the length of the voice segment, several audio data with close time points can be selected as a voice segment, such as the first audio particular b audio data containing the first audio particular b of 00:45:00, the first audio particular b of 00:47:00, and the first audio particular b of 00:50:10 as a voice segment, i.e. the start time point of the speech segment is e.g. 00:45:00 and the end time point is e.g. 00:50: 10.

Correspondingly, audio data comprising the first audio band b occurring at 03:15:10 and the first audio band b occurring at 03:20:00 are considered as a speech segment, i.e. the starting time of the speech segment is, for example, 03:15:10 and the ending time is, for example, 03:20: 00.

For example, in practical applications, in order to avoid losing important information before and after the voice segment, the start time of the voice segment may be set to a preset time length before the first occurrence time of the matched first audio feature, such as 30s or 1 minute, and the end time of the voice segment may be set to a preset time length after the last occurrence time of the matched first audio feature, such as 30s or 1 minute.

And step S105, taking the voice segment with the determined time point as a final voice search result.

For example, in some implementations, in order to facilitate the user to view and listen to the searched voice search result on trial, a voice search list may be displayed in a display interface of the electronic device, and the voice search result searched in the manner of the voice search voice may be displayed in the voice search list.

Further, when displaying the voice search results in the voice search list, it may be set that only a preset number of voice search results are displayed, such as only 10, or only 20, as needed.

Furthermore, it can be understood that if the searched voice search result is larger than the set display number, the voice search result with the highest matching degree can be selected, for example, when matching is performed by using the DTW algorithm, 10 or 20 with the smallest distance can be selected for display.

Therefore, the voice search method provided by the embodiment searches the audio features of the continuous voice stream data to be searched based on the audio features of the audio data of the keyword to be searched, that is, the voice is searched for the voice, and the whole search process does not need to convert the voice into the text, so that the search time is greatly shortened, and the retrieval efficiency is further improved.

The voice search method provided by the present application has been described with reference to fig. 2 and 3, and a scenario in which voice search is implemented by using the voice search method provided in the foregoing embodiment when the electronic device is taken as a mobile phone with reference to fig. 4 to 10 is described in detail below.

Referring to fig. 4, exemplarily, a plurality of installed Applications (APPs) are displayed in the main page 10a of the mobile phone 100, such as a voice search APP (an Application for performing voice search in the present embodiment), a clock APP, a calendar APP, a gallery APP, a memo APP, a file management APP, a calculator APP, a setting APP, a recorder APP, a camera APP, an address book APP, a telephone APP, an information APP, an email APP, a music APP, a video APP, a weather APP, a browser APP, and the like.

Continuing to refer to fig. 4, for example, after the user clicks the icon control 10a-1 corresponding to the voice search APP, the mobile phone 100 starts the voice search APP in response to the operation behavior of the user, and the display page may display a start page 10b as shown in fig. 5.

Referring to FIG. 5, for example, the launch page 10b may include one or more controls, such as a control 10b-1 for canceling the launch of the voice search APP and a control 10b-2 for agreeing to launch the voice search APP.

Continuing with FIG. 5, for example, in order to provide a better user experience, the start page 10b may also display the rights required by the voice search APP and the content for instructing the user to use.

Continuing to refer to fig. 5, for example, when the user clicks the control 10b-1, the mobile phone 100, in response to the operation behavior of the user, exits from the start page 10b and returns to the main page 10 a; when the user clicks the control 10b-2, the mobile phone 100 jumps from the start page 10b to a page for acquiring the first audio data and the second audio data described in the above embodiment and performing voice search in response to the operation behavior of the user.

For example, in some implementations, the page that jumps to may display, for example, a first control for importing first audio data and a second control for recording second audio data.

Accordingly, based on the page, the operation of acquiring the first audio data may be, for example: and when the clicking operation of the first control is received, responding to the clicking operation of the first control, and displaying a first audio data selection list.

Understandably, the selectable audio data and the selection control corresponding to each selectable audio data are displayed in the first audio data selection list, so that a user can select the audio data to be searched through the selection control corresponding to each audio data, that is, when the user receives the click operation on any selection control in the first audio data selection list, the user responds to the click operation on any selection control to acquire the audio data corresponding to the selection control, and the first audio data can be obtained.

Further, based on the page, the operation of acquiring the second audio data may be, for example: and when the clicking operation on the second control is received, responding to the clicking operation on the second control, and recording audio data through the audio acquisition equipment. Meanwhile, in the process of recording the audio data by the audio equipment, after the clicking operation on the second control is received, the recording of the audio data is stopped, and the audio data recorded between the two clicking operations is used as the second audio data.

In order to better understand the technical solution of the present embodiment, a page that jumps from the start page 10b to the page 10c shown in fig. 6 for acquiring the first audio data and the second audio data and performing the voice search is specifically described as an example.

Referring to fig. 6, the page 10c may illustratively include one or more controls, such as a control 10c-1 for exiting the page 10c, a control 10c-2 for importing first audio data (said first control), a control 10c-3 for displaying a preview of the imported first audio data, a control 10c-4 for recording second audio data (said second control), a control 10c-5 for displaying a preview of the recorded second audio data, controls 10c-6 and 10c-7 for setting a maximum display number of voice search results, and a control 10c-8 for starting a voice search operation.

Continuing to refer to fig. 6, for example, after the user clicks the control 10c-1, the mobile phone 100 will exit the page 10c and return to the main page 10a in response to the operation of the user; when the user clicks the control 10c-2, the cell phone 100 may jump to the page 10d shown in fig. 7 in response to the user's operation.

Referring to FIG. 7, for example, the page 10d may include one or more controls, such as a control 10d-1 for exiting the page 10d, a first list of audio data 10d-2 for displaying selectable first audio data, and a control 10d-3 for agreeing to the selected first audio data.

Continuing with FIG. 7, for example, assuming that there are 6 first audio data that can be selected, audio data A, audio data B, audio data C, audio data D, audio data E and audio data F shown in FIG. 7, these 6 audio data are displayed in the first audio data list 10D-2, and selection controls for selecting each audio data are displayed, such as selection controls 10D-21 for selecting audio data A, selection controls 10D-22 for selecting audio data B, selection controls 10D-23 for selecting audio data C, selection controls 10D-24 for selecting audio data D, selection controls 10D-25 for selecting audio data E, and selection controls 10D-26 for selecting audio data F.

For example, in some implementations, the selectable audio data displayed in the first audio data selection list 10d-2 may be displayed in descending chronological order, for example, as shown in fig. 7; the display can also be performed in descending/ascending order according to the duration of the audio data; the arrangement display can also be performed according to the recording place of the audio data.

Continuing to refer to fig. 7, for example, after the user clicks the control 10d-1, the mobile phone 100 responds to the operation behavior of the user, and exits the page 10d and returns to the main page 10 c; when the user clicks any selection control in the first audio data selection list 10d-2, for example, 10d-21 in fig. 7, and clicks the control 10d-3, the mobile phone 100 takes the audio data a as the first audio data in response to the operation behavior of the user, and jumps back to the page 10 c.

Illustratively, when jumping back to page 10c, an audio preview of the first audio data (selected audio data A) is displayed in control 10c-3, such as shown in FIG. 8.

Continuing with fig. 6, for example, after the user clicks the control 10c-4, the mobile phone 100 responds to the user's action, and if the mobile phone 100 is not plugged into a headset at this time, the user's voice is picked up through the microphone of the mobile phone 100.

Referring to fig. 8, for example, during recording of the user's voice, control 10c-5 will display the duration of the recording and will have an audio preview of the recorded audio data.

Continuing with fig. 8, for example, assuming that the user clicks the control 10c-4 again while the control 10c-5 displays "00: 00: 23", that is, 23s, the mobile phone 100 stops recording the user's voice in response to the user's operation behavior, and takes the recorded 23s audio data as the second audio data.

Continuing with fig. 8, illustratively, control 10c-6 is selected by default, i.e. the voice search displays at most 10 voice search results, and if the user needs to modify, when the user clicks control 10c-7, handset 100 determines that the voice search displays at most 20 voice search results in response to the user's operation behavior.

It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not to be taken as the only limitation of the present embodiment. In practical applications, the start control for recording the second audio data and the end control for stopping recording the second audio data may be two different controls, such as the controls 10c-9 and 10c-10 shown in fig. 9.

Referring to fig. 9, for example, when the user clicks the control 10c-9, the mobile phone 100 responds to the user's operation, and if the mobile phone 100 is not plugged into the earphone at this time, the user's voice is picked up through the microphone of the mobile phone 100.

Referring to fig. 9, for example, during recording of the user's voice, control 10c-5 will display the duration of the recording and will have an audio preview of the recorded audio data.

Continuing with fig. 9, for example, assuming that the user clicks the control 10c-10 at "00: 00: 23", i.e. 23s, displayed by the control 10c-5, the mobile phone 100 stops recording the user's voice in response to the user's operation behavior, and takes the recorded 23s audio data as the second audio data.

In addition, regarding the way of setting the maximum display number of the present voice search, in addition to the fixed options shown in fig. 8, a user setting entry, for example, the controls 10c-11 shown in fig. 9, may be provided. Thus, after the user clicks 10c-11, the mobile phone 100 will pop up the number input keyboard in response to the user's operation, so that the user can input the maximum number to be displayed as required.

It should be understood that the above description is only an example for better understanding of the technical solution of the present embodiment, and is not intended to limit the present embodiment.

Continuing with fig. 9, exemplarily, after the user clicks the control 10c-8, the mobile phone 100 responds to the operation behavior of the user, and according to the method for searching for speech by speech provided in the above embodiment, extracts the first audio feature from the first audio data, extracts the second audio feature from the second audio data, and searches for the first audio feature according to the second audio feature, thereby searching for a speech segment meeting the preset requirement, and obtaining a speech search result that can be displayed to the user. I.e., when a voice search structure capable of being displayed to the user is searched, a search result list 10c-12 is displayed in the page 10 c.

Continuing with FIG. 9, by way of example, one or more controls may be included in the search result list 10c-12, such as controls 10c-13 for sliding the content displayed in the search result list 10c-12, with a trial-and-listen control corresponding to each voice search result.

For example, assuming that the maximum number of voice search results allowed to be displayed set by the control 10c-11 is 15, but only 5 voice search results satisfying the condition are searched by the voice search method given in the above embodiment, only the 5 voice search results are displayed in the search result list 10c-12, and if only 25 voice search results are searched by the voice search method given in the above embodiment, only the optimal 15 voice search results are displayed in the search result list 10c-12 according to the setting.

For example, assuming that only 5 speech search results are finally displayed in the search result list 10c-12, as shown in fig. 9, each speech search result has a corresponding listening trial control, such as a listening trial control 10c-14 for listening to a first speech search result, a listening trial control 10c-15 for listening to a second speech search result, a listening trial control 10c-16 for listening to a third speech search result, a listening trial control 10c-17 for listening to a fourth speech search result, and a listening trial control 10c-18 for listening to a fifth speech search result.

Continuing to refer to fig. 9, for example, after the user clicks any one of the audition controls, such as the audition controls 10c-14, displayed in the search result list 10c-12, the mobile phone 100 responds to the operation behavior of the user, that is, responds to the clicking operation on any one of the audition controls, and at this time, according to the time point of the voice search result corresponding to the audition control 10c-14, such as "00: 45: 00", determines the playing position of the first audio data, and plays the first audio data from the playing position of "00: 45: 00".

Furthermore, it should be noted that, in some implementations, in order to avoid missing transition contents of adjacent voice segments, the playing position of playing the first audio data may start from the first tens of seconds or minutes corresponding to the time point of the voice search result, for example, when the time point of the auditorily listened voice search result is "00: 45: 00", the playing position may be "00: 44: 00", for example.

The description is given of a usage scenario in which the electronic device performs voice search by using the voice search method provided in the above embodiment, and it should be understood that the above description is only an example for better understanding of the technical solution of the embodiment, and is not to be taken as the only limitation to the embodiment.

Furthermore, it is understood that the electronic device comprises corresponding hardware and/or software modules for performing the respective functions in order to implement the above-described functions. The present application can be realized in hardware or a combination of hardware and computer software in connection with the exemplary algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, in conjunction with the embodiments, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In addition, it should be noted that, in an actual application scenario, the voice search method provided by the foregoing embodiments implemented by the electronic device may also be executed by a chip system included in the electronic device, where the chip system may include a processor. The system-on-chip may be coupled to the memory, such that the computer program stored in the memory is called when the system-on-chip is run to implement the steps performed by the electronic device. The processor in the system on chip may be an application processor or a processor other than an application processor.

In addition, an embodiment of the present application further provides a computer-readable storage medium, where a computer instruction is stored in the computer-readable storage medium, and when the computer instruction runs on an electronic device, the electronic device is caused to execute the relevant method steps to implement the voice search method in the foregoing embodiment.

In addition, an embodiment of the present application further provides a computer program product, which, when running on an electronic device, causes the electronic device to execute the relevant steps described above, so as to implement the voice search method in the foregoing embodiment.

Additionally, embodiments of the present application also provide a chip (which may also be a component or a module) that may include one or more processing circuits and one or more transceiver pins; the receiving pin and the processing circuit are communicated with each other through an internal connection path, and the processing circuit executes the related method steps to realize the voice search method in the embodiment so as to control the receiving pin to receive signals and control the sending pin to send signals.

In addition, as can be seen from the above description, the electronic device, the computer-readable storage medium, the computer program product, or the chip provided in the embodiments of the present application are all configured to execute the corresponding method provided above, so that the beneficial effects achieved by the electronic device, the computer-readable storage medium, the computer program product, or the chip can refer to the beneficial effects in the corresponding method provided above, and are not repeated herein.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and these modifications or substitutions do not depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A voice search method is applied to an electronic device, and the method comprises the following steps:

acquiring first audio data and second audio data, wherein the first audio data are continuous voice stream data to be searched, and the second audio data are audio data of keywords to be searched;

extracting first audio features of the first audio data, and constructing an index table at the time point of the first audio data according to the first audio features and the first audio features;

searching the first audio data for a matched first audio characteristic according to a second audio characteristic of the second audio data;

when the matched first audio features are searched, determining a voice segment containing the matched first audio features and the time point of the voice segment according to the matched first audio features and the time point of the matched first audio features recorded in the index table;

and taking the voice segment with the determined time point as a final voice search result.

2. The method of claim 1, wherein the extracting a first audio feature of the first audio data and constructing an index table according to the first audio feature and the first audio feature at a time point of the first audio data comprises:

extracting a first audio feature of each frame of audio data in the first audio data;

for a first audio feature of each frame of audio data, constructing a mapping relation between the first audio feature and the first audio feature at a time point of the first audio data;

and constructing the index table according to the mapping relation of all the first audio features extracted from the first audio data.

3. The method of claim 2, wherein extracting the first audio feature of each frame of audio data in the first audio data comprises:

pre-emphasis processing is carried out on the first audio data;

performing frame division processing on the pre-emphasized first audio data;

windowing the first audio data subjected to framing processing;

performing time-frequency conversion processing on the first audio data subjected to windowing processing;

and determining the Mel frequency cepstrum coefficient audio characteristics of each frame of audio data in the first audio data after time-frequency conversion processing to obtain the first audio characteristics of each frame of audio data in the first audio data.

4. The method as claimed in claim 3, wherein the determining the Mel frequency cepstrum coefficient audio characteristics of each frame of audio data in the first audio data after time-frequency transform processing comprises:

determining an energy spectrum corresponding to an amplitude spectrum by using a formula for converting frequency into Mel scale, wherein the amplitude spectrum is obtained by performing time-frequency conversion on the first audio data;

passing the energy spectrum through a Mel filter bank, and calculating logarithmic energy output by the Mel filter bank, wherein the Mel filter bank comprises M triangular filters, and M is an integer greater than 0;

and performing discrete cosine transform on the logarithmic energy to obtain the Mel frequency cepstrum coefficient audio characteristics of each frame of audio data in the first audio data after time-frequency conversion processing.

5. The method of claim 1, wherein the second audio feature of the second audio data is a mel-frequency cepstrum coefficient audio feature;

the searching for the matching first audio feature from the first audio data according to the second audio feature of the second audio data comprises:

determining a cumulative distance between a Mel frequency cepstrum coefficient audio feature of each frame of sound data of the second audio data and a Mel frequency cepstrum coefficient audio feature of each frame of sound data of the first audio data by using a dynamic time warping algorithm;

and screening the accumulated distance according to a preset condition, and taking the first audio feature corresponding to the screened accumulated distance as the audio feature matched with the second audio feature.

6. The method of any of claims 1 to 5, wherein the audio characteristic of the first audio data and the audio characteristic of the second audio data are vocal characteristics.

7. The method according to any one of claims 1 to 5, wherein a first control for importing the first audio data and a second control for recording the second audio data are displayed in a display interface of the electronic device;

the acquiring of the first audio data and the second audio data includes:

responding to the clicking operation of the first control, displaying a first audio data selection list, wherein selection controls corresponding to the selectable audio data are displayed in the first audio data selection list;

responding to the click operation of any selection control, and acquiring audio data corresponding to the selection control to obtain the first audio data;

responding to the click operation of the second control, and recording audio data through audio acquisition equipment;

and in the process of recording the audio data by the audio equipment, stopping recording the audio data after the click operation of the second control is received, and taking the audio data recorded between two click operations as the second audio data.

8. The method according to any one of claims 1 to 5, wherein after the step of determining the speech segment at the time point as a final speech search result, the method further comprises:

displaying a voice search list in a display interface of the electronic equipment;

and displaying the voice search result in the voice search list.

9. The method of claim 8, wherein the audition control corresponding to each voice search result is displayed in the voice search list;

after displaying the voice search results in the voice search listing, the method further comprises:

and responding to the clicking operation of any one audition control, determining the playing position of the first audio data according to the time point of the voice search result corresponding to the audition control, and playing the first audio data from the playing position.

10. An electronic device, characterized in that the electronic device comprises: a memory and a processor, the memory and the processor coupled; the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the voice search method of any of claims 1 to 9.

11. A computer-readable storage medium, characterized by comprising a computer program which, when run on an electronic device, causes the electronic device to execute the voice search method according to any one of claims 1 to 9.