CN110085214A

CN110085214A - Audio originates point detecting method and device

Info

Publication number: CN110085214A
Application number: CN201910151671.0A
Authority: CN
Inventors: 李为
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Douyin Vision Co Ltd; Douyin Vision Beijing Co Ltd
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2019-08-02
Anticipated expiration: 2039-02-28
Also published as: CN110085214B

Abstract

The present disclosure discloses a kind of audio starting point detecting method, device, electronic equipment and computer readable storage mediums.Wherein audio starting point detecting method includes: for each frequency range, the mean value of voice spectrum parameters is determined according to the voice spectrum parameters of the voice spectrum parameters of current frequency range and the predetermined number frequency range chosen from remaining frequency range, wherein remaining frequency range is all frequency ranges being located at before current frequency range according to timing；One or more initial point positions of the note and syllable in audio are determined according to the corresponding voice spectrum parameters of each frequency range and mean value.The embodiment of the present disclosure is due to having references to the corresponding voice spectrum parameters of multiple frequency ranges when determining initial point position, so that the mean value of the voice spectrum parameters determined is more accurate, improve the signal shift phenomenon in the curve constituted referring to voice spectrum parameters and mean value, the starting point for going out the note and syllable in audio so as to accurate detection, the occurrence of reducing erroneous detection and missing inspection.

Description

Audio starting point detection method and device

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and an apparatus for detecting an audio starting point, an electronic device, and a computer-readable storage medium.

Background

Audio onset detection is an information extraction algorithm applied to audio signals with the goal of accurately detecting the locations of the onset of notes and syllables. Wherein the note (note) refers to a music signal; syllable (phone) is specifically used for voice and human voice signals. Audio start point detection has many important uses and application prospects in the field of signal processing, for example, as follows: the method is used for automatically segmenting and labeling human voice and music audio, extracting information, compressing in a segmented mode and playing interactive entertainment. Fig. 1a and 1b show start point detection, where fig. 1a is an audio signal and fig. 1b is the detected start point position.

In the prior art, a voice spectrum parameter curve corresponding to an audio signal is usually calculated, a local maximum point of the curve is determined according to the voice spectrum parameter curve, a voice spectrum parameter corresponding to a point change is compared with a set threshold value, and if the voice spectrum parameter is greater than the threshold value, a position corresponding to the point is determined as a starting point position.

However, the above algorithm is mainly suitable for audio signals with clear boundaries and relatively single rhythm (e.g., fast-rhythm music with clear note boundaries and relatively single rhythm), and for some audio signals with complex rhythm but poor feeling (e.g., music mixed by multiple musical instruments, music with slow rhythm, and human voice), the above detection algorithm cannot accurately detect the boundaries, and frequent false detection and missed detection occur.

Disclosure of Invention

In a first aspect, an embodiment of the present disclosure provides an audio starting point detection method, including:

determining voice spectrum parameters corresponding to each frequency band according to frequency domain signals corresponding to audio signals of the audio;

determining the mean value of the voice spectrum parameters according to the voice spectrum parameters of the current frequency band and the voice spectrum parameters of a preset number of frequency bands selected from the rest frequency bands aiming at each frequency band, and taking the mean value as the mean value of the current frequency band, wherein the rest frequency bands are all frequency bands positioned in front of the current frequency band according to time sequence;

and determining the position of the starting point according to the voice spectrum parameters and the mean value corresponding to each frequency band.

Further, the determining the locations of one or more start points of notes and syllables in the audio according to the speech spectrum parameters and the mean corresponding to the frequency bands includes:

calculating the difference value between the speech frequency spectrum parameter of each frequency and the mean value;

and determining one or more starting point positions of the notes and the syllables in the audio according to the difference value of the frequencies.

Further, the determining the position of one or more start points of the notes and syllables in the audio according to the difference of the frequencies includes:

drawing a voice frequency spectrum parameter curve according to the difference value of each frequency;

and determining a local highest point according to the voice spectrum parameter curve, and determining one or more starting point positions of notes and syllables in the audio frequency according to a difference value corresponding to the local highest point.

Further, the determining the speech spectrum parameters corresponding to each frequency band according to the frequency domain signal corresponding to the audio signal of the audio includes:

segmenting the audio signal of the audio frequency into a plurality of sub audio signals, and converting each sub audio signal into a frequency domain signal, wherein each sub audio signal corresponds to one frequency band;

and determining the voice spectrum parameters corresponding to each frequency band.

In a second aspect, an embodiment of the present disclosure provides an audio starting point detecting apparatus, including:

the parameter determining module is used for determining the voice spectrum parameters corresponding to each frequency band according to the frequency domain signals corresponding to the audio signals of the audio;

the mean value determining module is used for determining the mean value of the voice spectrum parameters according to the voice spectrum parameters of the current frequency band and the voice spectrum parameters of a preset number of frequency bands selected from the rest frequency bands aiming at each frequency band, and taking the mean value as the mean value of the current frequency band, wherein the rest frequency bands are all frequency bands positioned before the current frequency band according to time sequence;

and the starting point determining module is used for determining one or more starting point positions of notes and syllables in the audio according to the voice frequency spectrum parameters and the average value corresponding to each frequency band.

Further, the starting point determining module includes:

the difference value calculating unit is used for calculating the difference value between the voice frequency spectrum parameter of each frequency and the average value;

and the starting point determining unit is used for determining one or more starting point positions of the notes and the syllables in the audio according to the difference value of the frequencies.

Further, the starting point determining unit is specifically configured to: drawing a voice frequency spectrum parameter curve according to the difference value of each frequency; and determining a local highest point according to the voice spectrum parameter curve, and determining one or more starting point positions of notes and syllables in the audio frequency according to a difference value corresponding to the local highest point.

Further, the parameter determining module is specifically configured to: segmenting the audio signal of the audio frequency into a plurality of sub audio signals, and converting each sub audio signal into a frequency domain signal, wherein each sub audio signal corresponds to one frequency band; and determining the voice spectrum parameters corresponding to each frequency band.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the audio onset detection method of any of the preceding first aspects.

In a fourth aspect, the disclosed embodiments provide a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions for causing a computer to execute any one of the audio onset detection methods in the foregoing first aspect.

The method comprises the steps of determining voice spectrum parameters corresponding to all frequency bands according to frequency domain signals corresponding to audio signals of audio, determining mean values of the voice spectrum parameters according to the voice spectrum parameters of the current frequency band and voice spectrum parameters of a preset number of frequency bands selected from residual frequency bands aiming at all the frequency bands, and taking the mean values as the mean values of the current frequency bands, wherein the residual frequency bands are all frequency bands located in front of the current frequency bands according to time sequence; one or more starting point positions of notes and syllables in the audio frequency are determined according to the voice frequency spectrum parameters and the mean value corresponding to each frequency band, and the voice frequency spectrum parameters corresponding to the frequency bands are referred when the starting point positions are determined, so that the mean value of the determined voice frequency spectrum parameters is more accurate, the signal offset phenomenon in a curve formed by referring to the voice frequency spectrum parameters and the mean value is improved, the starting points of the notes and the syllables in the audio frequency can be accurately detected, and the situations of false detection and false detection are reduced.

The foregoing is a summary of the present disclosure, and for the purposes of promoting a clear understanding of the technical means of the present disclosure, the present disclosure may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and other drawings can be obtained according to the drawings without creative efforts for those skilled in the art.

FIG. 1a is a schematic diagram of an audio signal provided by the prior art;

FIG. 1b is a diagram illustrating a detection result of an audio start point according to the prior art;

fig. 2a is a flowchart of an audio starting point detection method according to an embodiment of the disclosure;

fig. 2b is a schematic diagram of an audio signal in an audio starting point detection method according to an embodiment of the disclosure;

fig. 2c is a speech frequency spectrum diagram of an audio signal in an audio starting point detection method according to an embodiment of the disclosure;

fig. 3a is a flowchart of an audio starting point detection method according to a second embodiment of the disclosure;

FIG. 3b is a graph of the speech spectral parameter composition in the audio starting point detection method provided by the prior art;

fig. 3c is a graph of a difference component in the audio starting point detection method according to the second embodiment of the disclosure;

fig. 4 is a schematic structural diagram of an audio starting point detection apparatus according to a third embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present disclosure.

Detailed Description

The embodiments of the present disclosure are described below with specific examples, and other advantages and effects of the present disclosure will be readily apparent to those skilled in the art from the disclosure in the specification. It is to be understood that the described embodiments are merely illustrative of some, and not restrictive, of the embodiments of the disclosure. The disclosure may be embodied or carried out in various other specific embodiments, and various modifications and changes may be made in the details within the description without departing from the spirit of the disclosure. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present disclosure, and the drawings only show the components related to the present disclosure rather than the number, shape and size of the components in actual implementation, and the type, amount and ratio of the components in actual implementation may be changed arbitrarily, and the layout of the components may be more complicated.

In addition, in the following description, specific details are provided to facilitate a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.

Example one

Fig. 2a is a flowchart of an audio starting point detection method according to an embodiment of the present disclosure, where the audio starting point detection method according to this embodiment may be executed by an audio starting point detection apparatus, and the audio starting point detection apparatus may be implemented as software, or implemented as a combination of software and hardware, and the audio starting point detection apparatus may be integrated in a certain device in an audio starting point detection system, such as an audio starting point detection server or an audio starting point detection terminal device. The embodiment can be applied to some scenes with less complex audio with weak rhythm (such as music mixed by multiple instruments, music with slower rhythm and human voice). As shown in fig. 2a, the method comprises the steps of:

step S21: and determining the voice spectrum parameters corresponding to each frequency band according to the frequency domain signals corresponding to the audio signals of the audio.

The audio signal may be a piece of music or voice, and the corresponding frequency domain signal is obtained by converting the audio signal of the time domain into the frequency domain.

The speech spectral parameters can be determined according to the spectral amplitude and phase.

In an optional embodiment, step S21 specifically includes:

step S211: the audio signal of the audio is segmented into a plurality of sub audio signals, and each sub audio signal is converted into a frequency domain signal, wherein each sub audio signal corresponds to one frequency band.

Step S212: and determining first voice spectrum parameters corresponding to each frequency band.

Specifically, the audio signal is a one-dimensional series of discrete times, which can be expressed as: x ═ X₁,x₂…x_NWhere N is the total number of discrete sample points. Although the audio signal is a time-non-periodic signal, the audio signal can approximately exhibit a stationary (approximately periodic) characteristic in a short-time range (usually the short time is defined as 10-40 ms), so that the audio signal can be divided into short-time speech segments of equal length, i.e. sub-audio signals, for analysis. For example, as shown in fig. 2b, for an audio signal with a sampling rate of 16000Hz, 512 sample points can be selected as one sub-audio signal, which corresponds to a speech length of 32 ms.

Here, the fourier transform may be used to convert the audio signal in the time domain into the audio signal in the frequency domain, and the frequency information that changes with time is called a spectrogram, as shown in fig. 2c, the energy changes of the sub-audio signals in different frequency bands can be clearly seen, and it can be seen that at the starting point position, the frequency spectrum has obvious step changes.

Wherein, the corresponding frequency domain signal can be expressed as:where n denotes an nth sub audio signal, L denotes a length of the sub audio signal, and k denotes a kth frequency band.

Accordingly, when the audio signal is divided into a plurality of sub-audio signals, the first speech spectral parameter may specifically be a synthesized weighting of spectral amplitudes and phases of different sub-audio signals, for example, a formula Is calculated to obtain wherein_n(k) L is the amplitude of the kth frequency band, whereIs a second order phase difference of a k-th frequency band, wherein WhereinIs a first order phase difference of a k-th frequency band, wherein WhereinThe phase of the k-th band. The second order difference of the phase is adopted in the embodiment to better represent the starting point information.

Step S22: and determining the mean value of the voice spectrum parameters according to the voice spectrum parameters of the current frequency band and the voice spectrum parameters of a preset number of frequency bands selected from the residual frequency bands, and taking the mean value as the mean value of the current frequency band, wherein the residual frequency bands are all frequency bands positioned before the current frequency band according to time sequence.

The preset number can be set by user.

In order to ensure the real-time performance of the initial point detection, the remaining frequency bands are all frequency bands located before the current frequency band according to the time sequence.

Specifically, in determining the speech spectrum parameters of each frequency band, firstly, any frequency band is selected as a current frequency band, then, an average value of the speech spectrum parameters is determined according to the speech spectrum parameters of the current frequency band and the speech spectrum parameters of a preset number of frequency bands selected from the remaining frequency bands, the average value is used as the average value of the current frequency band, then, any frequency band is selected from the remaining frequency bands as the current frequency band, and the above operations are repeatedly executed until the speech spectrum parameters and the average value of all the frequency bands are determined.

Step S23: and determining one or more starting point positions of notes and syllables in the audio according to the speech frequency spectrum parameters and the mean value corresponding to each frequency band.

In this embodiment, a voice spectrum parameter corresponding to each frequency band is determined according to a frequency domain signal corresponding to an audio signal of an audio frequency, an average value of the voice spectrum parameters is determined according to the voice spectrum parameter of a current frequency band and voice spectrum parameters of a preset number of frequency bands selected from remaining frequency bands, and the average value is used as the average value of the current frequency band, wherein the remaining frequency bands are all frequency bands located before the current frequency band according to a time sequence; one or more starting point positions of notes and syllables in the audio frequency are determined according to the voice frequency spectrum parameters and the mean value corresponding to each frequency band, and the voice frequency spectrum parameters corresponding to the frequency bands are referred when the starting point positions are determined, so that the mean value of the determined voice frequency spectrum parameters is more accurate, the signal offset phenomenon in a curve formed by referring to the voice frequency spectrum parameters and the mean value is improved, the starting points of the notes and the syllables in the audio frequency can be accurately detected, and the situations of false detection and false detection are reduced.

Example two

Fig. 3a is a flowchart of an audio starting point detection method according to a second embodiment of the present disclosure, and in this embodiment, based on the above embodiments, the step of determining the position of the starting point according to the speech spectrum parameters and the average value corresponding to each frequency band is further optimized, and this embodiment may be applied to some scenes of audio with relatively complex rhythm and poor feeling (for example, music mixed by multiple musical instruments, music with relatively slow rhythm, and human voice). As shown in fig. 3a, the method specifically includes:

step S31: and determining the voice spectrum parameters corresponding to each frequency band according to the frequency domain signals corresponding to the audio signals of the audio.

Step S32: and determining the mean value of the voice spectrum parameters according to the voice spectrum parameters of the current frequency band and the voice spectrum parameters of a preset number of frequency bands selected from the residual frequency bands, and taking the mean value as the mean value of the current frequency band, wherein the residual frequency bands are all frequency bands positioned before the current frequency band according to time sequence.

Step S33: and calculating the difference value between the speech spectrum parameter of each frequency and the average value.

Step S34: one or more start point positions of the notes and syllables in the audio are determined based on the difference between the frequencies.

In an alternative embodiment, step S34 includes:

step S341: and drawing a voice spectrum parameter curve according to the difference value of each frequency.

Step S342: and determining a local highest point according to the voice spectrum parameter curve, and determining one or more starting point positions of notes and syllables in the audio frequency according to a difference value corresponding to the local highest point.

Specifically, the voice spectrum parameter curve drawn according to the difference between the frequencies can reduce the signal offset phenomenon in the curve formed by the voice spectrum parameters in the prior art, as shown in fig. 3b, which is a curve formed by the voice spectrum parameters in the prior art, as shown in fig. 3c, which is a curve formed by the voice spectrum parameters in the present scheme.

According to the embodiment of the invention, the voice frequency spectrum parameters corresponding to a plurality of frequency bands are referred when the starting point positions of the notes and the syllables in the audio are determined, so that the mean value of the determined voice frequency spectrum parameters is more accurate, and the starting point positions of the notes and the syllables in the audio are determined according to the difference value of the voice frequency spectrum parameters and the mean value, so that the signal offset phenomenon in a formed curve is improved, the starting point of the audio can be accurately detected, and the situations of false detection and missed detection are reduced.

EXAMPLE III

Fig. 4 is a schematic structural diagram of an audio starting point detection apparatus according to a third embodiment of the present disclosure, where the audio starting point detection apparatus may be implemented as software, or implemented as a combination of software and hardware, and the audio starting point detection apparatus may be integrated in a certain device in an audio starting point detection system, such as an audio starting point detection server or an audio starting point detection terminal device. The embodiment can be applied to some scenes with less complex audio with weak rhythm (such as music mixed by multiple instruments, music with slower rhythm and human voice). As shown in fig. 4, the apparatus includes: a parameter determination module 41, a mean determination module 42 and a starting point determination module 43; wherein,

the parameter determining module 41 is configured to determine a speech spectrum parameter corresponding to each frequency band according to a frequency domain signal corresponding to an audio signal of the audio;

the mean value determining module 42 is configured to determine a mean value of the speech spectrum parameters according to the speech spectrum parameters of the current frequency band and the speech spectrum parameters of a preset number of frequency bands selected from remaining frequency bands, and use the mean value as the mean value of the current frequency band, where the remaining frequency bands are all frequency bands located before the current frequency band according to a time sequence;

the starting point determining module 43 is configured to determine one or more starting point positions of the notes and the syllables in the audio according to the speech spectrum parameters and the average corresponding to the frequency bands.

Further, the starting point determining module 43 includes: a difference value calculation unit 431 and a start point determination unit 432; wherein,

the difference value calculating unit 431 is used for calculating the difference value between the speech frequency spectrum parameter of each frequency and the average value;

the start point determining unit 432 is configured to determine one or more start point positions of the notes and syllables in the audio according to the difference between the frequencies.

Further, the starting point determining unit 432 is specifically configured to: drawing a voice frequency spectrum parameter curve according to the difference value of each frequency; and determining a local highest point according to the voice spectrum parameter curve, and determining one or more starting point positions of notes and syllables in the audio frequency according to a difference value corresponding to the local highest point.

Further, the parameter determining module 41 is specifically configured to: the audio signal is segmented into a plurality of sub audio signals, and each sub audio signal is converted into a frequency domain signal, wherein each sub audio signal corresponds to one frequency band; and determining the voice spectrum parameters corresponding to each frequency band.

For detailed descriptions of the working principle, the implemented technical effect, and the like of the embodiment of the audio starting point detection apparatus, reference may be made to the related descriptions in the foregoing embodiment of the audio starting point detection method, and further description is omitted here.

Example four

Referring now to FIG. 5, shown is a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, the electronic device may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 309, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: determining voice frequency spectrum parameters corresponding to each frequency band according to the frequency domain signals corresponding to the audio signals; determining the mean value of the voice spectrum parameters according to the voice spectrum parameters of the current frequency band and the voice spectrum parameters of a preset number of frequency bands selected from the rest frequency bands, and taking the mean value as the mean value of the current frequency band; and determining the position of the starting point according to the voice spectrum parameters and the mean value corresponding to each frequency band.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a cell does not in some cases constitute a definition of the cell itself, for example, the drag point determination module may also be described as a "module for determining a drag point on a template image".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for audio origin detection, comprising:

and determining one or more starting point positions of notes and syllables in the audio according to the voice frequency spectrum parameters and the mean value corresponding to each frequency band.

2. The method for detecting audio onset of claim 1, wherein the determining the locations of one or more onset positions of notes and syllables in the audio according to the speech spectrum parameters and the mean corresponding to the frequency bands comprises:

3. The method of claim 2, wherein determining the locations of one or more onsets of notes and syllables in the audio based on the differences between the frequencies comprises:

4. The method for detecting an audio starting point according to any one of claims 1 to 3, wherein the determining the speech spectrum parameters corresponding to each frequency band according to the frequency domain signal corresponding to the audio signal of the audio comprises:

5. An audio starting point detecting apparatus, comprising:

6. The audio starting point detecting device according to claim 5, wherein the starting point determining module comprises:

7. The audio starting point detecting device according to claim 6, wherein the starting point determining unit is specifically configured to: drawing a voice frequency spectrum parameter curve according to the difference value of each frequency; and determining a local highest point according to the voice spectrum parameter curve, and determining one or more starting point positions of notes and syllables in the audio frequency according to a difference value corresponding to the local highest point.

8. An electronic device, comprising:

a memory for storing non-transitory computer readable instructions; and

a processor for executing the computer readable instructions such that the processor when executing implements the audio onset detection method according to any of claims 1-4.

9. A computer-readable storage medium storing non-transitory computer-readable instructions which, when executed by a computer, cause the computer to perform the audio onset detection method of any one of claims 1-4.