CN114979798B

CN114979798B - Playing speed control method and electronic equipment

Info

Publication number: CN114979798B
Application number: CN202210425500.4A
Authority: CN
Inventors: 程戈
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2024-03-22
Anticipated expiration: 2042-04-21
Also published as: WO2023202522A1; CN114979798A

Abstract

The application discloses a play speed control method and electronic equipment, and belongs to the technical field of audio. The specific scheme comprises the following steps: acquiring a target streaming media; dividing the voice frame of the target streaming media subjected to short-time framing according to a preset frame length to obtain a plurality of long-time frames; respectively determining a long-time spectrum energy difference characteristic value of each long-time frame, and determining a first stream media fragment and a second stream media fragment according to the long-time spectrum energy difference characteristic value; outputting the first stream media fragments in the target stream media according to the first speed, and outputting the second stream media fragments in the target stream media according to the second speed; the first stream media segment is stream media data containing voice information, the second stream media segment is stream media data not containing voice information, and the first speed is smaller than the second speed.

Description

Playing speed control method and electronic equipment

Technical Field

The application belongs to the technical field of audio, and particularly relates to a play speed control method and electronic equipment.

Background

With the development of short video platforms, video content is becoming increasingly rich and colorful. For short video platforms, high-quality video content is the basis, but many short video creators often have the problems of video content wading and long playing rhythm when shooting videos, so that users of the short video platform easily feel tedious to switch videos or close the platform.

In the related art, when a user of a short video platform feels that a video clip currently being played is tedious, the playing progress of the video can be accelerated by dragging the playing progress bar in the video interface, however, when the user uses the method, the user is difficult to grasp the dragging scale, and the user usually needs to repeatedly drag the progress bar to position the video to be checked, so that the user is easy to miss key content, and the operation complexity of the user is improved.

Disclosure of Invention

The embodiment of the application aims to provide a play speed control method and electronic equipment, which can solve the problems that a mode of controlling play speed in related technologies not only easily causes a user to miss key content, but also improves the operation complexity of the user.

In a first aspect, an embodiment of the present application provides a play speed control method, where the method includes: acquiring a target streaming media; dividing the voice frame of the target streaming media subjected to short-time framing according to a preset frame length to obtain a plurality of long-time frames; respectively determining a long-time spectrum energy difference characteristic value of each long-time frame, and determining a first stream media fragment and a second stream media fragment according to the long-time spectrum energy difference characteristic value; outputting the first stream media fragments in the target stream media according to the first speed, and outputting the second stream media fragments in the target stream media according to the second speed; the first stream media segment is stream media data containing voice information, the second stream media segment is stream media data not containing voice information, and the first speed is smaller than the second speed.

In a second aspect, an embodiment of the present application provides a play speed control device, including: an acquisition module and an output module; the acquisition module is used for acquiring the target streaming media; the processing module is used for dividing the voice frame of the target streaming media subjected to short-time framing processing according to a preset frame length to obtain a plurality of long-time frames; the processing module is further used for respectively determining long-time spectrum energy difference characteristic values of each long-time frame and determining a first stream media fragment and a second stream media fragment according to the long-time spectrum energy difference characteristic values; the output module is used for outputting the first streaming media fragments in the target streaming media according to the first speed and outputting the second streaming media fragments in the target streaming media according to the second speed; the first stream media segment is stream media data containing voice information, the second stream media segment is stream media data not containing voice information, and the first speed is smaller than the second speed.

In a third aspect, embodiments of the present application provide an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method as described in the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and where the processor is configured to execute a program or instructions to implement a method according to the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product stored in a storage medium, the program product being executable by at least one processor to implement the method according to the first aspect.

In the embodiment of the application, the target streaming media can be acquired; dividing the voice frame of the target streaming media subjected to short-time framing according to a preset frame length to obtain a plurality of long-time frames; respectively determining a long-time spectrum energy difference characteristic value of each long-time frame, and determining a first stream media fragment and a second stream media fragment according to the long-time spectrum energy difference characteristic value; outputting the first stream media fragments in the target stream media according to the first speed, and outputting the second stream media fragments in the target stream media according to the second speed; the first stream media segment is stream media data containing voice information, the second stream media segment is stream media data not containing voice information, and the first speed is smaller than the second speed. According to the scheme, the voice frame of the target streaming media subjected to short-time framing processing can be divided into a plurality of long-time frames, the first streaming media fragment and the second streaming media fragment in the target streaming media are determined by analyzing the long-time spectrum energy difference characteristic values of the long-time frames, the first streaming media fragment is output at a first speed, and the second streaming media fragment is output at a second speed, so that on one hand, the long-time characteristics have higher smoothness and stability than the short-time characteristics, and on the other hand, the accuracy of an analysis result can be improved by analyzing the long-time spectrum energy difference characteristic values of the long-time frames; on the other hand, the first speed is smaller than the second speed, that is, the playing speed of the second streaming media fragment is larger than the playing speed of the first streaming media fragment, so that the time wasted in the playing process of the streaming media fragment which does not contain voice information can be reduced, the user is prevented from missing key contents in the second streaming media fragment, and in the outputting process of the target streaming media, the user does not need to input any information, so that the complexity of the user operation is reduced.

Drawings

Fig. 1 is a flow chart of a play speed control method provided in an embodiment of the present application;

fig. 2 is one of interface schematic diagrams of a play speed control method according to an embodiment of the present application;

FIG. 3 is a second interface diagram of a playback speed control method according to an embodiment of the present disclosure;

FIG. 4 is a third interface diagram of a playback speed control method according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a play speed control device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 7 is a schematic hardware diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application are within the scope of the protection of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type and not limited to the number of objects, e.g., the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The following describes in detail the play speed control method provided in the embodiment of the present application through specific embodiments and application scenarios thereof with reference to the accompanying drawings.

The execution main body of the play speed control method provided in the embodiment of the present application may be an electronic device or a functional module or a functional entity capable of implementing the play speed control method in the electronic device, where the electronic device mentioned in the embodiment of the present application includes, but is not limited to, a mobile phone, a tablet computer, a camera, a wearable device, etc., and the play speed control method provided in the embodiment of the present application is described below by taking the electronic device as an execution main body as an example.

As shown in fig. 1, the embodiment of the present application provides a play speed control method, which may include steps 101 to 104:

step 101, obtaining target streaming media.

Optionally, the target streaming media is streaming media including audio data, for example, voice messages, recordings, music, voiced novels, and the like; or may be an audio video.

For example, in the case where the target streaming media is a voice message, the electronic device acquiring the target streaming media may include: in the case of displaying a chat interface with other contacts, the user may make a click input on a target voice message received by the electronic device, and the electronic device may obtain the target voice message in response to the click input. In the case that the target streaming media is an audio video, the electronic device obtaining the target streaming media may include: if the user wants to watch the target video, a click input can be performed on the target video, and the electronic device can respond to the click input to display a video playing interface of the target video and acquire audio data in the target video.

Optionally, before the target streaming media is acquired, the electronic device may receive a first input of the user while displaying the play speed setting interface; in response to the first input, a first speed and a second speed are determined. The first speed is less than the second speed.

Specifically, a piece of audio may include a first streaming media segment and a second streaming media segment, where the first streaming media segment is streaming media data containing voice information, and the second streaming media segment is streaming media data not containing voice information, and for example, the second streaming media segment may be a noise segment or a blank sound segment. In the process of playing audio, the user wants to acquire the voice information in the first streaming media segment, and the second streaming media segment can be skipped, so that the user can set the playing speed of the first streaming media segment, namely the first speed, and the playing speed of the second streaming media segment, namely the second speed, respectively.

For example, as shown in fig. 2, if the user wants to adjust the audio playing speed of the electronic device, the electronic device may be triggered to display a playing speed setting interface, where the playing speed setting interface may include two playing speed setting options, namely, "first streaming media segment playing speed adjustment" and "second streaming media segment playing speed adjustment", respectively, where the "first streaming media segment playing speed adjustment" corresponds to the switch 111 and the "second streaming media segment playing speed adjustment" corresponds to the switch 112. If the user wants to adjust the playing speed of the first streaming media segment, a click input may be performed on the switch 111, and the electronic device may control the switch 111 to be in an on state in response to the click input, and cancel displaying the playing speed setting interface, and display the playing speed adjusting interface of the first streaming media segment. If the user wants to adjust the playing speed of the second streaming media segment, a click input may be performed on the switch 112, and the electronic device may control the switch 112 to be in an on state in response to the click input, and cancel displaying the playing speed setting interface, and display the playing speed adjusting interface of the second streaming media segment.

As shown in fig. 3, the first streaming media segment playing speed adjustment interface includes a pre-stored playing interface 121 of the first video, an acceleration adjustment control 122, a deceleration adjustment control 123, and a determination control 124, where in the case that the first video is in a playing state, the user may adjust the playing speed of the first video by clicking the acceleration adjustment control 122 or the deceleration adjustment control 123, and when adjusting to a speech speed comfortable to the ear, the user may make a click input to the confirmation control 124, and the electronic device may determine the adjusted final playing speed as the first speed in response to the click input.

As shown in fig. 4, the second streaming media segment playing speed adjustment interface includes a plurality of multiple speed adjustment controls and a confirmation control 131, where a user may perform click input on any one of the multiple speed adjustment controls according to his own demand for playing speed, the electronic device may respond to the click input, highlight the control that the user clicks input, and then the user may perform click input on the confirmation control 131, and the electronic device may respond to the click input, and determine the second speed according to the multiple speed value corresponding to the highlighted control.

It should be noted that the first input may include a plurality of sub-inputs, for example, may include an input triggering the electronic device to display the first streaming media segment playing speed adjustment interface or the second streaming media segment playing speed adjustment interface, and may further include a touch input to a control in the first streaming media segment playing speed adjustment interface or the second streaming media segment playing speed adjustment interface.

Based on the scheme, the first speed and the second speed can be determined according to the first input, so that a user can carry out self-defined adjustment on the two playing speeds according to the self-demand, and the playing demands of different users on the diversified speeds are met.

And 102, dividing the voice frame of the target streaming media subjected to short-time framing according to a preset frame length to obtain a plurality of long-time frames.

Alternatively, the electronic device may detect whether a voice signal of a person exists in the target streaming media through a voice endpoint detection (voice activity detection, VAD) algorithm, however, the VAD algorithm in the related art may determine that a part of noise in a non-voice section is a voice signal, and has a characteristic of higher smoothness and stability than a short-time characteristic based on the long-time characteristic, so the electronic device may re-segment a voice frame of the target streaming media that has undergone short-time framing processing through one long-time window and analyze the re-segmented voice characteristic.

Illustratively, taking the target streaming media as an example, the target streaming media includes 100 short-time frames and the length of the long-time window is 10 frames, the long-time frames 1-91 can be obtained after the 100 short-time frames are re-divided, where the long-time frames 1 include short-time frames 1-10, the long-time frames 2 include short-time frames 2-11, and so on.

Step 103, determining a long-time spectrum energy difference characteristic value of each long-time frame respectively, and determining a first stream media fragment and a second stream media fragment according to the long-time spectrum energy difference characteristic value.

Optionally, after the speech frame is re-segmented, the electronic device may determine a long-term spectral energy difference feature value of each long-term frame in the target streaming media, respectively.

In particular, the electronic device may determine a long spectral energy difference characteristic value of the first long time frameWherein the long-term spectral envelope of the nth order of the first frameX (N) represents a speech segment containing noise, X (k, l+j) represents the amplitude spectrum of the first frame of speech at frequency k, and N (k) represents the amplitude spectrum of the background noise at amplitude k; NFFT represents the number of sample points in the fast fourier transform FFT.

According to the formula, when the electronic device calculates the long-term spectrum envelope of the N-order, a long-term window with the length of 2N+1 frames is added on the basis of the short-term amplitude spectrum for analysis by utilizing the long-term principle, and the difference between the LTE and the noise amplitude spectrum can be enlarged by the long-term window, so that the voice and the noise in the target streaming media can be more accurately detected.

Optionally, the determining, by the electronic device, the first streaming media segment and the second streaming media segment according to the long-term spectrum energy difference feature value may specifically include: determining the target long-time frame as the first streaming media fragment under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is smaller than a first threshold value; determining the target long-time frame as the second streaming media fragment under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is larger than the first threshold value; wherein the target long time frame is any one of the plurality of long time frames, and the first threshold is related to a noise estimation value and a signal-to-noise ratio.

Optionally, the first threshold valueWherein (1)>E _h (k) Representing the noise estimate obtained at the current time, i.e., the latest noise estimate, and SNR represents the signal-to-noise ratio.

Based on the above scheme, since whether the target long-time frame is the first stream media segment or the first stream media segment can be determined according to the long-time spectrum energy difference characteristic value of the target long-time frame, the first stream media segment and the second stream media segment in the target stream media can be determined based on the long-time spectrum energy difference characteristic value, thereby providing a basis for playing different stream media segments according to different speeds.

Optionally, the determining, by the electronic device, the first streaming media segment and the second streaming media segment according to the long-term spectrum energy difference feature value may specifically include: determining the number of frames containing fundamental tones in a target long-time frame under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is smaller than a first threshold value; determining the target long-term frame as the first streaming media segment in the case that the ratio of the number of frames containing the fundamental tone to the total number of frames of the target long-term frame is greater than a second threshold; determining the target long-term frame as the second streaming media segment in the case that the ratio of the number of frames containing the fundamental tone to the total number of frames of the target long-term frame is smaller than the second threshold; wherein the target long time frame is any one of the plurality of long time frames.

Specifically, for the case of knocking a keyboard, burst noise such as a collision microphone has similar LTD characteristics to those of voice, so that erroneous judgment is often caused, and thus, a gene proportion characteristic may be introduced to assist the judgment. The pitch frequency is the vibration frequency of vocal cords in the voice, and the electronic device can determine the number M of frames containing the pitch by detecting the pitch of the target long time frame _pitch The electronic device may then further determine M _pitch Ratio θ of total frame number M to target long time frame _pitch At theta _pitch Greater than a second threshold value theta _v The electronic device may determine the target long time frame as the first streaming media segment; at theta _pitch Less than a second threshold value theta _v In the event that the electronic device can determine the target long time frame as the second streaming media segment.

Based on the above scheme, since whether the target long-time frame is the first stream media segment or the second stream media segment can be judged according to the ratio of the frame number of the fundamental tone contained in the target long-time frame to the total frame number of the target long-time frame, on one hand, the first stream media segment and the second stream media segment in the target stream media can be determined based on the frame number of the fundamental tone, thereby providing a basis for playing different stream media segments according to different speeds; on the other hand, the accuracy of the judgment result can be improved.

Optionally, the determining, by the electronic device, the first streaming media segment and the second streaming media segment according to the long-term spectrum energy difference feature value may specifically include: under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is smaller than a first threshold value, converting the time domain voice signal of the target long-time frame into a frequency domain energy signal to obtain a target streaming media fragment; determining average energy values and maximum energy values of a plurality of frequency domain energy sampling points in the target streaming media segment; determining the target streaming media segment as the first streaming media segment under the condition that the difference between the maximum energy value and the average energy value is in a first preset range; determining the target streaming media segment as the second streaming media segment in the case that the difference between the maximum energy value and the average energy value is not within the first preset range; wherein the target long time frame is any one of the plurality of long time frames.

The noise is of a wide variety, with the two most common types of noise being background noise and background noise, respectively. The signal energy of the background noise is distributed in the low frequency part (0-1000 Hz) in a concentrated way, the signal energy of the background noise is hardly distributed in the middle and high frequency parts, the signal energy of the background noise is evenly distributed in the whole frequency domain, each frequency point has energy distribution, and the energy distribution is more uniform. The spectral characteristics of the speech signal and the noise of these two types are different, and there is often a signal energy distribution throughout the spectrum, but the energy distribution is irregular for different frequency points. For low noise, since the signal energy of the noise is basically concentrated in the low frequency band, for the same sampling point, the average energy and the maximum energy are calculated by uniformly selecting frequency points for the whole frequency band, and the difference between the average energy and the maximum energy is usually larger. For background noise, the distribution of the background noise in the whole frequency band is relatively average, so that the difference of signal energy at different frequency points is not large, the average energy and the maximum energy are calculated by uniformly selecting frequency points in the whole frequency band, and the difference between the average energy and the maximum energy is usually small. For speech signals, since the energy of the speech signal does not have the above-mentioned law, the difference between the average energy and the maximum energy is typically within a difference range that is typically larger than the difference of the background noise and smaller than the difference of the background noise. Thus, using the spectral energy characteristics of the background noise, and the speech signal, it is possible to determine whether a streaming media segment is speech or noise.

Specifically, the electronic device may convert the time domain speech signal in the target long time frame with the long-term spectrum energy difference characteristic value smaller than the first threshold into the frequency domain energy signal through the audio tool to obtain the target streaming media segment, and the electronic device after conversion may display the energy sizes of different frequency points, where the darker the color is the larger the frequency point energy, the darker the color is, and the smaller the frequency point energy is represented. Thereafter, can be fromThe frequency domain energy sampling point is selected from the target streaming media fragments with preset time length and preset frequency domain range, for example, the time length t can be selected from the target streaming media fragments ₀ The preset frequency domain range is (0, f) ₀ ) 40 frequency domain energy sampling points are selected from the target streaming media fragments, namely 0.1t in the time domain ₀ For the starting time point, 5 points were evenly selected: t= (0.1 t ₀ ,0.3t ₀ ,0.5t ₀ ,0.7t ₀ ,0.9t ₀ ) In the frequency domain toHz is the initial frequency, and 8 points are uniformly selected:then, the average energy value +.f. of 8 frequency points corresponding to each time domain sampling point t is calculated>And maximum energy value->

Average energy value at 5 time-domain sampling pointsAnd maximum energy value->After that, these 5 average energy values are again +.>And 5 maximum energiesValue->Averaging to obtain average energy value E of all frequency domain energy sampling points ^mean And maximum energy value E ^max ：

Average energy value

Maximum energy value

The electronic device may then calculate the average energy value E ^mean And maximum energy value E ^max And judging whether the difference is within a first preset range (a, b):

if E ^max -E ^mean The electronic equipment can determine the target streaming media fragment as a background noise signal, namely, under the condition that the difference between the maximum energy value and the average energy value is not in a first preset range, the electronic equipment can determine the target streaming media fragment as a second streaming media fragment;

if E ^max -E ^mean The electronic equipment can determine the target streaming media segment as a background noise signal, namely, the electronic equipment can determine the target streaming media segment as a second streaming media segment under the condition that the difference between the maximum energy value and the average energy value is not in a first preset range;

if a is less than E ^max -E ^mean The electronic device may determine the target streaming media segment as a speech signal, i.e. in case the difference between the maximum energy value and the average energy value is within a first preset range.

Based on the above scheme, the time domain voice signal of the target long time frame can be converted into the frequency domain energy signal under the condition that the characteristic value of the long time spectrum energy difference of the target long time frame is smaller than the first threshold value, so as to obtain the target streaming media fragment, the average energy value and the maximum energy value of a plurality of frequency domain energy sampling points in the target streaming media fragment are determined, and whether the target streaming media fragment is the first streaming media fragment or the first streaming media fragment is determined according to the difference between the maximum energy value and the average energy value, so that on one hand, the first streaming media fragment and the second streaming media fragment in the target streaming media can be determined based on the frequency domain energy, and a basis is provided for playing different streaming media fragments according to different speeds; on the other hand, the accuracy of the judgment result can be improved.

Step 104, outputting the first stream media fragment in the target stream media according to the first speed, and outputting the second stream media fragment in the target stream media according to the second speed.

Specifically, the electronic device may output a first streaming media segment of the target streaming media at a first speed and output a second streaming media segment of the target streaming media at a second speed. That is, in the process of playing the target streaming media, the electronic device may switch the playing speed of different streaming media segments.

In the embodiment of the application, the voice frame of the target streaming media subjected to short-time framing processing can be divided into a plurality of long-time frames, the first streaming media fragment and the second streaming media fragment in the target streaming media are determined by analyzing the long-time spectrum energy difference characteristic values of the long-time frames, the first streaming media fragment is output at a first speed, and the second streaming media fragment is output at a second speed, so that on one hand, the long-time characteristics have higher smoothness and stability than the short-time characteristics, and on the other hand, the accuracy of an analysis result can be improved by analyzing the long-time spectrum energy difference characteristic values of the long-time frames; on the other hand, the first speed is smaller than the second speed, that is, the playing speed of the second streaming media fragment is larger than the playing speed of the first streaming media fragment, so that the time wasted in the playing process of the streaming media fragment which does not contain voice information can be reduced, the user is prevented from missing key contents in the second streaming media fragment, and in the outputting process of the target streaming media, the user does not need to input any information, so that the complexity of the user operation is reduced.

According to the play speed control method provided by the embodiment of the application, the execution main body can be a play speed control device. In this embodiment of the present application, a play speed control device executes a play speed control method as an example, and the play speed control device provided in this embodiment of the present application is described.

As shown in fig. 5, the embodiment of the present application further provides a play speed control device 500, including: an acquisition module 501, a processing module 502 and an output module 503; the acquiring module 501 is configured to acquire a target streaming media; the processing module 502 is configured to segment the voice frame of the target streaming media after the short-time framing processing according to a preset frame length, so as to obtain a plurality of long-time frames; the processing module 502 is further configured to determine a long-term spectrum energy difference feature value of each long-term frame, and determine a first streaming media segment and a second streaming media segment according to the long-term spectrum energy difference feature value; the output module 503 is configured to output a first streaming media segment in the target streaming media according to a first speed, and output a second streaming media segment in the target streaming media according to a second speed; the first stream media segment is stream media data containing voice information, the second stream media segment is stream media data not containing voice information, and the first speed is smaller than the second speed.

Optionally, the processing module 502 is specifically configured to: determining the target long-time frame as the first streaming media fragment under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is smaller than a first threshold value; determining the target long-time frame as the second streaming media fragment under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is larger than the first threshold value; wherein the target long time frame is any one of the plurality of long time frames, and the first threshold is related to a noise estimation value and a signal-to-noise ratio.

Optionally, the processing module 502 is specifically configured to: determining the number of frames containing fundamental tones in a target long-time frame under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is smaller than a first threshold value; determining the target long-term frame as the first streaming media segment in the case that the ratio of the number of frames containing the fundamental tone to the total number of frames of the target long-term frame is greater than a second threshold; determining the target long-term frame as the second streaming media segment in the case that the ratio of the number of frames containing the fundamental tone to the total number of frames of the target long-term frame is smaller than the second threshold; wherein the target long time frame is any one of the plurality of long time frames.

Optionally, the processing module 502 is specifically configured to: under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is smaller than a first threshold value, converting the time domain voice signal of the target long-time frame into a frequency domain energy signal to obtain a target streaming media fragment; determining average energy values and maximum energy values of a plurality of frequency domain energy sampling points in the target streaming media segment; determining the target streaming media segment as the first streaming media segment under the condition that the difference between the maximum energy value and the average energy value is in a first preset range; determining the target streaming media segment as the second streaming media segment in the case that the difference between the maximum energy value and the average energy value is not within the first preset range; wherein the target long time frame is any one of the plurality of long time frames.

Optionally, with continued reference to fig. 5, the apparatus 500 further includes a receiving module 504; the receiving module 504 is configured to receive a first input from a user when the play speed setting interface is displayed; the processing module 503 is further configured to determine the first speed and the second speed in response to the first input.

The play speed control device in the embodiment of the present application may be an electronic device, or may be a component in an electronic device, for example, an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the electronic device may be a mobile phone, tablet computer, notebook computer, palm computer, vehicle-mounted electronic device, mobile internet appliance (Mobile Internet Device, MID), augmented reality (augmented reality, AR)/Virtual Reality (VR) device, robot, wearable device, ultra-mobile personal computer, UMPC, netbook or personal digital assistant (personal digital assistant, PDA), etc., but may also be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and the embodiments of the present application are not limited in particular.

The play speed control device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android operating system, an iOS operating system, or other possible operating systems, which are not specifically limited in the embodiments of the present application.

The play speed control device provided in the embodiment of the present application can implement each process implemented by the embodiments of the methods of fig. 1 to fig. 4, and in order to avoid repetition, a detailed description is omitted here.

Optionally, as shown in fig. 6, the embodiment of the present application further provides an electronic device 600, including a processor 601 and a memory 602, where the memory 602 stores a program or an instruction that can be executed on the processor 601, and the program or the instruction implements each step of the above-mentioned embodiment of the play speed control method when executed by the processor 601, and the steps achieve the same technical effects, so that repetition is avoided, and no further description is given here.

The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 7 is a schematic hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 1000 includes, but is not limited to: radio frequency unit 1001, network module 1002, audio output unit 1003, input unit 1004, sensor 1005, display unit 1006, user input unit 1007, interface unit 1008, memory 1009, and processor 1010.

Those skilled in the art will appreciate that the electronic device 1000 may also include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 1010 by a power management system to perform functions such as managing charge, discharge, and power consumption by the power management system. The electronic device structure shown in fig. 7 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.

The processor 1010 is configured to obtain a target streaming media; a processor 1010, configured to segment the voice frame of the target streaming media after the short-time framing processing according to a preset frame length, to obtain a plurality of long-time frames; the processor 1010 is further configured to determine a long-term spectrum energy difference characteristic value of each of the long-term frames, and determine a first streaming media segment and a second streaming media segment according to the long-term spectrum energy difference characteristic value; an audio output unit 1003 or a display unit 1006, configured to output a first streaming media segment in the target streaming media at a first speed, and output a second streaming media segment in the target streaming media at a second speed; the first stream media segment is stream media data containing voice information, the second stream media segment is stream media data not containing voice information, and the first speed is smaller than the second speed.

Optionally, the processor 1010 is specifically configured to: determining the target long-time frame as the first streaming media fragment under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is smaller than a first threshold value; determining the target long-time frame as the second streaming media fragment under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is larger than the first threshold value; wherein the target long time frame is any one of the plurality of long time frames, and the first threshold is related to a noise estimation value and a signal-to-noise ratio.

In the embodiment of the application, since whether the target long-time frame is the first stream media segment or the first stream media segment can be determined according to the long-time spectrum energy difference characteristic value of the target long-time frame, the first stream media segment and the second stream media segment in the target stream media can be determined based on the long-time spectrum energy difference characteristic value, so that a basis is provided for playing different stream media segments according to different speeds.

Optionally, the processor 1010 is specifically configured to: determining the number of frames containing fundamental tones in a target long-time frame under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is smaller than a first threshold value; determining the target long-term frame as the first streaming media segment in the case that the ratio of the number of frames containing the fundamental tone to the total number of frames of the target long-term frame is greater than a second threshold; determining the target long-term frame as the second streaming media segment in the case that the ratio of the number of frames containing the fundamental tone to the total number of frames of the target long-term frame is smaller than the second threshold; wherein the target long time frame is any one of the plurality of long time frames.

In the embodiment of the present application, since whether the target long-time frame is the first streaming media segment or the second streaming media segment can be determined according to the ratio of the number of frames containing the fundamental tone in the target long-time frame to the total number of frames of the target long-time frame, on one hand, the first streaming media segment and the second streaming media segment in the target streaming media can be determined based on the number of frames of the fundamental tone, thereby providing a basis for playing different streaming media segments at different speeds; on the other hand, the accuracy of the judgment result can be improved.

Optionally, the processor 1010 is specifically configured to: under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is smaller than a first threshold value, converting the time domain voice signal of the target long-time frame into a frequency domain energy signal to obtain a target streaming media fragment; determining average energy values and maximum energy values of a plurality of frequency domain energy sampling points in the target streaming media segment; determining the target streaming media segment as the first streaming media segment under the condition that the difference between the maximum energy value and the average energy value is in a first preset range; determining the target streaming media segment as the second streaming media segment in the case that the difference between the maximum energy value and the average energy value is not within the first preset range; wherein the target long time frame is any one of the plurality of long time frames.

In the embodiment of the present application, under the condition that the characteristic value of the long-time spectrum energy difference of the target long-time frame is smaller than the first threshold value, the time domain speech signal of the target long-time frame is converted into the frequency domain energy signal to obtain the target streaming media segment, the average energy value and the maximum energy value of a plurality of frequency domain energy sampling points in the target streaming media segment are determined, and whether the target streaming media segment is the first streaming media segment or the first streaming media segment is determined according to the difference between the maximum energy value and the average energy value, so that on one hand, the first streaming media segment and the second streaming media segment in the target streaming media can be determined based on the frequency domain energy, thereby providing a basis for playing different streaming media segments according to different speeds; on the other hand, the accuracy of the judgment result can be improved.

Optionally, a user input unit 1007 is configured to receive a first input of a user in a case where a play speed setting interface is displayed; the processor 1010 is further configured to determine the first speed and the second speed in response to the first input.

In the embodiment of the application, since the first speed and the second speed can be determined according to the first input, the user can perform self-defined adjustment on the two playing speeds according to the self-demand, thereby meeting the playing demands of different users for diversified speeds.

It should be understood that in the embodiment of the present application, the input unit 1004 may include a graphics processor (Graphics Processing Unit, GPU) 10041 and a microphone 10042, and the graphics processor 10041 processes image data of still pictures or videos obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 1006 may include a display panel 10061, and the display panel 10061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1007 includes at least one of a touch panel 10071 and other input devices 10072. The touch panel 10071 is also referred to as a touch screen. The touch panel 10071 can include two portions, a touch detection device and a touch controller. Other input devices 10072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein.

The memory 1009 may be used to store software programs as well as various data. The memory 1009 may mainly include a first memory area storing programs or instructions and a second memory area storing data, wherein the first memory area may store an operating system, application programs or instructions (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, the memory 1009 may include volatile memory or nonvolatile memory, or the memory 1009 may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM), static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (ddr SDRAM), enhanced SDRAM (Enhanced SDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRRAM). Memory 1009 in embodiments of the present application includes, but is not limited to, these and any other suitable types of memory.

The processor 1010 may include one or more processing units; optionally, the processor 1010 integrates an application processor that primarily processes operations involving an operating system, user interface, application programs, and the like, and a modem processor that primarily processes wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 1010.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction realizes each process of the embodiment of the play speed control method, and the same technical effects can be achieved, so that repetition is avoided, and no further description is provided herein.

Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage medium such as computer readable memory ROM, random access memory RAM, magnetic or optical disk, etc.

The embodiment of the application further provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled with the processor, and the processor is used for running a program or an instruction, so as to implement each process of the above embodiment of the play speed control method, and achieve the same technical effect, so that repetition is avoided, and no redundant description is provided herein.

It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.

The embodiments of the present application provide a computer program product stored in a storage medium, where the program product is executed by at least one processor to implement the respective processes of the embodiments of the playing speed control method, and achieve the same technical effects, and are not repeated herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may also be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the related art in the form of a computer software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method described in the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims

1. A play speed control method, comprising:

acquiring a target streaming media;

dividing the voice frame of the target streaming media subjected to short-time framing according to a preset frame length to obtain a plurality of long-time frames;

respectively determining a long-time spectrum energy difference characteristic value of each long-time frame, and determining a first stream media fragment and a second stream media fragment according to the long-time spectrum energy difference characteristic value;

outputting the first stream media fragments in the target stream media according to the first speed, and outputting the second stream media fragments in the target stream media according to the second speed;

the first streaming media segment is streaming media data containing voice information, the second streaming media segment is streaming media data not containing voice information, and the first speed is smaller than the second speed;

the determining the first stream media segment and the second stream media segment according to the long-term spectrum energy difference characteristic value comprises the following steps:

under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is smaller than a first threshold value, converting the time domain voice signal of the target long-time frame into a frequency domain energy signal to obtain a target streaming media fragment;

Determining average energy values and maximum energy values of a plurality of frequency domain energy sampling points in the target streaming media segment;

determining the target streaming media segment as the first streaming media segment under the condition that the difference between the maximum energy value and the average energy value is in a first preset range;

determining the target streaming media segment as the second streaming media segment in the case that the difference between the maximum energy value and the average energy value is not within the first preset range;

wherein the target long time frame is any one of the plurality of long time frames.

2. The play speed control method according to claim 1, wherein before the target streaming media is acquired, the method further comprises:

receiving a first input of a user under the condition that a play speed setting interface is displayed;

the first speed and the second speed are determined in response to the first input.

3. A play speed control apparatus, comprising: the device comprises an acquisition module, a processing module and an output module;

the acquisition module is used for acquiring the target streaming media;

the processing module is used for dividing the voice frame of the target streaming media subjected to short-time framing processing according to a preset frame length to obtain a plurality of long-time frames;

The processing module is further used for respectively determining long-time spectrum energy difference characteristic values of each long-time frame and determining a first stream media fragment and a second stream media fragment according to the long-time spectrum energy difference characteristic values;

the output module is used for outputting the first streaming media fragments in the target streaming media according to the first speed and outputting the second streaming media fragments in the target streaming media according to the second speed;

the processing module is specifically configured to: under the condition that the long-time spectrum energy difference characteristic value of the target long-time frame is smaller than a first threshold value, converting the time domain voice signal of the target long-time frame into a frequency domain energy signal to obtain a target streaming media fragment; determining average energy values and maximum energy values of a plurality of frequency domain energy sampling points in the target streaming media segment; determining the target streaming media segment as the first streaming media segment under the condition that the difference between the maximum energy value and the average energy value is in a first preset range; determining the target streaming media segment as the second streaming media segment in the case that the difference between the maximum energy value and the average energy value is not within the first preset range; wherein the target long time frame is any one of the plurality of long time frames.

4. A play-speed control device according to claim 3, characterized in that the device further comprises a receiving module;

the receiving module is used for receiving a first input of a user under the condition of displaying a playing speed setting interface before the target streaming media is acquired;

the processing module is further configured to determine the first speed and the second speed in response to the first input.

5. An electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implements the play speed control method of any one of claims 1-2.

6. A readable storage medium, wherein a program or instructions is stored on the readable storage medium, which when executed by a processor, implements the play speed control method according to any one of claims 1-2.