CN106357929A

CN106357929A - Previewing method based on audio file and mobile terminal

Info

Publication number: CN106357929A
Application number: CN201610991972.0A
Authority: CN
Inventors: 李光宇
Original assignee: Nubia Technology Co Ltd
Current assignee: Nubia Technology Co Ltd
Priority date: 2016-11-10
Filing date: 2016-11-10
Publication date: 2017-01-25

Abstract

The invention discloses a previewing method based on an audio file and a mobile terminal. The mobile terminal comprises a segmentation module, an acquiring module and a previewing module, wherein the segmentation module is used for segmenting the audio file according to a blank part of the audio file or in an equivalent segmenting manner so as to acquire a voice segment; the acquiring module is used for confirming visual description information of the voice segment according to the content of the voice segment; the previewing module is used for displaying the visual description information of the voice segment as an identifier of the corresponding voice segment. According to the technical scheme provided by the invention, quick index listening test and quick film editing of the voice can be realized, the function of the terminal products can be added and the audio finding efficiency can be increased.

Description

Preview method based on voice file and mobile terminal

Technical Field

The invention relates to the technical field of mobile terminals, in particular to a preview method based on a voice file and a mobile terminal.

Background

The existing mobile terminal can conveniently carry out recording work, but when the playback operation is carried out on a recording file with a long recording time, the content of useful contents still needs to be searched by using a sliding bar, and blank contents in the middle cannot be effectively eliminated. The screening efficiency for useful content may become very low. It takes several drags of the slider to make trial listening to find the desired content. If the user wants to clip some useful parts, specialized tools are usually required and it takes a long time.

Disclosure of Invention

The invention mainly aims to provide a previewing method based on a voice file and a mobile terminal, and aims to solve the problem that previewing of the voice file is difficult in the prior art.

In order to achieve the above object, the present invention provides a mobile terminal, including:

the segmentation module is used for segmenting the voice file to obtain voice fragments according to the blank part of the voice file or according to an equal division mode;

the acquisition module is used for determining the visual description information of the voice fragment according to the content of the voice fragment;

and the preview module is used for displaying the visual description information of each voice segment as the identifier of the corresponding voice segment.

Optionally, the segmentation module includes:

a blank segmentation unit, configured to segment the voice file into voice segments according to a blank portion included in the voice file and reaching a set duration; or,

and the unit segmentation unit is used for segmenting the voice file into voice segments according to a set time interval or a set single voice segment size.

Optionally, in a case that the segmentation module adopts the blank segmentation unit, the preview module is further configured to: and hiding the identification of the voice segment corresponding to the blank part.

Optionally, the obtaining module includes:

the character acquisition unit is used for converting the voice fragments into corresponding character strings; extracting a character string abstract from a character string corresponding to the voice fragment, and taking the character string abstract as visual description information of the voice fragment; or,

and the image acquisition unit is used for extracting an image frame from a video file corresponding to the voice segment in the audio and video file as the visual description information of the voice segment under the condition that the voice file is from the audio and video file.

Optionally, the visual description information further includes: the time starting and ending position of the voice segment in the voice file. Optionally, the apparatus further includes:

a processing module, configured to receive and respond to an operation instruction for the identification of each voice segment, where the operation instruction includes: selected, unselected, deleted, sorted, or played.

Optionally, the processing module is further configured to:

and receiving a saving instruction aiming at the identifications of all the displayed voice segments, and generating a voice clip file for all the displayed voice segments based on the saving instruction.

Optionally, the processing module is specifically configured to, when the operation instruction is playing, if the voice file is from an audio/video file, play the video file corresponding to the voice clip in the audio/video file together with the voice clip.

In addition, in order to achieve the above object, the present invention further provides a preview method based on a voice file, including:

segmenting the voice file to obtain voice fragments according to the blank part of the voice file or according to an equal division mode;

determining the visual description information of the voice fragment according to the content of the voice fragment;

and displaying the visual description information of each voice fragment as the identification of the corresponding voice fragment.

Optionally, the segmenting the voice file according to the blank part of the voice file or according to an equal division manner to obtain a voice segment includes:

dividing the voice file into voice segments according to a blank part which reaches a set time length and is contained in the voice file; or,

and dividing the voice file into voice segments according to a set time interval or a set single voice segment size.

Optionally, in a case that the voice file is divided into voice segments according to a blank portion included in the voice file and reaching the set duration, the method further includes:

and hiding the identification of the voice segment corresponding to the blank part.

Optionally, the determining the visual description information of the voice segment according to the content of the voice segment includes:

converting the voice segments into corresponding character strings; extracting a character string abstract from a character string corresponding to the voice fragment, and taking the character string abstract as visual description information of the voice fragment; or,

and under the condition that the voice file is from an audio/video file, extracting an image frame from a video file corresponding to the voice segment in the audio/video file as visual description information of the voice segment.

Optionally, the visual description information further includes: the time starting and ending position of the voice segment in the voice file.

Optionally, the method further includes:

receiving and responding to the identified operation instructions for each of the voice segments, the operation instructions comprising: selected, unselected, deleted, sorted, or played.

Optionally, the method further includes:

Optionally, when the operation instruction is playing, if the voice file is from an audio/video file, the video file corresponding to the voice clip in the audio/video file is played together with the voice clip.

The preview method and the mobile terminal based on the voice file, provided by the invention, can disassemble the segments of the voice file, can eliminate blank segments, can present the visual description information of the content of the voice file on a screen in a bubble mode through a specific trigger mode and a human-computer interaction interface, is convenient for a user to quickly find the interested segments to test listening, select and reverse select through clicking, and arrange and combine the segments through sequencing the segments, and finally generate the arranged and recombined sound segments, thereby realizing quick retrieval test listening and quick editing of voice, increasing the functions of terminal products and improving the efficiency of audio searching.

Drawings

Fig. 1 is a schematic hardware configuration diagram of an alternative mobile terminal implementing various embodiments of the present invention;

FIG. 2 is a diagram of a wireless communication system for the mobile terminal shown in FIG. 1;

fig. 3 is a schematic diagram illustrating a situation in which a mobile terminal according to embodiments of the present invention is held by a user;

fig. 4 is a schematic structural diagram of a mobile terminal according to a first embodiment of the present invention;

fig. 5 is a schematic structural diagram of another mobile terminal according to the first embodiment of the present invention;

fig. 6 is a schematic structural diagram of a mobile terminal according to a second embodiment of the present invention;

fig. 7 is a schematic structural diagram of a mobile terminal according to a third embodiment of the present invention;

FIG. 8 is a flowchart illustrating a preview method based on a voice file according to a fourth embodiment of the present invention;

FIG. 9 is a flowchart of a preview method based on voice files according to a fifth embodiment of the present invention;

FIG. 10 is a flowchart of a preview method based on voice files according to a sixth embodiment of the present invention;

FIG. 11 is a diagram illustrating a preview effect of a voice file according to a seventh embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

A mobile terminal implementing various embodiments of the present invention will now be described with reference to the accompanying drawings. In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in themselves. Thus, "module" and "component" may be used in a mixture.

The mobile terminal may be implemented in various forms. For example, the terminal described in the present invention may include a mobile terminal such as a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a navigation device, and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. In the following, it is assumed that the terminal is a mobile terminal. However, it will be understood by those skilled in the art that the configuration according to the embodiment of the present invention can be applied to a fixed type terminal in addition to elements particularly used for moving purposes.

Fig. 1 is a schematic hardware structure of an alternative mobile terminal implementing various embodiments of the present invention.

The mobile terminal 100 may include a wireless communication unit 110, an a/V (audio/video) input unit 120, a user input unit 130, a sensing unit 140, an output unit 150, a memory 160, an interface unit 170, a controller 180, and a power supply unit 190, etc. Fig. 1 illustrates a mobile terminal having various components, but it is to be understood that not all illustrated components are required to be implemented. More or fewer components may alternatively be implemented. Elements of the mobile terminal will be described in detail below.

The wireless communication unit 110 typically includes one or more components that allow radio communication between the mobile terminal 100 and a wireless communication system or network. For example, the wireless communication unit may include at least one of a broadcast receiving module 111, a mobile communication module 112, a wireless internet module 113, a short-range communication module 114, and a location information module 115.

The broadcast receiving module 111 receives a broadcast signal and/or broadcast associated information from an external broadcast management server via a broadcast channel. The broadcast channel may include a satellite channel and/or a terrestrial channel. The broadcast management server may be a server that generates and transmits a broadcast signal and/or broadcast associated information or a server that receives a previously generated broadcast signal and/or broadcast associated information and transmits it to a terminal. The broadcast signal may include a TV broadcast signal, a radio broadcast signal, a data broadcast signal, and the like. Also, the broadcast signal may further include a broadcast signal combined with a TV or radio broadcast signal. The broadcast associated information may also be provided via a mobile communication network, and in this case, the broadcast associated information may be received by the mobile communication module 112. The broadcast signal may exist in various forms, for example, it may exist in the form of an Electronic Program Guide (EPG) of Digital Multimedia Broadcasting (DMB), an Electronic Service Guide (ESG) of digital video broadcasting-handheld (DVB-H), and the like. The broadcast receiving module 111 may receive a signal broadcast by using various types of broadcasting systems. In particular, the broadcast receiving module 111 may receive a broadcast signal by using a signal such as multimedia broadcasting-terrestrial (DMB-T), digital multimedia broadcasting-satellite (DMB-S)Digital video broadcasting-handheld (DVB-H), forward link media (MediaFLO)^@) A digital broadcasting system of a terrestrial digital broadcasting integrated service (ISDB-T), etc. receives digital broadcasting. The broadcast receiving module 111 may be constructed to be suitable for various broadcasting systems that provide broadcast signals as well as the above-mentioned digital broadcasting systems. The broadcast signal and/or broadcast associated information received via the broadcast receiving module 111 may be stored in the memory 160 (or other type of storage medium).

The mobile communication module 112 transmits and/or receives radio signals to and/or from at least one of a base station (e.g., access point, node B, etc.), an external terminal, and a server. Such radio signals may include voice call signals, video call signals, or various types of data transmitted and/or received according to text and/or multimedia messages.

The wireless internet module 113 supports wireless internet access of the mobile terminal. The module may be internally or externally coupled to the terminal. The wireless internet access technology to which the module relates may include WLAN (wireless LAN) (Wi-Fi), Wibro (wireless broadband), Wimax (worldwide interoperability for microwave access), HSDPA (high speed downlink packet access), and the like.

The short-range communication module 114 is a module for supporting short-range communication. Some examples of short-range communication technologies include bluetooth^TMRadio Frequency Identification (RFID), infrared data association (IrDA), Ultra Wideband (UWB), zigbee^TMAnd so on.

The location information module 115 is a module for checking or acquiring location information of the mobile terminal. A typical example of the location information module is a GPS (global positioning system). According to the current technology, the GPS module 115 calculates distance information and accurate time information from three or more satellites and applies triangulation to the calculated information, thereby accurately calculating three-dimensional current location information according to longitude, latitude, and altitude. Currently, a method for calculating position and time information uses three satellites and corrects an error of the calculated position and time information by using another satellite. In addition, the GPS module 115 can calculate speed information by continuously calculating current position information in real time.

The a/V input unit 120 is used to receive an audio or video signal. The a/V input unit 120 may include a camera 121 and a microphone 122, and the camera 121 processes image data of still pictures or video obtained by an image capturing apparatus in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 151. The image frames processed by the cameras 121 may be stored in the memory 160 (or other storage medium) or transmitted via the wireless communication unit 110, and two or more cameras 121 may be provided according to the construction of the mobile terminal. The microphone 122 may receive sounds (audio data) via the microphone in a phone call mode, a recording mode, a voice recognition mode, or the like, and can process such sounds into audio data. The processed audio (voice) data may be converted into a format output transmittable to a mobile communication base station via the mobile communication module 112 in case of a phone call mode. The microphone 122 may implement various types of noise cancellation (or suppression) algorithms to cancel (or suppress) noise or interference generated in the course of receiving and transmitting audio signals.

The user input unit 130 may generate key input data according to a command input by a user to control various operations of the mobile terminal. The user input unit 130 allows a user to input various types of information, and may include a keyboard, dome sheet, touch pad (e.g., a touch-sensitive member that detects changes in resistance, pressure, capacitance, and the like due to being touched), scroll wheel, joystick, and the like. In particular, when the touch pad is superimposed on the display unit 151 in the form of a layer, a touch screen may be formed.

The sensing unit 140 detects a current state of the mobile terminal 100 (e.g., an open or closed state of the mobile terminal 100), a position of the mobile terminal 100, presence or absence of contact (i.e., touch input) by a user with the mobile terminal 100, an orientation of the mobile terminal 100, acceleration or deceleration movement and direction of the mobile terminal 100, and the like, and generates a command or signal for controlling an operation of the mobile terminal 100. For example, when the mobile terminal 100 is implemented as a slide-type mobile phone, the sensing unit 140 may sense whether the slide-type phone is opened or closed. In addition, the sensing unit 140 can detect whether the power supply unit 190 supplies power or whether the interface unit 170 is coupled with an external device. The sensing unit 140 may include a proximity sensor 141 and the like.

The interface unit 170 serves as an interface through which at least one external device is connected to the mobile terminal 100. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The identification module may store various information for authenticating a user using the mobile terminal 100 and may include a User Identity Module (UIM), a Subscriber Identity Module (SIM), a Universal Subscriber Identity Module (USIM), and the like. In addition, a device having an identification module (hereinafter, referred to as an "identification device") may take the form of a smart card, and thus, the identification device may be connected with the mobile terminal 100 via a port or other connection means. The interface unit 170 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the mobile terminal 100 or may be used to transmit data between the mobile terminal and the external device.

In addition, when the mobile terminal 100 is connected with an external cradle, the interface unit 170 may serve as a path through which power is supplied from the cradle to the mobile terminal 100 or may serve as a path through which various command signals input from the cradle are transmitted to the mobile terminal. Various command signals or power input from the cradle may be used as signals for recognizing whether the mobile terminal is accurately mounted on the cradle. The output unit 150 is configured to provide output signals (e.g., audio signals, video signals, alarm signals, vibration signals, etc.) in a visual, audio, and/or tactile manner.

The output unit 150 may include a display unit 151, an audio output module 152, an alarm unit 153, and the like.

The display unit 151 may display information processed in the mobile terminal 100. For example, when the mobile terminal 100 is in a phone call mode, the display unit 151 may display a User Interface (UI) or a Graphical User Interface (GUI) related to a call or other communication (e.g., text messaging, multimedia file downloading, etc.). When the mobile terminal 100 is in a video call mode or an image capturing mode, the display unit 151 may display a captured image and/or a received image, a UI or GUI showing a video or an image and related functions, and the like.

Meanwhile, when the display unit 151 and the touch pad are overlapped with each other in the form of a layer to form a touch screen, the display unit 151 may serve as an input device and an output device. The display unit 151 may include at least one of a Liquid Crystal Display (LCD), a thin film transistor LCD (TFT-LCD), an Organic Light Emitting Diode (OLED) display, a flexible display, a three-dimensional (3D) display, and the like. Some of these displays may be configured to be transparent to allow a user to view from the outside, which may be referred to as transparent displays, and a typical transparent display may be, for example, a TOLED (transparent organic light emitting diode) display or the like. Depending on the particular desired implementation, the mobile terminal 100 may include two or more display units (or other display devices), for example, the mobile terminal may include an external display unit (not shown) and an internal display unit (not shown). The touch screen may be used to detect a touch input pressure as well as a touch input position and a touch input area.

The audio output module 152 may convert audio data received by the wireless communication unit 110 or stored in the memory 160 into an audio signal and output as sound when the mobile terminal is in a call signal reception mode, a call mode, a recording mode, a voice recognition mode, a broadcast reception mode, or the like. Also, the audio output module 152 may provide audio output related to a specific function performed by the mobile terminal 100 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output module 152 may include a speaker, a buzzer, and the like.

The alarm unit 153 may provide an output to notify the mobile terminal 100 of the occurrence of an event. Typical events may include call reception, message reception, key signal input, touch input, and the like. In addition to audio or video output, the alarm unit 153 may provide output in different ways to notify the occurrence of an event. For example, the alarm unit 153 may provide an output in the form of vibration, and when a call, a message, or some other incoming communication (incomingmunication) is received, the alarm unit 153 may provide a tactile output (i.e., vibration) to inform the user thereof. By providing such a tactile output, the user can recognize the occurrence of various events even when the user's mobile phone is in the user's pocket. The alarm unit 153 may also provide an output notifying the occurrence of an event via the display unit 151 or the audio output module 152.

The memory 160 may store software programs and the like for processing and controlling operations performed by the controller 180, or may temporarily store data (e.g., a phonebook, messages, still images, videos, and the like) that has been or will be output. Also, the memory 160 may store data regarding various ways of vibration and audio signals output when a touch is applied to the touch screen.

The memory 160 may include at least one type of storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. Also, the mobile terminal 100 may cooperate with a network storage device that performs a storage function of the memory 160 through a network connection.

The controller 180 generally controls the overall operation of the mobile terminal. For example, the controller 180 performs control and processing related to voice calls, data communications, video calls, and the like. In addition, the controller 180 may include a multimedia module 181 for reproducing (or playing back) multimedia data, and the multimedia module 181 may be constructed within the controller 180 or may be constructed separately from the controller 180. The controller 180 may perform a pattern recognition process to recognize a handwriting input or a picture drawing input performed on the touch screen as a character or an image.

The power supply unit 190 receives external power or internal power and provides appropriate power required to operate various elements and components under the control of the controller 180.

The various embodiments described herein may be implemented in a computer-readable medium using, for example, computer software, hardware, or any combination thereof. For a hardware implementation, the embodiments described herein may be implemented using at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a processor, a controller, a microcontroller, a microprocessor, an electronic unit designed to perform the functions described herein, and in some cases, such embodiments may be implemented in the controller 180. For a software implementation, the implementation such as a process or a function may be implemented with a separate software module that allows performing at least one function or operation. The software codes may be implemented by software applications (or programs) written in any suitable programming language, which may be stored in the memory 160 and executed by the controller 180.

Up to this point, mobile terminals have been described in terms of their functionality. Hereinafter, a slide-type mobile terminal among various types of mobile terminals, such as a folder-type, bar-type, swing-type, slide-type mobile terminal, and the like, will be described as an example for the sake of brevity. Accordingly, the present invention can be applied to any type of mobile terminal, and is not limited to a slide type mobile terminal.

The mobile terminal 100 as shown in fig. 1 may be configured to operate with communication systems such as wired and wireless communication systems and satellite-based communication systems that transmit data via frames or packets.

A communication system in which a mobile terminal according to the present invention is operable will now be described with reference to fig. 2.

Such communication systems may use different air interfaces and/or physical layers. For example, the air interface used by the communication system includes, for example, Frequency Division Multiple Access (FDMA), Time Division Multiple Access (TDMA), Code Division Multiple Access (CDMA), and Universal Mobile Telecommunications System (UMTS) (in particular, Long Term Evolution (LTE)), global system for mobile communications (GSM), and the like. By way of non-limiting example, the following description relates to a CDMA communication system, but such teachings are equally applicable to other types of systems.

Referring to fig. 2, the CDMA wireless communication system may include a plurality of mobile terminals 100, a plurality of Base Stations (BSs) 270, Base Station Controllers (BSCs) 275, and a Mobile Switching Center (MSC) 280. The MSC280 is configured to interface with a Public Switched Telephone Network (PSTN) 290. The MSC280 is also configured to interface with a BSC275, which may be coupled to the base station 270 via a backhaul. The backhaul may be constructed according to any of several known interfaces including, for example, E1/T1, ATM, IP, PPP, frame Relay, HDSL, ADSL, or xDSL. It will be understood that a system as shown in fig. 2 may include multiple BSCs 275.

Each BS270 may serve one or more sectors (or regions), each sector covered by a multi-directional antenna or an antenna pointing in a particular direction being radially distant from the BS 270. Alternatively, each partition may be covered by two or more antennas for diversity reception. Each BS270 may be configured to support multiple frequency allocations, with each frequency allocation having a particular frequency spectrum (e.g., 1.25MHz,5MHz, etc.).

The intersection of partitions with frequency allocations may be referred to as a CDMA channel. The BS270 may also be referred to as a Base Transceiver Subsystem (BTS) or other equivalent terminology. In such a case, the term "base station" may be used to generically refer to a single BSC275 and at least one BS 270. The base stations may also be referred to as "cells". Alternatively, each sector of a particular BS270 may be referred to as a plurality of cell sites.

As shown in fig. 2, a Broadcast Transmitter (BT)295 transmits a broadcast signal to the mobile terminal 100 operating within the system. A broadcast receiving module 111 as shown in fig. 1 is provided at the mobile terminal 100 to receive a broadcast signal transmitted by the BT 295. In fig. 2, several Global Positioning System (GPS) satellites 300 are shown. The satellite 300 assists in locating at least one of the plurality of mobile terminals 100.

In fig. 2, a plurality of satellites 300 are depicted, but it is understood that useful positioning information may be obtained with any number of satellites. The GPS module 115 as shown in fig. 1 is generally configured to cooperate with satellites 300 to obtain desired positioning information. Other techniques that can track the location of the mobile terminal may be used instead of or in addition to GPS tracking techniques. In addition, at least one GPS satellite 300 may selectively or additionally process satellite DMB transmission.

As a typical operation of the wireless communication system, the BS270 receives reverse link signals from various mobile terminals 100. The mobile terminal 100 is generally engaged in conversations, messaging, and other types of communications. Each reverse link signal received by a particular base station 270 is processed within the particular BS 270. The obtained data is forwarded to the associated BSC 275. The BSC provides call resource allocation and mobility management functions including coordination of soft handoff procedures between BSs 270. The BSCs 275 also route the received data to the MSC280, which provides additional routing services for interfacing with the PSTN 290. Similarly, the PSTN290 interfaces with the MSC280, the MSC interfaces with the BSCs 275, and the BSCs 275 accordingly control the BS270 to transmit forward link signals to the mobile terminal 100.

Taking a mobile terminal of a mobile phone as an example, a situation that a user holds the mobile terminal is shown in fig. 3.

Based on the above-described mobile terminal hardware structure and communication system, various embodiments of the present invention are proposed.

As shown in fig. 4 to 5, a first embodiment of the present invention provides a mobile terminal, including:

1) a segmenting module 401, configured to segment the voice file according to a blank portion of the voice file or according to an equal division manner to obtain a voice segment;

2) an obtaining module 402, configured to determine, according to content of a voice segment, visual description information of the voice segment;

3) and the preview module 403 is configured to display the visual description information of each voice segment as an identifier of the corresponding voice segment. Such as: the individual speech segments are identified in the form of a list or bubbles.

Optionally, as shown in fig. 4, the segmentation module 401 includes:

and a blank segmenting unit 41, configured to segment the voice file into voice segments according to blank portions included in the voice file and reaching the set time length. Such as: and judging a blank part based on the audio time domain waveform, and determining whether to adopt the blank part as a basis for segmentation according to whether the blank part reaches a set time length. In the case where the segmentation module 401 employs the blank segmentation unit 41, the preview module 403 is further configured to: and hiding the identification of the voice segment corresponding to the blank part.

Alternatively, as shown in fig. 5, the segmentation module 401 includes:

and a unit segmenting unit 42, configured to segment the voice file into voice segments according to a set time interval or a set size of a single voice segment.

The embodiment of the invention utilizes the audio time domain waveform analysis to disassemble the voice file segments, can eliminate blank segments, presents the identification of the visual description information of the voice segment content to the voice segment on the screen, is convenient for a user to quickly find the interested voice segment for trial listening, and improves the efficiency of audio searching.

As shown in fig. 6, a second embodiment of the present invention provides a mobile terminal, including:

2) a character acquisition module 402-a for converting the voice segments into corresponding character strings; extracting a character string abstract from a character string corresponding to the voice fragment, and taking the character string abstract as visual description information of the voice fragment; it can be considered that, in the embodiment of the present invention, the character obtaining module 402-a is a specific implementation of the obtaining module 402 in the first embodiment.

3) And the preview module 403 is configured to display the visual description information of each voice segment as an identifier of the corresponding voice segment.

Optionally, the apparatus further includes:

4) a processing module 404, configured to receive and respond to the operation instruction for the identification of each voice segment, where the operation instruction includes: selected, unselected, deleted, sorted, or played.

Optionally, the processing module 404 is further configured to:

In a case that the operation instruction is playing, optionally, the processing module 404 is specifically configured to: and if the voice file is from an audio and video file, playing the video file corresponding to the voice fragment in the audio and video file together with the voice fragment.

The embodiment of the invention utilizes audio time domain waveform analysis to disassemble the voice file segments, can eliminate blank segments, and the voice recognition technology is matched with each other to present the visual description information of the voice segment content to identify the voice segments on the screen, thereby facilitating a user to quickly find the interested voice segments to test listening, select and reverse select by clicking, and arrange and combine the segments by sequencing the voice segments, finally generating the arranged and recombined voice segments, realizing the quick retrieval test listening and the quick editing of voice, increasing the functions of terminal products and improving the efficiency of audio searching.

As shown in fig. 7, a third embodiment of the present invention provides a mobile terminal, including:

2) the image acquisition module 402-b is used for extracting an image frame from a video file corresponding to a voice segment in the audio and video file as visual description information of the voice segment under the condition that the voice file is from the audio and video file; it can be considered that, in the embodiment of the present invention, the image acquisition module 402-b is a specific implementation of the acquisition module 402 of the first embodiment.

Optionally, the apparatus further includes:

Optionally, the processing module 404 is further configured to:

The embodiment of the invention utilizes the audio time domain waveform analysis to disassemble the segments of the voice file, can eliminate blank segments, presents the identification of the image frames in the video file corresponding to the voice segments on the screen, is convenient for a user to quickly find the interested voice segments to test listening, select and reverse select by clicking, and arrange and combine the segments by sequencing the voice segments, and finally generates the arranged and recombined voice segments, thereby realizing the quick retrieval test listening and the quick editing of the voice, increasing the functions of terminal products and improving the efficiency of audio searching.

As shown in fig. 8, a fourth embodiment of the present invention provides a method for previewing a voice-based file, including:

step S101, segmenting the voice file to obtain voice fragments according to the blank part of the voice file or according to an equal division mode;

step S102, determining the visual description information of the voice segment according to the content of the voice segment;

and step S103, displaying the visual description information of each voice segment as the identification of the corresponding voice segment. Such as: the individual speech segments are identified in the form of a list or bubbles.

The visual description information has the function that a user can visually see the text information in the voice segment or the image frame information of the video segment corresponding to the voice segment from the terminal screen, so that the user can conveniently and quickly search and preview the video segment.

Optionally, the segmentation method in step S101 specifically includes the following two ways:

the first method comprises the following steps: dividing the voice file into voice segments according to a blank part which reaches a set time length and is contained in the voice file;

specifically, a blank portion is determined based on the audio time domain waveform of the voice file, and whether the blank portion is adopted as a basis for segmentation is determined according to whether the blank portion reaches a set duration. The blank part refers to a part of the audio time domain waveform outside a set audio amplitude range, such as: the audio amplitude range may be set according to human sound intensity characteristics, sounds below the minimum of the audio amplitude range are likely to be extraneous background sounds, and sounds above the maximum of the audio amplitude range are likely to be interfering noises. Preferably, in addition to the sound intensity, the corresponding amplitude range may be set according to the aspect of timbre and tone of human voice to determine the blank portion more precisely.

And the second method comprises the following steps: and dividing the voice file into voice segments according to a set time interval or a set single voice segment size.

Specifically, the user may manually set the time interval and the size of the single speech segment, such as: when the user presses the voice file in a specific way (such as long press of the voice file), a setting interface is triggered and displayed, the user can manually set the time interval or the size of a single voice segment on the setting interface, and the background divides the voice file based on the information set by the user. Alternatively, the user sets the time interval or the size of a single voice segment in advance, and when the user performs a specific mode (such as long-pressing a voice file), the voice file is divided directly according to the set information. The embodiment of the invention provides a first mode of segmenting the voice file according to the blank part and also provides a user-defined segmentation mode, so that the user can roughly segment or finely segment the voice file according to the self memory capacity, and correspondingly display less or more visual description information of the voice segment for the user to search the voice.

As shown in fig. 9, a fifth embodiment of the present invention provides a preview method based on a voice file, including:

step S201, segmenting the voice file to obtain voice fragments according to the blank part of the voice file or according to an equal division mode;

specifically, step S201 includes:

Optionally, under the condition that the voice file is divided into voice segments according to a blank portion which reaches a set time length and is contained in the voice file, the identifier of the voice segment corresponding to the blank portion is hidden.

Step S202, determining the visual description information of the voice segment according to the content of the voice segment;

specifically, step S202 includes:

Optionally, the visual description information further includes: the time starting and ending position of the voice segment in the voice file. The embodiment of the invention provides visual information about the content of the voice segment for the user, and also provides the information of the starting time and the ending time of the occurrence of the voice segment for the user to make auxiliary reference, such as: the user can find a desired speech segment more quickly based on the roughly remembered moment of speech occurrence. Subsequently, when the user clicks the voice segment to play, the voice of the voice segment in the time starting and ending position range in the voice file is also played.

Step S203, the visual description information of each voice segment is displayed as the identifier of the corresponding voice segment.

The embodiment of the invention utilizes the audio time domain waveform analysis to disassemble the segments of the voice file, can eliminate blank segments, presents character string information corresponding to the content of the voice segments or identifications of image frames in the video file corresponding to the voice segments on a screen, is convenient for a user to quickly find the interested voice segments to test, selects and counter-selects by clicking, and finally generates the arranged and recombined voice segments by arranging and combining the segments by sequencing the voice segments, thereby realizing the quick retrieval test listening and the quick editing of the voice, increasing the functions of terminal products and improving the efficiency of audio searching.

As shown in fig. 10, a sixth embodiment of the present invention provides a method for previewing a voice-based file, including:

step S301, segmenting the voice file to obtain voice fragments according to the blank part of the voice file or according to an equal division mode;

specifically, step S301 includes:

Step S302, determining the visual description information of the voice segment according to the content of the voice segment;

specifically, step S302 includes:

Step S303, the visual description information of each voice segment is displayed as the identifier of the corresponding voice segment.

Step S304, receiving and responding to an operation instruction aiming at the identification of each voice segment, wherein the operation instruction comprises: selected, unselected, deleted, sorted, or played.

Optionally, the step S304 further includes:

Under the condition that the operation instruction is played, optionally, if the voice file is from an audio/video file, the video file corresponding to the voice clip in the audio/video file is played together with the voice clip.

The seventh embodiment of the invention provides an application example for previewing based on a voice file, and the technical scheme of the application example is characterized in that a set of quick previewing method and any editing and combining method for the voice file are provided on a human-computer interaction interface by utilizing interaction means such as a touch screen, a pressure screen and the like and combining a voice recognition technology and a recording, and blanks can be efficiently skipped. When a user touches a voice file for a long time or presses the voice file with force, the background analyzes the audio time domain waveform, removes the waveform at the blank position from the audio time domain waveform, and simultaneously decomposes the waveform into a plurality of voice segments according to the blank position, the voice assistant performs voice-to-text operation on all the voice segments by acquiring the audio stream of the voice segments, and records the time point of each voice segment. The voice segments are changed into a plurality of bubbles on the man-machine interaction interface and distributed on the screen, character outlines corresponding to the voice segments are displayed in the bubbles, and starting and stopping time of the voice segments is displayed beside the bubbles. The user can click the bubbles of the corresponding voice segments to perform trial listening preview in the bubble primary screening mode, and can also randomly arrange and combine the bubbles, wherein the bubbles comprise a selected state and a non-selected state, and finally a new preliminarily edited audio file is spliced.

The implementation steps of the application example are as follows:

step a, voice file segmentation.

When the user reaches the trigger condition in a particular way, such as a hard click, a long click, etc. The system analyzes the waveform of the audio frequency in the time domain, and performs primary screening segmentation according to blank segments.

And b, converting the voice of the voice fragment into text.

And the voice assistant uploads all voice segment information to the cloud, waits for a CallBack () function to return the related character strings of each voice segment, and simultaneously records the starting and ending time of all voice segments in the whole voice file in the background.

And c, converting the voice fragment into a human-computer interaction interface.

Generating a corresponding number of bubbles on the human-computer interaction interface according to the number of the voice segments, the character strings corresponding to the content of each voice segment, and the start-stop time of each voice segment, as shown in fig. 11, displaying a character string outline in each bubble, and marking the start time beside each bubble. When the user clicks the corresponding segment, playing the preview of the current voice segment according to the start-stop time information; meanwhile, the user can slide the bubbles in the screen area, arrange randomly, select and unselect the bubbles for combination, and finally a preliminarily screened clip file can be generated.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A mobile terminal, characterized in that the mobile terminal comprises:

2. The mobile terminal of claim 1, wherein the segmentation module comprises:

3. The mobile terminal of claim 1, wherein the obtaining module comprises:

4. The mobile terminal of claim 3, wherein the visual descriptive information further comprises: the time starting and ending position of the voice segment in the voice file.

5. The mobile terminal of any of claims 1-4, wherein the apparatus further comprises:

a processing module, configured to receive and respond to an operation instruction for the identification of each voice segment, where the operation instruction includes: selecting, unselecting, deleting, sorting or playing;

6. A preview method based on voice files is characterized by comprising the following steps:

7. The method for previewing based on the voice file according to claim 6, wherein said segmenting the voice file according to the blank portion of the voice file or according to the halving manner to obtain the voice segment comprises:

8. The method for previewing based on voice file according to claim 6, wherein said determining the visual description information of said voice segment according to the content of said voice segment comprises:

9. The method for previewing based on a voice file according to claim 8, wherein said visual description information further comprises: the time starting and ending position of the voice segment in the voice file.

10. The method for previewing based on the voice file according to any one of claims 6 to 9, further comprising:

receiving and responding to the identified operation instructions for each of the voice segments, the operation instructions comprising: selecting, unselecting, deleting, sorting or playing;