WO2019183904A1 - 自动识别音频中不同人声的方法 - Google Patents

自动识别音频中不同人声的方法 Download PDF

Info

Publication number
WO2019183904A1
WO2019183904A1 PCT/CN2018/081184 CN2018081184W WO2019183904A1 WO 2019183904 A1 WO2019183904 A1 WO 2019183904A1 CN 2018081184 W CN2018081184 W CN 2018081184W WO 2019183904 A1 WO2019183904 A1 WO 2019183904A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
interface
terminal
audio
different
Prior art date
Application number
PCT/CN2018/081184
Other languages
English (en)
French (fr)
Inventor
武晓芳
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201880072788.3A priority Critical patent/CN111328418A/zh
Priority to PCT/CN2018/081184 priority patent/WO2019183904A1/zh
Publication of WO2019183904A1 publication Critical patent/WO2019183904A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Definitions

  • the present application relates to the field of communications technologies, and in particular, to a method and a terminal for audio processing in a terminal.
  • the Recorder application can intercept recorded files based on time. As shown in FIG. 1, it is a play interface 101 of a recording file ("new recording 2") in the terminal. At this interface, after the user clicks the edit button 102, the terminal displays the edit interface 103 of the recorded file. On this interface, the user can intercept part of the recorded content by time by dragging 104 and 105 to the recorded file.
  • the editing mode of the recording file by the terminal is too single, which cannot satisfy the user's processing requirements for the recording file in different scenarios, and affects the user experience.
  • the method and terminal for audio processing in the terminal provided by the present application can automatically extract audio of different vocals in the audio, which is beneficial to improving the user experience.
  • the method provided by the application includes: the terminal detects a first operation on the first interface; and in response to the first operation, the terminal automatically identifies a different voice to which the audio content in the first file belongs, where the first file is The file containing the audio; the terminal displays the second interface; wherein the different voices to which the audio content belongs in the first file have different marks in the second interface.
  • the first file is a file containing audio, and may be a pure audio file, a video file or a 3D image file or a hologram file.
  • the microphone array can be used to locate the sound source of different sounds in the first file.
  • the corresponding algorithm can be used to identify the sound source corresponding to the sound in the first file at the time of recording.
  • the position of the vocalist relative to the recording device is relatively fixed, so the sound corresponding to different sound sources corresponds to different people.
  • the terminal can locate the location of each sound source in the first file, and can determine that the sound file contains several people's voices. Then, the voiceprint recognition technology can be combined to determine the audio content corresponding to each person.
  • the terminal can automatically identify the voices of different people in the first file, and separately identify the audio content corresponding to the different people. In this way, the user can quickly locate the audio location of a specific person, which improves the user's work efficiency and improves the user experience.
  • the terminal detects the first operation on the first interface, specifically: the terminal detects the operation of clicking the function button for automatically recognizing the vocal on the playing interface or the editing interface of the first file, or selects the automatic identification. The operation of the vocal menu option.
  • the first interface may be a play interface or an edit interface (eg, the interface shown in FIG. 3E) of the first file in the audio application in the terminal, and the interface includes a function button or menu of “automatically recognize vocal”.
  • the first action is the user's operation of clicking the function button or selecting the menu option.
  • an audio application refers to an application that can process a file including audio.
  • the detecting, by the terminal, the first operation on the first interface is: when the terminal starts the function of automatically recognizing the human voice, the terminal detects that the first file is opened on the interface of the audio application. operating.
  • the first interface is a playlist interface of the audio application, or a play interface or an editing interface of the first file.
  • the method before the terminal detects the operation of opening the first file on the interface of the audio application, the method further includes: detecting an automatic recognition vocal of the open terminal on an interface of the system setting of the terminal The operation of the function; or, the operation of the function of automatically recognizing the vocal of the audio application is detected on the setting interface of the audio application.
  • the user may pre-configure the interface through the system of the terminal operating system, or the terminal may turn on the "auto-recognize vocal" function of the terminal by default. Then, when the terminal detects that an audio-related application (for example, an audio application, a recording application, etc.) has an operation of processing audio, the terminal can automatically recognize the vocal sound of the file containing the audio.
  • an audio-related application for example, an audio application, a recording application, etc.
  • the user may pre-empt a certain type of application (such as an audio and video application) or an application (for example, a "recorder” application) setting interface, or the terminal may enable "automatic identification" of an application or an application by default. The function of vocals. Then, when the terminal detects that the application or the application has an operation for processing audio, the terminal can automatically recognize the vocal of the file containing the audio.
  • a certain type of application such as an audio and video application
  • an application for example, a "recorder” application
  • the terminal detects the first operation on the first interface, specifically: the terminal detects a recording instruction input by the user on the interface of the recording application; the first file is a file generated by the recording application during recording; The second interface is the interface that the recording application displays during the recording process or after the recording is completed.
  • the first operation may also be an operation in which the user performs recording through the recording application.
  • the first file is a file generated when the recording application is recorded in real time.
  • an audio application refers to an application that can process a file including audio.
  • the recording application is an application that can record files including audio.
  • the different voices to which the audio content belongs in the first file have different marks in the second interface, including: the time axis corresponding to the different voices to which the audio content belongs in the first file has different marks.
  • the time axis corresponding to the different voices to which the audio content belongs in the first file has different marks, including: the time axis corresponding to the different voices to which the audio content belongs in the first file has different colors. .
  • the time axis corresponding to the different voices to which the audio content belongs in the first file has different marks, including: the time axis corresponding to the different voices to which the audio content belongs in the first file has different avatar marks. .
  • the terminal can receive the user's selection to play the complete audio of a certain person selected by the user.
  • a person's complete audio is the entire audio containing the person's voice in the first file, including the part that coincides with the voice of other people. It is also possible to automatically play the complete audio of each person in the order in which the sounds of different people in the first file appear. This embodiment of the present application does not limit this.
  • the method further includes: the terminal detects the second operation; and in response to the second operation, the terminal generates the second file, The second file contains all the audio content of one voice preset in the first file; the terminal displays the third interface, and the third interface displays the second file.
  • the second operation is an operation in which the user selects “generate a personal recording file”.
  • the second operation may be, for example, an operation button of the first file or an editing interface, and the user clicks a function button of “generate a personal recording file” or selects a menu option.
  • the audio of the coincident portion may be directly connected by the terminal, and the audio clip that each person speaks independently Together.
  • the user can identify the sound that he or she needs to listen to through the human ear. That is to say, the second file contains all the audio content of one person in the first file, and may also contain part of the audio of another person.
  • the audio of the part includes the sound of A and the sound of B.
  • the user himself/herself recognizes whether it is necessary to listen to the sound of A or the sound of B.
  • the terminal may further perform voice separation on the audio of the coincident portion based on the sound source localization technique and/or the voiceprint recognition technology. Edit the separated audio content together with the audio content of the other corresponding person. That is, the second file contains only the audio content of one person in the first file.
  • a terminal in a second aspect, includes: a detecting unit, configured to detect a first operation on a first interface; and a processing unit, configured to automatically identify a different person to which the audio content in the first file belongs in response to the first operation
  • the first file is a file containing audio
  • the display unit is configured to display the second interface; wherein different voices to which the audio content belongs in the first file have different marks in the second interface.
  • the detecting unit is specifically configured to detect an operation of clicking a function button for automatically recognizing a human voice on a play interface or an editing interface of the first file, or selecting an operation of automatically selecting a menu option of the human voice.
  • the detecting unit is specifically configured to detect an operation of opening the first file on the interface of the audio application when the terminal turns on the function of automatically recognizing the human voice.
  • the detecting unit is further configured to detect, before the operation of opening the first file on the interface of the audio application, the operation of detecting the function of automatically recognizing the vocal of the terminal on the interface of the system setting of the terminal. Or, an operation of turning on the function of automatically recognizing the vocal of the audio application is detected on the setting interface of the audio application.
  • the detecting unit is specifically configured to detect a recording instruction input by the user on the interface of the recording application; the first file is a file generated by the recording application during recording; and the second interface is a recording application during the recording process. Or the interface displayed after the recording is completed.
  • the different voices to which the audio content belongs in the first file have different marks in the second interface, including: the time axis corresponding to the different voices to which the audio content belongs in the first file has different marks.
  • the time axis corresponding to the different voices to which the audio content belongs in the first file has different marks, including: the time axis corresponding to the different voices to which the audio content belongs in the first file has different colors. .
  • the time axis corresponding to the different voices to which the audio content belongs in the first file has different marks, including: the time axis corresponding to the different voices to which the audio content belongs in the first file has different avatar marks. .
  • the detecting unit is further configured to detect the second operation; the processing unit is further configured to generate a second file in response to the second operation, where the second file includes a preset voice in the first file The entire audio content; the display unit is also used to display the third interface, and the third interface displays the second file.
  • the third interface is a play interface or an edit interface of the second file.
  • a third aspect a terminal, comprising: a processor, a memory and a touch screen, the memory, the touch screen being coupled to the processor, the memory for storing computer program code, the computer program code comprising computer instructions, and the processor reading the computer instruction from the memory To perform the method as described in any of the possible design methods of the first aspect.
  • a fourth aspect a computer storage medium comprising computer instructions that, when executed on a terminal, cause the terminal to perform the method as described in any of the possible design methods of the first aspect.
  • a fifth aspect a computer program product, when the computer program product is run on a computer, causing the computer to perform the method as described in any of the possible design methods of the first aspect.
  • FIG. 1 is a diagram showing an example of an interface of a recorder application of a terminal in the prior art
  • FIG. 2 is a schematic structural diagram 1 of a terminal provided by the present application.
  • 3A is a schematic diagram 1 of an interface example of a terminal provided by the present application.
  • FIG. 3B is a second schematic diagram of an interface of a terminal provided by the present application.
  • 3C is a third schematic diagram of an interface of a terminal provided by the present application.
  • FIG. 3D is a fourth schematic diagram of an interface of a terminal provided by the present application.
  • FIG. 3E is a schematic diagram 5 of an interface example of a terminal provided by the present application.
  • FIG. 3F is a schematic diagram 6 of an interface example of a terminal provided by the present application.
  • 3G is a schematic diagram 7 of an interface example of a terminal provided by the present application.
  • 3H is a schematic diagram 8 of an interface example of a terminal provided by the present application.
  • FIG. 3I is a schematic diagram 9 of an interface of a terminal provided by the present application.
  • FIG. 3 is a schematic diagram of an interface example of a terminal provided by the present application.
  • 3K is a schematic diagram 11 of an interface of a terminal provided by the present application.
  • FIG. 3L is a schematic diagram 12 of an interface example of a terminal provided by the present application.
  • FIG. 4 is a schematic flowchart 1 of a method for processing audio in a terminal according to the present application
  • FIG. 5 is a schematic diagram of a first file in a terminal according to the present application.
  • FIG. 6 is a schematic flowchart 2 of a method for processing audio in a terminal according to the present application.
  • FIG. 7 is a schematic structural diagram 2 of a terminal provided by the present application.
  • FIG. 8 is a schematic structural diagram 3 of a terminal provided by the present application.
  • an audio file may contain information on the content of speech by multiple people
  • the user may only need to focus on listening to the content of one of the speakers.
  • listening to the content of one of the speakers For example, in a recording of a meeting, there may be speeches from leaders and multiple employees. However, users may need to focus on the opinions of the leaders, or the arrangement of the work.
  • the user only listens to all the recording files, or when he hears the employee's speech, manually drags the progress bar a little and tries to skip the employee's speech. It can be seen that the longer the audio file is, the lower the user's work efficiency is, and the user experience is extremely poor.
  • the embodiment of the present application provides a method for processing audio in a terminal, which can identify sounds of multiple people in an audio file by combining sound source localization technology and/or voiceprint recognition technology. In this way, the user can listen to the audio content of a specific person in a targeted manner, thereby improving the user experience.
  • the terminal in the present application may be a mobile phone (such as the mobile phone 100 shown in FIG. 2), a tablet computer, a personal computer (PC), and a personal digital assistant (personal computer) that can install an application and display an application icon.
  • Digital assistant (PDA) smart watch, netbook, wearable electronic device, Augmented Reality (AR) device, Virtual Reality (VR) device, etc.
  • the application does not impose any special restrictions on the specific form of the terminal. .
  • the mobile phone 100 is used as an example of the terminal.
  • the mobile phone 100 may specifically include: a processor 101, a radio frequency (RF) circuit 102, a memory 103, a touch screen 104, a Bluetooth device 105, and one or more sensors 106. , Wireless Fidelity (WI-FI) device 107, positioning device 108, audio circuit 109, peripheral interface 110, and power supply device 111. These components can communicate over one or more communication buses or signal lines (not shown in Figure 2). It will be understood by those skilled in the art that the hardware structure shown in FIG. 2 does not constitute a limitation to the mobile phone, and the mobile phone 100 may include more or less components than those illustrated, or some components may be combined, or different component arrangements.
  • RF radio frequency
  • WI-FI Wireless Fidelity
  • the processor 101 is a control center of the mobile phone 100, and connects various parts of the mobile phone 100 by using various interfaces and lines, and executes the mobile phone 100 by running or executing an application stored in the memory 103 and calling data stored in the memory 103. Various functions and processing data.
  • processor 101 can include one or more processing units.
  • the radio frequency circuit 102 can be used to receive and transmit wireless signals during transmission or reception of information or calls.
  • the radio frequency circuit 102 can process the downlink data of the base station and then process it to the processor 101; in addition, transmit the data related to the uplink to the base station.
  • radio frequency circuits include, but are not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like.
  • the radio frequency circuit 102 can also communicate with other devices through wireless communication.
  • the wireless communication can use any communication standard or protocol, including but not limited to global mobile communication systems, general packet radio services, code division multiple access, wideband code division multiple access, long term evolution, email, short message service, and the like.
  • the memory 103 is used to store applications and data, and the processor 101 executes various functions and data processing of the mobile phone 100 by running applications and data stored in the memory 103.
  • the memory 103 mainly includes a storage program area and a storage data area, wherein the storage program area can store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.); the storage data area can be stored according to the use of the mobile phone. Data created at 100 o'clock (such as audio data, phone book, etc.).
  • the memory 103 may include a high speed random access memory (RAM), and may also include a nonvolatile memory such as a magnetic disk storage device, a flash memory device, or other volatile solid state storage device.
  • the memory 103 can store various operating systems, for example, developed by Apple. Operating system, developed by Google Inc. Operating system, etc.
  • the above memory 103 may be independent and connected to the processor 101 via the above communication bus; the memory 103 may also be integrated with the processor 101.
  • the touch screen 104 may specifically include a touch panel 104-1 and a display 104-2.
  • the touch panel 104-1 can collect touch events on or near the user of the mobile phone 100 (for example, the user uses any suitable object such as a finger, a stylus, or the like on the touch panel 104-1 or on the touchpad 104.
  • the operation near -1), and the collected touch information is sent to other devices (for example, processor 101).
  • the touch event of the user in the vicinity of the touch panel 104-1 may be referred to as a hovering touch; the hovering touch may mean that the user does not need to directly touch the touchpad in order to select, move or drag a target (eg, an icon, etc.) , and only the user is located near the device to perform the desired function.
  • the touch panel 104-1 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves.
  • a display (also referred to as display) 104-2 can be used to display information entered by the user or information provided to the user as well as various menus of the mobile phone 100.
  • the display 104-2 can be configured in the form of a liquid crystal display, an organic light emitting diode, or the like.
  • the touchpad 104-1 can be overlaid on the display 104-2, and when the touchpad 104-1 detects a touch event on or near it, it is transmitted to the processor 101 to determine the type of touch event, and then the processor 101 may provide a corresponding visual output on display 104-2 depending on the type of touch event.
  • the touchpad 104-1 and the display 104-2 are implemented as two separate components to implement the input and output functions of the handset 100, in some embodiments, the touchpad 104- 1 is integrated with the display screen 104-2 to implement the input and output functions of the mobile phone 100. It is to be understood that the touch screen 104 is formed by stacking a plurality of layers of materials. In the embodiment of the present application, only the touch panel (layer) and the display screen (layer) are shown, and other layers are not described in the embodiment of the present application. .
  • the touch panel 104-1 may be disposed on the front surface of the mobile phone 100 in the form of a full-board
  • the display screen 104-2 may also be disposed on the front surface of the mobile phone 100 in the form of a full-board, so that the front of the mobile phone can be borderless. Structure.
  • the mobile phone 100 can also have a fingerprint recognition function.
  • the fingerprint reader 112 can be configured on the back of the handset 100 (eg, below the rear camera) or on the front side of the handset 100 (eg, below the touch screen 104).
  • the fingerprint collection device 112 can be configured in the touch screen 104 to implement the fingerprint recognition function, that is, the fingerprint collection device 112 can be integrated with the touch screen 104 to implement the fingerprint recognition function of the mobile phone 100.
  • the fingerprint capture device 112 is disposed in the touch screen 104 and may be part of the touch screen 104 or may be otherwise disposed in the touch screen 104.
  • the main component of the fingerprint collection device 112 in the embodiment of the present application is a fingerprint sensor, which can employ any type of sensing technology, including but not limited to optical, capacitive, piezoelectric or ultrasonic sensing technologies.
  • the mobile phone 100 may also include a Bluetooth device 105 for enabling data exchange between the handset 100 and other short-range devices (eg, mobile phones, smart watches, etc.).
  • the Bluetooth device in the embodiment of the present application may be an integrated circuit or a Bluetooth chip or the like.
  • the handset 100 can also include at least one type of sensor 106, such as a light sensor, motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display of the touch screen 104 according to the brightness of the ambient light, and the proximity sensor may turn off the power of the display when the mobile phone 100 moves to the ear.
  • the accelerometer sensor can detect the magnitude of acceleration in all directions (usually three axes). When it is stationary, it can detect the magnitude and direction of gravity. It can be used to identify the gesture of the mobile phone (such as horizontal and vertical screen switching, related Game, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.
  • the mobile phone 100 can also be configured with gyroscopes, barometers, hygrometers, thermometers, infrared sensors and other sensors, here Let me repeat.
  • the WI-FI device 107 is configured to provide the mobile phone 100 with network access complying with the WI-FI related standard protocol, and the mobile phone 100 can access the WI-FI access point through the WI-FI device 107, thereby helping the user to send and receive emails. Browsing web pages and accessing streaming media, etc., it provides users with wireless broadband Internet access.
  • the WI-FI device 107 can also function as a WI-FI wireless access point, and can provide WI-FI network access for other devices.
  • the positioning device 108 is configured to provide a geographic location for the mobile phone 100. It can be understood that the positioning device 108 can be specifically a receiver of a positioning system such as a Global Positioning System (GPS) or a Beidou satellite navigation system, or a Russian GLONASS. After receiving the geographical location transmitted by the positioning system, the positioning device 108 sends the information to the processor 101 for processing, or sends it to the memory 103 for storage. In some other embodiments, the positioning device 108 can also be a receiver of an Assisted Global Positioning System (AGPS), which assists the positioning device 108 in performing ranging and positioning services by acting as an auxiliary server.
  • AGPS Assisted Global Positioning System
  • the secondary location server provides location assistance over a wireless communication network in communication with a location device 108 (i.e., a GPS receiver) of the device, such as handset 100.
  • the positioning device 108 can also be a WI-FI access point based positioning technology. Since each WI-FI access point has a globally unique (Media Access Control, MAC) address, the device can scan and collect the broadcast signals of the surrounding WI-FI access points when WI-FI is turned on. Therefore, the MAC address broadcasted by the WI-FI access point can be obtained; the device sends the data (such as the MAC address) capable of indicating the WI-FI access point to the location server through the wireless communication network, and each location is retrieved by the location server. The geographic location of the WI-FI access point, combined with the strength of the WI-FI broadcast signal, calculates the geographic location of the device and sends it to the location device 108 of the device.
  • MAC Media Access Control
  • the audio circuit 109, the speaker 113, and the microphone 114 can provide an audio interface between the user and the handset 100.
  • the audio circuit 109 can transmit the converted electrical data of the received audio data to the speaker 113 for conversion to the sound signal output by the speaker 113; on the other hand, the microphone 114 converts the collected sound signal into an electrical signal by the audio circuit 109. After receiving, it is converted into audio data, and then the audio data is output to the RF circuit 102 for transmission to, for example, another mobile phone, or the audio data is output to the memory 103 for further processing.
  • the terminal includes two or more microphones 114 to form a microphone array.
  • the microphone array can be used to process received speech signals, suppress noise, and improve call quality.
  • the microphone array can also be used to achieve a sound source localization of the speech signal based on the time difference of the speech signals to the respective microphones 114 to distinguish different human voices.
  • the peripheral interface 110 is used to provide various interfaces for external input/output devices (such as a keyboard, a mouse, an external display, an external memory, a subscriber identity module card, etc.). For example, it is connected to the mouse through a Universal Serial Bus (USB) interface, and is connected to a Subscriber Identification Module (SIM) card provided by the service provider through a metal contact on the card slot of the subscriber identity module. . Peripheral interface 110 can be used to couple the external input/output peripherals described above to processor 101 and memory 103.
  • USB Universal Serial Bus
  • SIM Subscriber Identification Module
  • the mobile phone 100 may further include a power supply device 111 (such as a battery and a power management chip) that supplies power to the various components.
  • the battery may be logically connected to the processor 101 through the power management chip to manage charging, discharging, and power management through the power supply device 111. And other functions.
  • the mobile phone 100 may further include a camera (front camera and/or rear camera), a flash, a micro projection device, a near field communication (NFC) device, and the like, and details are not described herein.
  • a camera front camera and/or rear camera
  • a flash a flash
  • micro projection device a micro projection device
  • NFC near field communication
  • the technical solution provided by the embodiment of the present application can be applied to a process in which a terminal processes a file (for example, an audio file or a video file) that includes audio.
  • a file for example, an audio file or a video file
  • the “recorder” application in the terminal processes the recorded file as an example, and the present application is applied to the present application.
  • the scheme provided by the embodiment is exemplified.
  • FIG. 3A to FIG. 3G an example of a terminal interface involved in the process of processing a recording file by the “recorder” application according to the technical solution provided by the embodiment of the present application.
  • the user can enter the main interface of the “recorder” application by clicking the icon 301 of the “recorder” application.
  • the main interface of the "Recorder” application the user can view all of the recorded files in the "Recorder” application by clicking on the "Record File” button 302.
  • FIG. 3C a list of recording files included in the "recorder" displayed for the terminal.
  • the user can enter the play interface of the recording file by clicking the "new recording 1" button 303.
  • FIG. 3D it is a playback interface of “new recording 1”.
  • the interface includes an "Edit” button 304.
  • the user can enter the editing interface for the "New Record 1" file by clicking the "Edit” button 304.
  • FIG. 3E it is an editing interface of "new recording 1".
  • the editing interface includes an "automatically recognize vocal” button 305. By clicking the button, the user can use the voiceprint recognition technology and/or the sound source localization technology to automatically identify a plurality of human voices included in the recorded file, and distinguish the recorded content corresponding to different human voices.
  • the voiceprint recognition technology and/or the sound source localization technology By a specific implementation process, refer to the description in step S102, and details are not described herein.
  • the editing interface also includes function buttons such as “transfer text”, “share” and “delete”.
  • the "transfer text” button enables the terminal to convert the voice signal in the recorded file into text information by using, for example, voice recognition technology.
  • the "Share” button can be used to forward the recorded file, for example, via SMS, email, WeChat, etc.
  • the Delete button is used to remove the recording file from the Sound Recorder application.
  • the embodiment of the present application does not limit the interface content and the specific interface form of the editing interface.
  • the interface for different vocals in the recording file of "New Recording 1" is automatically recognized for the terminal.
  • the progress bar of the recorded file in the interface is divided into several parts. Among them, 306 corresponds to one person's recorded content (which can be recorded as: A's recorded content), and 307 corresponds to another person's recorded content (which can be recorded as: B's recorded content).
  • 308 corresponds to the recording content of A and B (that is, A and B speak at the same time during this part of the time). Illustratively, 306, 307, and 308 can be labeled with different colors, or other annotations can be used.
  • the embodiment of the present application does not limit the manner in which the terminal labels the voices of different people.
  • the user manually adjusts the recorded content by dragging the dots on the progress bar. Since the progress bar clearly indicates which part of the time corresponds to which person's recording content, the user can quickly and accurately play the recording content of a specific person by dragging the dot on the progress bar.
  • FIG. 3G a schematic diagram of playing the recording content of A for the user dragging the progress bar.
  • FIG. 3H a schematic diagram of playing the recording content of B is performed by dragging the progress bar for the user.
  • 308 is the recording content corresponding to A and B, when the recorded content of A is played, the recorded content of 308 is also played; when the recorded content of B is played, the recorded content of 308 is also played. .
  • the terminal can also automatically play a complete recording content of a certain person according to the user's selection, that is, the user does not need to manually drag the progress bar to play.
  • the specific manner of playing the personal recording content of the terminal is not limited in the embodiment of the present application.
  • the terminal may click the “Generate Personal Recording File” button 309 to generate different recording files according to different people's voices according to the recording file of “New Recording 1”.
  • generating different recording files according to different people's voices may be to generate a new recording file or replace the original recording file.
  • An option 310 for the user to select "recording file for generating information" is shown in FIG. 3J.
  • FIG. 3K it is a playback interface of the “recording file of A in the new recording 1” generated by the terminal according to “new recording 1”.
  • the recording file of A includes the recording contents corresponding to 306 and 308 in "New Recording 1".
  • the interface also includes a "pause” button, a "headphone mode” button, a "tag” button, a "turn text” button, and a "share” button. That is to say, the terminal can process the newly generated personal recording file in the same way as the original recording file.
  • the interface content and interface form of the interface are not limited in this embodiment of the present application.
  • the user can also switch the play interface of “A recording file of A in new recording 1” to the playing interface of “recording file of B in new recording 1” by, for example, sliding to the right.
  • FIG. 3L it is a play interface of the “recording file of B in the new recording 1” generated by the terminal according to “new recording 1”. It can be seen that the recording file of B includes the recording contents corresponding to 307 and 308 in "New Recording 1".
  • FIG. 4 is a flowchart of a method for audio processing in a terminal according to an embodiment of the present disclosure, where the method specifically includes:
  • the terminal detects the first operation on the first interface.
  • the first interface may be a play interface or an edit interface (eg, the interface shown in FIG. 3E) of the first file in the audio application in the terminal, and the interface includes a function button of “automatically recognize vocal” or The menu option, then, the first operation is an operation in which the user clicks the function button or selects the menu option.
  • an audio application refers to an application that can process a file including audio.
  • the first file is a file containing audio, and may be a pure audio file, a video file or a 3D image file or a hologram file.
  • the user may pre-configure the interface through the system of the terminal operating system, or the terminal may turn on the "auto-recognize vocal" function of the terminal by default. Then, when the terminal detects that an audio-related application (for example, an audio application, a recording application, etc.) has an operation of processing audio, the terminal can automatically recognize the vocal sound of the file containing the audio.
  • an audio application refers to an application that can process a file including audio.
  • the recording application is an application that can record files including audio.
  • the first operation may be an operation in which the user opens the first file through the audio application.
  • the first interface is a playlist interface of the audio application, or a play interface or an edit interface before opening the first file.
  • the first operation may also be an operation in which the user performs recording through the recording application.
  • the first file is a file generated when the recording application is recorded in real time
  • the first interface is an interface presented by the recording application before the user inputs the recording instruction, and the user can input the recording instruction under the interface.
  • the first file is a file containing audio, and may be a pure audio file, a video file or a 3D image file or a hologram file.
  • the user may pre-empt a certain type of application (such as an audio and video application) or an application (for example, a "recorder” application) setting interface, or the terminal may enable "automatic identification" of an application or an application by default.
  • a certain type of application such as an audio and video application
  • an application for example, a "recorder” application
  • the terminal may enable "automatic identification" of an application or an application by default.
  • the function of vocals when the terminal detects that the application or the application has an operation for processing audio, the terminal can automatically recognize the vocal of the file containing the audio.
  • interface content and the interface form of the first interface are not limited in the embodiment of the present application, and the operation mode of the first operation is not limited.
  • the terminal In response to the first operation, the terminal identifies the audio content of different human voices included in the first file.
  • the first file is a file containing audio, and may be, for example, an audio file, a video file, or the like.
  • Specific audio files may include audio files, music files, and the like.
  • the terminal may adopt voiceprint recognition technology and/or sound source localization technology to automatically identify the sounds of multiple people included in the first file, and distinguish audio content corresponding to different sounds.
  • the mobile phone Because in daily life, when using mobile phone communication, it will be disturbed by noise and reverberation, and the signal collected by the microphone is not a pure voice signal. In order to enhance the voice signal and improve the quality of the call, the mobile phone usually uses microphone array technology. Microphone array technology is to form an array of multiple microphones according to certain rules. When voice and environment information is collected by multiple microphones, the microphone array can effectively form a beam directed to the target sound source in a specific direction by adjusting the filter coefficient of each channel, and enhance the signal in the beam. The external signal is suppressed to achieve the purpose of simultaneously extracting the sound source and suppressing noise. It is also because the mobile phone has a microphone array that the microphone array can be used to locate the sound source of different sounds in the first file.
  • the corresponding algorithm can be used to identify the sound source corresponding to the sound in the first file at the time of recording.
  • the position of the vocalist relative to the recording device is relatively fixed, so the sound corresponding to different sound sources corresponds to different people.
  • the sound source localization technology includes three types, namely, a high-resolution estimation positioning technique, a steerable beamforming method, and a time difference based arrival time technique.
  • the positioning of the sound source is described herein based on the time difference of arrival technique.
  • a plurality of microphones microphone arrays
  • the terminal can locate the location of each sound source in the first file, and can determine that the sound file contains several people's voices.
  • the voiceprint recognition technology can be combined to determine the audio content corresponding to each person.
  • the voiceprint recognition technology refers to a technique of reflecting the voice parameters of the speaker's physiological and behavioral characteristics through the voice waveform in the audio, thereby distinguishing the speaker's identity (ie, identifying the voices of different people). Specifically, if the voiceprint template of the person is stored in the mobile phone, the mobile phone can compare the audio code template with the audio in the first file to confirm the audio content corresponding to the voiceprint template.
  • the mobile phone may be based on a certain segment of the audio in the first file (may be used for Some or all of the audio is located in the sound source, and the voiceprint features of the audio are extracted to create a new voiceprint template.
  • the audio of the other part of the first file is compared with the voiceprint template to confirm the audio content corresponding to the newly established voiceprint module.
  • the mobile phone Considering the case where there are multiple people's voices in the first file overlapping in a certain period of time, the mobile phone also needs to be able to recognize a part of the audio in which the plurality of sounds in the first file coincide.
  • the MN part is the audio content corresponding to A
  • the PQ part is the audio content corresponding to B
  • the PN part is the audio of A and B simultaneous speaking. content.
  • the mobile phone can determine that the PN part is the part where the A and B sounds coincide, as follows: At the M point, the A starts to talk, and the mobile phone can determine the sound source position of the sound source by the sound source localization technology. At point P, A speaks and B starts to talk. Then, the mobile phone can determine the two sound source positions by sound source localization technology, namely the sound source position of A and the sound source position of B. Therefore, it can be determined that the P point starts, and the A and B simultaneously speak portions. Until, from the N point, the mobile phone only determines the sound source position of B, then, from the N point, B speaks alone. Therefore, it can be determined that the PN portion corresponds to the portion where the A and B sounds coincide. It should be noted that, by this method, the overlapping portions of the sounds of a plurality of people can also be identified, and the detailed description thereof will not be repeated here.
  • the mobile phone can also determine that the PN part is a coincident part of the A and B sounds, as follows: the mobile phone can determine the audio content of the MP part as A by the sound source localization technology and/or the voiceprint recognition technology, and the NQ part is B. Audio content. If the mobile phone does not recognize the audio content corresponding to the PN portion, it can be inferred that the PN portion corresponds to the audio content of A and B according to the consistency of the audio content before and after, that is, the audio of the A and B sound overlapping portions.
  • the audio source localization technology can be used to determine the audio content corresponding to each sound, and the voiceprint recognition technology can be used to determine the audio content corresponding to each sound, and the sound source localization technology can also be used to combine the sound.
  • the pattern recognition technology determines the audio content corresponding to each sound, which is not limited by the embodiment of the present application.
  • the terminal displays the second interface.
  • the second interface is a play interface or an edit interface (for example, the interface shown in FIG. 3F) of the first file in the audio application, or the first file is played in the audio application. Or the interface displayed after editing is completed.
  • the audio content of the different vocals in the first file is separately identified.
  • the manner of marking may be, for example, displaying audio corresponding to different human voices in different colors on the progress bar of the first file, wherein the multi-person overlapping portions may also be identified by different colors. It is also possible that when the user drags the dot on the progress bar, the avatar corresponding to the current position of the dot is displayed with a different avatar.
  • the first file is a file generated when the recording application is recorded in real time
  • the second interface may be a play interface or an editing interface of the first file.
  • the second interface may also be an interface that the recording application completes recording of the first file, and the user can directly play the first file through the interface.
  • the embodiment of the present application does not limit the interface form and the interface content of the second interface, and the identification manner of different human voices.
  • the terminal can receive the user's selection to play the complete audio of a certain person selected by the user.
  • a person's complete audio is the entire audio containing the person's voice in the first file, including the part that coincides with the voice of other people. It is also possible to automatically play the complete audio of each person in the order in which the sounds of different people in the first file appear. This embodiment of the present application does not limit this.
  • the terminal can automatically identify the voices of different people in the first file, and separately identify the audio content corresponding to the different people. In this way, the user can quickly locate the audio location of a specific person, which improves the user's work efficiency and improves the user experience.
  • the terminal may generate a plurality of new audio files including only the personal voice. That is to say, after the step S102, the method provided by the embodiment of the present application further includes:
  • the terminal receives the second operation.
  • the second operation is an operation in which the user selects “generate a personal recording file”.
  • the second operation may be, for example, an operation button of the first file or an editing interface, and the user clicks a function button of “generate a personal recording file” or selects a menu option.
  • the second operation may also be an operation of clicking a function button of “generating a personal recording file” or selecting a menu option on the second interface.
  • the embodiment of the present application does not limit the specific operation mode of receiving the second operation on the interface and the second operation.
  • the terminal generates a second file in response to the second operation.
  • the second file contains all the audio content of one person in the first file.
  • the second file may include only all the audio content of one person, and the second file may also include part of the audio content of another person in addition to all the audio content of one person. This embodiment of the present application does not limit this.
  • the first file may be copied, and the copied first file is re-edited. Specifically, the terminal may edit all audio content corresponding to one person in the first file into one file.
  • the “recording file of A in the new recording 1” is to edit the audio content corresponding to A in the “new recording 1” into a recording file, specifically including the audio portion of the A independent speech. (306) The audio portion (308) when A and B speak simultaneously.
  • the “recording file of B in the new recording 1” is to edit the audio content corresponding to B in the “new recording 1” into a recording file, specifically including the audio portion (307) of the B independent speech and The audio portion when A and B speak at the same time (308).
  • the audio of the coincident portion may be directly connected by the terminal, and the audio clip that each person speaks independently Together.
  • the user can identify the sound that he or she needs to listen to through the human ear. That is to say, the second file contains all the audio content of one person in the first file, and may also contain part of the audio of another person.
  • the audio of the part includes the sound of A and the sound of B.
  • the user himself/herself recognizes whether it is necessary to listen to the sound of A or the sound of B.
  • the terminal may further perform voice separation on the audio of the coincident portion based on the sound source localization technique and/or the voiceprint recognition technology. Edit the separated audio content together with the audio content of the other corresponding person. That is, the second file contains only the audio content of one person in the first file.
  • the terminal can generate only one person's personal audio file (for example, A's recording file) according to the user's selection, and the terminal can also automatically input the personal audio of each person included in the first file.
  • a file that generates multiple personal audio files such as A's recording file and B's recording file. This embodiment of the present application does not limit this.
  • the terminal displays a third interface.
  • the third interface is a play interface of the second file (for example, the interface shown in FIG. 3K or the interface shown in FIG. 3L).
  • the third interface may further include a “pause” button, a “headphone mode” button, a “tag” button, a “turn text” button, and a “share” button.
  • the "turn text” button enables the terminal to convert the voice signal in the second file into text information by using, for example, voice recognition technology.
  • the "Share” button can be used to forward the second file, for example, by SMS, email, WeChat, etc.
  • the "Delete” button is used to delete the second file from the current application or memory. The embodiment of the present application does not limit the interface content and the specific interface form included in the third interface.
  • the above terminal and the like include hardware structures and/or software modules corresponding to each function.
  • the embodiments of the present application can be implemented in a combination of hardware or hardware and computer software in combination with the elements and algorithm steps of the various examples described in the embodiments disclosed herein. Whether a function is implemented in hardware or computer software to drive hardware depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the embodiments of the invention.
  • the embodiment of the present application may perform the division of the function modules on the terminal or the like according to the foregoing method example.
  • each function module may be divided according to each function, or two or more functions may be integrated into one processing module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules. It should be noted that the division of the module in the embodiment of the present invention is schematic, and is only a logical function division, and the actual implementation may have another division manner.
  • FIG. 7 shows a possible structural diagram of the terminal involved in the above embodiment.
  • the terminal 1000 includes a detecting unit 1001, a processing unit 1002, and a display unit 1003.
  • the detecting unit 1001 is configured to support the terminal to perform step S101 in FIG. 4, step S201 in FIG. 6, and/or other processes for the techniques described herein.
  • the processing unit 1002 is configured to support the terminal to perform step S102 in FIG. 4, step S202 in FIG. 5, and/or other processes for the techniques described herein.
  • the display unit 1003 is configured to support the terminal to perform step S103 in FIG. 4, step S203 in FIG. 6, and display the terminal interface in FIGS. 3A through 3L, and/or other processes for the techniques described herein.
  • the terminal 1000 may further include a communication unit for the terminal to interact with other devices.
  • the specific functions that can be implemented by the foregoing functional units include, but are not limited to, the functions corresponding to the method steps described in the foregoing examples.
  • the terminal 1000 may further include a storage unit for storing the first file and the second file in the terminal, and program codes, data, and the like in the terminal.
  • the above detecting unit 1001 and processing unit 1002 may be integrated together, and may be a processing module of the terminal.
  • the communication unit described above may be a communication module of the terminal, such as an RF circuit, a WiFi module, or a Bluetooth module.
  • the above display unit 1003 may be a display of the terminal.
  • the above storage unit may be a storage module of the terminal.
  • FIG. 8 is a schematic diagram showing a possible structure of a terminal involved in the above embodiment.
  • the terminal 1100 includes a processing module 1101, a storage module 1102, and a communication module 1103.
  • the processing module 1101 is configured to control and manage the actions of the terminal.
  • the storage module 1102 is configured to save program codes and data of the terminal.
  • the communication module 1103 is for communicating with other terminals.
  • the processing module 1101 may be a processor or a controller, and may be, for example, a central processing unit (CPU), a general-purpose processor, a digital signal processor (DSP), and an application-specific integrated circuit (Application-Specific).
  • CPU central processing unit
  • DSP digital signal processor
  • Application-Specific Application-Specific
  • the processor may also be a combination of computing functions, for example, including one or more microprocessor combinations, a combination of a DSP and a microprocessor, and the like.
  • the communication module 1303 may be a transceiver, a transceiver circuit, a communication interface, or the like.
  • the storage module 1102 can be a memory.
  • the processing module 1101 is a processor (such as the processor 101 shown in FIG. 2)
  • the communication module 1103 is an RF transceiver circuit (such as the RF circuit 102 shown in FIG. 2)
  • the storage module 1102 is a memory (as shown in FIG. 2).
  • the terminal provided by the embodiment of the present application may be the terminal 100 shown in FIG. 2.
  • the communication module 1103 may include not only an RF circuit but also a WiFi module and a Bluetooth module. Communication modules such as RF circuits, WiFi modules, and Bluetooth modules can be collectively referred to as communication interfaces. Wherein, the above processor, communication interface and memory can be coupled together by a bus.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be another division manner for example, multiple units or components may be used. Combinations can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • a computer readable storage medium A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a flash memory, a mobile hard disk, a read only memory, a random access memory, a magnetic disk, or an optical disk, and the like, which can store program codes.

Abstract

本申请提供的自动识别音频中不同人声的方法,涉及通信技术领域,可以自动识别出音频中不同人声,有利于提升用户体验。该方法具体包括:终端在第一界面检测到第一操作,响应于检测到的第一操作,终端自动识别第一文件中的音频内容所属人声;终端显示第二界面,第一文件中的音频内容所述的不同人声在第二界面中具有不同的标记。

Description

自动识别音频中不同人声的方法 技术领域
本申请涉及通信技术领域,尤其涉及一种终端中音频处理的方法及终端。
背景技术
一般而言,用户通常会用到手机的“录音机”应用程序,去录制重要的谈话、会议或者通话等,形成录音文件。之后,用户可以重复播放录音文件,以便对其中的重要内容进行收听,避免遗漏关键信息。
目前,“录音机”应用程序可以根据时间,对录音文件进行截取。如图1所示,为终端中录音文件(“新录音2”)的播放界面101。在该界面,用户点击编辑按钮102后,终端显示录音文件的编辑界面103。在该界面上,用户可以通过拖动104和105对录音文件按照时间进行截取部分录音内容。
现有技术中终端对录音文件的编辑方式过于单一,不能满足用户对不同场景下对录音文件的处理需求,影响用户体验。
发明内容
本申请提供的一种终端中音频处理的方法及终端,可以自动提取音频中不同人声的音频,有利于提升用户体验。
第一方面,本申请提供的方法,包括:终端在第一界面上检测到第一操作;响应于第一操作,终端自动识别第一文件中的音频内容所属的不同人声,第一文件为包含音频的文件;终端显示第二界面;其中,第一文件中音频内容所属的不同人声在第二界面中具有不同的标记。
其中,第一文件为包含音频的文件,可以是纯音频文件,也可以是视频文件或者3D影像文件或者全息影像文件等。
示例的,可以利用该麦克风阵列对第一文件中不同声音的声源进行定位。换言之,对于使用麦克风阵列录音的第一文件,可以用对应的算法识别出该第一文件中的声音在录制时对应的声源。在通常的录音场合,发声的人相对于录音设备的位置是相对固定的,故不同的声源对应的声音就对应不同的人。
对于第一文件来说,在终端可以定位出第一文件中各个声源的位置,可以确定该录音文件中包含有几个人的声音。然后,可以再结合声纹识别技术,确定出各个人对应的音频内容。
由此可见,本申请实施例提供的方法,对于包含音频的第一文件,终端可以自动识别出第一文件中不同人的声音,并将不同人对应的音频内容分别标识出。这样,用户可以快速得定位到特定人的音频位置,提升了用户的工作效率,提升了用户体验。
一种可能的设计中,终端在第一界面上检测到第一操作具体为:终端在第一文件的播放界面或编辑界面上检测到点击自动识别人声的功能按钮的操作,或者选择自动识别人声的菜单选项的操作。
一些示例中,第一界面可以为终端中音频应用中第一文件的播放界面或编辑界面 (例如:图3E所示的界面),该界面上包含有“自动识别人声”的功能按钮或菜单选项,那么,第一操作为用户点击该功能按钮或选择该菜单选项的操作。其中,音频应用是指可以处理包括音频的文件的应用。
一种可能的设计中,所述终端在第一界面上检测到第一操作具体为:在终端开启自动识别人声的功能时,终端在音频应用的界面上检测到打开所述第一文件的操作。
此时,第一界面为音频应用的播放列表界面、或第一文件的播放界面或编辑界面等。
一种可能的设计中,在终端在音频应用的界面上检测到打开所述第一文件的操作之前,所述方法还包括:在终端的系统设置的界面上检测到打开终端的自动识别人声的功能的操作;或者,在音频应用的设置界面上检测到打开音频应用的自动识别人声的功能的操作。
一些示例中,用户可以预先通过终端操作系统的系统设置界面,或者,终端默认开启终端的“自动识别人声”的功能。那么,终端在检测到与音频相关的应用(例如:音频应用、录音应用等)有处理音频的操作时,可以对包含音频的文件进行自动识别人声。
一些示例中,用户可以预先通过某类应用(例如音视频类应用)或者某个应用(例如:“录音机”应用)的设置界面,或者,终端默认开启某类应用或某个应用的“自动识别人声”的功能。那么,终端在检测到该类应用或该应用有处理音频的操作时,可以对包含音频的文件进行自动识别人声。
一种可能的设计中,终端在第一界面上检测到第一操作具体为:终端在录音应用的界面上检测到用户输入的录音指令;第一文件为录音应用在录音时生成的文件;第二界面为录音应用在录音过程中,或录音完成后显示的界面。
例如:第一操作也可以是用户通过录音应用进行录音的操作。此时,第一文件是录音应用实时录制时生成的文件。其中,音频应用是指可以处理包括音频的文件的应用。录音应用为可以录制包括音频的文件的应用。
一种可能的设计中,第一文件中音频内容所属的不同人声在第二界面中具有不同的标记包括:第一文件中音频内容所属的不同人声所对应的时间轴具有不同的标记。
一种可能的设计中,第一文件中音频内容所属的不同人声所对应的时间轴具有不同的标记包括:第一文件中音频内容所属的不同人声所对应的时间轴具有不同颜色的标记。
一种可能的设计中,第一文件中音频内容所属的不同人声所对应的时间轴具有不同的标记包括:第一文件中音频内容所属的不同人声所对应的时间轴具有不同头像的标记。
示例的,终端在显示第二界面时,可以接收用户的选择,播放用户选择的某个人的完整音频。其中,某个人完整音频为第一文件中包含该人声音的全部音频,包括与其他人的声音重合的部分。也可以按照第一文件中不同人的声音出现的顺序自动播放每个人的完整音频。本申请实施例对此不做限定。
一种可能的设计中,在终端自动识别第一文件中音频内容所属的不同人声之后,所述方法还包括:终端检测到第二操作;响应于第二操作,终端生成第二文件,第二 文件包含第一文件中预设的一个人声的全部音频内容;终端显示第三界面,第三界面显示有第二文件。
其中,第二操作为用户选择“生成个人录音文件”的操作。示例性的,第二操作例如可以是在第一文件的播放界面或编辑界面,用户点击“生成个人录音文件”的功能按钮或选择菜单选项的操作。
需要说明的是,在剪辑过程中,对于第一文件中有多个人的声音重合部分的音频的处理,一些示例中,可以是终端直接将该重合部分的音频,和每个人独立说话的音频剪辑在一起。在播放时,用户可以通过人耳识别自己需要收听的那个声音。也就是说,第二文件中包含第一文件中一个人的全部音频内容外,可能还包含另一个人的部分音频。
例如:在图3K所示的播放界面中,终端在播放308部分的录音时,该部分的音频中即包括有A的声音,又包括B的声音。用户自己识别需要听A的声音,还是B的声音。
对于第一文件中有多个人的声音重合部分的音频的处理,还可以是终端基于声源定位技术和/或声纹识别技术,对该重合部分的音频进行人声分离。将分离后的音频内容和其他相应一个人的音频内容编辑在一起。也就是说,第二文件中只包含第一文件中的一个人的音频内容。
第二方面,一种终端,包括:检测单元,用于在第一界面上检测到第一操作;处理单元,用于响应于第一操作,自动识别第一文件中的音频内容所属的不同人声,第一文件为包含音频的文件;显示单元,用于显示第二界面;其中,第一文件中音频内容所属的不同人声在第二界面中具有不同的标记。
一种可能的设计中,检测单元具体用于在第一文件的播放界面或编辑界面上检测到点击自动识别人声的功能按钮的操作,或者选择自动识别人声的菜单选项的操作。
一种可能的设计中,检测单元具体用于在终端开启自动识别人声的功能时,在音频应用的界面上检测到打开第一文件的操作。
一种可能的设计中,检测单元,还用于在音频应用的界面上检测到打开第一文件的操作之前:在终端的系统设置的界面上检测到打开终端的自动识别人声的功能的操作;或者,在音频应用的设置界面上检测到打开音频应用的自动识别人声的功能的操作。
一种可能的设计中,检测单元具体用于在录音应用的界面上检测到用户输入的录音指令;第一文件为录音应用在录音时生成的文件;第二界面为录音应用在录音过程中,或录音完成后显示的界面。
一种可能的设计中,第一文件中音频内容所属的不同人声在第二界面中具有不同的标记包括:第一文件中音频内容所属的不同人声所对应的时间轴具有不同的标记。
一种可能的设计中,第一文件中音频内容所属的不同人声所对应的时间轴具有不同的标记包括:第一文件中音频内容所属的不同人声所对应的时间轴具有不同颜色的标记。
一种可能的设计中,第一文件中音频内容所属的不同人声所对应的时间轴具有不同的标记包括:第一文件中音频内容所属的不同人声所对应的时间轴具有不同头像的 标记。
一种可能的设计中,检测单元,还用于检测到第二操作;处理单元,还用于响应于第二操作,生成第二文件,第二文件包含第一文件中预设的一个人声的全部音频内容;显示单元,还用于显示第三界面,第三界面显示有第二文件。
一种可能的设计中,第三界面为第二文件的播放界面或编辑界面。
第三方面、一种终端,包括:处理器、存储器和触摸屏,存储器、触摸屏与处理器耦合,存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,当处理器从存储器中读取计算机指令,以执行如第一方面中任一种可能的设计方法中的所述的方法。
第四方面、一种计算机存储介质,包括计算机指令,当计算机指令在终端上运行时,使得终端执行如第一方面中任一种可能的设计方法中所述的方法。
第五方面、一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行如第一方面中任一种可能的设计方法中所述的方法。
附图说明
图1为现有技术中终端的录音机应用的界面示例图;
图2为本申请提供的一种终端的结构示意图一;
图3A为本申请提供的一种终端的界面实例示意图一;
图3B为本申请提供的一种终端的界面实例示意图二;
图3C为本申请提供的一种终端的界面实例示意图三;
图3D为本申请提供的一种终端的界面实例示意图四;
图3E为本申请提供的一种终端的界面实例示意图五;
图3F为本申请提供的一种终端的界面实例示意图六;
图3G为本申请提供的一种终端的界面实例示意图七;
图3H为本申请提供的一种终端的界面实例示意图八;
图3I为本申请提供的一种终端的界面实例示意图九;
图3J为本申请提供的一种终端的界面实例示意图十;
图3K为本申请提供的一种终端的界面实例示意图十一;
图3L为本申请提供的一种终端的界面实例示意图十二;
图4为本申请提供的一种终端中音频的处理方法的流程示意图一;
图5为本申请提供的一种终端中第一文件的示意图;
图6为本申请提供的一种终端中音频的处理方法的流程示意图二;
图7为本申请提供的一种终端的结构示意图二;
图8为本申请提供的一种终端的结构示意图三。
具体实施方式
考虑到这样的场景:一份音频文件中可能包含多个人的发言内容信息,然而,用户可能只需要对其中一个人的发言内容进行重点收听。例如:一份会议录音文件中,可能有领导和多个员工的发言。然而,用户可能需要重点收听领导的意见,或者对于工作的安排等。在现有的技术中,用户只有收听全部的录音文件,或者在听到员工的发言时,手动一点点拖动进度条,尝试跳过员工的发言。由此可见,音频文件越长, 用户的工作效率越低,用户体验极差。
为此,本申请实施例提供了一种终端中音频的处理方法,可以结合声源定位技术和/或声纹识别技术,将音频文件中的多个人的声音识别出来。这样,有利于用户可以有针对性的收听特定人的音频内容,提升用户体验。
下面结合附图对本申请实施例提供的技术方案进行说明。
示例性的,本申请中的终端可以为可以安装应用程序并显示应用程序图标的手机(如图2所示的手机100)、平板电脑、个人计算机(Personal Computer,PC)、个人数字助理(personal digital assistant,PDA)、智能手表、上网本、可穿戴电子设备、增强现实技术(Augmented Reality,AR)设备、虚拟现实(Virtual Reality,VR)设备等,本申请对该终端的具体形式不做特殊限制。
如图2所示,以手机100作为上述终端举例,手机100具体可以包括:处理器101、射频(Radio Frequency,RF)电路102、存储器103、触摸屏104、蓝牙装置105、一个或多个传感器106、无线保真(Wireless Fidelity,WI-FI)装置107、定位装置108、音频电路109、外设接口110以及电源装置111等部件。这些部件可通过一根或多根通信总线或信号线(图2中未示出)进行通信。本领域技术人员可以理解,图2中示出的硬件结构并不构成对手机的限定,手机100可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图2对手机100的各个部件进行具体的介绍:
处理器101是手机100的控制中心,利用各种接口和线路连接手机100的各个部分,通过运行或执行存储在存储器103内的应用程序,以及调用存储在存储器103内的数据,执行手机100的各种功能和处理数据。在一些实施例中,处理器101可包括一个或多个处理单元。
射频电路102可用于在收发信息或通话过程中,无线信号的接收和发送。特别地,射频电路102可以将基站的下行数据接收后,给处理器101处理;另外,将涉及上行的数据发送给基站。通常,射频电路包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器、双工器等。此外,射频电路102还可以通过无线通信和其他设备通信。所述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统、通用分组无线服务、码分多址、宽带码分多址、长期演进、电子邮件、短消息服务等。
存储器103用于存储应用程序以及数据,处理器101通过运行存储在存储器103的应用程序以及数据,执行手机100的各种功能以及数据处理。存储器103主要包括存储程序区以及存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等);存储数据区可以存储根据使用手机100时所创建的数据(比如音频数据、电话本等)。此外,存储器103可以包括高速随机存取存储器(Random Access Memory,RAM),还可以包括非易失存储器,例如磁盘存储器件、闪存器件或其他易失性固态存储器件等。存储器103可以存储各种操作系统,例如,苹果公司所开发的
Figure PCTCN2018081184-appb-000001
操作系统,谷歌公司所开发的
Figure PCTCN2018081184-appb-000002
操作系统等。上述存储器103可以是独立的,通过上述通信总线与处理器101相连接;存储器103也可以和处理器101集成在一起。
触摸屏104具体可以包括触控板104-1和显示器104-2。
其中,触控板104-1可采集手机100的用户在其上或附近的触摸事件(比如用户使用手指、触控笔等任何适合的物体在触控板104-1上或在触控板104-1附近的操作),并将采集到的触摸信息发送给其他器件(例如处理器101)。其中,用户在触控板104-1附近的触摸事件可以称之为悬浮触控;悬浮触控可以是指,用户无需为了选择、移动或拖动目标(例如图标等)而直接接触触控板,而只需用户位于设备附近以便执行所想要的功能。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型来实现触控板104-1。
显示器(也称为显示屏)104-2可用于显示由用户输入的信息或提供给用户的信息以及手机100的各种菜单。可以采用液晶显示器、有机发光二极管等形式来配置显示器104-2。触控板104-1可以覆盖在显示器104-2之上,当触控板104-1检测到在其上或附近的触摸事件后,传送给处理器101以确定触摸事件的类型,随后处理器101可以根据触摸事件的类型在显示器104-2上提供相应的视觉输出。虽然在图2中,触控板104-1与显示屏104-2是作为两个独立的部件来实现手机100的输入和输出功能,但是在某些实施例中,可以将触控板104-1与显示屏104-2集成而实现手机100的输入和输出功能。可以理解的是,触摸屏104是由多层的材料堆叠而成,本申请实施例中只展示出了触控板(层)和显示屏(层),其他层在本申请实施例中不予记载。另外,触控板104-1可以以全面板的形式配置在手机100的正面,显示屏104-2也可以以全面板的形式配置在手机100的正面,这样在手机的正面就能够实现无边框的结构。
另外,手机100还可以具有指纹识别功能。例如,可以在手机100的背面(例如后置摄像头的下方)配置指纹识别器112,或者在手机100的正面(例如触摸屏104的下方)配置指纹识别器112。又例如,可以在触摸屏104中配置指纹采集器件112来实现指纹识别功能,即指纹采集器件112可以与触摸屏104集成在一起来实现手机100的指纹识别功能。在这种情况下,该指纹采集器件112配置在触摸屏104中,可以是触摸屏104的一部分,也可以以其他方式配置在触摸屏104中。本申请实施例中的指纹采集器件112的主要部件是指纹传感器,该指纹传感器可以采用任何类型的感测技术,包括但不限于光学式、电容式、压电式或超声波传感技术等。
手机100还可以包括蓝牙装置105,用于实现手机100与其他短距离的设备(例如手机、智能手表等)之间的数据交换。本申请实施例中的蓝牙装置可以是集成电路或者蓝牙芯片等。
手机100还可以包括至少一种传感器106,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节触摸屏104的显示器的亮度,接近传感器可在手机100移动到耳边时,关闭显示器的电源。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机100还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
WI-FI装置107,用于为手机100提供遵循WI-FI相关标准协议的网络接入,手机 100可以通过WI-FI装置107接入到WI-FI接入点,进而帮助用户收发电子邮件、浏览网页和访问流媒体等,它为用户提供了无线的宽带互联网访问。在其他一些实施例中,该WI-FI装置107也可以作为WI-FI无线接入点,可以为其他设备提供WI-FI网络接入。
定位装置108,用于为手机100提供地理位置。可以理解的是,该定位装置108具体可以是全球定位系统(Global Positioning System,GPS)或北斗卫星导航系统、俄罗斯GLONASS等定位系统的接收器。定位装置108在接收到上述定位系统发送的地理位置后,将该信息发送给处理器101进行处理,或者发送给存储器103进行保存。在另外的一些实施例中,该定位装置108还可以是辅助全球卫星定位系统(Assisted Global Positioning System,AGPS)的接收器,AGPS系统通过作为辅助服务器来协助定位装置108完成测距和定位服务,在这种情况下,辅助定位服务器通过无线通信网络与设备例如手机100的定位装置108(即GPS接收器)通信而提供定位协助。在另外的一些实施例中,该定位装置108也可以是基于WI-FI接入点的定位技术。由于每一个WI-FI接入点都有一个全球唯一的(Media Access Control,MAC)地址,设备在开启WI-FI的情况下即可扫描并收集周围的WI-FI接入点的广播信号,因此可以获取到WI-FI接入点广播出来的MAC地址;设备将这些能够标示WI-FI接入点的数据(例如MAC地址)通过无线通信网络发送给位置服务器,由位置服务器检索出每一个WI-FI接入点的地理位置,并结合WI-FI广播信号的强弱程度,计算出该设备的地理位置并发送到该设备的定位装置108中。
音频电路109、扬声器113、麦克风114可提供用户与手机100之间的音频接口。音频电路109可将接收到的音频数据转换后的电信号,传输到扬声器113,由扬声器113转换为声音信号输出;另一方面,麦克风114将收集的声音信号转换为电信号,由音频电路109接收后转换为音频数据,再将音频数据输出至RF电路102以发送给比如另一手机,或者将音频数据输出至存储器103以便进一步处理。
需要说明的是,在本申请实施例中,终端中包含有两个或两个以上的麦克风114,组成麦克风阵列。该麦克风阵列可用于对接收到的语音信号进行处理,抑制噪声,提升通话质量。该麦克风阵列还可以用于基于语音信号达到各个麦克风114的时间差,实现对语音信号的声源定位,以便区分出不同人声。
外设接口110,用于为外部的输入/输出设备(例如键盘、鼠标、外接显示器、外部存储器、用户识别模块卡等)提供各种接口。例如通过通用串行总线(Universal Serial Bus,USB)接口与鼠标连接,通过用户识别模块卡卡槽上的金属触点与电信运营商提供的用户识别模块卡(Subscriber Identification Module,SIM)卡进行连接。外设接口110可以被用来将上述外部的输入/输出外围设备耦接到处理器101和存储器103。
手机100还可以包括给各个部件供电的电源装置111(比如电池和电源管理芯片),电池可以通过电源管理芯片与处理器101逻辑相连,从而通过电源装置111实现管理充电、放电、以及功耗管理等功能。
尽管图2未示出,手机100还可以包括摄像头(前置摄像头和/或后置摄像头)、闪光灯、微型投影装置、近场通信(Near Field Communication,NFC)装置等,在此不再赘述。
以下实施例中的方法均可以在具有上述硬件结构的手机100中实现。
本申请实施例提供的技术方案,可应用于终端对包含有音频的文件(例如:音频文件、视频文件)的处理过程中,这里以终端中“录音机”应用程序处理录音文件为例,对本申请实施例提供的方案进行示例性说明。
请参见附图3A~3G,为本申请实施例提供的技术方案应用于“录音机”应用程序处理录音文件的过程中涉及的终端界面的实例。
具体的,如图3A所示,为终端的主屏幕界面,用户可通过点击“录音机”应用程序的图标301,进入“录音机”应用程序的主界面。如图3B所示,“录音机”应用程序的主界面,用户可以通过点击“录音文件”按钮302,查看“录音机”应用程序中的全部录音文件。
如图3C所示,为终端显示的“录音机”中包含的录音文件的列表。用户可以通过点击“新录音1”按钮303,进入该录音文件的播放界面。如图3D所示,为“新录音1”的播放界面。其中,该界面包括“编辑”按钮304。用户可以通过点击“编辑”按钮304,进入对“新录音1”文件的编辑界面。
如图3E所示,为“新录音1”的编辑界面。其中,该编辑界面包含有“自动识别人声”按钮305。用户可以通过点击该按钮,使得终端利用例如声纹识别技术和/或声源定位技术,对录音文件中包含的多个人声进行自动识别,区分出不同人声对应的录音内容。具体的实现过程可参考步骤S102中的描述,在此不重复赘述。
需要说明的是,该编辑界面还包含有“转文本”、“分享”和“删除”等功能按钮。其中,“转文本”按钮可使得终端利用例如语音识别技术,将录音文件中的语音信号转换为文字信息。“分享”按钮可用于将录音文件进行转发,例如:通过短信、邮件、微信等进行转发。“删除”按钮用于将录音文件从“录音机”应用程序中删除。本申请实施例不限定该编辑界面的包含的界面内容和具体的界面形式。
如图3F所示,为终端自动识别出“新录音1”的录音文件中的不同人声的界面。其中,该界面中录音文件的进度条被划分为几个部分。其中,306对应于一个人的录音内容(可记录为:A的录音内容),307对应于另一个人的录音内容(可记录为:B的录音内容)。308对应于A和B的录音内容(即是在该部分时间内,A和B同时说话)。示例性的,306、307和308可以用不同的颜色标注,也可以采用其他的标注方式。本申请实施例并不限定终端对标注不同人的声音的方式。
需要说明的是,图中仅以“新录音1”中包含两个人声音为例进行说明,本申请实施例并不限定一个录音内容中可识别出多个(大于两个)人的声音的情况。
如图3G和3H所示,为用户通过拖动进度条上的圆点,手动调节播放的录音内容。由于进度条上明确标注出哪部分时间对应哪个人的录音内容,所以用户可以通过拖动进度条上的圆点,快速准确的播放到特定某人的录音内容。例如:如图3G所示,为用户拖动进度条,播放A的录音内容的示意图。如图3H所示,为用户拖动进度条,播放B的录音内容的示意图。
需要说明的是,其中308由于是对应于A和B的录音内容,所以在播放A的录音内容时,也会播放308的录音内容;在播放B的录音内容时,也会播放308的录音内 容。
还需要说明的是,终端也可以根据用户的选择,自动播放某个人的完整的录音内容,即不需要用户手动拖动进度条来播放。本申请实施例对终端播放个人的录音内容的具体方式不做限定。
可选的,如图3I所示,终端可以点击“生成个人录音文件”按钮309,根据“新录音1”的录音文件,按照不同人的声音生成不同的录音文件。如图3J所示,按照不同人的声音生成不同的录音文件可以是生成新的录音文件,也可以替换原来录音文件。图3J中示出了用户选择“生成信息的录音文件”的选项310。
如图3K所示,为终端根据“新录音1”生成的“新录音1中A的录音文件”的播放界面。可见,A的录音文件包括“新录音1”中306和308对应的录音内容。可选的,该界面中还包括“暂停”按钮、“耳机模式”按钮、“标签”按钮、“转文本”按钮和“分享”按钮。也就是说,终端对于新生成的个人的录音文件的处理可以与原来的录音文件相同。本申请实施例对该界面的界面内容和界面形式不做限定。
可选的,在该界面上,用户还可以通过例如向右滑动的操作,将“新录音1中A的录音文件”的播放界面切换到“新录音1中B的录音文件”的播放界面。如图3L所示,为终端根据“新录音1”生成的“新录音1中B的录音文件”的播放界面。可见,B的录音文件包括“新录音1”中307和308对应的录音内容。
如图4所示,为本申请实施例提供的一种终端中音频处理的方法流程图,该方法具体包括:
S101、终端在第一界面检测到第一操作。
一些示例中,第一界面可以为终端中音频应用中第一文件的的播放界面或编辑界面(例如:图3E所示的界面),该界面上包含有“自动识别人声”的功能按钮或菜单选项,那么,第一操作为用户点击该功能按钮或选择该菜单选项的操作。其中,音频应用是指可以处理包括音频的文件的应用。第一文件为包含音频的文件,可以是纯音频文件,也可以是视频文件或者3D影像文件或者全息影像文件等。
一些示例中,用户可以预先通过终端操作系统的系统设置界面,或者,终端默认开启终端的“自动识别人声”的功能。那么,终端在检测到与音频相关的应用(例如:音频应用、录音应用等)有处理音频的操作时,可以对包含音频的文件进行自动识别人声。其中,音频应用是指可以处理包括音频的文件的应用。录音应用为可以录制包括音频的文件的应用。例如:第一操作可以是用户通过音频应用打开第一文件的操作。此时,第一界面为音频应用的播放列表界面、或在打开第一文件之前的播放界面或编辑界面等。第一操作也可以是用户通过录音应用进行录音的操作。此时,第一文件是录音应用实时录制时生成的文件,第一界面是用户输入录音指令之前录音应用呈现的界面,在该界面下,用户可以输入录音指令。第一文件为包含音频的文件,可以是纯音频文件,也可以是视频文件或者3D影像文件或者全息影像文件等。
一些示例中,用户可以预先通过某类应用(例如音视频类应用)或者某个应用(例如:“录音机”应用)的设置界面,或者,终端默认开启某类应用或某个应用的“自动识别人声”的功能。那么,终端在检测到该类应用或该应用有处理音频的操作时,可以 对包含音频的文件进行自动识别人声。第一操作和第一界面的描述可参考上段描述,在此不重复赘述。
需要说明的是,本申请实施例对第一界面的界面内容和界面形式不做限定,对第一操作的操作形式也不做限定。
S102、响应于第一操作,终端识别第一文件中包含的不同人声的音频内容。
其中,第一文件为包含有音频的文件,例如可以是音频文件、视频文件等。具体的音频文件可以包括录音文件、音乐文件等。
具体的,终端可以采用声纹识别技术和/或声源定位技术,对第一文件中包含的多个人的声音进行自动识别,区分出不同声音对应的音频内容。
以图2中的手机100为例,对终端自动识别人声的实现过程进行说明。
由于在日常生活中,使用手机通信时会受到噪声和混响的干扰,麦克风所采集到信号不是纯净的语音信号。为了对语音信号进行增强,提高通话质量,手机通常会采用麦克风阵列技术。麦克风阵列技术是将多个麦克风按照一定规律组成一个阵列。当语音和环境信息被多个麦克风收集时,麦克风阵列可以通过调节每个通道的滤波系数,在特定的方向上有效地形成一个指向目标声源的波束,对这个波束内的信号进行增强,波束外的信号进行抑制,从而达到同时提取声源和抑制噪声的目的。也正是由于手机具有麦克风阵列,所以可以利用该麦克风阵列对第一文件中不同声音的声源进行定位。换言之,对于使用麦克风阵列录音的第一文件,可以用对应的算法识别出该第一文件中的声音在录制时对应的声源。在通常的录音场合,发声的人相对于录音设备的位置是相对固定的,故不同的声源对应的声音就对应不同的人。
具体的,声源定位技术包括三类,分别为高分辨率估计定位技术、基于可控波束形成法和基于到达时间差技术。示例性,这里以基于到达时间差技术来实现对声源的定位进行说明。首先使用多个麦克风(麦克风阵列)对声音信号进行接收,接着使用相关方法分别计算得到多个时延,然后由麦克风阵列位置结合计算得到的时延进行定位,从而确定出声源的位置。对于第一文件来说,在终端可以定位出第一文件中各个声源的位置,可以确定该录音文件中包含有几个人的声音。然后,可以再结合声纹识别技术,确定出各个人对应的音频内容。
其中,声纹识别技术是指通过音频中语音波形反映说话人生理和行为特征的语音参数,进而分辨出说话人身份(即识别出不同人的声音)的技术。具体的,若手机中保存有某人的声纹模板,则手机可以根据该声纹模板与第一文件中的音频进行比对,从而确认该声纹模板对应的音频内容。若比对后,第一文件中没有和已保存的声纹模板相对应的音频内容,或者,手机中没有保存声纹模板,则手机可以根据第一文件中某一段的音频(可以是用于声源定位的部分或全部音频),提取出该段音频的声纹特征,建立新的声纹模板。再将第一文件中其他部分的音频与该声纹模板进行比对,以便确认出新建立的声纹模块对应的音频内容。
考虑到存在第一文件中有多个人的声音在某段时间内重合的情况,那么手机还需要能够识别出第一文件中多个声音重合的部分音频。
举例来说,如图5所示,其中,第一文件对应的时间轴上,MN部分为A对应的音频内容,PQ部分为B对应的音频内容,其中PN部分为A和B同时说话的音频内 容。
可选的,手机可以这样确定PN部分为A和B声音重合的部分,如下:在M点时,A开始说话,手机可以通过声源定位技术确定出A的声源位置。在P点时,A说话,B也开始说话,那么,手机可以通过声源定位技术确定出两个声源位置,即A的声源位置和B的声源位置。因此,可以确定P点开始,为A和B同时说话部分。直到,从N点开始,手机只确定出B的声源位置,那么,从N点开始为B单独说话。因此,可以确定出PN部分对应于A和B声音重合的部分。需要说明的是,通过该方法也可以识别出多个人的声音重合的部分,在此不重复赘述。
可选的,手机还可以这样确定PN部分为A和B声音重合部分,如下:手机可以通过声源定位技术和/或声纹识别技术,确定出MP部分为A的音频内容,NQ部分为B的音频内容。若手机识别不出PN部分对应于谁的音频内容,可以根据前后音频内容的连贯性,推测出PN部分对应于A和B的音频内容,即为A和B声音重合部分的音频。
还需要说明的是,本申请实施例可以采用声源定位技术确定出各个声音对应的音频内容,也可以采用声纹识别技术确定出各个声音对应的音频内容,还可以采用声源定位技术结合声纹识别技术确定出各个声音对应的音频内容,本申请实施例对此不做限定。
S103、终端显示第二界面。
其中,在本发明的一些实施例中,第二界面为第一文件在音频应用中的播放界面或编辑界面(例如:图3F所示的界面),也可以是第一文件在音频应用中播放或编辑完成后显示的界面。其中第一文件中不同人声的音频内容被分别标识出。标记的方式例如可以是在第一文件的进度条上以不同的颜色显示出不同人声对应的音频,其中多人重合部分也可以不同的颜色标识出来。也可以是用户在拖动进度条上的圆点时,以不同的头像显示出圆点当前所在位置对应的人声的音频。
在本发明的一些实施例中,第一文件是录音应用实时录制时生成的文件,第二界面可以是第一文件的播放界面或编辑界面。第二界面也可以是录音应用对第一文件录制完成的界面,并且,用户可以通过该界面直接播放第一文件。本申请实施例对第二界面的界面形式和界面内容,以及对不同人声的标识方式均不作限定。
示例的,终端在显示第二界面时,可以接收用户的选择,播放用户选择的某个人的完整音频。其中,某个人完整音频为第一文件中包含该人声音的全部音频,包括与其他人的声音重合的部分。也可以按照第一文件中不同人的声音出现的顺序自动播放每个人的完整音频。本申请实施例对此不做限定。
由此可见,本申请实施例提供的方法,对于包含音频的第一文件,终端可以自动识别出第一文件中不同人的声音,并将不同人对应的音频内容分别标识出。这样,用户可以快速得定位到特定人的音频位置,提升了用户的工作效率,提升了用户体验。
进一步的,如图6所示,在终端识别出第一文件中包含的不同人声的音频内容后,终端可以生成只包含个人声音的多个新的音频文件。也就是说在步骤S102之后,本申请实施例提供的方法还包括:
S201、终端接收第二操作。
其中,第二操作为用户选择“生成个人录音文件”的操作。示例性的,第二操作例如可以是在第一文件的播放界面或编辑界面,用户点击“生成个人录音文件”的功能按钮或选择菜单选项的操作。
需要说明的是,若步骤S201之前,终端还执行了步骤S103,则第二操作也可以是在第二界面上,用户点击“生成个人录音文件”的功能按钮或选择菜单选项的操作。本申请实施例对在什么界面上接收到第二操作,以及第二操作的具体操作形式不做限定。
S202、响应于第二操作,终端生成第二文件。
其中,第二文件中包含第一文件中一个人的全部音频内容。具体的,第二文件中可以只包含一个人的全部音频内容,第二文件还可以在包含一个人的全部音频内容之外,还可以包含另一个人的部分音频内容。本申请实施例对此不做限定。
示例性的,在终端识别出不同人声的音频内容后,可以对第一文件进行复制,并对复制后的第一文件进行重新剪辑。具体的,终端可以将第一文件中一个人对应全部音频内容剪辑到一个文件中。
举例来说,如图3K所示,“新录音1中A的录音文件”是将“新录音1”中A对应的音频内容剪辑到一个录音文件中的,具体包括A独立说话时的音频部分(306)和A与B同时说话时的音频部分(308)。如图3H所示,“新录音1中B的录音文件”是将“新录音1”中B对应的音频内容剪辑到一个录音文件中的,具体包括B独立说话时的音频部分(307)和A与B同时说话时的音频部分(308)。
需要说明的是,在剪辑过程中,对于第一文件中有多个人的声音重合部分的音频的处理,一些示例中,可以是终端直接将该重合部分的音频,和每个人独立说话的音频剪辑在一起。在播放时,用户可以通过人耳识别自己需要收听的那个声音。也就是说,第二文件中包含第一文件中一个人的全部音频内容外,可能还包含另一个人的部分音频。
例如:在图3K所示的播放界面中,终端在播放308部分的录音时,该部分的音频中即包括有A的声音,又包括B的声音。用户自己识别需要听A的声音,还是B的声音。
对于第一文件中有多个人的声音重合部分的音频的处理,还可以是终端基于声源定位技术和/或声纹识别技术,对该重合部分的音频进行人声分离。将分离后的音频内容和其他相应一个人的音频内容编辑在一起。也就是说,第二文件中只包含第一文件中的一个人的音频内容。
例如:在图3K所示的播放界面中,终端在播放308部分的录音时,该部分的音频中即只包括有A的声音。
还需要说明的是,在本步骤中,终端可以根据用户的选择,只生成一个人的个人音频文件(例如A的录音文件),终端也可以自动将第一文件中包含的每个人的个人音频文件,即生成多个个人音频文件(例如A的录音文件和B的录音文件)。本申请实施例对此不做限定。
S203、终端显示第三界面。
其中,第三界面为第二文件的播放界面(例如:图3K所示的界面或图3L所示的界面)。可选的,第三界面中还可以包括“暂停”按钮、“耳机模式”按钮、“标签”按钮、“转文本”按钮和“分享”按钮。其中,“转文本”按钮可使得终端利用例如语音识别技术,将第二文件中的语音信号转换为文字信息。“分享”按钮可用于将第二文件进行转发,例如:通过短信、邮件、微信等进行转发。“删除”按钮用于将第二文件从当前的应用程序或存储器中删除。本申请实施例不限定第三界面包含的界面内容和具体的界面形式。
可以理解的是,上述终端等为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明实施例的范围。
本申请实施例可以根据上述方法示例对上述终端等进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本发明实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
在采用对应各个功能划分各个功能模块的情况下,图7示出了上述实施例中所涉及的终端的一种可能的结构示意图。如图7所示,终端1000包括:检测单元1001、处理单元1002和显示单元1003。
其中,检测单元1001用于支持终端执行图4中的步骤S101,图6中步骤S201,和/或用于本文所描述的技术的其它过程。处理单元1002用于支持终端执行图4中的步骤S102,图5中步骤S202,和/或用于本文所描述的技术的其它过程。显示单元1003用于支持终端执行图4中的步骤S103,图6中步骤S203,以及显示图3A至图3L中的终端界面,和/或用于本文所描述的技术的其它过程。
其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
当然,终端1000还可以包括通信单元,用于终端与其他设备进行交互。并且,上述功能单元的具体所能够实现的功能也包括但不限于上述实例所述的方法步骤对应的功能,终端1000的其他单元的详细描述可以参考其所对应方法步骤的详细描述,本申请实施例这里不再赘述。终端1000还可以包括存储单元,用于存储终端中的第一文件和第二文件,以及终端中的程序代码和数据等。
在采用集成的单元的情况下,上述检测单元1001和处理单元1002可以集成在一起,可以是终端的处理模块。上述的通信单元可以是终端的通信模块,如RF电路、WiFi模块或者蓝牙模块。上述显示单元1003可以是终端的显示器。上述的存储单元可以是终端的存储模块。
图8示出了上述实施例中所涉及的终端的一种可能的结构示意图。该终端1100包括:处理模块1101、存储模块1102和通信模块1103。处理模块1101用于对终端的 动作进行控制管理。存储模块1102,用于保存终端的程序代码和数据。通信模块1103用于与其他终端通信。其中,处理模块1101可以是处理器或控制器,例如可以是中央处理器(Central Processing Unit,CPU),通用处理器,数字信号处理器(Digital Signal Processor,DSP),专用集成电路(Application-Specific Integrated Circuit,ASIC),现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本发明公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。通信模块1303可以是收发器、收发电路或通信接口等。存储模块1102可以是存储器。
当处理模块1101为处理器(如图2所示的处理器101),通信模块1103为RF收发电路(如图2所示的射频电路102),存储模块1102为存储器(如图2所示的存储器103)时,本申请实施例所提供的终端可以为图2所示的终端100。其中,上述通信模块1103不仅可以包括RF电路,还可以包括WiFi模块和蓝牙模块。RF电路、WiFi模块和蓝牙模块等通信模块可以统称为通信接口。其中,上述处理器、通信接口和存储器可以通过总线耦合在一起。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:快闪存储器、移动 硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (23)

  1. 一种终端中音频处理的方法,其特征在于,所述方法包括:
    终端在第一界面上检测到第一操作;
    响应于所述第一操作,所述终端自动识别第一文件中的音频内容所属的不同人声,所述第一文件为包含音频的文件;
    所述终端显示第二界面;其中,所述第一文件中音频内容所属的不同人声在所述第二界面中具有不同的标记。
  2. 根据权利要求1所述的方法,其特征在于,所述终端在第一界面上检测到第一操作具体为:
    所述终端在所述第一文件的播放界面或编辑界面上检测到点击自动识别人声的功能按钮的操作,或者选择自动识别人声的菜单选项的操作。
  3. 根据权利要求1所述的方法,其特征在于,所述终端在第一界面上检测到第一操作具体为:
    在所述终端开启自动识别人声的功能时,所述终端在音频应用的界面上检测到打开所述第一文件的操作。
  4. 根据权利要求3所述的方法,其特征在于,在所述终端在音频应用的界面上检测到打开所述第一文件的操作之前,所述方法还包括:
    在所述终端的系统设置的界面上检测到打开所述终端的自动识别人声的功能的操作;
    或者,在所述音频应用的设置界面上检测到打开所述音频应用的自动识别人声的功能的操作。
  5. 根据权利要求1所述的方法,其特征在于,所述终端在第一界面上检测到第一操作具体为:
    所述终端在录音应用的界面上检测到用户输入的录音指令;所述第一文件为所述录音应用在录音时生成的文件;所述第二界面为所述录音应用在录音过程中,或录音完成后显示的界面。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述第一文件中音频内容所属的不同人声在所述第二界面中具有不同的标记包括:
    所述第一文件中音频内容所属的不同人声所对应的时间轴具有不同的标记。
  7. 根据权利要求6所述的方法,其特征在于,所述第一文件中音频内容所属的不同人声所对应的时间轴具有不同的标记包括:
    所述第一文件中音频内容所属的不同人声所对应的时间轴具有不同颜色的标记。
  8. 根据权利要求6所述的方法,其特征在于,所述第一文件中音频内容所属的不同人声所对应的时间轴具有不同的标记包括:
    所述第一文件中音频内容所属的不同人声所对应的时间轴具有不同头像的标记。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,在所述终端自动识别第一文件中音频内容所属的不同人声之后,所述方法还包括:
    所述终端检测到第二操作;
    响应于所述第二操作,所述终端生成第二文件,所述第二文件包含所述第一文件 中预设的一个人声的全部音频内容;
    所述终端显示第三界面,所述第三界面显示有所述第二文件。
  10. 根据要求要求9所述的方法,其特征在于,所述第三界面为所述第二文件的播放界面或编辑界面。
  11. 一种终端,其特征在于,包括:
    检测单元,用于在第一界面上检测到第一操作;
    处理单元,用于响应于所述第一操作,自动识别第一文件中的音频内容所属的不同人声,所述第一文件为包含音频的文件;
    显示单元,用于显示第二界面;其中,所述第一文件中音频内容所属的不同人声在所述第二界面中具有不同的标记。
  12. 根据权利要求11所述的终端,其特征在于,所述检测单元具体用于在所述第一文件的播放界面或编辑界面上检测到点击自动识别人声的功能按钮的操作,或者选择自动识别人声的菜单选项的操作。
  13. 根据权利要求11所述的终端,其特征在于,所述检测单元具体用于在所述终端开启自动识别人声的功能时,在音频应用的界面上检测到打开所述第一文件的操作。
  14. 根据权利要求13所述的终端,其特征在于,所述检测单元,还用于在音频应用的界面上检测到打开所述第一文件的操作之前,
    在所述终端的系统设置的界面上检测到打开所述终端的自动识别人声的功能的操作;
    或者,在所述音频应用的设置界面上检测到打开所述音频应用的自动识别人声的功能的操作。
  15. 根据权利要求11所述的终端,其特征在于,所述检测单元具体用于在录音应用的界面上检测到用户输入的录音指令;所述第一文件为所述录音应用在录音时生成的文件;所述第二界面为所述录音应用在录音过程中,或录音完成后显示的界面。
  16. 根据权利要求11-15任一项所述的终端,其特征在于,所述第一文件中音频内容所属的不同人声在所述第二界面中具有不同的标记包括:
    所述第一文件中音频内容所属的不同人声所对应的时间轴具有不同的标记。
  17. 根据权利要求16所述的终端,其特征在于,所述第一文件中音频内容所属的不同人声所对应的时间轴具有不同的标记包括:
    所述第一文件中音频内容所属的不同人声所对应的时间轴具有不同颜色的标记。
  18. 根据权利要求16所述的终端,其特征在于,所述第一文件中音频内容所属的不同人声所对应的时间轴具有不同的标记包括:
    所述第一文件中音频内容所属的不同人声所对应的时间轴具有不同头像的标记。
  19. 根据权利要求11-18任一项所述的终端,其特征在于,所述检测单元,还用于检测到第二操作;
    所述处理单元,还用于响应于所述第二操作,生成第二文件,所述第二文件包含所述第一文件中预设的一个人声的全部音频内容;
    所述显示单元,还用于显示第三界面,所述第三界面显示有所述第二文件。
  20. 根据要求要求19所述的终端,其特征在于,所述第三界面为所述第二文件的 播放界面或编辑界面。
  21. 一种终端,其特征在于,包括:处理器、存储器和触摸屏,所述存储器、所述触摸屏与所述处理器耦合,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,当所述处理器从所述存储器中读取所述计算机指令,以执行如权利要求1-10中任一项所述的方法。
  22. 一种计算机存储介质,其特征在于,包括计算机指令,当所述计算机指令在终端上运行时,使得所述终端执行如权利要求1-10中任一项所述的方法。
  23. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1-10中任一项所述的方法。
PCT/CN2018/081184 2018-03-29 2018-03-29 自动识别音频中不同人声的方法 WO2019183904A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880072788.3A CN111328418A (zh) 2018-03-29 2018-03-29 自动识别音频中不同人声的方法
PCT/CN2018/081184 WO2019183904A1 (zh) 2018-03-29 2018-03-29 自动识别音频中不同人声的方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/081184 WO2019183904A1 (zh) 2018-03-29 2018-03-29 自动识别音频中不同人声的方法

Publications (1)

Publication Number Publication Date
WO2019183904A1 true WO2019183904A1 (zh) 2019-10-03

Family

ID=68062514

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/081184 WO2019183904A1 (zh) 2018-03-29 2018-03-29 自动识别音频中不同人声的方法

Country Status (2)

Country Link
CN (1) CN111328418A (zh)
WO (1) WO2019183904A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114464198A (zh) * 2021-11-30 2022-05-10 中国人民解放军战略支援部队信息工程大学 一种可视化人声分离系统、方法以及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982800A (zh) * 2012-11-08 2013-03-20 鸿富锦精密工业(深圳)有限公司 具有影音文件处理功能的电子装置及影音文件处理方法
CN103530432A (zh) * 2013-09-24 2014-01-22 华南理工大学 一种具有语音提取功能的会议记录器及语音提取方法
CN106024009A (zh) * 2016-04-29 2016-10-12 北京小米移动软件有限公司 音频处理方法及装置
CN106448683A (zh) * 2016-09-30 2017-02-22 珠海市魅族科技有限公司 查看多媒体文件中录音的方法及装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160026317A (ko) * 2014-08-29 2016-03-09 삼성전자주식회사 음성 녹음 방법 및 장치
US10516782B2 (en) * 2015-02-03 2019-12-24 Dolby Laboratories Licensing Corporation Conference searching and playback of search results
CN105280183B (zh) * 2015-09-10 2017-06-20 百度在线网络技术(北京)有限公司 语音交互方法和系统
CN105262878B (zh) * 2015-11-20 2019-03-05 Oppo广东移动通信有限公司 通话自动录音的处理方法及移动终端
CN106448722B (zh) * 2016-09-14 2019-01-18 讯飞智元信息科技有限公司 录音方法、装置和系统
CN106357932A (zh) * 2016-11-22 2017-01-25 奇酷互联网络科技(深圳)有限公司 一种通话信息记录方法和移动终端
CN107342097A (zh) * 2017-07-13 2017-11-10 广东小天才科技有限公司 录音方法、录音装置、智能终端及计算机可读存储介质
CN107481743A (zh) * 2017-08-07 2017-12-15 捷开通讯(深圳)有限公司 移动终端、存储器及录音文件的编辑方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982800A (zh) * 2012-11-08 2013-03-20 鸿富锦精密工业(深圳)有限公司 具有影音文件处理功能的电子装置及影音文件处理方法
CN103530432A (zh) * 2013-09-24 2014-01-22 华南理工大学 一种具有语音提取功能的会议记录器及语音提取方法
CN106024009A (zh) * 2016-04-29 2016-10-12 北京小米移动软件有限公司 音频处理方法及装置
CN106448683A (zh) * 2016-09-30 2017-02-22 珠海市魅族科技有限公司 查看多媒体文件中录音的方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114464198A (zh) * 2021-11-30 2022-05-10 中国人民解放军战略支援部队信息工程大学 一种可视化人声分离系统、方法以及装置

Also Published As

Publication number Publication date
CN111328418A (zh) 2020-06-23

Similar Documents

Publication Publication Date Title
US10869146B2 (en) Portable terminal, hearing aid, and method of indicating positions of sound sources in the portable terminal
US11509973B2 (en) Method and apparatus for synthesizing video
CN108446022B (zh) 用户装置及其控制方法
CN108538320B (zh) 录音控制方法和装置、可读存储介质、终端
WO2016169465A1 (zh) 一种显示弹幕信息的方法、装置和系统
CN110168487B (zh) 一种触摸控制方法及装置
KR20160026317A (ko) 음성 녹음 방법 및 장치
US9444927B2 (en) Methods for voice management, and related devices
WO2017181365A1 (zh) 一种耳机声道控制方法、相关设备及系统
US20200258517A1 (en) Electronic device for providing graphic data based on voice and operating method thereof
US20150025882A1 (en) Method for operating conversation service based on messenger, user interface and electronic device using the same
CN109257498B (zh) 一种声音处理方法及移动终端
WO2021104160A1 (zh) 编辑方法及电子设备
KR102135370B1 (ko) 이동 단말기 및 이동 단말기의 제어방법
CN106506437B (zh) 一种音频数据处理方法,及设备
WO2017215661A1 (zh) 一种场景音效的控制方法、及电子设备
CN110798327B (zh) 消息处理方法、设备及存储介质
CN109194998A (zh) 数据传输方法、装置、电子设备及计算机可读介质
CN111369994B (zh) 语音处理方法及电子设备
WO2019183904A1 (zh) 自动识别音频中不同人声的方法
CN109144461B (zh) 发声控制方法、装置、电子装置及计算机可读介质
CN108958631B (zh) 屏幕发声控制方法、装置以及电子装置
CN110741619B (zh) 一种显示备注信息的方法及终端
CN111052050A (zh) 一种输入信息的方法及终端
KR20150089787A (ko) 이동 단말기 및 그것의 제어방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18912289

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18912289

Country of ref document: EP

Kind code of ref document: A1