CN117972134A

CN117972134A - Tone color recommendation method, electronic device, and computer storage medium

Info

Publication number: CN117972134A
Application number: CN202211303152.XA
Authority: CN
Inventors: 朱星宇; 耿杰; 赵伟; 张跃; 张�成; 还超
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2024-05-03

Abstract

The application provides a tone color recommendation method, electronic equipment and a computer storage medium, wherein the method comprises the following steps: displaying a first interface, wherein the first interface is used for playing first multimedia data; responding to a first operation of an audio recommendation control for a first interface, and acquiring a first tone, wherein the first tone is the tone of the voice in the first multimedia data or the tone with the maximum similarity between a plurality of tones and the tone of the voice in the first multimedia data; first information is displayed, the first information indicating a first timbre. The application can actively recommend the tone color meeting the user requirements to the user, greatly reduces the user operation and improves the user experience.

Description

Tone color recommendation method, electronic device, and computer storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a tone color recommendation method, an electronic device, and a computer storage medium.

Background

The Text To Speech (TTS) technology is widely applied to voice assistants of electronic devices, so that users can use voice assistants with different timbres. The tone market application of the electronic device can provide a plurality of different tone colors for users to select and set as the tone colors of the voice assistant, but users often need to actively search the tone colors and listen to the tone colors for multiple times to find the favorite tone colors, so that the operation is complex.

Disclosure of Invention

The application discloses a tone color recommending method, electronic equipment and a computer storage medium, which can actively recommend tone colors meeting the requirements of users to the users, greatly reduce the operation of the users and improve the user experience.

In a first aspect, the present application provides a timbre recommendation method, which is applied to an electronic device, and includes: displaying a first interface, wherein the first interface is used for playing first multimedia data; responding to a first operation of an audio recommendation control for the first interface, and acquiring a first tone, wherein the first tone is the tone of the voice in the first multimedia data, or the tone with the maximum similarity between a plurality of tones and the tone of the voice in the first multimedia data; first information is displayed, the first information indicating the first timbre.

In some examples, the first multimedia data is audio or video.

In some examples, the first interface is a user interface of the first application, the first control is a control of the first application, or the first control is not a control of the first application, such as a floating window.

In the method, the electronic equipment can respond to the user operation, acquire the tone according to the first multimedia data which is currently played and recommend the tone to the user, the accuracy of the tone is high, the user can conveniently and rapidly acquire the required tone without manually searching and listening for many times in a large number of tone, the user operation is reduced, the product functions are richer and more intelligent, and the user experience is greatly improved.

In one possible implementation, the first timbre is a timbre in a timbre market application of the electronic device, and/or the plurality of timbres are timbres in the timbre market application.

In the method, the electronic equipment can actively recommend the first tone meeting the user requirement in the tone market to the user, and the product functions are richer and more intelligent, so that the utilization rate of the tone market is improved, and the product availability is high.

In one possible implementation, the method further includes: receiving a second operation for the first information; and displaying a second interface, wherein the second interface comprises the information of the first tone.

In some examples, the second interface is a user interface of a tone market application.

In the method, the user can check the detailed information of the first tone through the first information, so that the user can conveniently select whether to use the first tone or not, and the user experience is improved.

In one possible implementation, the method further includes: in response to a third operation of the add control to the second interface, setting the first timbre to a timbre usable by at least one of the following functions: voice assistant, navigation voice, incoming call reminder, information alert tone, calendar alert tone, alarm alert tone, dial alert tone, short message alert tone, and time alert tone.

In one possible implementation manner, the setting the first tone color to a tone color usable by at least one of the following functions may be replaced by: the first tone color is set to a tone color of at least one of the following functions.

In one possible implementation manner, after the setting the first tone color to a tone color usable by at least one of the following functions, the method further includes: setting the first tone color to the tone color of the first function in response to an operation of an interface for the first function, the first function being at least one of: voice assistant, navigation voice, incoming call reminder, information alert tone, calendar alert tone, alarm alert tone, dial alert tone, short message alert tone, and time alert tone.

In the method, the functions of the first tone recommended by the electronic equipment can be added in various ways, and the user can use various scenes of the recommended first tone, so that the personalized requirements of the user are met, and the user experience is improved.

In one possible implementation manner, after the displaying the first information, the method further includes: when the operation for the first information is not received within the first duration, displaying a third interface, wherein the third interface is used for playing the first multimedia data, the third interface does not comprise the first information, and the third interface comprises the second information; the first information is displayed in response to a fourth operation for the second information in the third interface.

In some examples, the second information is less than the first information.

In one possible implementation, after the displaying the second interface, the method further includes: in response to the operation, displaying a third interface, wherein the third interface is used for playing the first multimedia data, the third interface does not comprise the first information, and the third interface comprises the second information; the first information is displayed in response to a fourth operation for the second information in the third interface.

In the method, if the first information is not operated in the first duration of the user, or the user looks up the information of the first tone color (namely the second interface) through the first information, the electronic device can cancel to display the first information and display the second information, and the second information can be used for looking up the first information again, so that the first information is prevented from affecting the user to watch the first multimedia data, and even if the user wants to look up the first tone color again later, the user does not need to manually adjust the playing time point of the first multimedia data to the time point of executing the first operation before, and then execute the first operation, so that the user operation is reduced, the user can conveniently and rapidly look up the tone color recommended before, and the user experience sense and the use ratio of the tone color market are further improved.

In a possible implementation manner, the second information is displayed at a position of a first time point in a playing progress bar of the first multimedia data of the third interface, where the first time point is a time point when the electronic device receives the first operation.

In the method, the second information can be displayed at the position of the time point of receiving the first operation in the playing progress bar, so that a user can conveniently and quickly acquire that the tone corresponding to the second information is the first tone, the user can conveniently select whether to operate the second information, the use mode of the user is met, and the user experience is further improved.

In a possible implementation manner, the first time point is any time point between the electronic device receiving the first operation and the electronic device displaying the first information, for example, is a time point at which the first information is displayed.

In one possible implementation manner, the acquiring the first timbre includes: and acquiring the first tone according to first audio, wherein the first audio is audio data from a second time point to a third time point in the first multimedia data, the second time point is later than or equal to a starting time point of the first multimedia data, and the third time point is earlier than or equal to a cut-off time point of the first multimedia data.

In one possible implementation, the second time point is a time point when the electronic device receives the first operation, and the third time point is later than the second time point by a second duration.

In the method, the starting time point of the first audio frequency used for acquiring the first tone is the time point of the electronic equipment receiving the first operation, the cut-off time point is the time point of the second duration preset after the starting time point, but not the audio frequency in the whole first multimedia data, so that the acquired accuracy of the first tone is higher, the user requirements are met, and the user experience and the utilization rate of tone markets are further improved.

In one possible implementation manner, the obtaining the first tone color according to the first audio includes: acquiring a voice interval according to the first audio, wherein the voice interval comprises audio data belonging to voice in the first audio; and when the length of the voice interval is greater than or equal to the preset length, acquiring the first tone according to the voice interval.

In one possible implementation, the method further includes: and when the length of the voice section is smaller than the preset length, displaying third information, wherein the third information indicates that the tone recommendation fails.

In the method, the electronic device can firstly acquire the audio data (i.e. the voice interval) belonging to the voice in the audio, and acquire the first tone according to the voice interval when the voice interval is greater than or equal to the preset length, otherwise prompt the user that the tone recommendation fails, avoid the influence of noise such as environmental sound in the audio on the acquisition of the tone, further improve the accuracy of the first tone, and further improve the user experience and the utilization rate of tone markets.

In one possible implementation manner, the first tone color is a tone color of a voice in the first multimedia data, and the acquiring the first tone color according to the voice section includes: extracting a first watermark from the voice section; and acquiring the first tone corresponding to the first watermark.

In some examples, the watermarks corresponding to different timbres are different, and any one watermark may be used to identify the corresponding timbre.

In some examples, the obtaining the first timbre corresponding to the first watermark includes: and determining the first tone corresponding to the first watermark from a plurality of tone colors applied by the tone market.

In one possible implementation manner, the first tone color is a tone color with the greatest similarity between the plurality of tone colors and a tone color of a voice in the first multimedia data, and the acquiring the first tone color according to the voice section includes: extracting a first voiceprint feature of the voice from the voice section; obtaining the similarity of the audio characteristics of a second tone and the first voiceprint characteristics, wherein the second tone is any tone of the plurality of tones; and acquiring the first tone color with the highest similarity between the audio frequency characteristics and the first voiceprint characteristics in the plurality of tone colors.

In one possible implementation manner, the extracting the first voiceprint feature of the voice from the voice section includes: and when the watermark extraction from the voice section fails, extracting the first voiceprint feature from the voice section.

In the method, the electronic device may extract the watermark capable of uniquely identifying the tone in the voice section, if the extraction is successful, the first tone corresponding to the watermark may be directly obtained, if the extraction is failed, the first tone with the highest similarity between the plurality of tones and the tone of the voice in the first multimedia data may be obtained, and if the tone of the voice in the first multimedia data may be obtained, the watermark may be directly obtained, and if the tone of the voice in the first multimedia data may not be obtained, the similar tone may be obtained, thereby improving the fault tolerance of the function, improving the usability of the product, and further improving the user experience and the use rate of the tone market.

In a second aspect, an embodiment of the present application provides an electronic device, including a transceiver, a processor, and a memory; the memory is configured to store a computer program, and the processor invokes the computer program to execute the timbre recommendation method provided in the first aspect of the embodiment and any implementation manner of the first aspect of the present application.

In a third aspect, an embodiment of the present application provides a computer storage medium storing a computer program, where the computer program is executed by a processor to perform a timbre recommendation method provided by any one of the first aspect and the implementation manner of the first aspect of the embodiment of the present application.

In a fourth aspect, embodiments of the present application provide a computer program product, which when run on an electronic device, causes the electronic device to perform the timbre recommendation method provided by the first aspect of the embodiments of the present application and any implementation manner of the first aspect.

In a fifth aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a method or apparatus for performing any implementation of the present application. The electronic device is, for example, a chip.

Drawings

The drawings to which the present application is applied are described below.

Fig. 1 is a schematic diagram of a hardware structure of an electronic device according to the present application;

fig. 2 is a schematic diagram of a software architecture of an electronic device according to the present application;

FIG. 3 is a schematic flow chart of a tone color recommendation method provided by the application;

FIG. 4 is a schematic diagram of a human voice detection process provided by the present application;

FIG. 5 is a schematic diagram of a voiceprint feature extraction process provided by the present application;

fig. 6-13 are schematic diagrams of some embodiments of user interfaces provided by the present application.

Detailed Description

The technical scheme in the embodiment of the application will be described below with reference to the accompanying drawings. Wherein, in the description of the embodiments of the present application, unless otherwise indicated, "/" means or, for example, a/B may represent a or B; the text "and/or" is merely an association relation describing the associated object, and indicates that three relations may exist, for example, a and/or B may indicate: the three cases where a exists alone, a and B exist together, and B exists alone, and furthermore, in the description of the embodiments of the present application, "plural" means two or more than two.

The terms "first," "second," and the like, are used below for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the application, unless otherwise indicated, the meaning of "a plurality" is two or more.

The tone of the voice assistant of the electronic device, the tone/bell of the incoming call reminder or the tone/bell of other functions can be replaced by the user's preference, and the manner of obtaining the preferred tone/bell by the user can include the following two ways:

In the first mode, the tone market application of the electronic device can provide a plurality of different tone colors for users to select and set as tone colors of a voice assistant, but when the users use the tone market, the users often need to actively search for tone colors and listen to the tone colors for multiple times, for example, the users can input keywords in a search box of the tone market and trigger search, the tone market can return a plurality of tone colors related to the keywords, then the users can listen to the tone colors in sequence to judge whether the tone colors meet requirements, the operation is extremely complicated, and the keywords in text form often cannot effectively represent the tone colors liked by the users, so that the tone colors returned by the tone market are inaccurate, are likely to not meet the requirements of the users, the efficiency of acquiring the needed tone colors by the users is very low, the use rate of the tone market is easy to be low, and the product usability is low.

In the second mode, the electronic device may use the microphone to collect the peripheral audio signal, identify the music/song that the audio information belongs to based on the beat information of the preset music, and then provide the music/song to the user for the user to set as the ring for reminding the incoming call, but the signal collected by the microphone is usually low in quality (for example, the collected noise is more), which is likely to cause the situation that the music/song cannot be identified or is not identified in error, and only the music/song cannot be identified but the tone cannot be identified, so that the user requirement cannot be met.

The application provides a tone color recommendation method, which is applied to electronic equipment, and can respond to operation when the electronic equipment plays audio and/or video (which can be simply called as audio and video), acquire the tone color of human voice in the audio and video or the tone color similar to the tone color of the human voice in the audio and video, and recommend the acquired tone color to a user. The tone color recommended for the user is determined according to the operation input by the user and the audio and video currently watched by the user, so that the user requirement is met, the quality of audio signals in the audio and video played by the electronic equipment is high, the accuracy of tone color identification is improved, and therefore the efficiency of acquiring the required tone color by the user is greatly improved, and the user experience is improved. The tone color can be the tone color in the tone color market of the electronic equipment, that is, the electronic equipment can actively recommend the tone color meeting the user requirement in the tone color market to the user, so that the utilization rate of the tone color market is improved, and the product availability is high.

The tone market can maintain watermarks of a plurality of tone colors, any tone color watermark can identify the tone color, and watermarks of different tone colors can be different. For any one tone, the watermark for that tone may be extracted from the audio of that tone.

In the present application, the electronic device may be a mobile phone, a tablet computer, a handheld computer, a desktop computer, a laptop computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), and smart home devices such as a smart television, a smart camera, a smart band, a smart watch, a wearable device such as smart glasses, an augmented reality (augmented reality, AR), a Virtual Reality (VR), a Mixed Reality (MR), and an extended reality (XR) device, a vehicle-mounted device, or a smart city device.

Fig. 1 exemplarily shows a hardware configuration diagram of an electronic device 100.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural-Network Processor (NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In one implementation, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In one implementation, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-INTEGRATED CIRCUIT, I2C) interface, an integrated circuit built-in audio (inter-INTEGRATED CIRCUIT SOUND, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

The I2C interface is a bi-directional synchronous serial bus comprising a serial data line (SERIAL DATA LINE, SDA) and a serial clock line (derail clock line, SCL). In one embodiment, the processor 110 may contain multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180K, charger, flash, camera 193, etc., respectively, through different I2C bus interfaces. For example: the processor 110 may be coupled to the touch sensor 180K through an I2C interface, such that the processor 110 communicates with the touch sensor 180K through an I2C bus interface to implement a touch function of the electronic device 100.

The I2S interface may be used for audio communication. In one embodiment, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In one embodiment, the audio module 170 may transmit an audio signal to the wireless communication module 160 through the I2S interface, so as to implement a function of answering a call through the bluetooth headset.

PCM interfaces may also be used for audio communication to sample, quantize and encode analog signals. In one embodiment, the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface. In one embodiment, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface to implement a function of answering a call through the bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus for asynchronous communications. The bus may be a bi-directional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In one embodiment, a UART interface is typically used to connect the processor 110 with the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In one embodiment, the audio module 170 may transmit an audio signal to the wireless communication module 160 through a UART interface, so as to implement a function of playing music through a bluetooth headset.

The MIPI interface may be used to connect the processor 110 to peripheral devices such as a display 194, a camera 193, and the like. The MIPI interfaces include camera serial interfaces (CAMERA SERIAL INTERFACE, CSI), display serial interfaces (DISPLAY SERIAL INTERFACE, DSI), and the like. In one embodiment, processor 110 and camera 193 communicate through a CSI interface to implement the photographing function of electronic device 100. The processor 110 and the display 194 communicate via a DSI interface to implement the display functionality of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal or as a data signal. In one embodiment, a GPIO interface may be used to connect processor 110 with camera 193, display 194, wireless communication module 160, audio module 170, sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, an MIPI interface, etc.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transfer data between the electronic device 100 and a peripheral device. And can also be used for connecting with a headset, and playing audio through the headset. The interface may also be used to connect other electronic devices, such as AR devices, etc.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is only illustrative, and is not meant to limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also employ different interfacing manners in the above embodiments, or a combination of multiple interfacing manners.

The charge management module 140 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charge management module 140 may receive a charging input of a wired charger through the USB interface 130. In some wireless charging embodiments, the charge management module 140 may receive wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 to power the processor 110, the internal memory 121, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be configured to monitor battery capacity, battery cycle number, battery health (leakage, impedance) and other parameters. In another embodiment, the power management module 141 may also be disposed in the processor 110. In another embodiment, the power management module 141 and the charge management module 140 may also be provided in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In another embodiment, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In one embodiment, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In one embodiment, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or video through the display screen 194. In one embodiment, the modem processor may be a stand-alone device. In another embodiment, the modem processor may be provided in the same device as the mobile communication module 150 or other functional module, independent of the processor 110.

The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (WIRELESS FIDELITY, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation SATELLITE SYSTEM, GNSS), frequency modulation (frequency modulation, FM), near field communication (NEAR FIELD communication, NFC), infrared (IR), etc., applied to the electronic device 100. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.

In one embodiment, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with a network and other devices via wireless communication technology. The wireless communication techniques can include the Global System for Mobile communications (global system for mobile communications, GSM), general packet radio service (GENERAL PACKET radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), time division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (global positioning system, GPS), a global navigation satellite system (global navigation SATELLITE SYSTEM, GLONASS), a beidou satellite navigation system (beidou navigation SATELLITE SYSTEM, BDS), a quasi zenith satellite system (quasi-zenith SATELLITE SYSTEM, QZSS) and/or a satellite based augmentation system (SATELLITE BASED AUGMENTATION SYSTEMS, SBAS).

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a Liquid Crystal Display (LCD) CRYSTAL DISPLAY, an organic light-emitting diode (OLED), an active-matrix organic LIGHT EMITTING diode (AMOLED), a flexible light-emitting diode (FLED), miniled, microLed, micro-oLed, a quantum dot LIGHT EMITTING diode (QLED), or the like. In one embodiment, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The ISP is used to process data fed back by the camera 193. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also perform algorithm optimization on noise and brightness of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In one embodiment, the ISP may be provided in the camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In one embodiment, the electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, or the like.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: dynamic picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, etc.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent awareness of the electronic device 100 may be implemented through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.

The internal memory 121 may be used to store computer executable program code including instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like. The processor 110 performs various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In one embodiment, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also referred to as a "horn," is used to convert audio electrical signals into sound signals. The electronic device 100 may listen to music, or to hands-free conversations, through the speaker 170A.

A receiver 170B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. When electronic device 100 is answering a telephone call or voice message, voice may be received by placing receiver 170B in close proximity to the human ear.

Microphone 170C, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can sound near the microphone 170C through the mouth, inputting a sound signal to the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In another embodiment, the electronic device 100 may be provided with two microphones 170C, and may implement a noise reduction function in addition to collecting sound signals. In another embodiment, the electronic device 100 may further be provided with three, four or more microphones 170C to enable collection of sound signals, noise reduction, identification of sound sources, directional recording functions, etc.

The earphone interface 170D is used to connect a wired earphone. The headset interface 170D may be a USB interface 130 or a 3.5mm open mobile electronic device platform (open mobile terminal platform, OMTP) standard interface, a american cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In one embodiment, the pressure sensor 180A may be disposed on the display 194. The pressure sensor 180A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a capacitive pressure sensor comprising at least two parallel plates with conductive material. The capacitance between the electrodes changes when a force is applied to the pressure sensor 180A. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the touch operation intensity according to the pressure sensor 180A. The electronic device 100 may also calculate the location of the touch based on the detection signal of the pressure sensor 180A. In one embodiment, touch operations that act on the same touch location, but with different touch operation intensities, may correspond to different operation instructions. For example: and executing an instruction for checking the short message when the touch operation with the touch operation intensity smaller than the first pressure threshold acts on the short message application icon. And executing an instruction for newly creating the short message when the touch operation with the touch operation intensity being greater than or equal to the first pressure threshold acts on the short message application icon.

The gyro sensor 180B may be used to determine a motion gesture of the electronic device 100. In one embodiment, the angular velocity of electronic device 100 about three axes (i.e., x, y, and z axes) may be determined by gyro sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 180B detects the shake angle of the electronic device 100, calculates the distance to be compensated by the lens module according to the angle, and makes the lens counteract the shake of the electronic device 100 through the reverse motion, so as to realize anti-shake. The gyro sensor 180B may also be used for navigating, somatosensory game scenes.

The air pressure sensor 180C is used to measure air pressure. In one embodiment, the electronic device 100 calculates altitude from the barometric pressure value measured by the barometric pressure sensor 180C, aiding in positioning and navigation.

The magnetic sensor 180D includes a hall sensor. The electronic device 100 may detect the opening and closing of the flip cover using the magnetic sensor 180D. In one embodiment, when the electronic device 100 is a flip machine, the electronic device 100 may detect the opening and closing of the flip according to the magnetic sensor 180D. And then according to the detected opening and closing state of the leather sheath or the opening and closing state of the flip, the characteristics of automatic unlocking of the flip and the like are set.

The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity may be detected when the electronic device 100 is stationary. The electronic equipment gesture recognition method can also be used for recognizing the gesture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. In one embodiment, the electronic device 100 may range using the distance sensor 180F to achieve fast focus.

The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device 100 emits infrared light outward through the light emitting diode. The electronic device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it may be determined that there is an object in the vicinity of the electronic device 100. When insufficient reflected light is detected, the electronic device 100 may determine that there is no object in the vicinity of the electronic device 100. The electronic device 100 can detect that the user holds the electronic device 100 close to the ear by using the proximity light sensor 180G, so as to automatically extinguish the screen for the purpose of saving power. The proximity light sensor 180G may also be used in holster mode, pocket mode to automatically unlock and lock the screen.

The ambient light sensor 180L is used to sense ambient light level. The electronic device 100 may adaptively adjust the brightness of the display 194 based on the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust white balance when taking a photograph. Ambient light sensor 180L may also cooperate with proximity light sensor 180G to detect whether electronic device 100 is in a pocket to prevent false touches.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 may utilize the collected fingerprint feature to unlock the fingerprint, access the application lock, photograph the fingerprint, answer the incoming call, etc.

The temperature sensor 180J is for detecting temperature. In one embodiment, the electronic device 100 performs a temperature processing strategy using the temperature detected by the temperature sensor 180J. For example, when the temperature reported by temperature sensor 180J exceeds a threshold, electronic device 100 performs a reduction in the performance of a processor located in the vicinity of temperature sensor 180J in order to reduce power consumption to implement thermal protection. In another embodiment, when the temperature is below another threshold, the electronic device 100 heats the battery 142 to avoid the low temperature causing an abnormal shutdown of the electronic device 100. In another embodiment, when the temperature is lower than the further threshold, the electronic device 100 performs boosting of the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperature.

The touch sensor 180K, also referred to as a "touch device". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is for detecting a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display 194. In another embodiment, the touch sensor 180K may also be disposed on the surface of the electronic device 100 at a different location than the display 194.

The bone conduction sensor 180M may acquire a vibration signal. In one embodiment, bone conduction sensor 180M may acquire a vibration signal of a human vocal tract vibrating bone pieces. The bone conduction sensor 180M may also contact the pulse of the human body to receive the blood pressure pulsation signal. In one embodiment, bone conduction sensor 180M may also be provided in a headset, in combination with an osteoinductive headset. The audio module 170 may analyze the voice signal based on the vibration signal of the sound portion vibration bone block obtained by the bone conduction sensor 180M, so as to implement a voice function. The application processor may analyze the heart rate information based on the blood pressure beat signal acquired by the bone conduction sensor 180M, so as to implement a heart rate detection function.

The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The electronic device 100 may receive key inputs, generating key signal inputs related to user settings and function controls of the electronic device 100.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration alerting as well as for touch vibration feedback. For example, touch operations acting on different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 191 may also correspond to different vibration feedback effects by touching different areas of the display screen 194. Different application scenarios (such as time reminding, receiving information, alarm clock, game, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

The indicator 192 may be an indicator light, may be used to indicate a state of charge, a change in charge, a message indicating a missed call, a notification, etc.

The SIM card interface 195 is used to connect a SIM card. The SIM card may be inserted into the SIM card interface 195, or removed from the SIM card interface 195 to enable contact and separation with the electronic device 100. The electronic device 100 may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 195 may support Nano SIM cards, micro SIM cards, and the like. The same SIM card interface 195 may be used to insert multiple cards simultaneously. The types of the plurality of cards may be the same or different. The SIM card interface 195 may also be compatible with different types of SIM cards. The SIM card interface 195 may also be compatible with external memory cards. The electronic device 100 interacts with the network through the SIM card to realize functions such as communication and data communication. In one implementation, the electronic device 100 employs esims, namely: an embedded SIM card. The eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.

The software system of the electronic device 100 may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. For example, the software system of the layered architecture may be an Android (Android) system, an Operating System (OS), or other software systems. The embodiment of the application uses the Android system of the hierarchical architecture as an example to illustrate the software structure of the electronic device 100.

Fig. 2 schematically shows a software architecture of the electronic device 100.

The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In one embodiment, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun rows (Android runtime) and system libraries, and a kernel layer.

The application layer may include a series of application packages.

As shown in FIG. 2, the application packages may include calendar, talk, settings, voice assistant, short message, navigation, video, music, tone market, etc. applications. The application program in the application can be replaced by other forms of software such as applets, atomization services and the like. The plurality of software in the present application may be integrated together, for example, the tone market is integrated in the setting as a function of the setting.

The application framework layer provides an application programming interface (application programming interface, API) and programming framework for the application of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 2, the application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, and the like.

The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.

The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.

The telephony manager is used to provide the communication functions of the electronic device 100. Such as the management of call status (including on, hung-up, etc.).

The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like.

The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. Such as notification manager is used to inform that the download is complete, message alerts, etc. The notification manager may also be a notification in the form of a chart or scroll bar text that appears on the system top status bar, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, a text message is prompted in a status bar, a prompt tone is emitted, the electronic device vibrates, and an indicator light blinks, etc.

In one embodiment, the application framework layer further includes a tone recommendation module, where the tone recommendation module may process an audio and video played by the electronic device, and acquire a tone of a voice in the audio and video, or a tone similar to a tone of a voice in the audio and video, where the acquired tone may be used for recommendation to a user.

Android run time includes a core library and virtual machines. Android runtime is responsible for scheduling and management of the android system.

The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface manager (surface manager), media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., openGL ES), 2D graphics engines (e.g., SGL), etc.

The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.

Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The workflow of the electronic device 100 software and hardware is illustrated below in connection with a video playback scenario.

When touch sensor 180K receives a touch operation, a corresponding hardware interrupt is issued to the kernel layer. The kernel layer processes the touch operation into the original input event (including information such as touch coordinates, time stamp of touch operation, etc.). The original input event is stored at the kernel layer. The application framework layer acquires an original input event from the kernel layer, and identifies a control corresponding to the input event. Taking the touch operation as a touch click operation, the control corresponding to the click operation is a play control of video 1 in the video application as an example, the video application calls an interface of an application framework layer, further calls a display driver and an audio driver of a kernel layer, and displays an image of the video 1 through a display screen 194 and plays audio of the video 1 through an audio module 170.

Next, a tone color recommendation method provided by the embodiment of the application is described.

Referring to fig. 3, fig. 3 is a flowchart illustrating a timbre recommending method according to an embodiment of the present application. The method may be applied to the electronic device 100 shown in fig. 1. The method may be applied to the electronic device 100 shown in fig. 2. The method may include, but is not limited to, the steps of:

s101: the electronic device displays a playing interface (which may be simply referred to as a first interface) of the first audio and video.

In one embodiment, the first interface may include a first control for turning on or off an audio recommendation function for recommending a tone color of a human voice in the first audio-video or a tone color similar to a tone color of a human voice in the first audio-video.

In one embodiment, the first interface is a user interface of the first application, such as a user interface of a music application, a user interface of a video application, in some examples, the first control is a control of the first application, such as a button of the first application, in other examples, the first control is independent of the first application, such as a floating window, and the application is not limited to the display form of the first control.

S102: the electronic device determines whether a tone color recommending function is on.

S102 is an optional step.

In one embodiment, when the tone color recommendation function is turned on, the electronic device may perform a tone color recommendation process based on the first audio/video, for example, performing S103-S112. In some examples, the electronic device may turn on the tone color recommendation function in response to a first operation for a first control in the first interface, e.g., the electronic device defaults to turn off the tone color recommendation function, and receipt of the first operation determines that the tone color recommendation function is on. In other examples, before S101, the electronic device may start the tone color recommendation function in response to an operation for a control for starting the tone color recommendation function in a setting interface, for example, a setting interface of the electronic device system or a setting interface of the first application, for example, in a case where the duration of application to the first audio/video is short. In other examples, the electronic device may default to the on tone color recommendation function, and S102 may not be executed at this time, for example, if the duration of the first audio/video is shorter, which is not limited by the present application.

In another embodiment, the electronic device may not perform the timbre recommendation process when the timbre recommendation function is turned off.

S103: the electronic device extracts a human voice interval in the first audio.

In an embodiment, the first audio is part or all of audio included in the first audio and video, and the first audio may be audio between a first time point and a second time point in the first audio and video, where the first time point is earlier than the second time point, the first time point is later than or equal to a start time point of the first audio and video, and the second time point is earlier than or equal to a stop time point of the first audio and video. In some examples, the first time point is a time point when the electronic device receives the first operation, that is, when the electronic device plays the audio and video at the first time point in the first audio and video, the first operation is received. In some examples, the second point in time is a point in time after the first point in time by a preset first time period.

In one embodiment, the human voice interval may include audio signals in the first audio that belong to human voice, and in one embodiment, the electronic device may identify whether each frame of audio signals in the first audio belong to human voice, for example, using a voice activity detection (voice activity detection, VAD) model to detect human voice. The VAD model may be a model for detecting human voice, in some examples, the structure of the VAD model is implemented based on a long short-term memory (LSTM), for example, as shown in fig. 4, the VAD model may sequentially include, from an input side to an output side, a feature extraction (Feature Extractor) module, LSTM1, LSTM2, and LSTM3, an input of the VAD model may be any frame of Audio signal (Audio) in the first Audio, the frame of Audio signal may be input to the feature extraction module for processing, an Audio feature of the frame of Audio signal output by the feature extraction module may be input to the LSTM1 for processing, an output of the LSTM1 may be input to the LSTM2 for processing, an output of the LSTM3 (i.e., an output of the LSTM 3) may be a tag (Label), a value of 0 or 1 may be obtained by the VAD model output of 1 may represent that a frame of Audio signal input to the VAD model belongs to the feature extraction module, and an output of the VAD model may not belong to the human voice model. In some examples, after identifying whether the audio signal in the first audio belongs to a voice, the VAD model may determine (determine whether the audio signal belongs to a voice) the whole audio signal belonging to a voice in the first audio, where the audio signal belonging to a voice obtained by the determination may form a voice section. It is understood that a human voice interval is a set of audio signals of human voice in the first audio. In another embodiment, the electronic device may also identify whether the multi-frame audio signal in the first audio belongs to voice, in some examples, directly input all of the first audio to the VAD model, in other examples, may input part of the multi-frame audio signal in the first audio to the VAD model for the first time, and may input the rest of the multi-frame audio signal in the first audio to the VAD model for the second time.

In an embodiment, when the tone color recommendation function is turned on, the electronic device may extract a voice section in the first audio, for example, the electronic device obtains the audio (i.e., the first audio) between the current playing time point and the time point after the preset first duration in response to the first operation for the first control in the first interface, and extracts all the voices (i.e., the voice section) in the audio.

S104: the electronic equipment judges whether the length of the voice section is greater than or equal to a preset length.

In one embodiment, when the length of the voice section is less than the preset length, the electronic device may prompt the user that the timbre recommendation fails, for example, S105 is performed. In another embodiment, when the length of the human voice section is greater than or equal to the preset length, the electronic device may identify a tone color of the human voice in the human voice section or a tone color similar to a tone color of the human voice in the human voice section, for example, perform S106 to S112.

S105: the electronic equipment displays prompt information of failure of tone color recommendation.

S105 is an optional step. In other examples, the electronic device may play a notification of failure of the tone color recommendation, which is not limited in the present application.

S106: the electronic device performs a process of extracting the watermark based on the human voice interval.

S107: the electronic device determines whether a watermark in the vocal section is extracted.

In one embodiment, the process of extracting the watermark may include: the electronic equipment can firstly perform feature extraction on the voice interval, and then perform correlation detection on the extracted audio features and a preset watermark signal to obtain decision statistics; when the decision statistic is greater than or equal to a preset threshold, the electronic device extracts the watermark (may be referred to as a first watermark) in the voice section, and when the decision statistic is less than the preset threshold, the electronic device may determine that the watermark in the voice section cannot be extracted, or may understand that the effective watermark cannot be extracted, or may refer to that the extraction of the watermark from the voice section fails. Wherein the decision statistic may indicate a degree of correlation between the audio feature and the watermark signal, the greater the decision statistic, the higher the degree of correlation, the decision statistic being for example a correlation coefficient. The watermark signal is for example a signal of a specific frequency and/or frequency band, optionally a watermark correlation of timbres in the watermark signal and the audio market. Without being limited thereto, in another embodiment, the process of extracting the watermark may be: the electronic equipment directly extracts the watermark in the voice interval, and the application does not limit the specific implementation of the watermark extraction processing process.

In one embodiment, when the electronic device extracts the first watermark of the vocal section, the electronic device may acquire the first tone corresponding to the first watermark, for example, S108 is performed. In another embodiment, when the electronic device determines that the watermark in the human voice section cannot be extracted, the electronic device may acquire a tone color similar to the tone color of the human voice in the human voice section, for example, perform S110 to S112.

S108: the electronic device obtains a first tone corresponding to a first watermark in the vocal interval.

In one embodiment, the electronic device may find a first timbre corresponding to the first watermark from a plurality of timbres maintained in a timbre market.

S109: the electronic device displays first information indicating a first tone color.

In one embodiment, the first information may include a link of the first tone color, which may be used to jump to a detail interface of the first tone color, for example, a detail interface of the first tone color in a tone color market, an example of a jump procedure may be seen in S113.

In one embodiment, the electronic device may display the first information on a playing interface of the first audio and video, where the playing interface of the first audio and video is a user interface of the first application, in some examples, the first information is a display control of the first application, for example, a prompt box of the first application, and in other examples, the first information is independent of the first application, for example, is a floating window, and a display manner of the first information is not limited by the present application.

In one embodiment, after the electronic device obtains the first tone, the first information may be displayed between the current time and a time after a preset second time, and after the preset second time, the electronic device may cancel displaying the first information.

S110: the electronic device extracts a first voiceprint feature of a voice in a voice section.

In one embodiment, the electronic device may extract voiceprint features (speaker embedding, SP) (e.g., high-dimensional vectors) of the human voice in the human voice interval using a voiceprint model, where the voiceprint model may be a model trained using deep learning techniques for extracting voiceprint features in the audio signal. In some examples, the structure of the voiceprint model may be implemented based on convolution (Conv), batch normalization (batch normalization, BN), activation functions (e.g., modified linear units (RECTIFIED LINEAR unit, reLu), etc.), such as shown in fig. 5, and the voiceprint model may include, in order from the input side to the output side, a feature extraction module, a multi-layer processing module 1, a multi-layer processing module 2, and a multi-layer processing module 3, any of which may implement processing of Conv, BN, and ReLu. The input of the voiceprint model may be an audio signal (Audios) of a human voiceprint section, the audio signal may be input to the feature extraction module for processing, the audio feature of the audio signal output by the feature extraction module may be input to the multi-layer processing module 1 for processing, the output of the multi-layer processing module 1 may be input to the multi-layer processing module 2 for processing, the output of the multi-layer processing module 2 may be input to the multi-layer processing module 3 for processing, and the output of the multi-layer processing module 3 (i.e., the output of the voiceprint model) may be the voiceprint feature of the audio signal.

S111: the electronic equipment acquires a second tone color with highest similarity with the first voiceprint feature.

In one embodiment, the electronic device may calculate a similarity of the first voiceprint feature and audio features of a plurality of timbres maintained by the timbre market and determine a corresponding second timbre of the plurality of timbres that has a highest similarity.

S112: the electronic device displays second information indicative of a second tone.

S112 and S109 are similar, and specific reference is made to the description of S109.

S113: the electronic device displays information of a first tone in response to an operation on the first information.

S113 is an optional step.

In one embodiment, the electronic device may display a second interface in response to an operation on the first information, the second interface may include information of the first tone color, such as, but not limited to, including: the name of the first tone, a control for playing audio of the first tone (which may be understood as listening to audio of the first tone), text of the audio of the first tone, a purchase price of the first tone, a control for purchasing the first tone, a control for downloading the audio of the first tone, a control for collecting the first tone, a control for setting a function that can use the first tone, and the like. It is understood that the user can browse the information of the first tone color through the first information, purchase the first tone color, download the first tone color, set a function of applying the first tone color, and the like.

S114: the electronic device sets the first tone color to a tone color usable by the first function in response to the operation.

S114 is an optional step.

In one embodiment, the electronic device may set the first tone color to a tone color usable by the first function (which may also be referred to as adding the first tone color to the first function) in response to an operation acting on the second interface, the first function including, for example and without limitation: voice assistants, navigational voices, incoming call alerts, information cues, calendar cues, alarm cues, dial cues, short message cues, time cues, etc. In one embodiment, the electronic device may set the first timbre to the timbre used by the first function in response to an operation of the user interface acting on the first function, e.g., the electronic device determines the first timbre selected by the user from among a plurality of timbres usable by the first function to be the timbre used by the first function.

In another embodiment, the electronic device may set the first tone color to the tone color used by the first function in response to an operation acting on the second interface, which is not limited by the present application.

S115: and the electronic equipment displays the first identifier in the playing interface of the first audio and video.

S115 is an optional step.

In one embodiment, after the electronic device displays the first information indicating the first tone, the electronic device may continue to play the first audio and video, and at this time, a third interface may be displayed, and the electronic device may display the first identifier in the third interface, where the third interface does not include the first information. In some examples, the first interface and the third interface are both used to play the first audio-video, but the time point of play is different, e.g., the time point of play of the first interface is earlier than the time point of play of the third interface. In some examples, after the electronic device obtains the first tone, the electronic device may display the first information in the playing interface of the first audio and video between the current time and a time after a preset second time, and after the preset second time, the electronic device may cancel displaying the first information and display the first identifier in the playing interface of the first audio and video.

In one embodiment, the first identifier may be located at a position of a third time point in the playing progress bar of the first audio and video, where the third time point is related to a time point when the first tone is acquired, in some examples, the third time point is any time point in the first audio, for example, the first time point or the second time point, in other examples, the third time point is a time point when the electronic device receives the first operation for the first control in the first interface, and may also be understood as a time point when the tone recommendation function is turned on, in other examples, the third time point is a time point when the electronic device displays the first information indicating the first tone, where the specific time point of the third time point is not limited, but is later than a starting time point of the first audio and video and a time point earlier than a time point when the electronic device displays the first information indicating the first tone.

The order of S115, S113, and S114 is not limited, for example, after S113, the electronic device may return to the playing interface for displaying the first audio and video in response to the operation on the second interface, and then display the first identifier in the playing interface for the first audio and video.

S116: the electronic device displays first information in response to an operation acting on the first identifier.

S116 is an optional step.

In one embodiment, the electronic device may display first information indicating the first tone color in the third interface in response to the operation for the first identification in the third interface, and the display manner of the first information may be described in S109.

Not limited to the embodiment shown in fig. 3, in another embodiment, when the electronic device determines that the watermark in the voice section cannot be extracted, the electronic device may also prompt the user for failure of tone color recommendation, for example, S110-S112 are not performed, but S105 is performed.

The operations in the present application may be, but not limited to, touch operations, gestures, voice inputs, brain waves, etc., and the present application is not limited to the specific form of the operations.

In the method shown in fig. 3, when the user finds that the favorite tone color appears in the first audio and video played by the electronic device, the first operation can be performed, the electronic device can acquire the tone color of the voice in the first audio and video according to the first operation, or the tone color similar to the tone color of the voice in the first audio and video and recommend the tone color to the user, the user can browse, purchase, download and use the recommended tone color, and the recommended tone color can be the tone color in the tone color market, so that the user can conveniently and quickly acquire the favorite tone color without manually searching in the tone color market (corresponding to the mode one), the operation is reduced, and compared with searching for the tone color through the keyword search (corresponding to the mode one), the recommended tone color is more accurate, the user requirement is effectively met, and the use ratio of the tone color market is also improved.

And if the first information is not operated in the second duration of the user, or after the user looks up the information of the first tone through the first information, the electronic device can cancel to display the first information and display the first identifier, and the user can look up the first information again through operating the first identifier, so that the first information is prevented from affecting the user to watch the first audio and video, and even if the user subsequently wants to look up the information of the first tone again, the user does not need to manually adjust the playing time point of the first audio and video to the time point of executing the first operation before, then execute the first operation again, reduce the operation, and improve the user experience, thereby further improving the use ratio of the tone market.

An application scenario and a user interface embodiment under the scenario according to the embodiment of the present application are described below. The following embodiment will be described taking the electronic device 100 as an example to turn off the tone color recommending function by default.

FIG. 6 illustrates a schematic diagram of one embodiment of a user interface.

As shown in fig. 6 (a), the electronic device 100 may display a user interface 610 of the video application. The user interface 610 may include a play box 611 (in which a title 612 is displayed, the title 612 including the character "video 1"), a play control 613, a video progress bar 614, a play progress bar 615, duration information 616, and a control 617, wherein the play box 611 may be used to play a video named "video 1" (hereinafter referred to simply as video 1). The play control 613 may be used to pause or play the currently playing video 1. The video progress bar 614 may indicate the overall progress of video 1 and the play progress bar 615 may indicate the progress of the video that is currently being played. The duration information 616 may include a total duration of the video 1 (corresponding to the video progress bar 614) and a currently played duration (corresponding to the play progress bar 615), for example, "11:28/48:52", where "48:52" may indicate that the total duration of the video 1 is 48 minutes and 52 seconds, "11:28" may indicate that the currently played duration is 11 minutes and 28 seconds, and may also indicate that the current playing time point is 11 minutes and 28 seconds in the video 1, that is, the image displayed by the play frame 611 is an image of 11 minutes and 28 seconds in the video 1. When the electronic device 100 plays the video 1 normally, the time length information 616 will increase the time length that has been played currently, and the playing progress bar 615 will also be lengthened. The character "timbre recommendation" may be displayed next to control 617, control 617 may be used to trigger the opening or closing of the timbre recommendation function, and control 617 shown in user interface 610 may characterize that the timbre recommendation function has been closed. In some examples, displaying user interface 610 by electronic device 100 may perform S101 shown in fig. 3 for electronic device 100, where video 1 is the first audio-video described in fig. 3 and control 617 is the first control described in fig. 3.

In one implementation, electronic device 100 may turn on the tone color recommendation function in response to an operation acting on control 617 (e.g., the operation is a touch operation, such as a click operation), in some examples, when electronic device 100 performs S102 shown in fig. 3, it may be determined that the tone color recommendation function has been turned on. Moreover, in response to the operation on the control 617, the electronic device 100 may acquire a tone color of a voice in the audio 1 from a time point (11 minutes and 28 seconds) currently played in the video 1 to a time point (for example, the first time period is 1 minute and the time point is 12 minutes and 28 seconds) after a preset first time period, or a tone color similar to the tone color of the voice in the audio 1, where the acquired tone color may be referred to as tone color 1, and in some examples, the acquiring process may be performed for the electronic device 100 by S102-S108 and S110-S111 shown in fig. 3, where the audio 1 is the first audio described in fig. 3, the 11 minutes and 28 seconds is the first time point described in fig. 3, the first time period is 1 minute and the 12 minutes and 28 seconds is the second time point described in fig. 3, and the tone color 1 is the first tone color or the second tone color described in fig. 3. After the electronic device 100 turns on the tone color recommendation function, a user interface 620 shown in fig. 6 (B), which is illustrated in fig. 6 (B) by taking the electronic device 100 as an example of acquiring the tone color 1. As shown in fig. 6 (B), user interface 620 is similar to user interface 610, except that control 617 in user interface 620 may indicate that the timbre recommendation function is on, and that the character "in detect" may be displayed next to control 617 in user interface 620, which may indicate that electronic device 100 is currently conducting a timbre recommendation process.

In one embodiment, after the embodiment shown in fig. 6, when the electronic device 100 determines that the tone color of the voice in the audio 1 cannot be obtained or is similar to the tone color of the voice in the audio 1, the user may be prompted that the tone color recommendation fails, and the tone color recommendation function may be turned off, for example, when the electronic device 100 performs S104 shown in fig. 3, if it is determined that the length of the voice section is less than the preset length, S105 may be performed. At this time, the electronic device 100 may display the user interface 710 shown in fig. 7, where the user interface 710 is similar to the user interface 610 shown in fig. 6 (a), and the controls 617 each indicate that the tone color recommendation function is turned off, where a character "too short voice, failed recommendation" may be displayed next to the control 617 in the user interface 710, and in some examples, the character may be a prompt for the tone color recommendation failure described in fig. 3.

In one embodiment, after the embodiment shown in fig. 6, when the electronic device 100 obtains the tone color 1, information indicating the tone color 1 may be displayed, specifically, referring to the user interface 810 shown in fig. 8, the user interface 810 is similar to the user interface 610 shown in fig. 6 (a), except that the control 617 in the user interface 810 may indicate that the tone color recommending function is turned on, and the user interface 810 further includes a control 811, the control 811 may include a character "for you to match to the tone color 1", and the control 811 may be displayed at a position 812 of a first time point (11 minutes 28 seconds) in the video progress bar 614. In some examples, displaying user interface 810 by electronic device 100 may perform S109 or S112 shown in fig. 3 for electronic device 100, where control 811 is the first information or the second information described in fig. 3.

In one embodiment, the tone color recommendation function may be turned off when the control 811 is displayed by the electronic device 100 or after the control 811 is displayed, and a specific interface example may be seen in fig. 9.

In one embodiment, after the embodiment shown in fig. 8, the electronic device 100 may cancel the display control 811, e.g., the electronic device 100 may cancel the display control 811 if no operation is received on the control 811 for a second period of time after the display of the user interface 810 shown in fig. 8, see, in particular, the user interface 910 shown in fig. 9. The user interface 910 is similar to the user interface 610 shown in fig. 6 (a), and the control 617 characterizes the timbre recommendation function as turned off, except that the duration information 616 in the user interface 910 is "14:28/48:52" and may characterize the duration currently played as 14 minutes and 28 seconds, 3 minutes later than the duration "11 minutes and 28 seconds" played in the user interface 610. Also, the user interface 910 also includes a control 911, the control 911 can be associated with the control 811, and the control 911 can be displayed at a location 912 of a first point in time (11 minutes 28 seconds) in the video progress bar 614. In some examples, displaying the user interface 910 by the electronic device 100 may perform S115 shown in fig. 3 for the electronic device 100, where the control 911 is the first identification described in fig. 3. In some examples, electronic device 100 may display user interface 810 (where control 811 is the first information) from a point in time when tone 1 was acquired (i.e., 11 minutes 28 seconds) to a point in time after a second time period (i.e., 14 minutes 28 seconds, and the second time period is 3 minutes), and if electronic device 100 does not receive an operation on control 811 during this time period, electronic device 100 may display user interface 910 (where control 911 is the second information) at the point in time after the second time period. It will be appreciated that after successful timbre 1 recommendation, if the user does not operate control 811 for recommending timbre 1 for a period of time, electronic device 100 may not display control 811 and turn off the timbre recommendation function and dot the previously recommended timbre 1 through control 911 on the play progress bar for subsequent use by the user.

In one implementation, electronic device 100 may display control 811 indicating tone 1 in response to an operation (e.g., a touch operation, such as a click operation) with respect to control 911 in user interface 910 shown in fig. 9, and in particular with respect to user interface 1010 shown in fig. 10. The user interface 1010 is similar to the user interface 910 shown in fig. 9, except that the user interface 1010 does not include the control 911 in the user interface 910, but includes the control 811 in the user interface 810 shown in fig. 8, the control 811 in the user interface 1010 being displayed in the display position of the control 911 in the user interface 910, i.e., in the position 912 at the first point in time (11 minutes 28 seconds) in the video progress bar 614. In some examples, displaying user interface 1010 by electronic device 100 may perform S116 shown in fig. 3 for electronic device 100, where control 911 is the first identification described in fig. 3 and control 811 is the first information described in fig. 3.

In one implementation, electronic device 100 may display detailed information for tone 1 in response to an operation (e.g., a touch operation, such as a click operation) with respect to control 811 in user interface 810 shown in fig. 8, and in particular with reference to user interface 1110 shown in fig. 11. The user interface 1110 may be a user interface for a tone market, and the user interface 1110 may include a name 1111 (including the characters "tone 1" and "popularity man"), a listening test control 1112, price information 1113, a collection control 1114, a purchase control 1115, an add control 1116, and a recommendation list 1117 for other tones. Among other things, listening control 1112 may be used to play some or all of tone color 1 audio. Price information 1113 includes the character "5 points," which may characterize the purchase price of timbre 1 as 5 points, and purchase control 1115 includes the character "purchase," which may be used to purchase timbre 1, which may be set to a timbre available for use by the functionality of the electronic device after timbre 1 purchase. The collection control 1114 may be used to collect timbre 1. The add control 1116 may be used to set tone color 1 as a tone color usable by the functionality of the electronic device 100 (also referred to as adding tone color 1 to the functionality of the electronic device 100). The recommendation list 1117 may include information of a plurality of timbres, such as information 1117A of "timbre 2" (including purchase price "4 points") and information 1117B of "timbre 3" (including purchase price "5 points").

Without being limited to the above-described embodiment, in another embodiment, the electronic device 100 may also display the user interface 1110 shown in fig. 11 in response to an operation (for example, the operation is a touch operation, for example, a click operation) with respect to the control 811 in the user interface 1010 shown in fig. 10.

Without being limited to the above embodiment, in another embodiment, after the electronic device 100 displays the user interface 610 shown in fig. 6 (a), the user interface 620 shown in fig. 6 (B) may be not displayed (including information indicating that the electronic device 100 is currently performing a timbre recommendation process), but a result of timbre recommendation is directly displayed, for example, the user interface 710 shown in fig. 7 (including information indicating that timbre recommendation failed), or the user interface 810 shown in fig. 8 (including information indicating that timbre recommendation succeeded), which is not limited by the present application.

Without limitation to the above embodiment, in another embodiment, the control 811 or the control 911 shown in fig. 8 to 10 may also be displayed at the position of other time points (for example, the second time point) of the audio 1 in the video progress bar 614, which is not limited by the present application. In one embodiment, after the embodiment shown in fig. 11, the electronic device 100 may return to the playing interface for displaying the video 1, and mark the previously recommended tone color 1 (in this case, the tone color 1 already viewed by the user) on the playing progress bar of the playing interface, for example, to display the user interface 910 shown in fig. 9.

In one implementation, electronic device 100 can display information of functions that can add tone 1 in response to an operation (e.g., a touch operation, such as a click operation) for add control 1116 in user interface 1110 shown in fig. 11, particularly with reference to user interface 1210 shown in fig. 12. The user interface 1210 is similar to the user interface 1110 shown in FIG. 11, except that the purchase control 1115 in the user interface 1210 includes the character "purchased" and may characterize that timbre 1 has been purchased, and the user interface 1210 does not include the recommendation list 1117 but includes the window 1211. Window 1211 may include a title 1211A (including the character "please select add location"), a plurality of controls that may add functionality for tone color 1, a determination control 1212, and a cancel control 1213. Wherein, the plurality of controls capable of adding the tone color 1 function include, for example: control 1211B (including the characters "add to voice assistant", indicating the functionality of voice assistant), control 1211C (including the characters "add to navigation voice", indicating the functionality of navigation voice), and control 1211D (including the characters "add to incoming call alert", indicating the functionality of incoming call alert), any of which may be used to select or deselect the functionality indicated by the control, e.g., control 1211B in user interface 1110 indicates that the functionality of voice assistant is selected, and control 1211C and control 1211D indicate that the functionality of navigation voice and incoming call alert is not selected. The determine control 1212 may be used to add timbre 1 to the selected function and the cancel control 1213 may be used to cancel adding timbre 1 to the function of the electronic device 100.

Not limited to the embodiment shown in fig. 12, in another embodiment, the plurality of controls capable of adding the function of tone color 1 may further include a control indicating at least one of the following functions: information alert tones, calendar alert tones, alarm clock alert tones, dial alert tones, short message alert tones, time alert tones, etc.

In one implementation, the electronic device 100 may add tone color 1 to the selected function "voice assistant" in response to an operation (e.g., a touch operation, such as a click operation) with respect to the determination control 1212 in the user interface 1210 shown in fig. 12, particularly with reference to the user interface 1310 shown in fig. 13. The user interface 1310 may be a voice assistant user interface, the user interface 1310 may include a tone list 1311, the tone list 1311 may include a plurality of preset tone options, such as a "female" option 1312 and a "child" option 1313, where any one of the options may correspond to a selection control that may be used to select the tone indicated by the corresponding option to be the tone of the voice assistant, or to cancel the selection, such as the selection control 1312B corresponding to the option 1312 being in a selected state, to indicate that the "female" tone indicated by the currently selected option 1312 is the tone of the voice assistant. The user interface 1310 also includes a control 1314, a list of timbres 1315, and a control 1316, where the control 1314 may be used to add a custom timbre, such as by a microphone. Tone list 1315 may include at least one tone added from the tone market, such as option 1315A indicating tone 1. Control 1316 may be used to open a user interface for a tone market. In some examples, option 1315A in user interface 1310 is in an unselected state, electronic device 100 may set tone 1 indicated by option 1315A to the tone of the voice assistant in response to an operation for option 1315A (e.g., the operation is a touch operation, such as a click operation), at which point option 1315A may be in a selected state, and optionally, selection control 1312B described above changes to an unselected state. Without being limited to the above embodiment, in another embodiment, the electronic device 100 may also set tone color 1 to the tone color of the selected function "voice assistant" in response to an operation (for example, the operation is a touch operation, for example, a click operation) with respect to the determination control 1212 in the user interface 1210 shown in fig. 12, which is not limited by the present application.

The method provided by the embodiments of the present application may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user device, or other programmable apparatus. The computer instructions may be stored in a computer storage medium or transmitted from one computer storage medium to another computer storage medium, for example, from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., digital video disc (digital video disc, DWD), or semiconductor medium (e.g., solid State Drive (SSD)) the above embodiments are merely for illustrating the technical solution of the present application, but not for limiting the same, and although the present application is described with reference to the foregoing embodiments, it should be understood by those skilled in the art that modifications may be made to the technical solution described in the foregoing embodiments, or equivalents may be substituted for some of the technical features thereof, and such modifications or substitutions do not depart from the essence of the corresponding technical solution from the scope of the technical solution of the embodiments of the present application.

Claims

1. A tone color recommendation method, applied to an electronic device, comprising:

displaying a first interface, wherein the first interface is used for playing first multimedia data;

responding to a first operation of an audio recommendation control for the first interface, and acquiring a first tone, wherein the first tone is the tone of the voice in the first multimedia data, or the tone with the maximum similarity between a plurality of tones and the tone of the voice in the first multimedia data;

first information is displayed, the first information indicating the first timbre.

2. The method of claim 1, wherein the first timbre is a timbre in a timbre market application of the electronic device and/or the plurality of timbres is a timbre in the timbre market application.

3. The method of claim 1 or 2, wherein the method further comprises:

receiving a second operation for the first information;

and displaying a second interface, wherein the second interface comprises the information of the first tone.

4. A method as claimed in claim 3, wherein the method further comprises:

In response to a third operation of the add control to the second interface, setting the first timbre to a timbre usable by at least one of the following functions: voice assistant, navigation voice, incoming call reminder, information alert tone, calendar alert tone, alarm alert tone, dial alert tone, short message alert tone, and time alert tone.

5. The method of any of claims 1-4, wherein after displaying the first information, the method further comprises:

When the operation for the first information is not received within the first duration, displaying a third interface, wherein the third interface is used for playing the first multimedia data, the third interface does not comprise the first information, and the third interface comprises the second information;

The first information is displayed in response to a fourth operation for the second information in the third interface.

6. The method of claim 5, wherein the second information is displayed at a location of a first point in time in a playing progress bar of the first multimedia data of the third interface, the first point in time being a point in time when the electronic device receives the first operation.

7. The method of any one of claims 1-6, wherein the obtaining a first timbre comprises:

and acquiring the first tone according to first audio, wherein the first audio is audio data from a second time point to a third time point in the first multimedia data, the second time point is later than or equal to a starting time point of the first multimedia data, and the third time point is earlier than or equal to a cut-off time point of the first multimedia data.

8. The method of claim 7, wherein the second point in time is a point in time when the electronic device receives the first operation, and the third point in time is a second period of time later than the second point in time.

9. The method of claim 7 or 8, wherein the obtaining the first timbre from the first audio comprises:

acquiring a voice interval according to the first audio, wherein the voice interval comprises audio data belonging to voice in the first audio;

And when the length of the voice interval is greater than or equal to the preset length, acquiring the first tone according to the voice interval.

10. The method of claim 9, wherein the method further comprises:

And when the length of the voice section is smaller than the preset length, displaying third information, wherein the third information indicates that the tone recommendation fails.

11. The method of claim 9 or 10, wherein the first tone color is a tone color of a human voice in the first multimedia data, and the acquiring the first tone color according to the human voice section includes:

Extracting a first watermark from the voice section;

and acquiring the first tone corresponding to the first watermark.

12. The method of claim 9 or 10, wherein the first tone color is a tone color having a maximum similarity between a tone color of a human voice in the plurality of tone colors and the first multimedia data, and the acquiring the first tone color according to the human voice section includes:

Extracting a first voiceprint feature of the voice from the voice section;

Obtaining the similarity of the audio characteristics of a second tone and the first voiceprint characteristics, wherein the second tone is any tone of the plurality of tones;

and acquiring the first tone color with the highest similarity between the audio frequency characteristics and the first voiceprint characteristics in the plurality of tone colors.

13. The method of claim 12, wherein the extracting the first voiceprint feature of the voice from the voice section comprises:

And when the watermark extraction from the voice section fails, extracting the first voiceprint feature from the voice section.

14. An electronic device comprising a transceiver, a processor and a memory, the memory for storing a computer program, the processor invoking the computer program for performing the method of any of claims 1-13.

15. A computer storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the method of any of claims 1-13.