CN112214636A

CN112214636A - Audio file recommendation method and device, electronic equipment and readable storage medium

Info

Publication number: CN112214636A
Application number: CN202011005042.6A
Authority: CN
Inventors: 徐致欣; 许浩维; 刘永祥
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2021-01-12

Abstract

The application is applicable to the technical field of data processing, and provides a recommendation method and device for audio files, electronic equipment and a readable storage medium, wherein the method comprises the following steps: acquiring an event label of a shooting event corresponding to a video file; analyzing the video file to obtain a content tag corresponding to the video file; and selecting target audio associated with the content tag and/or the event tag in a music library, and generating audio recommendation information of the video file based on the target audio. According to the technical scheme, the target audio is searched through the event label and the content label. The accuracy of target audio selection is improved, accurate pushing can be achieved while audio files are recommended in a personalized mode, user experience is improved, and the relevance between the recommended audio files and the video files is also improved.

Description

Audio file recommendation method and device, electronic equipment and readable storage medium

Technical Field

The present application belongs to the technical field of data processing, and in particular, to a method and an apparatus for recommending an audio file, an electronic device, and a readable storage medium.

Background

With the continuous development of multimedia technology, short videos and micro videos are popularized, and users can shoot video files through electronic equipment in daily life to record life. In order to improve the interest and the appreciation of the video file, a user can add background music to the video file, and therefore how to recommend the background music matched with the video file to the user by the electronic device becomes a problem which needs to be solved urgently.

In the conventional audio file recommendation technology, a plurality of different music categories, such as "rock", "ballad", "pop", and other different music categories, are preset, a plurality of candidate audios are fixedly associated with the different music categories, and the plurality of candidate audios associated with the categories are recommended according to the category selected by a user. However, the method cannot adjust the candidate audio according to the video file, the personalized recommendation degree is low, and accurate recommendation of the audio file cannot be realized.

Disclosure of Invention

The embodiment of the application provides a recommendation method and device for audio files, electronic equipment and a readable storage medium, and can solve the problems that the existing recommendation technology for audio files cannot adjust candidate audio according to video files, the personalized recommendation degree is low, and accurate recommendation of audio files cannot be realized.

In a first aspect, an embodiment of the present application provides a method for recommending an audio file, including:

acquiring an event label of a shooting event corresponding to a video file;

analyzing the video file to obtain a content tag corresponding to the video file;

and selecting target audio associated with the content tag and/or the event tag in a music library, and generating audio recommendation information of the video file based on the target audio.

The embodiment of the application has the following beneficial effects: when an audio file associated with the video file needs to be generated, a video tag corresponding to a shooting event for shooting the video file is acquired, a content tag is obtained based on shooting content of the video file, associated target audio is extracted from a music library through the video tag and the content tag, and audio recommendation information associated with the video file is generated, so that the purpose of personalized recommendation is achieved. Compared with the existing audio file recommendation technology, the method and the device for generating the event tags can generate the corresponding content tags according to the video content and the event tags related to the shooting events, so that the target audio can be searched through the two types of tags. The accuracy of target audio selection is improved, accurate pushing can be achieved while audio files are recommended in a personalized mode, user experience is improved, and the relevance between the recommended audio files and the video files is also improved.

In one possible implementation form of the first aspect, the event tag includes a shooting location; the selecting the target audio associated with the content tag and/or the event tag in the music library comprises:

acquiring text information of each existing audio in the music library, and taking the existing audio containing the shooting place in the text information as the target audio; and/or

Acquiring a video clip associated with the shooting place, and taking the dubbing audio of the video clip as the target audio; and/or

And determining a historical event associated with the shooting place, and taking an audio file corresponding to the historical event as the target audio.

In one possible implementation of the first aspect, the event tag comprises a shooting date; the selecting the target audio associated with the content tag and/or the event tag in the music library comprises:

and if the shooting date is any preset specific date, taking the audio file associated with the specific date as the target audio.

In a possible implementation manner of the first aspect, before the taking an audio file associated with any one specific date as the target audio if the shooting date is any one of preset specific dates, the method further includes:

acquiring user information of a currently logged user account;

determining the particular date based on the user information.

In one possible implementation manner of the first aspect, the selecting, within the music library, target audio associated with the content tag and/or the event tag includes:

generating an initial label sequence corresponding to the video file by using the content label and the event label;

generating a mapping label sequence corresponding to the initial label sequence based on a preset label mapping algorithm; configuring associated weight values for each label in the mapping label sequence;

calculating the matching degree of each existing audio in the music library based on the weight value of each label in the mapping label sequence;

and determining the target audio from the existing audio based on the matching degree.

In one possible implementation manner of the first aspect, the determining the target audio from the existing audio based on the matching degree includes:

selecting the existing audio with the matching degree larger than a preset matching threshold as a candidate audio;

determining user characteristics of a user account based on user information of a currently logged-in user account;

and if any candidate audio is matched with the user characteristics, taking the candidate audio as the target audio.

In a possible implementation manner of the first aspect, the parsing the video file to obtain the content tag corresponding to the video file includes:

determining a picture label contained in each video image frame of the video file;

respectively counting the occurrence times of each picture label in all the video image frames of the video file, and sequencing each picture label based on the sequence from the small occurrence times to obtain a picture label sequence;

selecting the first N picture labels in the picture labels as content labels corresponding to the video files; and N is a positive integer.

In a possible implementation manner of the first aspect, before the selecting, in the music library, the target audio associated with the content tag and/or the event tag, the method further includes:

if the authorization information of the currently logged user account is stored, acquiring an operation record of the user account;

generating user characteristics of the user account based on the operation record;

and extracting the existing music matched with the user characteristics from a database to generate the music library.

In a second aspect, an embodiment of the present application provides an apparatus for recommending an audio file, including:

the event tag acquisition unit is used for acquiring an event tag of a shooting event corresponding to the video file;

the content tag acquisition unit is used for analyzing the video file to obtain a content tag corresponding to the video file;

and the audio recommendation information generating unit is used for selecting target audio associated with the content tag and/or the event tag in the music library and generating audio recommendation information of the video file based on the target audio.

In a third aspect, an embodiment of the present application provides an electronic device, a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the method for recommending an audio file according to any one of the above first aspects when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program is configured to, when executed by a processor, implement the audio file recommendation method according to any one of the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer program product, which, when run on an electronic device, causes the electronic device to execute the method for recommending an audio file according to any one of the above first aspects.

In a sixth aspect, an embodiment of the present application provides a chip system, which includes a processor, where the processor is coupled with a memory, and the processor executes a computer program stored in the memory to implement the method for recommending an audio file according to any one of the first aspect.

It is understood that the beneficial effects of the second to sixth aspects can be seen from the description of the first aspect, and are not described herein again.

Drawings

Fig. 1 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

fig. 2 is a block diagram of a software structure of an electronic device according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an editing interface provided by an embodiment of the present application;

FIG. 4 is a flowchart illustrating an implementation of a method for recommending audio files according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating a video file selection operation according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a video image frame provided by an embodiment of the present application;

FIG. 7 is a markup diagram of a content tag provided by an embodiment of the present application;

fig. 8 is a flowchart illustrating an implementation details of S402 in the audio file recommendation method according to an embodiment of the present application;

FIG. 9 is a schematic diagram illustrating the division of video frame segments according to an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating a process for generating a content tag according to an embodiment of the present application;

FIG. 11 is a schematic diagram of audio recommendation information provided by an embodiment of the present application;

fig. 12 is a schematic diagram of acquisition of a shooting location according to an embodiment of the present application;

fig. 13 is a flowchart of an implementation of S403 in a method for recommending an audio file according to another embodiment of the present application;

fig. 14 is a flowchart of an implementation of S403 in a method for recommending an audio file according to another embodiment of the present application;

fig. 15 is a flowchart illustrating a specific implementation of S403 in a method for recommending an audio file according to another embodiment of the present application;

FIG. 16 is a schematic diagram illustrating the generation of a mapping tag sequence according to an embodiment of the present application;

FIG. 17 is a flowchart illustrating an implementation of a method for recommending audio files according to an embodiment of the present application;

FIG. 18 is a schematic diagram illustrating the acquisition of authorization information provided by an embodiment of the present application;

fig. 19 is a flowchart illustrating an interaction between an electronic device and a cloud server according to an embodiment of the present application;

FIG. 20 is a schematic diagram illustrating the generation of a content tag according to an embodiment of the present application;

fig. 21 is a block diagram illustrating an exemplary audio file recommendation apparatus according to an embodiment of the present disclosure;

fig. 22 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The audio file recommendation method provided by the embodiment of the application can be applied to electronic devices such as a mobile phone, a tablet personal computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and the like, and the embodiment of the application does not limit the specific types of the electronic devices at all.

For example, the electronic device may be a Station (ST) in a WLAN, and may be a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a Wireless Local Loop (WLL) station, a Personal Digital Assistant (PDA) device, a handheld device with Wireless communication capability, a computing device or other processing device connected to a Wireless modem, a computer, a laptop, a handheld communication device, a handheld computing device, and/or other devices for communicating on a Wireless system, and a next generation communication system, such as a Mobile terminal in a 5G Network or a Mobile terminal in a future evolved Public Land Mobile Network (PLMN) Network, and so on.

Fig. 1 shows a schematic structural diagram of an electronic device 100.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a key 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identification Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors.

The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

The I2C interface is a bi-directional synchronous serial bus that includes a serial data line (SDA) and a Serial Clock Line (SCL). In some embodiments, processor 110 may include multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180K, the charger, the flash, the camera 193, etc. through different I2C bus interfaces, respectively. For example: the processor 110 may be coupled to the touch sensor 180K via an I2C interface, such that the processor 110 and the touch sensor 180K communicate via an I2C bus interface to implement the touch functionality of the electronic device 100.

The I2S interface may be used for audio communication. In some embodiments, processor 110 may include multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may communicate audio signals to the wireless communication module 160 via the I2S interface, enabling answering of calls via a bluetooth headset.

The PCM interface may also be used for audio communication, sampling, quantizing and encoding analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled by a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface, so as to implement a function of answering a call through a bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus used for asynchronous communications. The bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is generally used to connect the processor 110 with the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 170 may transmit the audio signal to the wireless communication module 160 through a UART interface, so as to realize the function of playing music through a bluetooth headset.

MIPI interfaces may be used to connect processor 110 with peripheral devices such as display screen 194, camera 193, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a Display Serial Interface (DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the capture functionality of electronic device 100. The processor 110 and the display screen 194 communicate through the DSI interface to implement the display function of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, a MIPI interface, and the like.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transmit data between the electronic device 100 and a peripheral device. And the earphone can also be used for connecting an earphone and playing audio through the earphone. The interface may also be used to connect other electronic devices, such as AR devices and the like.

It should be understood that the connection relationship between the modules according to the embodiment of the present invention is only illustrative, and is not limited to the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The charging management module 140 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from a wired charger via the USB interface 130. In some wireless charging embodiments, the charging management module 140 may receive a wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the internal memory 121, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be used to monitor parameters such as battery capacity, battery cycle count, battery state of health (leakage, impedance), etc. In some other embodiments, the power management module 141 may also be disposed in the processor 110. In other embodiments, the power management module 141 and the charging management module 140 may be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution including 2G/3G/4G/5G wireless communication applied to the electronic device 100. The mobile communication module 150 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 150 may receive the electromagnetic wave from the antenna 1, filter, amplify, etc. the received electromagnetic wave, and transmit the electromagnetic wave to the modem processor for demodulation. The mobile communication module 150 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating a low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then passes the demodulated low frequency baseband signal to a baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs a sound signal through an audio device (not limited to the speaker 170A, the receiver 170B, etc.) or displays an image or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional modules, independent of the processor 110.

The wireless communication module 160 may provide a solution for wireless communication applied to the electronic device 100, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), bluetooth (bluetooth, BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, perform frequency modulation and amplification on the signal, and convert the signal into electromagnetic waves through the antenna 2 to radiate the electromagnetic waves.

In some embodiments, antenna 1 of electronic device 100 is coupled to mobile communication module 150 and antenna 2 is coupled to wireless communication module 160 so that electronic device 100 can communicate with networks and other devices through wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), General Packet Radio Service (GPRS), code division multiple access (code division multiple access, CDMA), Wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), Long Term Evolution (LTE), LTE, BT, GNSS, WLAN, NFC, FM, and/or IR technologies, etc. The GNSS may include a Global Positioning System (GPS), a global navigation satellite system (GLONASS), a beidou navigation satellite system (BDS), a quasi-zenith satellite system (QZSS), and/or a Satellite Based Augmentation System (SBAS).

The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, with N being a positive integer greater than 1. The display screen 194 may include a touch panel as well as other input devices.

The electronic device 100 may implement a shooting function through the ISP, the camera 193, the video codec, the GPU, the display 194, the application processor, and the like.

The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, the electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to perform fourier transform or the like on the frequency bin energy.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. Applications such as intelligent recognition of the electronic device 100 can be realized through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the memory capability of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (such as audio data, phone book, etc.) created during use of the electronic device 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

The electronic device 100 may implement audio functions via the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The electronic apparatus 100 can listen to music through the speaker 170A or listen to a handsfree call.

The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the electronic apparatus 100 receives a call or voice information, it can receive voice by placing the receiver 170B close to the ear of the person.

The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal to the microphone 170C by speaking the user's mouth near the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further include three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, perform directional recording, and so on.

The headphone interface 170D is used to connect a wired headphone. The headset interface 170D may be the USB interface 130, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 180A is used for sensing a pressure signal, and converting the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 180A, the capacitance between the electrodes changes. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the intensity of the touch operation according to the pressure sensor 180A. The electronic apparatus 100 may also calculate the touched position from the detection signal of the pressure sensor 180A. In some embodiments, the touch operations that are applied to the same touch position but different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message.

The gyro sensor 180B may be used to determine the motion attitude of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., the x, y, and z axes) may be determined by gyroscope sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 180B detects a shake angle of the electronic device 100, calculates a distance to be compensated for by the lens module according to the shake angle, and allows the lens to counteract the shake of the electronic device 100 through a reverse movement, thereby achieving anti-shake. The gyroscope sensor 180B may also be used for navigation, somatosensory gaming scenes.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, electronic device 100 calculates altitude, aiding in positioning and navigation, from barometric pressure values measured by barometric pressure sensor 180C.

The magnetic sensor 180D includes a hall sensor. The electronic device 100 may detect the opening and closing of the flip holster using the magnetic sensor 180D. In some embodiments, when the electronic device 100 is a flip phone, the electronic device 100 may detect the opening and closing of the flip according to the magnetic sensor 180D. And then according to the opening and closing state of the leather sheath or the opening and closing state of the flip cover, the automatic unlocking of the flip cover is set.

The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity can be detected when the electronic device 100 is stationary. The method can also be used for recognizing the posture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, taking a picture of a scene, electronic device 100 may utilize range sensor 180F to range for fast focus.

The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device 100 emits infrared light to the outside through the light emitting diode. The electronic device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object near the electronic device 100. When insufficient reflected light is detected, the electronic device 100 may determine that there are no objects near the electronic device 100. The electronic device 100 can utilize the proximity light sensor 180G to detect that the user holds the electronic device 100 close to the ear for talking, so as to automatically turn off the screen to achieve the purpose of saving power. The proximity light sensor 180G may also be used in a holster mode, a pocket mode automatically unlocks and locks the screen.

The ambient light sensor 180L is used to sense the ambient light level. Electronic device 100 may adaptively adjust the brightness of display screen 194 based on the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust the white balance when taking a picture. The ambient light sensor 180L may also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in a pocket to prevent accidental touches.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 can utilize the collected fingerprint characteristics to unlock the fingerprint, access the application lock, photograph the fingerprint, answer an incoming call with the fingerprint, and so on.

The temperature sensor 180J is used to detect temperature. In some embodiments, electronic device 100 implements a temperature processing strategy using the temperature detected by temperature sensor 180J. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold, the electronic device 100 performs a reduction in performance of a processor located near the temperature sensor 180J, so as to reduce power consumption and implement thermal protection. In other embodiments, the electronic device 100 heats the battery 142 when the temperature is below another threshold to avoid the low temperature causing the electronic device 100 to shut down abnormally. In other embodiments, when the temperature is lower than a further threshold, the electronic device 100 performs boosting on the output voltage of the battery 142 to avoid abnormal shutdown due to low temperature.

The touch sensor 180K is also called a "touch device". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is used to detect a touch operation applied thereto or nearby. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided through the display screen 194. In other embodiments, the touch sensor 180K may be disposed on a surface of the electronic device 100, different from the position of the display screen 194.

The bone conduction sensor 180M may acquire a vibration signal. In some embodiments, the bone conduction sensor 180M may acquire a vibration signal of the human vocal part vibrating the bone mass. The bone conduction sensor 180M may also contact the human pulse to receive the blood pressure pulsation signal. In some embodiments, the bone conduction sensor 180M may also be disposed in a headset, integrated into a bone conduction headset. The audio module 170 may analyze a voice signal based on the vibration signal of the bone mass vibrated by the sound part acquired by the bone conduction sensor 180M, so as to implement a voice function. The application processor can analyze heart rate information based on the blood pressure beating signal acquired by the bone conduction sensor 180M, so as to realize the heart rate detection function.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The electronic apparatus 100 may receive a key input, and generate a key signal input related to user setting and function control of the electronic apparatus 100.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration cues, as well as for touch vibration feedback. For example, touch operations applied to different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 191 may also respond to different vibration feedback effects for touch operations applied to different areas of the display screen 194. Different application scenes (such as time reminding, receiving information, alarm clock, game and the like) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

Indicator 192 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc.

The SIM card interface 195 is used to connect a SIM card. The SIM card can be brought into and out of contact with the electronic apparatus 100 by being inserted into the SIM card interface 195 or being pulled out of the SIM card interface 195. The electronic device 100 may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 195 may support a Nano SIM card, a Micro SIM card, a SIM card, etc. The same SIM card interface 195 can be inserted with multiple cards at the same time. The types of the plurality of cards may be the same or different. The SIM card interface 195 may also be compatible with different types of SIM cards. The SIM card interface 195 may also be compatible with external memory cards. The electronic device 100 interacts with the network through the SIM card to implement functions such as communication and data communication. In some embodiments, the electronic device 100 employs esims, namely: an embedded SIM card. The eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.

The software system of the electronic device 100 may employ a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present invention uses an Android system with a layered architecture as an example to exemplarily illustrate a software structure of the electronic device 100.

Fig. 2 is a block diagram of a software structure of an electronic device according to an embodiment of the present application.

The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom.

The application layer may include a series of application packages.

As shown in fig. 2, the application package may include applications such as camera, gallery, calendar, phone call, map, navigation, WLAN, bluetooth, music, video, short message, etc.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 2, the application framework layers may include a window manager, content provider, view system, phone manager, resource manager, notification manager, and the like.

The window manager is used for managing window programs. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

The content provider is used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide communication functions of the electronic device. Such as management of call status (including on, off, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to inform download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, prompting text information in the status bar, sounding a prompt tone, vibrating the electronic device, flashing an indicator light, etc.

The Android Runtime comprises a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system.

The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. And executing java files of the application program layer and the application program framework layer into a binary file by the virtual machine. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface managers (surface managers), Media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., OpenGL ES), 2D graphics engines (e.g., SGL), and the like.

The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

The following describes exemplary workflow of the software and hardware of the electronic device 100 in connection with capturing a photo scene.

When the touch sensor 180K receives a touch operation, a corresponding hardware interrupt is issued to the kernel layer. The kernel layer processes the touch operation into an original input event (including touch coordinates, a time stamp of the touch operation, and other information). The raw input events are stored at the kernel layer. And the application program framework layer acquires the original input event from the kernel layer and identifies the control corresponding to the input event. Taking the touch operation as a touch click operation, and taking a control corresponding to the click operation as a control of a camera application icon as an example, the camera application calls an interface of an application framework layer, starts the camera application, further starts a camera drive by calling a kernel layer, and captures a still image or a video through the camera 193.

The first embodiment is as follows:

the electronic equipment can output an editing interface of the video file, receive editing operation of a user in the editing interface, and modify the video file based on the editing operation. Exemplarily, fig. 3 shows a schematic diagram of an editing interface provided by an embodiment of the present application. Referring to fig. 3 (a), the editing interface includes a filter setting control 301, a subtitle setting control 302, and a background music setting control 303, and according to an editing requirement, a user may click the corresponding control, so that the electronic device displays the corresponding setting interface or editing interface. For example, when a video file needs to be edited, the electronic device receives a user's click operation on the background music setting control 303, and at this time, the electronic device switches to a setting page of the background music, that is, as shown in (b) in fig. 3. The setting page of the background music includes a plurality of music categories, which are "popular recommendation", "popular music", "rock music", "ballad music", and "pure music", respectively, each music category is fixedly associated with at least one candidate music, and the electronic device may receive a selection instruction initiated by a user and expand a music recommendation list corresponding to the music category, as shown in (c) of fig. 3. The music classification is not changed according to the content of the video file, and the candidate music associated with each music classification is relatively fixed, so that the personalization degree of the recommendation operation of the music file is reduced. On the other hand, the recommendation method requires the user to manually determine the music classification to be selected and then display the recommendation list of the associated candidate music, so that the operation is complex, and the operation efficiency of the user for selecting the background music is also reduced.

Therefore, in order to solve the above-mentioned drawbacks of the audio file recommendation technology, the present application provides a recommendation method for an audio file, which is detailed as follows: referring to fig. 4, an execution subject of the audio file recommendation method is an electronic device, and the electronic device may be a smart phone, a tablet computer, a smart game machine, or any device configured with a display module. Fig. 4 shows a flowchart of an implementation of a recommendation method for an audio file according to an embodiment of the present application, which is detailed as follows:

in S401, an event tag of a shooting event corresponding to the video file is acquired.

In this embodiment, the electronic device may receive a recommendation request for background music, where the recommendation request includes a file identifier of a video file that needs to configure the background music. The electronic device may obtain a file identifier of the video file by analyzing the recommendation request, and obtain the video file and a shooting record of a shooting event associated with the video file based on the file identifier.

In one possible implementation, the electronic device may be configured with an editing application for video files. Illustratively, fig. 5 shows a flowchart of a selecting operation of a video file according to an embodiment of the present application. Referring to (a) of fig. 5, an application icon, i.e., an icon 501, of an editing application of a video file may be displayed in a main interface of the electronic device. When the electronic device detects that the user clicks the icon 501, the electronic device starts the editing application and outputs a video file selection interface, as shown in fig. 5 (b). The electronic device may scan the memory, obtain candidate videos included in the electronic device, and display thumbnails of all the candidate videos in the video file selection interface. The electronic device may receive a selection operation initiated by a user, identify that a candidate video selected by the user is a target video, that is, a video file mentioned above, and switch to an interface for editing the video file, as shown in fig. 5 (c).

In a possible implementation manner, when the electronic device starts an editing application, the electronic device may prompt a user to input login information, and obtain a video database associated with a user account based on the login information, where a plurality of candidate videos corresponding to the user account are stored in the video database, and each candidate video may be stored in a local memory of the electronic device, may also be stored in a cloud server, and of course, may also be stored in an external storage device. By establishing the video database associated with the user account, all candidate videos do not need to be stored in a local memory of the electronic equipment, the storage pressure of the memory is reduced, and the number of editable video files can be increased.

In this embodiment, the shooting event specifically refers to an event corresponding to a shooting video file, and the event tag is a tag related to the shooting event. The event tag is specifically used for describing tags related to the process of shooting the video file, and is not related to the content of the video file. The event tags include, but are not limited to: a shooting location, a shooting time, a user who performs a shooting operation, a shooting device, and the like.

In a possible implementation manner, the event tag may be encapsulated in a data packet corresponding to the video file when the video file is generated, for example, the data packet corresponding to the video file includes at least two parts, which are a header part and a content part, respectively, where the header part is specifically used to store a video identifier, shooting event related information, and the like, and the content part is specifically used to store video data, that is, each video image frame and a corresponding audio track. In this case, the electronic device may parse the video file, obtain the content of the header portion within the video file, obtain information related to the shooting event, and generate an event tag based on the information.

In one possible implementation, the electronic device may be used to capture the video file described above. In this case, the electronic device may be configured with a positioning module, which may specifically be a global positioning module GPRS, and a clock module. The electronic equipment can acquire the position information through the positioning module and the time information through the clock module, when the recording start is detected, the position information and the time information of the recording start moment can be acquired, and the two pieces of information are packaged in a data packet of a currently recorded video file to record the information related to the shooting event. If the electronic device logs in the user account when recording the video file, the account identifier of the user account can be added into the data packet, so that the account identifier is used as a shooting user of the video file.

In one possible implementation, if no information about the shooting event is recorded in the video file, the electronic device may output a blank tag or configure a default tag as preset.

In one possible implementation manner, if information about a shooting event is not recorded in the video file, the electronic device parses the video file, and configures a corresponding event tag based on the video content. For example, if the electronic device can acquire a background area image in each video image frame in the video file, the shooting location corresponding to the video file is determined by identifying a shooting object such as a building or a natural landscape contained in the background area image. Fig. 6 shows a schematic diagram of a video image frame provided by an embodiment of the present application. Referring to fig. 6, a video file does not record information of a shooting location, and a background area image of one of video image frames of the video file contains a landmark building of "guangzhou tower", so that the shooting location can be identified as "guangzhou", and further, the shooting location can be "guangzhou tower".

In one possible implementation, the electronic device may configure the event label with a corresponding whitelist. The method comprises the steps that electronic equipment obtains information of a shooting event related to a video file, obtains a plurality of candidate labels related to the shooting event, then judges whether the candidate labels are in a preset white list or not, and if yes, the candidate labels are used as event labels; otherwise, if the candidate tag is not in the white list, the candidate tag is regarded as an invalid tag. Since not all tags related to the shooting event contribute to the selection of the audio file, in order to reduce the matching operation between the invalid event tags and the audio file, a corresponding white list may be configured, the invalid tags may be filtered, and the valid event tags may be retained.

In a possible implementation manner, the white list may be specifically generated according to music tags associated with each candidate audio in the music library. The candidate audio in the music library may be configured with associated music tags for the audio according to lyrics, song names, melody characteristics, and the like included in the audio. The electronic equipment can acquire the music labels associated with the candidate audios in the music library and sort all the music labels to generate the white list, so that the candidate labels associated with the audios can be selected from all the candidate labels corresponding to the shooting events to serve as event labels, and the accuracy of the event labels is improved.

In S402, the video file is analyzed to obtain a content tag corresponding to the video file.

In this embodiment, the electronic device may parse the video file and determine a content tag related to the shooting content of the video file. Exemplarily, fig. 7 shows a labeling diagram of a content tag provided by an embodiment of the present application. Referring to fig. 7, the electronic device may parse the video file, obtain each video image frame included in the video file, identify a shooting object in the video image frame, for example, if the picture shown in fig. 7 includes shooting objects such as children, trees, balloons, and the like, each shooting object is used as a content tag, and of course, according to color information of the video image frame, it may determine weather or a time period, for example, daytime, corresponding to the video image frame. Therefore, by identifying each video image frame in the video file, a content tag corresponding to the video file can be obtained.

In one possible implementation, the electronic device may import each video image frame in the video file into the trained multiple convolutional neural network, so as to output a content tag corresponding to the video image frame. The multiple convolution neural network comprises a plurality of convolution layers and a full connection layer, wherein each convolution layer is provided with a corresponding convolution kernel, a video image frame is subjected to convolution operation through the convolution kernels, corresponding characteristic vectors are output, the characteristic vectors output by the last convolution layer are led into the full connection layer, the matching probability between the existing labels pre-configured in the full connection layer is calculated, and the existing labels with the matching probability larger than a preset probability threshold value are selected to serve as content labels of the video image frame. The multiple convolutional neural network can be obtained by training the neural network through a large number of pictures and associated training labels, so that the accuracy of content label identification can be improved. The existing tags configured in the full connection layer of the multiple convolutional neural network may be obtained based on the music tags of each candidate audio in the music library (the manner of obtaining the music tags may be referred to the above description, and is not described herein again), or may be set by itself, which is not limited herein.

Further, as another embodiment of the present application, fig. 8 shows a flowchart of a specific implementation of S402 in the audio file recommendation method provided in an embodiment of the present application. Referring to fig. 8, with respect to the embodiment shown in fig. 4, S402 in this embodiment specifically includes S4021 to S4023, which is specifically described as follows:

further, the analyzing the video file to obtain the content tag corresponding to the video file includes:

in S4021, a picture tag included in each video image frame of the video file is determined.

In this embodiment, the electronic device may parse the video file to obtain a plurality of video image frames, perform label identification on the captured pictures in each video image frame, and determine picture labels included in each video image frame.

In a possible implementation manner, the electronic device may identify similarity between a plurality of adjacent video image frames, and if the similarity between the plurality of adjacent video image frames is greater than a preset similarity threshold, identify that the plurality of video image frames belong to the same video frame segment, and configure a corresponding frame tag for the video frame segment. In order to improve the accuracy of picture label identification, for example, to identify a part of dynamic type labels, and at the same time, to reduce the number of times of identifying picture labels, the electronic device may perform segmentation processing on a video file, divide a plurality of video image frames with a relatively high picture correlation into the same video picture segment, identify picture labels included in the video picture segment, and use the picture labels of the video picture segment as the true picture labels of each video image in the video picture segment. Illustratively, fig. 9 shows a schematic diagram of division of a video picture segment provided by an embodiment of the present application. Referring to fig. 9, the electronic device obtains a video file, the video duration of the video file is 60 seconds, and the electronic device identifies that the similarity between a plurality of video image frames included in the first to fourth seconds is greater than a similarity threshold, so that the video image frames of the first to fourth seconds are divided into the same video frame segment, and the frame tag corresponding to the video frame end is identified. Based on the same identification mode, dividing the video image frames corresponding to the 25 th to 28 th seconds into the same video image segment, identifying the video image frames corresponding to the 57 th to 60 th seconds into the same video image segment, and determining the image labels corresponding to the video image segments.

In a possible implementation manner, the electronic device may further divide the video file into a plurality of video frame segments based on a preset time interval, identify the frame tags included in each video frame segment, and use the frame tags of the video frame segments as the frame tags corresponding to all video image frames in the video frame segment. For example, with 4s as the time interval, the video file is divided into a plurality of video picture segments with a time length of 4 s.

In one possible implementation manner, the electronic device determines the picture tags included in the video image frames, and may specifically perform tag identification by using one or a combination of two or more of the following detection algorithms: face detection algorithms (face recognition detection and face attribute detection), scene recognition algorithms, object detection algorithms, and the like.

For example, the electronic device may determine whether a video image frame includes a face region through a face detection algorithm, and if so, locate the face region, determine a plurality of key feature points based on the face region, and determine object attributes (such as gender, age, and the like) of a shooting object corresponding to the face based on feature information corresponding to the plurality of key feature points. Similarly, the electronic device may further use the identified face region as a reference point to obtain a human body region, so as to determine object attributes such as height and stature of the shooting object. Optionally, after obtaining the human body region, the electronic device may perform human body tracking in the plurality of video image frames, so as to determine posture change information of the photographic subject in the plurality of video image frames, and thus determine a motion type (such as running, jumping, sitting, etc.) of the photographic subject according to the posture change information.

For example, the electronic device may have previously stored standard models of different identifiable objects. The electronic equipment can judge whether an image area matched with the standard model of any recognizable object exists in the video image frame, if so, the shooting object corresponding to the image area is recognized as the recognizable object, and therefore object recognition is achieved.

In a possible implementation manner, after obtaining the picture tags, the electronic device may import a plurality of page tags of the video image frame into the semantic perception model, and generate a description sentence corresponding to the video image frame. Because the mutual relations among the picture labels are independent and unrelated, on the basis, the electronic equipment can import the page labels belonging to the same video image frame into the semantic perception model, so that a plurality of isolated picture labels are integrated into a sentence with coherent significance, the picture content of the video image frame can be more accurately described, and the accuracy of subsequent music matching is improved. For example, continuing to refer to fig. 9, the picture labels corresponding to the first second to the fourth second in the video file are: blue sky, beach, sea water, dog, and children. Therefore, the electronic device may import the plurality of picture labels into the semantic perception model, and output the corresponding description language segment: "Children and puppies play on beach near the sea" in the blue sky. The electronic equipment can extract the supplementary label which is not identified by the picture label from the descriptive sentence, thereby improving the integrity and the accuracy of label identification.

In S4022, the number of occurrences of each picture tag in all the video image frames of the video file is counted, and the picture tags are sorted based on the order of the number of occurrences from small to small, so as to obtain a picture tag sequence.

In this embodiment, after obtaining the plurality of picture tags, the electronic device may perform cluster analysis on all the picture tags, and count the occurrence frequency of each picture tag in all video image frames of the video file. If the video file is divided into a plurality of video image segments, the number of video image frames contained in the video image segments is used as the appearance number corresponding to the image labels of the video image segments. For example, the picture duration of a video picture segment of the electronic device is 4s, and the acquisition frame rate of the video file is 60fps, then the number of video image frames corresponding to the video picture segment with the duration of four seconds is 240, that is, the number of occurrences of the picture tag corresponding to the video picture segment is 240 times.

In a possible implementation manner, the clustering algorithm of the electronic device may identify a plurality of picture labels that are synonymous or near-synonymous as the same label, that is, generate a clustering label, and superimpose the occurrence number of the plurality of picture labels belonging to the same label as the occurrence number of the clustering label. For example, if a picture label obtained by a certain video image frame is "rose", and a picture label obtained by another video image frame is "red rose", because the contents of the two labels are similar, the two labels can be identified as the same label, the two picture labels are clustered to generate a cluster label of "rose", and the first occurrence frequency of the picture label "rose" is added to the second occurrence frequency of the picture label "red rose" to be used as the occurrence frequency of the cluster label "rose". By clustering the picture tags, a plurality of different tags with similar or same content can be sorted, so that the content identification accuracy of the video file can not be reduced while the number of the page tags can be reduced.

In this embodiment, the electronic device may sequentially arrange the picture labels from large to small according to the occurrence frequency of each picture label, and generate a picture label sequence, where in the picture label sequence, the more advanced picture labels are, the more frequent the occurrence frequency is; conversely, the more the screen label is located at the back, the fewer the number of occurrences thereof. Alternatively, if there are two or more picture tags with the same occurrence frequency, the order in the picture tag sequence may be determined according to the average area of the picture tag in each video image frame, and if the average area of the picture tag in each video image frame is larger, the corresponding order is further forward.

In S4023, selecting the first N picture tags in the picture tags as content tags corresponding to the video file; and N is a positive integer.

In this embodiment, after the obtained picture label sequence, the electronic device may extract the first N picture labels as content labels of the video file, and the last N picture labels as invalid labels. Since the occurrence frequency of the picture labels in the back ranking is less, the representation of the content of the video file is poor, and the correlation degree between the video file and the background music is poor, in this case, the picture labels with poor representation are not needed to be used as the labels for determining the audio files to be recommended, and therefore the picture labels are identified as invalid labels; if the number of occurrences is large, the video file content is high in representativeness, and the picture label with high representativeness can be used as a label for determining the audio file to be recommended.

In a possible implementation manner, the value of N may be a fixed value or a default value, or may be dynamically adjusted according to the number of picture tags included in the video file. For example, if the preset tag proportion is 30%, and the number of tags identified by the video file is 100, the top 100 × 30% — 30 screen tags are selected as the content tags of the video file.

Illustratively, fig. 10 shows a schematic diagram of a generation flow of a content tag provided by an embodiment of the present application. Referring to fig. 10, the electronic device identifies picture tags corresponding to each video image frame in the video file, such as picture tags of "blue sky", "sea water", and the like, and then counts the occurrence frequency of each picture tag in the entire video file, for example, if 12 video image frames of the picture tag of "child" contain the picture tag in the video file, the occurrence frequency of the picture tag of "child" is 12, and so on, determines the occurrence frequency of other picture tags, and obtains a picture tag sequence as shown in the figure. If the preset N is 5, the first 5 picture labels (respectively: child, beach, sea water, sky and dog) are selected as the content labels of the video file.

In the embodiment of the application, the video file is analyzed, the picture labels contained in each video image frame are respectively identified, and the occurrence times of each picture label are counted, so that the picture label with high relevance to the recommended audio file can be selected as the content label, the accuracy of selecting the audio file can be ensured, and the influence of invalid noise on the recommendation process is reduced.

In S403, a target audio associated with the content tag and/or the event tag is selected from the music library, and audio recommendation information of the video file is generated based on the target audio.

In this embodiment, the electronic device obtains two types of tags of the video file, namely an event tag related to a shooting event and a content tag related to shooting content. When a user selects background music for a video file, two aspects of consideration are often considered, namely, the consideration is to meet the current shooting mood or be matched with shooting contents, and the shooting contents required to be expressed by shooting the video file cannot be accurately expressed only by picture contents of the video file. For example, a user visits a famous natural scenic spot, and animals and plants contained in the natural scenic spot may be common animals and plants, and the shot content cannot be accurately expressed only through the picture content of the video file, that is, the natural scenic spot cannot be identified, so that recommendation of the audio file cannot be accurately realized. Based on the method, the electronic equipment not only identifies the picture content of the video file, but also acquires the label related to the shooting event, determines the event label according to the date, the place, the shooting object and the like, and facilitates the electronic equipment to accurately determine the shooting content of the video file through the event label and the content label, so that the recommendation of the audio file more related to the shooting content is realized. For example, in the example of the natural scenic spot, the shooting location corresponding to the video file can be determined through the event tag, and the animal and plant of the current picture, specifically, a natural scenic spot can be determined through the shooting location and the picture content, so that the one-sidedness and inaccuracy of the shot content determined through the picture content are solved.

In this embodiment, the music library may be stored in a memory of the electronic device, or may be stored in a cloud, the electronic device may obtain audio information of each candidate audio in the music library, match each audio information with the content tag and/or the event tag, select a candidate audio matched with the content tag and/or the time tag as a target audio, that is, an audio that needs to be recommended to a user, and generate audio recommendation information including the target audio, where fig. 11 exemplarily shows a schematic diagram of the audio recommendation information provided in an embodiment of the present application. Referring to fig. 11 (a), the audio recommendation information may provide cover images associated with the respective audio files, and may also provide corresponding cover images in addition to the audio files determined by the file names of the audio files, so that the user can know more information related to the audio. Referring to fig. 11 (b), the audio recommendation information may specifically be an audio recommendation list, and determine a recommendation order according to the correlation between each target audio and the video file to generate the audio recommendation list, where a target audio with a higher correlation with the video file is ranked in front in the audio recommendation list, and conversely, a target audio with a lower correlation with the video file is ranked in back in the audio recommendation list. In the audio recommendation list, the song name of the target audio may be displayed, and other information related to the target audio, such as the singer, the cover of the song, or the composer, may also be displayed.

In one possible implementation, each candidate audio in the music library is associated with a corresponding audio tag. The audio tag can be determined according to the information of the song name, the content of the song lyrics, the music style and the like of the audio file. When the target music is obtained, whether one or more tags in the audio tags associated with the audio file exist in a tag group formed by the event tags and the content tags can be judged, and if yes, the candidate audio is identified as the target audio; otherwise, the candidate audio is identified as not the target audio.

Further, as another embodiment of the present application, the selecting the target audio associated with the content tag and/or the event tag in the music library may include at least the following two ways:

mode 1: target audio associated with the event tag is retrieved from the music library. The mode of selecting the target audio frequency can be subdivided into two processes according to different label types of the event labels. Wherein, the process 1 is a selection process adopted when the label type of the event label is a location type, namely the event label is a shooting location; the process 2 is a selection process adopted when the tag type of the event tag is a time type, that is, the event tag is specifically a shooting time.

It should be noted that the shooting location of the event label may be obtained through a positioning module of the electronic device when the video file is shot. The electronic device can acquire a Point of Information (POI) of a shooting place, call a third-party map application, and determine the POI of the shooting place through the position Information acquired by the positioning module and a third-party positioning system.

In one possible implementation manner, fig. 12 shows an acquisition schematic diagram of a shooting location provided by an embodiment of the present application. Referring to fig. 12, the electronic device may obtain outdoor location information of a shooting location through a built-in positioning module, obtain indoor location information through a network device in a scene of the shooting location, generate a location coordinate of the shooting location based on the outdoor location information and the indoor location information, and determine an inverse address code and a scene attribute of the location coordinate through a server corresponding to a third party application using the location coordinate, so as to obtain the shooting location of the video file and the scene attribute corresponding to the shooting location. For example, the shooting location is of a type of park, mall or office building, so as to more fully understand the shooting location.

Mode 2: and calculating the matching degree between the existing audio in the music library and the event label and the content label, and selecting the target audio from the existing audio based on the matching degree.

Specific implementations of the above two processes are set forth in detail below:

mode 1: determining a target audio based on the event tag:

process 1: referring to fig. 13, fig. 13 is a flowchart illustrating implementation of S403 in a method for recommending an audio file according to another embodiment of the present application. Referring to fig. 13, compared with the embodiment shown in fig. 4, the event tag described above in this embodiment includes a shooting location; selecting the target audio associated with the content tag and/or the event tag in the music library specifically includes: one or more than two of S1301 to S1303 are specifically described as follows:

in S1301, text information of each existing audio in the music library is acquired, and the existing audio including the shooting location in the text information is used as the target audio.

In this embodiment, each existing audio in the music library is associated with corresponding text information. The text information may specifically include information such as song names, lyrics, singers, composers, and word fillers corresponding to the existing audio. The electronic equipment can judge whether the text information contains a shooting place corresponding to the shooting of the video file, and if yes, the text information contains the existing audio of the shooting place as the target audio required to be recommended. Because the shooting location corresponding to the shooting of the video file is often highly correlated with the content theme of the video file and contributes to recommendation of background music greatly, if the shooting location of the video file directly appears in text information of some existing audio, in this case, the recommendation priority of the existing audio is also high, so that the existing file can be used as the target audio.

For example, the shooting place corresponding to the video file is the Yangtze river, an existing music is the my Chinese heart in the music library, the song comprises a song lyric of the song, the song lyric is the song of the song, and the song lyric is the song lyric of the song.

In S1302, a video clip associated with the shooting location is acquired, and the audio of the soundtrack of the video clip is taken as the target audio.

In this embodiment, the music library may contain the soundtrack audio of different existing movie and television scenarios, including but not limited to: a movie, a tv show, an animated drama, an animated movie, a short video, etc., and the video segment may be a segment in the movie and tv show. For a known movie and television program, the soundtrack audio corresponding to the classical scene in the program often becomes the "exclusive soundtrack" of the place associated with the classical scene, that is, there is strong correlation between part of shooting places and the soundtrack of the program, and at this time, the soundtrack audio having strong correlation with the shooting places can be used as the target audio required to be recommended.

For example, in the classic movie & television drama Shanghai beach, most of the shooting points of the classic scene are located at the Shanghai beach. If the shooting place of a video file of the audio file required to be recommended by the electronic equipment is also the overseas of Shanghai, the dubbing audio in Shanghai beach of movie and television drama can be used as the target audio required to be recommended, such as song Shanghai beach.

For another example, a classic comic drama, "chu lan gao shou", which is a story telling a basketball team in middle school in the japanese sickle house, has a strong correlation with the location of the sickle house. If the shooting place corresponding to the video file is the Japanese sickle house, the music of the dubbing music in the 'Shuikougaoshan' of the cartoon drama can be taken as the target audio of the required recommendation, such as 'till the end of the world'.

For another example, the classic movie "harry baud" has a classic scene that harry needs to enter the magic world through a 9 and three quarter station, and the scene is shot at the king cross station of london, england, i.e., there is a strong correlation between the king cross station (broadly, london) of london, england and "harry baud". If the winning video file was taken in london, and further in king cross station, the score in the movie harry baud could be used as the target audio for the recommendation.

In a possible implementation manner, the electronic device may store a corresponding relationship between a movie and a movie scenario and a shooting location, and table 1 shows the corresponding relationship provided in an embodiment of the present application. As shown in table 1, the title name of the movie title and the associated soundtrack music are recorded in the correspondence relationship. The dubbing music can be stored in a music library, and each movie and television show can be associated with a corresponding shooting place. After the shooting location of the event label of the video file is identified, whether the shooting location of the video file is in the corresponding relationship stored in advance can be judged, if yes, the dubbing music of the movie and television drama corresponding to the shooting location of the video file in the corresponding relationship is used as the dubbing music required to be recommended. It should be noted that one shooting location may correspond to a plurality of movie and television shows.

Shooting location	Name of drama	Name of dubbing music
			Japanese sickle cabin	"dunk basket senior high school	To the end of the world
Cross station of kingdom of london	Harry potter	"movie Harry potter duble
			Shanghai oversea beach	Shanghai beach	Shanghai beach

TABLE 1

In S1303, a history event associated with the shooting location is determined, and an audio file corresponding to the history event is used as the target audio.

In this embodiment, the shooting location may be related to a part of the historical events in addition to the movie and television drama, and the electronic device may acquire an audio file corresponding to the historical event from the music library as the target audio to be recommended according to the event type, the occurrence year, and the like of the associated historical event. For example, if a historical event occurs in the 60 th 90 th century, existing audio with the creation time or the release time in the 60 th 90 th century can be selected from the music library as an audio file corresponding to the historical event; for another example, if a historical event is a war time, existing music related to the war can be selected from the music library as an audio file corresponding to the historical time.

For example, in the United states, Pearl harbor, there is a strong correlation with the historical event, "airport attacking Pearl harbor". And the historical event is a war event, existing music with war and explosion or audio keywords of 'disaster' and 'sadness' can be selected from the music library to be used as target audio required to be recommended.

In one possible implementation, for a place, the place-related historical events may also include residential events of historical celebrities, i.e., a place may be the residence of a historical celebrity. In this case, the electronic device may take the audio file associated with the history celebrity as the audio file corresponding to the location.

In the embodiment of the application, the shooting place of the video file is obtained, and the audio file related to the shooting place is obtained in the modes of audio texts such as lyrics, song names and the like, existing movie and television dramas, historical events and the like to serve as the target audio to be recommended, so that the diversity of the recommended audio searching mode and the accuracy of the related audio searching can be improved.

And (2) a process: referring to fig. 14, fig. 14 is a flowchart illustrating implementation of S403 in a method for recommending an audio file according to another embodiment of the present application. As shown in fig. 14, compared with the embodiment shown in fig. 4, the event tag described above in the present embodiment includes a shooting date; selecting the target audio associated with the content tag and/or the event tag in the music library specifically includes: one or more of S1401 to S1403 are specifically described as follows:

in S1403, if the shooting date is any one of preset specific dates, the audio file associated with the specific date is taken as the target audio.

In this embodiment, the electronic device may store a plurality of specific dates in advance, where the specific dates include, but are not limited to: public holidays or dates of special interest. For example, the characteristic date may be: public holidays such as Christmas, children, mid-autumn festival and the like can also be 11 months and 11 days, and the dates have special meanings such as shopping day, optical rod day and the like; for another example, the shooting date corresponds to a weekend of a week or a certain date in the week, the lyrics may include an audio file in the weekend or the week, for example, the lyrics may include monday, and the lyrics may be used as an audio file associated with monday as the shooting date. The electronic equipment can configure the associated audio files for each specific date and establish the corresponding relation between the specific date and the audio files. Each particular date may be associated with one or more associated audio files.

Illustratively, table 2 shows a correspondence table between specific dates and audio files provided by an embodiment of the present application. Referring to table 2, some dates with special significance may be associated with corresponding audio files, such as a holiday of the public, and one specific date may be associated with one or more audio files, such as a new lunar calendar year corresponding to two audio files.

Specific date	Associated song
		Christmas day	《Merry Christmas》
New year of lunar calendar	Novel"Gong xi Fang Cai" good year
		Festival of national day	Good days

TABLE 2

In this embodiment, after the shooting date of the video file is determined, the electronic device may determine the audio file associated with the shooting date by querying the correspondence, and use the audio file as the target audio to be recommended.

Further, as another embodiment of the present application, before S1403, S1401 and S1402 may be further included, which is specifically described as follows:

in S1401, user information of the currently logged-in user account is acquired.

In this embodiment, the part of the specific date belongs to a date related to a private person, that is, is different for different people, and the specific date for the private person is not general-purpose, and needs to be determined by combining user information of a user to which the electronic device belongs. For example, a birthday date, a wedding anniversary, a commemorative day of acquaintance with a lover. Based on this, the electronic device may first acquire user information pre-registered with a user account to which the electronic device is currently logged in before determining the specific date.

In a possible implementation manner, the user information may include a birth date, a wedding date, a birthday date of a parent, a birthday date of a spouse, and the like.

In this embodiment, the currently logged-in user account is specifically a user account that the electronic device logs in when the electronic device edits the video file. The user account may be a user account that the video file editing application program logs in, or may be a user account that has logged in an equipment system of the electronic device.

In S1402, the specific date is determined based on the user information.

In this embodiment, the electronic device may store a plurality of date types of specific dates, and determine the specific date of the specific date according to the user information and the date types.

For example, if the date type of the specific date is a date type, the electronic device may obtain the date of birth of the currently logged-in user from the user information, and determine a specific date of the birthday type every year, that is, the specific date, based on the date of birth.

In this embodiment, for each specific date related to the user's private, a corresponding audio file may be associated, and the audio file related to the specific date may be set by default by the system or may be configured manually by the user. For example, if a particular date is the user's birthday, "happy birthday" may be used as the audio file associated with the particular date; songs that the user likes may also be set as audio files associated with the user's birthday, such as "youth" set by the user as the audio files associated with the user's birthday.

In the embodiment of the application, whether the shooting date of the video file is in the preset specific date or not is judged by determining the shooting date of the video file, and if yes, the audio file associated with the specific date is used as the target audio to be recommended, so that the diversity of target audio selection modes and the accuracy of audio searching related to the video file are improved.

Mode 2: and calculating the matching degree between the existing audio in the music library and the event label and the content label, and selecting the target audio from the existing audio based on the matching degree. Referring to fig. 15, fig. 15 is a flowchart illustrating a specific implementation of S403 in a method for recommending an audio file according to another embodiment of the present application. Referring to fig. 15, compared to the embodiment shown in fig. 4, the selecting, in the music library, the target audio associated with the content tag and/or the event tag specifically includes: s1501 to S1504, are specifically described as follows:

in S1501, an initial tag sequence corresponding to the video file is generated from the content tag and the event tag.

In this embodiment, the electronic device may package a content tag related to the video content and an event tag related to the shooting event, and generate an initial tag sequence corresponding to the video file. The initial sequence of tags includes a content tag and an event tag.

In one possible implementation, the electronic device may cluster the content tags and the event tags to generate an initial tag sequence. The clustering method may specifically be to merge the content tags and the tags with the same or similar meanings in the event tags, and generate an initial tag sequence based on the merged tags. For example, the event tag corresponding to the shooting event includes the shooting date, and the shooting date is specific to the shooting time, for example, 9 months and 20 days, and after the video image frame of the video file is analyzed, a content tag is obtained as the day, the event tag of "9 months and 20 days" and the content tag of "day" can be merged into the content tag of "9 months and 20 days", so that the number of tags is reduced, and invalid matching operation is avoided.

For example, table 3 shows a table corresponding to the initial tag sequence provided in an embodiment of the present application. Referring to table 3, the event labels include two, which are a shooting location and a shooting date, specifically, "sandbeach" and "2019/07/0514: 30"; the electronic device may determine, according to the type of the shot content, the type of the tag corresponding to each content tag, for example, the content tag "child", where the corresponding initial tag type is "person", and count the proportion of each tag type appearing in all video image frames, that is, the tag proportion, for example, the tag proportion of the content tag "child" is 25%, that is, the video image frame in which the shot content of "child" appears in the video file is 25% of all video image frames, and if the video file includes 100 frames, the number of the video image frames including "child" may specifically be 25 frames.

TABLE 3

In S1502, based on a preset tag mapping algorithm, a mapping tag sequence corresponding to the initial tag sequence is generated; each tag in the mapping tag sequence is configured with an associated weight value.

In this embodiment, a plurality of existing tags are stored in the tag library, and each existing tag can configure a corresponding weight value according to the correlation between the background music, and if the correlation between the existing tag and the background music is high, the preconfigured weight value is large; on the contrary, if the correlation between the existing tag and the background music is low, the pre-configured weight value is small. It should be noted that the tag library may be stored in a memory of the electronic device, or may be stored in the cloud server.

In a possible implementation manner, if the tag library is stored in the cloud server, the electronic device may send the generated initial tag sequence to the cloud server through a communication link with the cloud server, and after receiving the initial tag sequence, the cloud server may map each tag in the initial tag sequence based on the stored tag library, determine existing tags in the tag library, which are associated with the tags in each initial tag sequence, and generate a mapped tag sequence. The cloud server can feed back the generated mapping tag sequence to the electronic device, and can also feed back the determined target audio to the electronic device after the target audio is determined to be completed, so that the electronic device generates audio recommendation information after receiving the target audio.

In this embodiment, the electronic device may map each tag in the initial tag sequence with each existing tag in the tag library, that is, determine an existing tag associated with each tag in the initial tag sequence, and generate a mapping tag sequence corresponding to the video file based on each existing tag after mapping, where a tag included in the mapping tag sequence is any existing tag in the tag library. Since the existing tags in the tag library have been configured with corresponding weight values in advance, the weight values associated with the existing tags can be encapsulated in the mapping tag sequence when the mapping tag sequence is generated.

In a possible implementation manner, the granularity of the tags in the initial tag sequence identified by the electronic device may be fine, while the granularity of the existing tags in the tag library is coarse, in which case, by performing tag mapping, the tags in the initial tag sequence can be clustered, and invalid classification tags can be screened.

Illustratively, fig. 16 shows a schematic diagram of generating a mapping tag sequence according to an embodiment of the present application. Referring to fig. 16, the tags included in the initial tag sequence are specifically shown in table 3, and the existing tags associated with each tag in the initial tag sequence in the tag library, for example, an event tag "thnzhaitan" for the initial tag sequence, the associated label type is 'shooting place', the label content is 'three-third beach', after mapping to the mapping label sequence by the preset label mapping algorithm, the label content of the method remains unchanged and is still 'three-third beach', and the label types are clustered into 'scenes' from the same configuration of a refined 'shooting place', and based on the preset tag type, determining the weight value associated with the tag in the mapping tag sequence, for example, for a scene type tag, the associated weight value in the mapping tag sequence is 0.8, while for a weather type tag, the associated weight value in the mapping tag sequence is 0.2. If any tag in the initial sequence of tags does not have an associated tag type in the tag mapping algorithm, the initial tag is identified as an invalid tag, e.g., "puppy-object" in fig. 16, and the associated mapping tag cannot be determined by the tag mapping algorithm, which configures the tag type of the initial tag to be ignored, does not configure the associated weight value for the ignored mapping tag, or sets the weight value of the ignored mapping tag to 0.

In S1503, based on the weight value of each tag in the mapping tag sequence, the matching degree of each existing audio in the music library is calculated.

In this embodiment, after determining the weight value of each tag in the mapping tag sequence, the electronic device may determine whether each existing audio in the music library includes any tag in the mapping tag sequence, where the manner that the existing audio includes the mapping tag sequence specifically is: the existing audio can be associated with a plurality of audio tags, and the existing audio is associated with text information (such as song titles, lyrics, and the like), whether any tag in the mapping tag sequence is included in the audio tag and the text information is judged, and if so, the matching degree of the existing audio is calculated based on the weight value associated with the included tag.

For example, if an audio tag associated with an existing audio in the audio library is "pop music", "sandbeach", "blue sky" and "easy", and includes two tags in the mapping tag sequence, respectively, "sandbeach-scene" and "blue sky-weather", the matching degree of the existing audio is calculated based on the weight values associated with the two tags, respectively being 0.8 and 0.2, for example, the matching degree is superimposed based on the weight values of the two tags, and 0.8+0.2 is equal to 1.

Optionally, the matching degree may be calculated by substituting the proportion of the video image frame occupied by the mapping tag in the video file to obtain a corresponding weighted value, so as to calculate the matching degree of the existing audio.

In S1504, the target audio is determined from the existing audio based on the matching degree.

In this embodiment, after calculating the matching degree between each existing audio and the mapping tag sequence, the electronic device may determine the target audio from the existing audio according to the matching degree. The matching degree can represent the correlation degree between the existing audio and video contents, and if the matching degree is higher, the correlation degree between the existing audio and video contents is higher; conversely, the lower the matching degree, the smaller the correlation degree between the existing audio and video contents.

In a possible implementation manner, the electronic device may select an existing audio with a matching degree greater than a preset matching threshold as a target audio; or sequencing the existing audios according to the sequence of the matching degrees from large to small, and selecting the first M existing audios as target audios, wherein M is any positive integer.

Further, as another embodiment of the present application, the above-mentioned manner of selecting the target audio may specifically include:

s1, selecting the existing audio with the matching degree larger than a preset matching threshold value as a candidate audio.

S2, determining the user characteristics of the user account based on the user information of the currently logged user account.

And S3, if any candidate audio is matched with the user characteristics, taking the candidate audio as the target audio.

In this embodiment, the electronic device may determine the target audio according to the preference of the user, in addition to selecting the target audio according to the matching degree. In this case, the electronic device may obtain the matching degree according to the calculation, screen all existing audios from the music library, and select an existing audio with the matching degree greater than a preset matching threshold as a candidate audio. Because the matching degree can determine the correlation degree between the existing audio and video files, candidate audio with strong correlation with the video files can be screened out through the matching degree.

Then, the electronic device may obtain user information of the currently logged-in user account. The user information is specifically used to determine the user's preference for music selection.

In one possible implementation, the electronic device needs to first obtain authorization information of the user before obtaining the user information. The electronic equipment can output an authorization prompt box, and if the authorization operation granted by the user based on the feedback of the authorization prompt box is received, the user information is obtained from the user account.

In this embodiment, the user information may specifically include: the electronic device may determine the music preference of the user according to the audio file corresponding to each play record, or directly use the pre-configuration information of the user to determine the music preference of the user, and determine the user characteristics of the user account based on the music preference of the user account.

In this embodiment, the electronic device may determine whether the selected candidate audio matches the user characteristic, for example, whether the music type of the candidate audio is consistent with the music type recorded by the user characteristic, and if so, take the matched candidate music as the target audio required to be recommended.

In the embodiment of the application, the accuracy of target audio selection can be improved by acquiring the user characteristics of the user account and selecting the target audio according to the matching degree and the user characteristics.

In the embodiment of the application, the mapping tag sequence is obtained by mapping the identified event tags and the content tags, so that the matching degree between the mapping tag sequence and each existing audio is calculated, the target audio is selected based on the matching degree, the target audio associated with the video content can be selected, and the accuracy of selecting the target audio is improved.

The above is a specific description of two implementations of selecting the target audio.

In a possible implementation manner, the electronic device may determine the target audio required to be recommended in one of manner 1 or manner 2, or may determine the target audio in both manners.

If the electronic device determines the target audio by using the two methods, corresponding determination priorities can be configured for different acquisition methods.

In a possible implementation manner, the electronic device may first obtain the target audio matching the shooting location or the shooting date from the music library in manner 1, and then determine the target audio in manner 2, that is, calculate the matching degree of the existing audio through the event tag and the content tag. That is, the decision priority of the mode 1 is higher than that of the mode 2, wherein the target audio selected by the mode 1 can be temporarily removed from the music library, and repeated recognition is avoided when the decision is made by the mode 2.

Illustratively, the music library contains track 1, track 2, and track 3. If the lyrics of the song 1 include the shooting location of the video file, the electronic device recognizes the song 1 as the target audio when performing the determination operation of the mode 1. When the determination operation of the mode 2 is performed, the track 1 may be removed from the music library, that is, when the mode 2 is performed, the track 2 and the track 3 are included in the music library, so that unnecessary determination operations are avoided.

In a possible implementation manner, the electronic device may first determine the target audio by calculating the matching degree of the existing audio through the event tag and the content tag in manner 2, and then obtain the target audio matching the shooting location or the shooting date from the music library in manner 1. That is, the decision priority of the mode 2 is higher than that of the mode 1, wherein the target audio selected by the mode 2 can be temporarily removed from the music library, and repeated recognition is avoided when the decision is made by the mode 1.

Further, as another embodiment of the present application, before performing S403, the electronic device may generate a music library associated with a user account currently used. Fig. 17 shows a flowchart of an implementation of a recommendation method for an audio file according to an embodiment of the present application. As shown in fig. 17, compared with the embodiment shown in fig. 4, the embodiment of the present application further includes, before S403: S1701-S1703 are specifically described as follows:

in S1701, if authorization information of a currently registered user account is stored, an operation record of the user account is acquired.

In this embodiment, the electronic device may obtain authorization information of a currently logged-in user account. Exemplarily, fig. 18 shows a schematic diagram of obtaining authorization information provided in an embodiment of the present application. Referring to fig. 18, when a user starts a video editing application, the electronic device may output an authorization acquisition popup, where the authorization acquisition popup includes two controls, namely an authorization approval control 181 and an authorization rejection control 182, where if the electronic device receives a click operation on the authorization approval control 181, it determines that the user approves to acquire privacy information of a user account, such as user information and an operation record, and generates and stores the authorization information; on the contrary, if the click operation on the denial authorization control 182 is received, it is determined that the user refuses to acquire the privacy information of the user account, that is, the above authorization information is not generated.

In this embodiment, the operation record of the user account may be stored in a local memory, or may be stored in the cloud server. When the electronic device detects that authorization information of the user account is stored, an operation record of the user account can be obtained based on the authorization information, where the operation record of the user account includes but is not limited to: audio play records, music video browsing records, music website browsing records, and editing records of video files (an editing record for a video file may include a record of a pick-up operation for background audio), and so forth.

In S1702, user characteristics of the user account are generated based on the operation record.

In this embodiment, after obtaining a plurality of operation records of a user account, the electronic device may determine, according to the plurality of operation records, a user characteristic corresponding to the user account, where the user characteristic is specifically used to indicate a preference of the user account for music.

In a possible implementation manner, the electronic device may determine the played audio files associated with the operation record, identify the music classification of each audio file, and then count the playing times corresponding to each preset music classification; and selecting the music classification with the playing times being the most or larger than a preset threshold value as the music classification preferred by the user account to obtain the user characteristics.

In S1703, existing music matching the user characteristics is extracted from the database, and the music library is generated.

In this embodiment, a large number of audio files may be stored in the database, and each audio file may be associated with a corresponding music category. The electronic equipment can obtain the user characteristics according to the identification, determine the music classification preferred by the user, select the music classification matched with the user characteristics from the database as the existing audio, and generate the music library associated with the user account based on all the selected existing audio.

In a possible implementation manner, if the database is stored in the cloud server, the electronic device may send the user characteristics to the cloud server, and after receiving the user characteristics, the cloud server may select an audio file matching the user characteristics from the database stored in the cloud as the existing audio, and create a music library associated with the user account at the cloud server based on all the existing audio.

In the embodiment of the application, the operation record of the currently logged-in user account is obtained, the preference of the user for selecting music is determined, and the music library corresponding to the user account is generated, so that the accuracy of selecting the recommended audio file can be improved.

In one possible implementation, if the music library is stored in the cloud server, the operation of selecting the target audio may be performed by the cloud server. Under the circumstance, the electronic equipment can send the event tags and the content tags obtained through identification to the cloud server, and the cloud server can identify the electronic equipment according to the event tags and the content tags fed back by the electronic equipment. Exemplarily, fig. 19 shows a flowchart of an interaction between an electronic device and a cloud server according to an embodiment of the present application. Referring to fig. 19, the interactive process implemented is as follows:

1. the electronic device may first obtain an event tag of the video file, where the event tag may contain a shooting location and a shooting date.

2. Meanwhile, the electronic equipment can analyze the video file, determine content tags contained in each video image frame of the video file, and generate a description language segment for expressing the content of the video file based on a semantic understanding algorithm and a plurality of content tags obtained by identification.

3. And aggregating all the labels according to the identified event labels and content labels (also can be description language segments obtained through semantic understanding).

4. And sending the gathered tags to a cloud server so as to determine the target audio through the cloud server.

5. After obtaining the aggregated tags, the cloud server may search for a matching target audio from a music library (i.e., a music service application). The method for searching the target audio can be divided into two modes, one mode is that under the condition that a user logs in and obtains privacy authorization, the operation record of the user can be obtained, the user characteristics (namely user preference) are determined, and big data recommendation is carried out based on the user characteristics and the label to obtain the target audio; the other mode is that under the condition that the user does not log in or obtain privacy authorization, big data search can be directly carried out according to the gathered tags to obtain target audio.

6. And the cloud server sends the selected target audio to the electronic equipment.

7. And the electronic equipment generates audio recommendation information according to the fed back target audio.

In particular, for the above steps 2 to 4, specific identification manners may be shown in fig. 20, where fig. 20 shows a schematic diagram of generating a content tag according to an embodiment of the present application. The electronic equipment can transmit a video file to an image signal processing unit, each frame of video image frame in the video file is analyzed through an image signal processing unit ISP, a content label contained in the video image frame is determined through an action recognition algorithm, a scene recognition algorithm, a face detection algorithm and a face attribute algorithm, wherein the action recognition algorithm needs to introduce the currently recognized video image frame and a plurality of cached related video image frames for recognition, action detection is carried out, labels related to actions are determined, the content labels obtained through recognition are uniformly fed back to a camera module drive, the content labels are gathered and fed back to a camera module application program package, wherein the camera module application program package can also obtain event labels from the camera module, generate an initial label sequence through the content labels and the event labels, and send the initial label sequence to a gallery application program package, the gallery application package can map the initial tag sequence to obtain a mapping tag sequence, feed the mapping tag sequence back to the music application of the cloud server, and calculate the matching degree of each existing audio, so as to determine the target audio.

As can be seen from the above, the audio file recommendation method provided in the embodiment of the application can acquire the video tag corresponding to the shooting event for shooting the video file when the audio file associated with the video file needs to be generated, obtain the content tag based on the shooting content of the video file, extract the associated target audio from the music library through the video tag and the content tag, and generate the audio recommendation information associated with the video file, so as to achieve the purpose of personalized recommendation. Compared with the existing audio file recommendation technology, the method and the device for generating the event tags can generate the corresponding content tags according to the video content and the event tags related to the shooting events, so that the target audio can be searched through the two types of tags. The accuracy of target audio selection is improved, accurate pushing can be achieved while audio files are recommended in a personalized mode, user experience is improved, and the relevance between the recommended audio files and the video files is also improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Example two:

fig. 21 is a block diagram showing a configuration of an audio file recommendation apparatus according to an embodiment of the present application, and only a part related to the embodiment of the present application is shown for convenience of description.

Referring to fig. 21, the audio file recommendation apparatus includes:

an event tag acquiring unit 211 configured to acquire an event tag of a shooting event corresponding to the video file;

a content tag obtaining unit 212, configured to parse the video file to obtain a content tag corresponding to the video file;

and an audio recommendation information generating unit 213, configured to select a target audio associated with the content tag and/or the event tag in the music library, and generate audio recommendation information of the video file based on the target audio.

Optionally, the event tag contains a shooting location; the audio recommendation information generating unit 213 includes:

the audio text matching unit is used for acquiring the text information of each existing audio in the music library and taking the existing audio containing the shooting place in the text information as the target audio; and/or

The video clip matching unit is used for acquiring a video clip associated with the shooting place and taking the dubbing audio of the video clip as the target audio; and/or

And the historical event matching unit is used for determining the historical events related to the shooting places and taking the audio files corresponding to the historical events as the target audio.

Optionally, the event tag contains a shooting date; the audio recommendation information generating unit 213 is specifically configured to:

Optionally, the apparatus for recommending an audio file further comprises:

the user information acquisition unit is used for acquiring the user information of the currently logged user account;

a specific date determination unit for determining the specific date based on the user information.

Optionally, the audio recommendation information generating unit 213 includes:

an initial tag sequence generating unit, configured to generate an initial tag sequence corresponding to the video file from the content tag and the event tag;

a mapping tag sequence generating unit, configured to generate a mapping tag sequence corresponding to the initial tag sequence based on a preset tag mapping algorithm; configuring associated weight values for each label in the mapping label sequence;

the matching degree calculating unit is used for calculating the matching degree of each existing audio in the music library based on the weight value of each label in the mapping label sequence;

and the matching degree identification unit is used for determining the target audio from the existing audio based on the matching degree.

Optionally, the matching degree identifying unit includes:

the candidate audio selecting unit is used for selecting the existing audio with the matching degree larger than a preset matching threshold as the candidate audio;

the system comprises a user characteristic acquisition unit, a user characteristic acquisition unit and a user characteristic acquisition unit, wherein the user characteristic acquisition unit is used for determining the user characteristics of a user account based on the user information of the currently logged user account;

and the user characteristic matching unit is used for taking any candidate audio as the target audio if the candidate audio is matched with the user characteristic.

Optionally, the content tag obtaining unit 212 includes:

the picture label determining unit is used for determining a picture label contained in each video image frame of the video file;

the occurrence frequency counting unit is used for respectively counting the occurrence frequency of each picture label in all the video image frames of the video file, and sequencing each picture label based on the sequence from the small occurrence frequency to obtain a picture label sequence;

a content tag selection unit, configured to select the first N picture tags in the picture tags as content tags corresponding to the video file; and N is a positive integer.

Optionally, the apparatus for recommending an audio file further comprises:

the operation record acquisition unit is used for acquiring the operation record of the user account if the authorization information of the currently logged user account is stored;

the operation record analysis unit is used for generating user characteristics of the user account based on the operation record;

and the music library generating unit is used for extracting the existing music matched with the user characteristics from the database and generating the music library.

Therefore, the audio file recommendation device provided in the embodiment of the present application may also determine a skin area image from the original image by obtaining the original image including the shooting object, obtain the skin sensitivity associated with each pixel according to the pixel value of each pixel in the skin area image, and adjust the pixel value of the corresponding pixel in the original image according to the determined skin sensitivity, thereby generating a sensitivity distribution diagram, so that a user can know the overall skin sensitivity in the sensitivity distribution diagram and can also determine the sensitivity corresponding to each local skin. Compared with the existing skin detection technology, the generated sensitivity distribution schematic diagram is generated by adjusting the pixel values of the pixel points based on the original image, so that the outline of the skin area is consistent with that of the original image, the sensitivity corresponding to each local area can be determined by viewing the sensitivity distribution schematic diagram, and the display effect is improved; on the other hand, the generation process of the sensitivity distribution schematic diagram can be completed only by the electronic equipment comprising the camera module, and a user does not need to go to a specific medical institution to complete the generation process, so that the convenience of skin sensitivity acquisition is greatly improved, the acquisition difficulty is reduced, and the skin detection efficiency is improved.

Fig. 22 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 22, the electronic apparatus 22 of this embodiment includes: at least one processor 220 (only one shown in fig. 22), a memory 221, and a computer program 222 stored in the memory 221 and executable on the at least one processor 220, the processor 220 implementing the steps in the preferred method embodiments of any of the various audio files described above when executing the computer program 222.

The electronic device 22 may be a computing device such as a desktop computer, a notebook, a palm computer, and a cloud server. The electronic device may include, but is not limited to, a processor 220, a memory 221. Those skilled in the art will appreciate that fig. 22 is merely an example of the electronic device 22, and does not constitute a limitation of the electronic device 22, and may include more or less components than those shown, or some components in combination, or different components, such as input output devices, network access devices, etc.

The Processor 220 may be a Central Processing Unit (CPU), and the Processor 220 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 221 may be an internal storage unit of the electronic device 22 in some embodiments, such as a hard disk or a memory of the electronic device 22. The memory 221 may also be an external storage device of the electronic device 22 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 22. Further, the memory 221 may also include both an internal storage unit and an external storage device of the electronic device 22. The memory 221 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer programs. The memory 221 may also be used to temporarily store data that has been output or is to be output.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

An embodiment of the present application further provides an electronic device, including: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, the processor implementing the steps of any of the various method embodiments described above when executing the computer program.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.

The embodiments of the present application provide a computer program product, which when running on a mobile terminal, enables the mobile terminal to implement the steps in the above method embodiments when executed.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/electronic device, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for recommending an audio file, comprising:

acquiring an event label of a shooting event corresponding to a video file;

2. The recommendation method according to claim 1, wherein the event tag contains a shooting location; the selecting the target audio associated with the content tag and/or the event tag in the music library comprises:

3. The recommendation method according to claim 1, wherein the event tag contains a shooting date; the selecting the target audio associated with the content tag and/or the event tag in the music library comprises:

4. The recommendation method according to claim 3, wherein before the step of taking the audio file associated with the specific date as the target audio if the shooting date is any one of preset specific dates, the method further comprises:

acquiring user information of a currently logged user account;

determining the particular date based on the user information.

5. The recommendation method according to claim 1, wherein the selecting the target audio associated with the content tag and/or the event tag in the music library comprises:

6. The recommendation method according to claim 5, wherein the determining the target audio from the existing audio based on the matching degree comprises:

7. The recommendation method according to any one of claims 1 to 6, wherein the parsing the video file to obtain the content tag corresponding to the video file comprises:

8. The recommendation method according to any one of claims 1-6, further comprising, before said selecting the target audio associated with the content tag and/or the event tag in the music library:

9. An apparatus for recommending audio files, comprising:

10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.