WO2021004481A1

WO2021004481A1 - Media files recommending method and device

Info

Publication number: WO2021004481A1
Application number: PCT/CN2020/100858
Authority: WO
Inventors: 王家凯
Original assignee: 华为技术有限公司
Priority date: 2019-07-08
Filing date: 2020-07-08
Publication date: 2021-01-14
Also published as: CN110473546B; CN110473546A

Abstract

Provided are media files recommending method and device, relating to the technical field of terminals, which can be applied to intelligent voice apparatus, the problem of low accuracy and poor user experience in the recommendation solution based on user voice emotion in the prior art can be solved. The specific solution includes: the electronic apparatus (100) receives the user voice signal and converts the voice signal into text information; acquiring the user intention and slot information included in the user intention according to text information; the slot information can include emotion information and timing information; querying the media file library according to the user intention and slot information, to obtain the media files corresponding to the user intention and slot information.

Description

Method and device for recommending media files

This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office on July 8, 2019, the application number is 201910609618.0, and the application name is "a media file recommendation method and device", the entire content of which is incorporated herein by reference Applying.

Technical field

This application relates to the field of terminal technology, and in particular to a method and device for recommending media files.

Background technique

With the application and popularization of smart terminal devices, smart voice devices play an increasingly important role in human-computer interaction, and it is desirable to make smart voice devices recognize the emotions expressed in human voice information and be based on voice emotions Recommending data and services for users is an important direction of artificial intelligence research today.

The current recommendation scheme based on user's voice emotion is an algorithm that extracts emotional features based on Mel-Frequency Cepstral Coefficients (MFCC), and extracts the user's voice emotional features based on the prosodic features and sound quality features in the voice information , Query the database according to the corresponding relationship between voice emotion characteristics and emotion types, and recommend data or services of the same or similar emotion types for users. For example, recommend a sad song, a funny movie, etc. to the user according to the user's voice and emotion.

However, this matching method only supports coarse-grained emotional matching, which is based on multimedia file-level data recommendation, but the user wants to know the most exciting part of a multimedia file, for example, the user enters the voice message "I want to watch XXX movies the most "Funny clips" and "want to watch scary movie clips" cannot be recommended for users, or the recommendation accuracy is low, and the user experience is poor.

Summary of the invention

The present application provides a method and device for recommending media files, which solves the problems of low accuracy and poor user experience in recommending a recommendation scheme based on user voice emotions in the prior art.

In order to achieve the above objectives, this application adopts the following technical solutions:

In a first aspect, a media file recommendation method is provided, which is applied to an electronic device. The method includes: receiving a voice signal, and converting the voice signal into text information; obtaining user intent and slot information included in the user intent according to the text information; The position information includes emotional information and timing information; the media file library is queried according to the user's intention and slot information, and the media file corresponding to the user's intention and slot information is obtained.

In the embodiment of this application, the electronic device queries the media file library according to the user's intention and slot information contained in the user's voice information, and matches the user with the multimedia file that is closest to the user's needs and emotional needs based on the timing information and emotional information, so that it can be accurate Identify users' emotional needs, intelligently recommend fine-grained data for users, and improve users' experience.

In a possible design method, the media file library stores multiple user intentions, slot information, and the first mapping relationship between multiple media file identifiers; query the media file library according to user intentions and slot information, and get The media file corresponding to the intention and the slot information includes: obtaining the media file corresponding to the user's intention and the slot information according to the first mapping relationship. In the foregoing possible implementation manners, the electronic device queries the media file library according to the mapping relationship between the user's intention and the slot information, so that the most suitable media file can be matched to recommend to the user according to the different emotional needs of different users, thereby improving the intelligent recommendation. Accuracy and flexibility improve user experience.

In a possible design method, before obtaining the user intent and slot information in the text information, the method further includes: determining whether the text information contains the user intent; if it is determined that the text information does not contain the user intent, use the Mel Frequency Cepstral coefficient MFCC algorithm obtains the emotional feature vector of the speech signal; query the media file library according to the emotional feature vector to obtain the media file corresponding to the emotional feature vector, where multiple emotional feature vectors and multiple emotional information are stored in the media file library In the second mapping relationship, each type of emotional information corresponds to multiple media files. In the foregoing possible implementation manners, if the user voice information does not contain a clear user intention, the electronic device can match the media files according to the corresponding emotional information according to the user voice emotion characteristics extracted from the user voice information, thereby improving the flexibility of intelligent recommendation To improve user experience.

In a possible design method, before receiving the voice signal, the method further includes: obtaining the emotional information of the user commenting on multiple media files; determining that the emotional information is fine-grained emotional information or coarse-grained emotional information; if the emotional information is fine-grained Emotional information is obtained from the slots in the fine-grained emotional information, and the first mapping relationship is established in the media file library; if the emotional information is coarse-grained emotional information, the emotional feature vector is obtained according to the emotional information to obtain the emotional information of the media file, Establish a second mapping relationship for the media file. In the foregoing possible implementation manners, the electronic device can extract emotional information and timing information to establish a mapping relationship based on massive user multimedia comment information, thereby generating a multimedia file library, and improving the readiness of intelligent recommendation.

In a possible design manner, converting the voice signal into text information includes: converting the voice signal into text information through automatic voice recognition ASR. In the foregoing possible implementation manners, the electronic device can recognize the text information included in the user's voice information through automatic voice recognition technology, thereby improving the accuracy of the smart recommendation.

In a possible design manner, obtaining user intentions in text information includes: obtaining user intentions in text information through natural speech understanding NLU technology. The electronic device can recognize the user's intention included in the user's voice information through natural speech understanding technology, and match the recommendation according to the user's intention, thereby improving the accuracy of the intelligent recommendation.

In a second aspect, an electronic device is provided. The electronic device includes a processor and a memory connected to the processor. The memory is used for storing instructions. When the instructions are executed by the processor, the electronic device is used for execution: receiving voice signals, Convert voice signals into text information; obtain user intentions and slot information included in user intentions according to text information; slot information includes emotional information and timing information; query media file libraries according to user intentions and slot information to obtain user intentions The media file corresponding to the slot information.

In a possible design, the media file library stores multiple first mapping relationships between user intents, slot information, and multiple media file identifiers; the electronic device is specifically used to execute: according to the first mapping relationship, obtain and Media files corresponding to user intent and slot information.

In a possible design method, the electronic device is also used to perform: determine whether the text information contains the user's intention; if it is determined that the text information does not contain the user's intention, obtain the emotional characteristics of the voice signal through the Mel frequency cepstral coefficient MFCC algorithm Vector; query the media file library according to the emotional feature vector to obtain the media file corresponding to the emotional feature vector, wherein the media file library stores the second mapping relationship between multiple emotional feature vectors and multiple emotional tags.

In a possible design method, the electronic device is also used to perform: obtain the emotional information of the user commenting on multiple media files; determine the emotional information as fine-grained emotional information or coarse-grained emotional information; if the emotional information is fine-grained emotional information, Then obtain the slots in the fine-grained emotional information, and establish the first mapping relationship in the media file library; if the emotional information is coarse-grained emotional information, obtain the emotional feature vector according to the emotional tag, obtain the emotional tag of the media file, and create the media file The second mapping relationship.

In a possible design manner, converting the voice signal into text information includes: converting the voice signal into text information through automatic voice recognition ASR.

In a possible design manner, obtaining user intentions in text information includes: obtaining user intentions in text information through natural speech understanding NLU technology.

In a third aspect, a chip system is provided, which is applied to electronic equipment; the chip system includes one or more interface circuits and one or more processors; the interface circuit and the processor are interconnected by wires; the interface circuit is used for The memory of the device receives the signal and sends the signal to the processor. The signal includes the computer instruction stored in the memory; when the processor executes the computer instruction, the electronic device executes the first aspect and any of the possible design methods.

In a fourth aspect, a readable storage medium is provided. The readable storage medium stores instructions. When the readable storage medium runs on an electronic device, the electronic device executes the first aspect and any of its possible design methods Methods.

In a fifth aspect, a computer program product is provided. When the computer program product runs on a computer, the computer executes the first aspect and any of the possible design methods.

Understandably, any of the electronic devices, systems, readable storage media, and computer program products recommended by the media files provided above are all used to execute the corresponding methods provided above, and therefore, the beneficial effects that can be achieved Reference may be made to the beneficial effects in the corresponding methods provided above, which will not be repeated here.

Description of the drawings

FIG. 1 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the application;

FIG. 2 is a software system architecture diagram of an electronic device provided by an embodiment of this application;

FIG. 3 is a schematic flowchart of a method for recommending media files according to an embodiment of the application;

4 is a schematic diagram of a process for extracting emotional feature vectors provided by an embodiment of the application;

FIG. 5 is a schematic flowchart of establishing a media file library in a method for recommending media files according to an embodiment of the application;

FIG. 6 is a schematic structural diagram of a chip system provided by an embodiment of the application.

Detailed ways

Before introducing the method embodiments of the present application, the following description of the technology involved in the embodiments of the present application is made:

Intelligent voice device: It is an electronic device that can receive user voice information, can output voice information, and can interact with the user by voice.

Automatic Speech Recognition (ASR) technology: It is a technology that converts human speech information into text information. The goal is to enable computers to "dictate" continuous speech spoken by different people, which is commonly known as " "Voice dictation machine" is a technology that realizes the conversion from "sound" to "text".

Natural Language Understanding (NLU) technology: It is a technology that recognizes the text content and intentions in human natural language, that is, a technology that allows computers to "understand" natural language, so as to use natural language to communicate with computers. Natural language communication between man and machine. It covers a wide range of fields, including sentence detection, word segmentation, part-of-speech tagging, syntactic analysis, text classification/clustering, text angle, information extraction/automatic summarization, machine translation, automatic question answering, text generation and many other fields.

Slot: A concept in human-machine dialogue. Slot is the definition of key information identified in the user's voice information. That is, the information needed to transform the user's intention into a clear user instruction, and a slot corresponds to a kind of information that needs to be obtained in the processing of a thing.

The embodiments of the present application provide a media file recommendation method, which can be applied to electronic devices including smart voice devices, such as voice assistants, smart speakers, smart phones, tablet computers, computers, wearable electronic devices, and smart robots. Through this method, the electronic device can intelligently recognize the emotions and recommendation needs expressed in the user’s voice information, and recommend fine-grained data to the user, such as segment-level media files, to improve the accuracy of data recommendation, thereby improving the user’s Use experience.

The implementation of the embodiments of the present application will be described in detail below in conjunction with the accompanying drawings. Please refer to FIG. 1, which is a schematic diagram of a possible structure of an electronic device 100 provided by an embodiment of this application. As shown in FIG. 1, the electronic device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, and a battery 142, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, a display screen 194, and a subscriber identification module (SIM) card interface 195, etc.

Among them, the aforementioned sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, and a touch sensor 180K, Ambient light sensor 180L and bone conduction sensor 180M and other sensors.

It can be understood that the structure illustrated in this embodiment does not constitute a specific limitation on the electronic device 100. In other embodiments, the electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components. The illustrated components can be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units. For example, the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Among them, the different processing units may be independent devices or integrated in one or more processors.

The controller may be the nerve center and command center of the electronic device 100. The controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.

A memory may also be provided in the processor 110 to store instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.

In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter (universal asynchronous transmitter) interface. receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / Or Universal Serial Bus (USB) interface, etc.

It can be understood that the interface connection relationship between the modules illustrated in this embodiment is merely a schematic description, and does not constitute a structural limitation of the electronic device 100. In other embodiments, the electronic device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example, save music, video and other files in an external memory card.

The internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions. The processor 110 executes various functional applications and data processing of the electronic device 100 by running instructions stored in the internal memory 121. For example, in the embodiment of the present application, the processor 110 may execute instructions stored in the internal memory 121, and the internal memory 121 may include a storage program area and a storage data area.

Among them, the storage program area can store an operating system, at least one application program (such as a sound playback function, an image playback function, etc.) required by at least one function. The data storage area can store data (such as audio data, phone book, etc.) created during the use of the electronic device 100. In addition, the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), etc.

The electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. For example, music playback, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal. The audio module 170 can also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110, or part of the functional modules of the audio module 170 may be provided in the processor 110.

The speaker 170A, also called a "speaker", is used to convert audio electrical signals into sound signals. The electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.

The receiver 170B, also called "earpiece", is used to convert audio electrical signals into sound signals. When the electronic device 100 answers a call or voice message, it can receive the voice by bringing the receiver 170B close to the human ear.

The microphone 170C, also called "microphone", "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can approach the microphone 170C through the mouth to make a sound, and input the sound signal to the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement noise reduction functions in addition to collecting sound signals. In some other embodiments, the electronic device 100 can also be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions.

The software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiment of the present invention takes a layered Android system as an example to illustrate the software structure of the electronic device 100 by way of example.

FIG. 2 is a software structure block diagram of an electronic device 100 according to an embodiment of the present invention.

The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the Android runtime and system library, and the kernel layer.

The application layer can include a series of application packages.

As shown in Figure 2, the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, SMS and voice assistant.

The application framework layer provides application programming interfaces (application programming interface, API) and programming frameworks for applications in the application layer. The application framework layer includes some predefined functions.

As shown in Figure 2, the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, and a notification manager.

The window manager is used to manage window programs. The window manager can obtain the size of the display, determine whether there is a status bar, lock the screen, take a screenshot, etc.

The content provider is used to store and retrieve data and make these data accessible to applications. The data may include video, image, audio, phone calls made and received, browsing history and bookmarks, phone book, etc.

The view system includes visual controls, such as controls that display text and controls that display pictures. The view system can be used to build applications. The display interface can be composed of one or more views. For example, a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.

The phone manager is used to provide the communication function of the electronic device 100. For example, the management of the call status (including connecting, hanging up, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, etc.

The notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can disappear automatically after a short stay without user interaction. For example, the notification manager is used to notify the download completion, message reminder, etc. The notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, text messages are prompted in the status bar, prompt sounds, electronic devices vibrate, and indicator lights flash.

Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.

The core library consists of two parts: one part is the function functions that the java language needs to call, and the other part is the core library of Android.

The application layer and the application framework layer run in a virtual machine. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.

The system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.

The surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.

The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support multiple audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.

The 3D graphics processing library is used to realize 3D graphics drawing, image rendering, synthesis, and layer processing.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is the layer between hardware and software. The kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.

The embodiment of the present application provides a media file recommendation method. As shown in FIG. 3, the method may include 301-303:

301: The electronic device receives a voice signal, and converts the voice signal into text information.

That is, the voice signal when the user is speaking is received, and the ASR technology is used to convert the voice signal into corresponding text information.

Among them, the process of ASR technology to convert voice information into text information may include: voice signal preprocessing and feature extraction; acoustic model and pattern matching; language model and language processing. First, select one of words (sentences), syllables or phonemes as the voice recognition unit to extract voice features from the voice information; then, match and compare the extracted voice features with the pre-established acoustic model (pattern) , Get the best recognition results; then match the language model, that is, match the grammatical network formed by recognizing voice commands or the language model formed by statistical methods, and perform language processing of grammar and semantic analysis to generate the corresponding voice information Text information.

For example, the electronic device converts a piece of user voice audio received into a text message: "I want to listen to a sad song."

In combination with the aforementioned architecture of the electronic device 100, the above method may be that the electronic device 100 obtains the user's voice signal through the microphone 170C, and sends the voice signal to the processor 110 for processing. The audio module 170 in the processor 110 may process the voice signal. Specifically, the system can instruct the voice assistant program of the application layer through commands, call related programs of the application framework layer and related functions of the core library, process the voice signal, and convert it into text information.

302: Acquire the user's intention according to the text information, and the slot information included in the user's intention.

Among them, user intentions are user needs, that is, information indicating what tasks the user wants the electronic device to complete.

The user intention may be an intention keyword included in the above text information, where the intention keyword may be used to classify user needs into a certain type. Specific to the relevant embodiments of the present application, the intention keywords may include: media data attributes, emotional intentions, media data file names, keywords related to media data, and so on. Specifically, media data attributes, such as music, movies, variety shows, dramas, fine arts, literary works, photos, etc., can classify media data types according to user needs. Emotional intentions, for example, happy, sad, scary, etc., can divide user needs according to the pre-defined emotional types of media data. The name of the media data file can obtain the media data clearly required by the user, such as "Jane Eyre" and "The King of Comedy". Keywords related to media data, for example, a segment of a certain character in a film, a segment of a specific plot in a film, a segment of a certain rhythm in music, a description of a plot in a literary work, etc., Can locate the needs of users.

The slot information may include: timing information and emotional information. Wherein, the timing information may be a part of the content of the media file marked with a tag, or may be called a timing tag, which corresponds to a part of the content of the media file, and may be a specific time of the media file, or a timing segment. For example, the 12th to the third minute of a movie, or the second to third minute of a certain piece of music.

Emotional information can be the emotional type of the media file marked with tags, and can also be called emotional tags. Specifically, it can include: beautiful, happy, sad, scary, cheerful, and exciting.

There is a corresponding relationship between the emotion tag and the timing tag. For example, the 12th:05th of a certain movie is beautiful, and the second to third minutes of a certain music are cheerful.

For example, the text information obtained by 301 is "I want to listen to a sad song", the intent keywords that can be extracted are: song, and the slot information extracted is: sad. For another example, by 301 The obtained text information is "I want to see a funny plot in Charlotte's troubles", and the intent keywords that can be extracted are: movie "Charlotte's troubles", and the timing tags in the extracted slot information are: For a plot, the emotional tag in the extracted slot information is: funny.

Specifically, the user's intention in the text information can be obtained through NLU technology. Specifically, it can use deep learning technology and neural network algorithms to identify all the words and words included in the text information, perform text semantic understanding, and determine the user's intention. The specific implementation process of this technology is not described in detail in the embodiment of this application.

Further, if a clear user emotional need is not extracted according to the above process, such as at least one of the above intention keywords, emotional tags, timing information, etc., the electronic device uses the MFCC algorithm to extract the user’s voice emotional characteristics, and the emotional characteristics The matched emotional label serves as the user's intentional emotional demand. Specifically, the obtained user voice information is processed by the MFCC algorithm to obtain the MFCC feature vector representing the emotional characteristics of the user's voice; the MFCC feature vector can be matched with the preset MFCC feature vector and emotional label pair, and the matched As the user's intention, the emotion label is used for the following steps.

Among them, the specific process of the MFCC algorithm to extract the user’s voice emotional features can be as shown in Figure 4, including: analog-to-digital conversion, pre-emphasis, frame and windowing, Fourier transform, Mel filtering, cepstrum, energy and difference. This process generates MFCC feature vectors.

First, analog-to-digital conversion, that is, convert the input analog signal into a digital signal. Pre-emphasis is to pass the digital signal through a high-pass filter. The purpose is to boost the high-frequency part, flatten the frequency spectrum of the signal, and keep it in the entire frequency band from low to high frequency. The same signal-to-noise ratio can be used to find the frequency spectrum. At the same time, it is also to eliminate the effects of the vocal cords and lips in the process of occurrence, to compensate for the high frequency part of the voice signal suppressed by the pronunciation system, and to highlight the high frequency formant.

Framing and windowing processing is to gather N sampling points into one observation unit, called frame. Multiply each frame by the Hamming window (the Hamming window specifies a period of signal) to increase the continuity between the left and right ends of the frame.

Since the transformation of the signal in the time domain is usually difficult to see the characteristics of the signal, it is usually converted to the energy distribution in the frequency domain for observation. Different energy distributions can represent the characteristics of different voices. Therefore, after multiplying the Hamming window, each frame must undergo a fast Fourier transform to obtain the energy distribution on the frequency spectrum. Fast Fourier transform is performed on each frame signal after frame division and windowing to obtain the frequency spectrum of each frame. And take the modulus square of the frequency spectrum of the speech signal to obtain the power spectrum of the speech signal.

The frequency spectrum is smoothed by Mel filtering, and the effect of harmonics is eliminated, highlighting the formant of the original voice. Therefore, the pitch or pitch of a piece of speech will not appear in the MFCC parameters. In other words, a speech recognition system featuring MFCC will not be affected by the pitch of the input speech. In addition, the amount of calculation can be reduced.

The cepstrum is processed into the Fourier transform spectrum of the signal, and the inverse Fourier transform is performed after logarithmic operation. In this step, the logarithmic energy output by each filter bank can be obtained.

Energy and difference processing means that the standard cepstrum parameter MFCC only reflects the static characteristics of speech parameters, and the dynamic characteristics of speech can be described by the difference spectrum of these static characteristics. Experiments show that combining dynamic and static features can effectively improve the recognition performance of the system.

Further in another possible embodiment, the above user intention may also include other keywords indicating the user's recommendation needs, for example, information indicating a certain plot, information indicating a certain line, information indicating a certain actor , Indicating the text and other information of a certain plot. For example, the voice device obtained by the voice device of the user is "I want to watch a clip of the heroine crying in the king of comedy", and the user's demand characteristics can be extracted including keywords: the heroine crying, clear media data information For: the movie "The King of Comedy", timing information: fragment. Then, the electronic device can query the above keywords as emotional needs.

In combination with the aforementioned software architecture of the electronic device, the above method can instruct the system to instruct the voice assistant program at the application layer through commands, call related programs in the application framework layer and related functions in the core library, to recognize and process text information, according to certain The algorithm extracts the intention and slot information it contains.

303: The electronic device queries the media file library according to the user's intention and the slot information, and obtains the media file corresponding to the user's intention and the slot information.

Query data from the media file library according to the extracted user intent and slot information. The media file library is pre-established by the electronic device or is obtained by the electronic device through a cloud service. The cloud service can be provided to the electronic device. The cloud device for data processing and data storage services may specifically be a server.

The media file library stores a first mapping relationship between multiple user intentions, slot information, and multiple media file identifiers. Specifically, it may be a fine-grained media file library containing emotion tags and time series tags. For example, it includes massive media files such as music, movies, dramas, fine arts, literary works, and photos. Each media file can contain a macro tag, such as music, movie, drama, art, literature, and photos, etc., or it can include a specific multimedia file name, such as "The King of Comedy". The macro tag can correspond to the user's Intent keywords. Each multimedia file can also contain emotional tags, for example, happy, sad, scary, etc. In addition, each multimedia file may also include at least one timing information, for example, the third minute to the fourth minute, the last ten minutes, and so on. The slot information representing the emotional intent may form a corresponding relationship with the slot information representing the time sequence. For example, the emotional label corresponding to the third minute to the fourth minute of the movie is sad, and the emotional label corresponding to the last ten minutes of the movie It is joyful waiting.

The specific process of establishing the media file library will be described in detail below, and will not be repeated here.

The media file library is queried according to the intent and the slot to obtain the media file corresponding to the intent and the slot, and the media file corresponding to the intent and the slot is obtained according to the first mapping relationship. That is, according to the emotion tags, time sequence information, macro tags, etc., obtained in the above steps, query in the media file library, and use the multimedia data with the highest degree of emotional demand matching as the matching data.

The specific matching process can be as follows: first query the corresponding database based on the macro tags, such as the movie, or the movie "King of Comedy"; then query the media file database according to the emotion tags, and match the movie fragments corresponding to the emotion tags of the movie. The time series tag that is closest to the emotion tag, and the media data segment corresponding to the time series tag is the media file matched by the electronic device for the user.

Further, the sentiment label may also include a corresponding recommended value, and the recommended value may be a quantified sentiment value representing the data segment corresponding to the sentiment label of the multimedia data, which may be used for matching degree calculation. For example, the recommended value can be represented by the amount of data annotated with the sentiment tag, the amount of data searched, or the user score. The higher the recommendation value, the higher the matching degree of the sentiment label, and the lower the recommendation value, the lower the matching degree of the sentiment label. For example, the emotional tags of multiple time series segments of a certain movie are all funny. When matching the most funny time series segment for the user, it can be determined according to the recommended value corresponding to the emotional tags of the multiple time series segments. The corresponding time sequence segment is regarded as the data with the highest matching degree.

The media file library may include a file library of fine-grained multimedia tags, not only tags related to emotion types, but also other tags that may be recommended by users. For example, a segment of a certain character in a film, a segment of a specific plot in a film, a segment of a certain rhythm in music, a description of a plot in a literary work, and so on.

According to the above-mentioned multimedia data with the highest matching degree, it is recommended to the user to send voice information or directly send the multimedia data to the user.

For example, the electronic device obtains the user's voice information "I want to watch the most touching episode in the king of comedy", processes the voice information, and recognizes the following emotional needs: the emotional label is: the most touching; represents the sequence The slot information of is: a plot; the name of the extracted multimedia file is: the movie "King of Comedy". The electronic device queries the multimedia file library according to the above emotional needs, and queries the time series segments of the movie "King of Comedy" whose emotional tags are touching, sad, or tears, and selects the data with the highest recommended value in the query results, such as the movie No. For a plot segment from 40 minutes to 50 minutes, the electronic device uses the multimedia data as matching data to recommend to the user.

In another possible embodiment, the user’s voice information acquired by the electronic device contains only macro tags. For example, the user’s voice information acquired by the electronic device is "I want to watch a movie" and the macro tag is a movie. The device queries the media file library according to the macro tag, and recommends the higher recommended value to the user after matching the movie.

In another possible embodiment, the user's voice information obtained by the electronic device only contains macro tags and emotional tags. For example, the user's voice information obtained by the electronic device is "I want to watch a horror movie", and the macro tag is If the movie has an emotional tag of horror, the electronic device queries the media file library according to the macro tag and emotional tag, and after matching the horror movie, selects a higher recommended value to recommend to the user.

In another possible embodiment, if the electronic device cannot make judgments and recommendations based on the semantics expressed in the user's voice, that is, if it is determined that the text information does not contain the user's intention, the voice signal is obtained by the Mel frequency cepstral coefficient MFCC algorithm Corresponding emotional feature vector. The media file library is queried according to the emotional feature vector, and media files under emotional tags corresponding to similar emotional features are selected for recommendation in the media file library. Among them, the media file library stores the second mapping relationship between multiple emotional feature vectors and multiple emotional tags, each emotional tag corresponds to at least one media file, and any media file matched by the emotional tag can be used as a guaranteed recommendation The data is recommended to users.

Combined with the aforementioned software architecture of the electronic device, the above method can call the related programs of the application framework layer and related functions of the core library through the voice assistant program, and use a certain matching algorithm according to the extracted user intentions and slots. Get the corresponding media file.

Further, in 303 of the foregoing embodiment, the process of establishing the multimedia file library by the electronic device may be as shown in FIG. 5, including:

501: Obtain multimedia comment information.

Obtain a large amount of comment information on multimedia data, and can comment on emotional information of multiple media files for users. It can be through various channels on the Internet, for example, comments on multimedia files in user comments on various websites such as forums, post bars, news, and applications, and can also include user comment areas and bullet screen message areas on movie and video websites. Specifically, it can use web crawler technology to grab comments from the Internet, and obtain a large amount of comment information about multimedia files according to the comment extraction model, such as comment information about music, movies, dramas, fine arts, literary works, or images.

Determine whether the emotional information is fine-grained emotional information or coarse-grained emotional information. Among them, the coarse-grained emotional information indicates that the review information can be macroscopic, for example, "This music is very sad"; it can also be the review information about the sequence segment level, that is, the fine-grained emotional information, for example, "The keynote of the movie is comedy, but The last 15 minutes is still very sensational", "The overall film is relatively plain, but the plot of the film is terrifying for 30-40 minutes."

502: If the emotion information is fine-grained emotion information, establish the first mapping relationship of the media file according to the obtained slot information; if the emotion information is coarse-grained emotion information, obtain the emotion feature vector according to the emotion label, The tag and the emotional feature vector establish a second mapping relationship of the media file.

Specifically, the obtained multimedia comment information is labeled, which can be a manual labeling method or a rule matching algorithm to obtain comment keywords in the multimedia comment information, such as sentiment tags, time series tags, macro tags, or other The corresponding relationship in the keywords indicating the recommendation requirements.

First, determine that the multimedia review information is fine-grained emotional review information or coarse-grained emotional review information; if the emotional information is fine-grained emotional review information, obtain the slot in the fine-grained emotional review information and establish it in the media file library The first mapping relationship between the multimedia file and the slot information; the slot may specifically include timing tags, emotional tags, and so on.

If the emotional information is coarse-grained emotional comment information, the emotional feature vector is obtained according to the emotional tag, the emotional tag of the media file is obtained, and the second mapping relationship of the media file is established.

For example, the corresponding relationship between the time series tag and the emotional tag of a certain movie is stored, and the multimedia data is stored correspondingly. For example, a certain timing information and emotional label pair are, the emotional label corresponding to the third minute to the fourth minute is sad, and the emotional label corresponding to the last ten minutes is happy.

Further, if the emotional information associated with the timing information is not captured, the emotional tags or macro tags on the multimedia on each platform are directly saved to the media file library, for example, comedy, tragedy, fast song, sad song, etc. Recommended as a guaranteed data.

According to the multimedia file and its mapping relationship, establish a multimedia file library.

According to the generated massive first mapping relationship and second mapping relationship of multimedia files, data storage is carried out, and a multimedia file library is established, so as to perform information query and data matching from the multimedia file library according to the emotional needs of the user's voice information .

Further, the multimedia file library may be continuously updated, continuously acquiring new comment information about the multimedia file, and extracting emotional tags about time sequence segments included in the comment information, so as to fine-grain and enrich the resources of the multimedia file library, for example , For a certain movie, it can be fine-grained to establish the emotional tag, plot description and other mapping relationships of each frame of the movie, so as to achieve more accurate matching and recommendation of the time series of multimedia files for users.

Other embodiments of the present application provide an electronic device, which may include a memory and one or more processors, and the memory and the processor are coupled. The memory is used to store computer program code, and the computer program code includes computer instructions. When the processor executes the computer instructions, the electronic device can execute various functions or steps in the foregoing method embodiments.

The embodiment of the present application also provides a chip system. As shown in FIG. 6, the chip system includes at least one processor 601 and at least one interface circuit 602. The processor 601 and the interface circuit 602 may be interconnected by wires. For example, the interface circuit 602 may be used to receive signals from other devices (such as the memory of an electronic device). For another example, the interface circuit 602 may be used to send signals to other devices (for example, the processor 601). Exemplarily, the interface circuit 602 can read an instruction stored in the memory, and send the instruction to the processor 601. When the instructions are executed by the processor 601, the electronic device can be made to execute various functions or steps executed by the electronic device in the foregoing embodiments. Of course, the chip system may also include other discrete devices, which are not specifically limited in the embodiment of the present application.

The embodiments of the present application also provide a computer storage medium, which includes computer instructions, which when the computer instructions run on the above-mentioned electronic device, cause the electronic device to execute each function or step in the above-mentioned method embodiment.

The embodiments of the present application also provide a computer program product, which when the computer program product runs on a computer, causes the computer to execute each function or step in the foregoing method embodiment.

Through the description of the above embodiments, those skilled in the art can clearly understand that for the convenience and conciseness of the description, only the division of the above-mentioned functional modules is used as an example. In practical applications, the above-mentioned functions can be allocated by Different functional modules are completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.

In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be It can be combined or integrated into another device, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate. The parts displayed as units may be one physical unit or multiple physical units, that is, they may be located in one place, or they may be distributed to multiple different places. . Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application are essentially or the part that contributes to the prior art, or all or part of the technical solutions can be embodied in the form of software products, which are stored in a storage medium It includes several instructions to make a device (may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read only memory (read only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program codes.

The above content is only the specific implementation of this application, but the protection scope of this application is not limited to this, and any changes or substitutions within the technical scope disclosed in this application should be covered within the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A media file recommendation method, characterized in that it is applied to an electronic device, and the method includes:

Receiving a voice signal, and converting the voice signal into text information;

Obtaining user intentions and slot information included in the user intentions according to the text information; the slot information includes emotion information and time sequence information;

The media file library is queried according to the user's intention and the slot information to obtain a media file corresponding to the user's intention and the slot information.
The method according to claim 1, wherein the media file library stores a first mapping relationship between multiple user intentions, slot information, and multiple media file identifiers;

The querying the media file library according to the user intention and the slot information to obtain the media file corresponding to the user intention and the slot information includes:

According to the first mapping relationship, a media file corresponding to the user's intention and the slot information is obtained.
The method according to claim 1 or 2, characterized in that, before obtaining the user intention and the slot information in the text information, the method further comprises:

Determining whether the user intention is included in the text information;

If it is determined that the text information does not contain the user's intention, obtain the emotional feature vector of the speech signal through the Mel frequency cepstrum coefficient MFCC algorithm;

Query the media file library according to the emotional feature vector to obtain a media file corresponding to the emotional feature vector, wherein the media file library stores a second mapping relationship between multiple emotional feature vectors and multiple emotional information , Each emotion information corresponds to multiple media files.
The method according to claim 3, characterized in that, before receiving the voice signal, the method further comprises:

Obtain the emotional information of users commenting on multiple media files;

Determining that the emotional information is fine-grained emotional information or coarse-grained emotional information;

If the emotional information is the fine-grained emotional information, acquiring slots in the fine-grained emotional information, and establishing a first mapping relationship in the media file library;

If the emotion information is the coarse-grained emotion information, an emotion feature vector is obtained according to the emotion information, the emotion information of the media file is obtained, and the second mapping relationship of the media file is established.
The method according to claim 1, wherein converting the voice signal into text information comprises:

The voice signal is converted into the text information through automatic voice recognition ASR.
The method according to claim 1, wherein said obtaining user intentions in said text information comprises:

Obtain the user's intention in the text information through natural speech understanding NLU technology.
An electronic device, characterized in that the electronic device includes a processor and a memory connected to the processor, the memory is used to store instructions, and when the instructions are executed by the processor, the electronic device Used to execute:

Receiving a voice signal, and converting the voice signal into text information;

Obtaining user intentions and slot information included in the user intentions according to the text information; the slot information includes emotion information and time sequence information;

The media file library is queried according to the user's intention and the slot information to obtain a media file corresponding to the user's intention and the slot information.
The electronic device according to claim 7, wherein the media file library stores a plurality of first mapping relationships between user intentions, slot information, and multiple media file identifiers;

The electronic device is specifically used to execute:

According to the first mapping relationship, a media file corresponding to the user's intention and the slot information is obtained.
The electronic device according to claim 7 or 8, wherein the electronic device is further configured to execute:

Determining whether the user intention is included in the text information;

If it is determined that the text information does not contain the user's intention, obtain the emotional feature vector of the speech signal through the Mel frequency cepstrum coefficient MFCC algorithm;

Query the media file library according to the emotional feature vector to obtain a media file corresponding to the emotional feature vector, wherein the media file library stores a second mapping relationship between multiple emotional feature vectors and multiple emotional tags .
The electronic device according to claim 9, wherein the electronic device is further configured to execute:

Obtain the emotional information of users commenting on multiple media files;

Determining that the emotional information is fine-grained emotional information or coarse-grained emotional information;

If the emotional information is the fine-grained emotional information, acquiring slots in the fine-grained emotional information, and establishing a first mapping relationship in the media file library;

If the emotion information is the coarse-grained emotion information, an emotion feature vector is obtained according to an emotion tag, an emotion tag of the media file is obtained, and the second mapping relationship of the media file is established.
8. The electronic device of claim 7, wherein converting the voice signal into text information comprises:

The voice signal is converted into the text information through automatic voice recognition ASR.
The electronic device according to claim 7, wherein said obtaining the user's intention in the text information comprises:

Obtain the user's intention in the text information through natural speech understanding NLU technology.
A chip system, characterized in that the chip system is applied to electronic equipment; the chip system includes one or more interface circuits and one or more processors; the interface circuit and the processor are interconnected by wires; The interface circuit is used to receive a signal from the memory of the electronic device and send the signal to the processor, and the signal includes a computer instruction stored in the memory; when the processor executes the computer instruction At this time, the electronic device executes the media file recommendation method according to any one of claims 1-6.
A readable storage medium, characterized in that instructions are stored in the readable storage medium, and when the readable storage medium runs on an electronic device, the electronic device executes any one of claims 1-6 The described media file recommendation method.
A computer program product, characterized in that when the computer program product runs on a computer, the computer is caused to execute the media file recommendation method according to any one of claims 1-6.