WO2021004481A1 - Media files recommending method and device - Google Patents

Media files recommending method and device Download PDF

Info

Publication number
WO2021004481A1
WO2021004481A1 PCT/CN2020/100858 CN2020100858W WO2021004481A1 WO 2021004481 A1 WO2021004481 A1 WO 2021004481A1 CN 2020100858 W CN2020100858 W CN 2020100858W WO 2021004481 A1 WO2021004481 A1 WO 2021004481A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
user
media file
emotional
electronic device
Prior art date
Application number
PCT/CN2020/100858
Other languages
French (fr)
Chinese (zh)
Inventor
王家凯
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021004481A1 publication Critical patent/WO2021004481A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • This application relates to the field of terminal technology, and in particular to a method and device for recommending media files.
  • smart voice devices play an increasingly important role in human-computer interaction, and it is desirable to make smart voice devices recognize the emotions expressed in human voice information and be based on voice emotions Recommending data and services for users is an important direction of artificial intelligence research today.
  • the current recommendation scheme based on user's voice emotion is an algorithm that extracts emotional features based on Mel-Frequency Cepstral Coefficients (MFCC), and extracts the user's voice emotional features based on the prosodic features and sound quality features in the voice information , Query the database according to the corresponding relationship between voice emotion characteristics and emotion types, and recommend data or services of the same or similar emotion types for users. For example, recommend a sad song, a funny movie, etc. to the user according to the user's voice and emotion.
  • MFCC Mel-Frequency Cepstral Coefficients
  • this matching method only supports coarse-grained emotional matching, which is based on multimedia file-level data recommendation, but the user wants to know the most exciting part of a multimedia file, for example, the user enters the voice message "I want to watch XXX movies the most "Funny clips” and “want to watch scary movie clips” cannot be recommended for users, or the recommendation accuracy is low, and the user experience is poor.
  • the present application provides a method and device for recommending media files, which solves the problems of low accuracy and poor user experience in recommending a recommendation scheme based on user voice emotions in the prior art.
  • a media file recommendation method is provided, which is applied to an electronic device.
  • the method includes: receiving a voice signal, and converting the voice signal into text information; obtaining user intent and slot information included in the user intent according to the text information; The position information includes emotional information and timing information; the media file library is queried according to the user's intention and slot information, and the media file corresponding to the user's intention and slot information is obtained.
  • the electronic device queries the media file library according to the user's intention and slot information contained in the user's voice information, and matches the user with the multimedia file that is closest to the user's needs and emotional needs based on the timing information and emotional information, so that it can be accurate Identify users' emotional needs, intelligently recommend fine-grained data for users, and improve users' experience.
  • the media file library stores multiple user intentions, slot information, and the first mapping relationship between multiple media file identifiers; query the media file library according to user intentions and slot information, and get The media file corresponding to the intention and the slot information includes: obtaining the media file corresponding to the user's intention and the slot information according to the first mapping relationship.
  • the electronic device queries the media file library according to the mapping relationship between the user's intention and the slot information, so that the most suitable media file can be matched to recommend to the user according to the different emotional needs of different users, thereby improving the intelligent recommendation. Accuracy and flexibility improve user experience.
  • the method before obtaining the user intent and slot information in the text information, the method further includes: determining whether the text information contains the user intent; if it is determined that the text information does not contain the user intent, use the Mel Frequency Cepstral coefficient MFCC algorithm obtains the emotional feature vector of the speech signal; query the media file library according to the emotional feature vector to obtain the media file corresponding to the emotional feature vector, where multiple emotional feature vectors and multiple emotional information are stored in the media file library In the second mapping relationship, each type of emotional information corresponds to multiple media files.
  • the electronic device can match the media files according to the corresponding emotional information according to the user voice emotion characteristics extracted from the user voice information, thereby improving the flexibility of intelligent recommendation To improve user experience.
  • the method before receiving the voice signal, the method further includes: obtaining the emotional information of the user commenting on multiple media files; determining that the emotional information is fine-grained emotional information or coarse-grained emotional information; if the emotional information is fine-grained Emotional information is obtained from the slots in the fine-grained emotional information, and the first mapping relationship is established in the media file library; if the emotional information is coarse-grained emotional information, the emotional feature vector is obtained according to the emotional information to obtain the emotional information of the media file, Establish a second mapping relationship for the media file.
  • the electronic device can extract emotional information and timing information to establish a mapping relationship based on massive user multimedia comment information, thereby generating a multimedia file library, and improving the readiness of intelligent recommendation.
  • converting the voice signal into text information includes: converting the voice signal into text information through automatic voice recognition ASR.
  • the electronic device can recognize the text information included in the user's voice information through automatic voice recognition technology, thereby improving the accuracy of the smart recommendation.
  • obtaining user intentions in text information includes: obtaining user intentions in text information through natural speech understanding NLU technology.
  • the electronic device can recognize the user's intention included in the user's voice information through natural speech understanding technology, and match the recommendation according to the user's intention, thereby improving the accuracy of the intelligent recommendation.
  • an electronic device in a second aspect, includes a processor and a memory connected to the processor.
  • the memory is used for storing instructions.
  • the electronic device is used for execution: receiving voice signals, Convert voice signals into text information; obtain user intentions and slot information included in user intentions according to text information; slot information includes emotional information and timing information; query media file libraries according to user intentions and slot information to obtain user intentions The media file corresponding to the slot information.
  • the media file library stores multiple first mapping relationships between user intents, slot information, and multiple media file identifiers; the electronic device is specifically used to execute: according to the first mapping relationship, obtain and Media files corresponding to user intent and slot information.
  • the electronic device is also used to perform: determine whether the text information contains the user's intention; if it is determined that the text information does not contain the user's intention, obtain the emotional characteristics of the voice signal through the Mel frequency cepstral coefficient MFCC algorithm Vector; query the media file library according to the emotional feature vector to obtain the media file corresponding to the emotional feature vector, wherein the media file library stores the second mapping relationship between multiple emotional feature vectors and multiple emotional tags.
  • the electronic device is also used to perform: obtain the emotional information of the user commenting on multiple media files; determine the emotional information as fine-grained emotional information or coarse-grained emotional information; if the emotional information is fine-grained emotional information, Then obtain the slots in the fine-grained emotional information, and establish the first mapping relationship in the media file library; if the emotional information is coarse-grained emotional information, obtain the emotional feature vector according to the emotional tag, obtain the emotional tag of the media file, and create the media file The second mapping relationship.
  • converting the voice signal into text information includes: converting the voice signal into text information through automatic voice recognition ASR.
  • obtaining user intentions in text information includes: obtaining user intentions in text information through natural speech understanding NLU technology.
  • a chip system which is applied to electronic equipment; the chip system includes one or more interface circuits and one or more processors; the interface circuit and the processor are interconnected by wires; the interface circuit is used for The memory of the device receives the signal and sends the signal to the processor.
  • the signal includes the computer instruction stored in the memory; when the processor executes the computer instruction, the electronic device executes the first aspect and any of the possible design methods.
  • a readable storage medium stores instructions.
  • the electronic device executes the first aspect and any of its possible design methods Methods.
  • a computer program product is provided.
  • the computer program product runs on a computer, the computer executes the first aspect and any of the possible design methods.
  • any of the electronic devices, systems, readable storage media, and computer program products recommended by the media files provided above are all used to execute the corresponding methods provided above, and therefore, the beneficial effects that can be achieved Reference may be made to the beneficial effects in the corresponding methods provided above, which will not be repeated here.
  • FIG. 1 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the application.
  • FIG. 2 is a software system architecture diagram of an electronic device provided by an embodiment of this application.
  • FIG. 3 is a schematic flowchart of a method for recommending media files according to an embodiment of the application
  • FIG. 4 is a schematic diagram of a process for extracting emotional feature vectors provided by an embodiment of the application.
  • FIG. 5 is a schematic flowchart of establishing a media file library in a method for recommending media files according to an embodiment of the application
  • FIG. 6 is a schematic structural diagram of a chip system provided by an embodiment of the application.
  • Intelligent voice device It is an electronic device that can receive user voice information, can output voice information, and can interact with the user by voice.
  • ASR Automatic Speech Recognition
  • Natural Language Understanding (NLU) technology It is a technology that recognizes the text content and intentions in human natural language, that is, a technology that allows computers to "understand” natural language, so as to use natural language to communicate with computers. Natural language communication between man and machine. It covers a wide range of fields, including sentence detection, word segmentation, part-of-speech tagging, syntactic analysis, text classification/clustering, text angle, information extraction/automatic summarization, machine translation, automatic question answering, text generation and many other fields.
  • Slot A concept in human-machine dialogue. Slot is the definition of key information identified in the user's voice information. That is, the information needed to transform the user's intention into a clear user instruction, and a slot corresponds to a kind of information that needs to be obtained in the processing of a thing.
  • the embodiments of the present application provide a media file recommendation method, which can be applied to electronic devices including smart voice devices, such as voice assistants, smart speakers, smart phones, tablet computers, computers, wearable electronic devices, and smart robots.
  • electronic devices including smart voice devices, such as voice assistants, smart speakers, smart phones, tablet computers, computers, wearable electronic devices, and smart robots.
  • the electronic device can intelligently recognize the emotions and recommendation needs expressed in the user’s voice information, and recommend fine-grained data to the user, such as segment-level media files, to improve the accuracy of data recommendation, thereby improving the user’s Use experience.
  • FIG. 1 is a schematic diagram of a possible structure of an electronic device 100 provided by an embodiment of this application.
  • the electronic device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, and a battery 142, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, a display screen 194, and a subscriber identification module (SIM) card interface 195, etc.
  • SIM subscriber identification module
  • the aforementioned sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, and a touch sensor 180K, Ambient light sensor 180L and bone conduction sensor 180M and other sensors.
  • the structure illustrated in this embodiment does not constitute a specific limitation on the electronic device 100.
  • the electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components.
  • the illustrated components can be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait.
  • AP application processor
  • modem processor modem processor
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the different processing units may be independent devices or integrated in one or more processors.
  • the controller may be the nerve center and command center of the electronic device 100.
  • the controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 to store instructions and data.
  • the memory in the processor 110 is a cache memory.
  • the memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter (universal asynchronous transmitter) interface.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB Universal Serial Bus
  • the interface connection relationship between the modules illustrated in this embodiment is merely a schematic description, and does not constitute a structural limitation of the electronic device 100.
  • the electronic device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
  • the external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example, save music, video and other files in an external memory card.
  • the internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions.
  • the processor 110 executes various functional applications and data processing of the electronic device 100 by running instructions stored in the internal memory 121.
  • the processor 110 may execute instructions stored in the internal memory 121, and the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, at least one application program (such as a sound playback function, an image playback function, etc.) required by at least one function.
  • the data storage area can store data (such as audio data, phone book, etc.) created during the use of the electronic device 100.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), etc.
  • the electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. For example, music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.
  • the audio module 170 can also be used to encode and decode audio signals.
  • the audio module 170 may be provided in the processor 110, or part of the functional modules of the audio module 170 may be provided in the processor 110.
  • the speaker 170A also called a “speaker” is used to convert audio electrical signals into sound signals.
  • the electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • the receiver 170B also called “earpiece” is used to convert audio electrical signals into sound signals.
  • the electronic device 100 answers a call or voice message, it can receive the voice by bringing the receiver 170B close to the human ear.
  • the microphone 170C also called “microphone”, “microphone”, is used to convert sound signals into electrical signals.
  • the user can approach the microphone 170C through the mouth to make a sound, and input the sound signal to the microphone 170C.
  • the electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement noise reduction functions in addition to collecting sound signals. In some other embodiments, the electronic device 100 can also be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions.
  • the software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiment of the present invention takes a layered Android system as an example to illustrate the software structure of the electronic device 100 by way of example.
  • FIG. 2 is a software structure block diagram of an electronic device 100 according to an embodiment of the present invention.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface.
  • the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the Android runtime and system library, and the kernel layer.
  • the application layer can include a series of application packages.
  • the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, SMS and voice assistant.
  • the application framework layer provides application programming interfaces (application programming interface, API) and programming frameworks for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, and a notification manager.
  • the window manager is used to manage window programs.
  • the window manager can obtain the size of the display, determine whether there is a status bar, lock the screen, take a screenshot, etc.
  • the content provider is used to store and retrieve data and make these data accessible to applications.
  • the data may include video, image, audio, phone calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls that display text and controls that display pictures.
  • the view system can be used to build applications.
  • the display interface can be composed of one or more views.
  • a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.
  • the phone manager is used to provide the communication function of the electronic device 100. For example, the management of the call status (including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, etc.
  • the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can disappear automatically after a short stay without user interaction.
  • the notification manager is used to notify the download completion, message reminder, etc.
  • the notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, text messages are prompted in the status bar, prompt sounds, electronic devices vibrate, and indicator lights flash.
  • Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.
  • the core library consists of two parts: one part is the function functions that the java language needs to call, and the other part is the core library of Android.
  • the application layer and the application framework layer run in a virtual machine.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • the system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.
  • the surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support multiple audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to realize 3D graphics drawing, image rendering, synthesis, and layer processing.
  • the 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
  • the embodiment of the present application provides a media file recommendation method. As shown in FIG. 3, the method may include 301-303:
  • the electronic device receives a voice signal, and converts the voice signal into text information.
  • the voice signal when the user is speaking is received, and the ASR technology is used to convert the voice signal into corresponding text information.
  • the process of ASR technology to convert voice information into text information may include: voice signal preprocessing and feature extraction; acoustic model and pattern matching; language model and language processing.
  • voice recognition unit select one of words (sentences), syllables or phonemes as the voice recognition unit to extract voice features from the voice information; then, match and compare the extracted voice features with the pre-established acoustic model (pattern) , Get the best recognition results; then match the language model, that is, match the grammatical network formed by recognizing voice commands or the language model formed by statistical methods, and perform language processing of grammar and semantic analysis to generate the corresponding voice information Text information.
  • the electronic device converts a piece of user voice audio received into a text message: "I want to listen to a sad song.”
  • the above method may be that the electronic device 100 obtains the user's voice signal through the microphone 170C, and sends the voice signal to the processor 110 for processing.
  • the audio module 170 in the processor 110 may process the voice signal.
  • the system can instruct the voice assistant program of the application layer through commands, call related programs of the application framework layer and related functions of the core library, process the voice signal, and convert it into text information.
  • user intentions are user needs, that is, information indicating what tasks the user wants the electronic device to complete.
  • the user intention may be an intention keyword included in the above text information, where the intention keyword may be used to classify user needs into a certain type.
  • the intention keywords may include: media data attributes, emotional intentions, media data file names, keywords related to media data, and so on.
  • media data attributes such as music, movies, variety shows, dramas, fine arts, literary works, photos, etc.
  • Emotional intentions for example, happy, sad, scary, etc.
  • the name of the media data file can obtain the media data clearly required by the user, such as "Jane Eyre" and "The King of Comedy".
  • Keywords related to media data for example, a segment of a certain character in a film, a segment of a specific plot in a film, a segment of a certain rhythm in music, a description of a plot in a literary work, etc., Can locate the needs of users.
  • the slot information may include: timing information and emotional information.
  • the timing information may be a part of the content of the media file marked with a tag, or may be called a timing tag, which corresponds to a part of the content of the media file, and may be a specific time of the media file, or a timing segment. For example, the 12th to the third minute of a movie, or the second to third minute of a certain piece of music.
  • Emotional information can be the emotional type of the media file marked with tags, and can also be called emotional tags. Specifically, it can include: beautiful, happy, sad, scary, cheerful, and exciting.
  • the text information obtained by 301 is "I want to listen to a sad song", the intent keywords that can be extracted are: song, and the slot information extracted is: sad.
  • the obtained text information is "I want to see a funny plot in Charlotte's troubles”, and the intent keywords that can be extracted are: movie "Charlotte's troubles”, and the timing tags in the extracted slot information are: For a plot, the emotional tag in the extracted slot information is: funny.
  • the user's intention in the text information can be obtained through NLU technology. Specifically, it can use deep learning technology and neural network algorithms to identify all the words and words included in the text information, perform text semantic understanding, and determine the user's intention.
  • NLU technology can use deep learning technology and neural network algorithms to identify all the words and words included in the text information, perform text semantic understanding, and determine the user's intention. The specific implementation process of this technology is not described in detail in the embodiment of this application.
  • the electronic device uses the MFCC algorithm to extract the user’s voice emotional characteristics, and the emotional characteristics
  • the matched emotional label serves as the user's intentional emotional demand.
  • the obtained user voice information is processed by the MFCC algorithm to obtain the MFCC feature vector representing the emotional characteristics of the user's voice; the MFCC feature vector can be matched with the preset MFCC feature vector and emotional label pair, and the matched
  • the emotion label is used for the following steps.
  • the specific process of the MFCC algorithm to extract the user’s voice emotional features can be as shown in Figure 4, including: analog-to-digital conversion, pre-emphasis, frame and windowing, Fourier transform, Mel filtering, cepstrum, energy and difference. This process generates MFCC feature vectors.
  • analog-to-digital conversion that is, convert the input analog signal into a digital signal.
  • Pre-emphasis is to pass the digital signal through a high-pass filter.
  • the purpose is to boost the high-frequency part, flatten the frequency spectrum of the signal, and keep it in the entire frequency band from low to high frequency.
  • the same signal-to-noise ratio can be used to find the frequency spectrum.
  • it is also to eliminate the effects of the vocal cords and lips in the process of occurrence, to compensate for the high frequency part of the voice signal suppressed by the pronunciation system, and to highlight the high frequency formant.
  • Framing and windowing processing is to gather N sampling points into one observation unit, called frame. Multiply each frame by the Hamming window (the Hamming window specifies a period of signal) to increase the continuity between the left and right ends of the frame.
  • the Hamming window specifies a period of signal
  • each frame After multiplying the Hamming window, each frame must undergo a fast Fourier transform to obtain the energy distribution on the frequency spectrum. Fast Fourier transform is performed on each frame signal after frame division and windowing to obtain the frequency spectrum of each frame. And take the modulus square of the frequency spectrum of the speech signal to obtain the power spectrum of the speech signal.
  • the frequency spectrum is smoothed by Mel filtering, and the effect of harmonics is eliminated, highlighting the formant of the original voice. Therefore, the pitch or pitch of a piece of speech will not appear in the MFCC parameters. In other words, a speech recognition system featuring MFCC will not be affected by the pitch of the input speech. In addition, the amount of calculation can be reduced.
  • the cepstrum is processed into the Fourier transform spectrum of the signal, and the inverse Fourier transform is performed after logarithmic operation. In this step, the logarithmic energy output by each filter bank can be obtained.
  • Energy and difference processing means that the standard cepstrum parameter MFCC only reflects the static characteristics of speech parameters, and the dynamic characteristics of speech can be described by the difference spectrum of these static characteristics. Experiments show that combining dynamic and static features can effectively improve the recognition performance of the system.
  • the above user intention may also include other keywords indicating the user's recommendation needs, for example, information indicating a certain plot, information indicating a certain line, information indicating a certain actor , Indicating the text and other information of a certain plot.
  • the voice device obtained by the voice device of the user is "I want to watch a clip of the heroine crying in the king of comedy”
  • the user's demand characteristics can be extracted including keywords: the heroine crying, clear media data information For: the movie "The King of Comedy", timing information: fragment.
  • the electronic device can query the above keywords as emotional needs.
  • the above method can instruct the system to instruct the voice assistant program at the application layer through commands, call related programs in the application framework layer and related functions in the core library, to recognize and process text information, according to certain
  • the algorithm extracts the intention and slot information it contains.
  • the electronic device queries the media file library according to the user's intention and the slot information, and obtains the media file corresponding to the user's intention and the slot information.
  • the media file library is pre-established by the electronic device or is obtained by the electronic device through a cloud service.
  • the cloud service can be provided to the electronic device.
  • the cloud device for data processing and data storage services may specifically be a server.
  • the media file library stores a first mapping relationship between multiple user intentions, slot information, and multiple media file identifiers.
  • it may be a fine-grained media file library containing emotion tags and time series tags.
  • it includes massive media files such as music, movies, dramas, fine arts, literary works, and photos.
  • Each media file can contain a macro tag, such as music, movie, drama, art, literature, and photos, etc., or it can include a specific multimedia file name, such as "The King of Comedy".
  • the macro tag can correspond to the user's Intent keywords.
  • Each multimedia file can also contain emotional tags, for example, happy, sad, scary, etc.
  • each multimedia file may also include at least one timing information, for example, the third minute to the fourth minute, the last ten minutes, and so on.
  • the slot information representing the emotional intent may form a corresponding relationship with the slot information representing the time sequence. For example, the emotional label corresponding to the third minute to the fourth minute of the movie is sad, and the emotional label corresponding to the last ten minutes of the movie It is joyful waiting.
  • the media file library is queried according to the intent and the slot to obtain the media file corresponding to the intent and the slot, and the media file corresponding to the intent and the slot is obtained according to the first mapping relationship. That is, according to the emotion tags, time sequence information, macro tags, etc., obtained in the above steps, query in the media file library, and use the multimedia data with the highest degree of emotional demand matching as the matching data.
  • the specific matching process can be as follows: first query the corresponding database based on the macro tags, such as the movie, or the movie "King of Comedy"; then query the media file database according to the emotion tags, and match the movie fragments corresponding to the emotion tags of the movie.
  • the time series tag that is closest to the emotion tag, and the media data segment corresponding to the time series tag is the media file matched by the electronic device for the user.
  • the sentiment label may also include a corresponding recommended value
  • the recommended value may be a quantified sentiment value representing the data segment corresponding to the sentiment label of the multimedia data, which may be used for matching degree calculation.
  • the recommended value can be represented by the amount of data annotated with the sentiment tag, the amount of data searched, or the user score. The higher the recommendation value, the higher the matching degree of the sentiment label, and the lower the recommendation value, the lower the matching degree of the sentiment label.
  • the emotional tags of multiple time series segments of a certain movie are all funny. When matching the most funny time series segment for the user, it can be determined according to the recommended value corresponding to the emotional tags of the multiple time series segments.
  • the corresponding time sequence segment is regarded as the data with the highest matching degree.
  • the media file library may include a file library of fine-grained multimedia tags, not only tags related to emotion types, but also other tags that may be recommended by users. For example, a segment of a certain character in a film, a segment of a specific plot in a film, a segment of a certain rhythm in music, a description of a plot in a literary work, and so on.
  • multimedia data with the highest matching degree it is recommended to the user to send voice information or directly send the multimedia data to the user.
  • the electronic device obtains the user's voice information "I want to watch the most touching episode in the king of comedy", processes the voice information, and recognizes the following emotional needs: the emotional label is: the most touching; represents the sequence
  • the slot information of is: a plot; the name of the extracted multimedia file is: the movie "King of Comedy”.
  • the electronic device queries the multimedia file library according to the above emotional needs, and queries the time series segments of the movie "King of Comedy" whose emotional tags are touching, sad, or tears, and selects the data with the highest recommended value in the query results, such as the movie No. For a plot segment from 40 minutes to 50 minutes, the electronic device uses the multimedia data as matching data to recommend to the user.
  • the user’s voice information acquired by the electronic device contains only macro tags.
  • the user’s voice information acquired by the electronic device is "I want to watch a movie" and the macro tag is a movie.
  • the device queries the media file library according to the macro tag, and recommends the higher recommended value to the user after matching the movie.
  • the user's voice information obtained by the electronic device only contains macro tags and emotional tags.
  • the user's voice information obtained by the electronic device is "I want to watch a horror movie", and the macro tag is If the movie has an emotional tag of horror, the electronic device queries the media file library according to the macro tag and emotional tag, and after matching the horror movie, selects a higher recommended value to recommend to the user.
  • the voice signal is obtained by the Mel frequency cepstral coefficient MFCC algorithm Corresponding emotional feature vector.
  • the media file library is queried according to the emotional feature vector, and media files under emotional tags corresponding to similar emotional features are selected for recommendation in the media file library.
  • the media file library stores the second mapping relationship between multiple emotional feature vectors and multiple emotional tags, each emotional tag corresponds to at least one media file, and any media file matched by the emotional tag can be used as a guaranteed recommendation
  • the data is recommended to users.
  • the above method can call the related programs of the application framework layer and related functions of the core library through the voice assistant program, and use a certain matching algorithm according to the extracted user intentions and slots. Get the corresponding media file.
  • the process of establishing the multimedia file library by the electronic device may be as shown in FIG. 5, including:
  • Obtain a large amount of comment information on multimedia data can comment on emotional information of multiple media files for users. It can be through various channels on the Internet, for example, comments on multimedia files in user comments on various websites such as forums, post bars, news, and applications, and can also include user comment areas and bullet screen message areas on movie and video websites. Specifically, it can use web crawler technology to grab comments from the Internet, and obtain a large amount of comment information about multimedia files according to the comment extraction model, such as comment information about music, movies, dramas, fine arts, literary works, or images.
  • the coarse-grained emotional information indicates that the review information can be macroscopic, for example, "This music is very sad”; it can also be the review information about the sequence segment level, that is, the fine-grained emotional information, for example, "The keynote of the movie is comedy, but The last 15 minutes is still very sensational", "The overall film is relatively plain, but the plot of the film is notable for 30-40 minutes.”
  • the emotion information is fine-grained emotion information, establish the first mapping relationship of the media file according to the obtained slot information; if the emotion information is coarse-grained emotion information, obtain the emotion feature vector according to the emotion label, The tag and the emotional feature vector establish a second mapping relationship of the media file.
  • the obtained multimedia comment information is labeled, which can be a manual labeling method or a rule matching algorithm to obtain comment keywords in the multimedia comment information, such as sentiment tags, time series tags, macro tags, or other The corresponding relationship in the keywords indicating the recommendation requirements.
  • the multimedia review information is fine-grained emotional review information or coarse-grained emotional review information; if the emotional information is fine-grained emotional review information, obtain the slot in the fine-grained emotional review information and establish it in the media file library
  • the first mapping relationship between the multimedia file and the slot information; the slot may specifically include timing tags, emotional tags, and so on.
  • the emotional information is coarse-grained emotional comment information
  • the emotional feature vector is obtained according to the emotional tag
  • the emotional tag of the media file is obtained
  • the second mapping relationship of the media file is established.
  • the corresponding relationship between the time series tag and the emotional tag of a certain movie is stored, and the multimedia data is stored correspondingly.
  • a certain timing information and emotional label pair are, the emotional label corresponding to the third minute to the fourth minute is sad, and the emotional label corresponding to the last ten minutes is happy.
  • the emotional tags or macro tags on the multimedia on each platform are directly saved to the media file library, for example, comedy, tragedy, fast song, sad song, etc. Recommended as a guaranteed data.
  • data storage is carried out, and a multimedia file library is established, so as to perform information query and data matching from the multimedia file library according to the emotional needs of the user's voice information .
  • the multimedia file library may be continuously updated, continuously acquiring new comment information about the multimedia file, and extracting emotional tags about time sequence segments included in the comment information, so as to fine-grain and enrich the resources of the multimedia file library, for example .
  • it can be fine-grained to establish the emotional tag, plot description and other mapping relationships of each frame of the movie, so as to achieve more accurate matching and recommendation of the time series of multimedia files for users.
  • an electronic device which may include a memory and one or more processors, and the memory and the processor are coupled.
  • the memory is used to store computer program code, and the computer program code includes computer instructions.
  • the processor executes the computer instructions, the electronic device can execute various functions or steps in the foregoing method embodiments.
  • the chip system includes at least one processor 601 and at least one interface circuit 602.
  • the processor 601 and the interface circuit 602 may be interconnected by wires.
  • the interface circuit 602 may be used to receive signals from other devices (such as the memory of an electronic device).
  • the interface circuit 602 may be used to send signals to other devices (for example, the processor 601).
  • the interface circuit 602 can read an instruction stored in the memory, and send the instruction to the processor 601.
  • the electronic device can be made to execute various functions or steps executed by the electronic device in the foregoing embodiments.
  • the chip system may also include other discrete devices, which are not specifically limited in the embodiment of the present application.
  • the embodiments of the present application also provide a computer storage medium, which includes computer instructions, which when the computer instructions run on the above-mentioned electronic device, cause the electronic device to execute each function or step in the above-mentioned method embodiment.
  • the embodiments of the present application also provide a computer program product, which when the computer program product runs on a computer, causes the computer to execute each function or step in the foregoing method embodiment.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be other division methods for example, multiple units or components may be It can be combined or integrated into another device, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate parts may or may not be physically separate.
  • the parts displayed as units may be one physical unit or multiple physical units, that is, they may be located in one place, or they may be distributed to multiple different places. . Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium.
  • the technical solutions of the embodiments of the present application are essentially or the part that contributes to the prior art, or all or part of the technical solutions can be embodied in the form of software products, which are stored in a storage medium It includes several instructions to make a device (may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read only memory (read only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program codes.

Abstract

Provided are media files recommending method and device, relating to the technical field of terminals, which can be applied to intelligent voice apparatus, the problem of low accuracy and poor user experience in the recommendation solution based on user voice emotion in the prior art can be solved. The specific solution includes: the electronic apparatus (100) receives the user voice signal and converts the voice signal into text information; acquiring the user intention and slot information included in the user intention according to text information; the slot information can include emotion information and timing information; querying the media file library according to the user intention and slot information, to obtain the media files corresponding to the user intention and slot information.

Description

一种媒体文件推荐方法及装置Method and device for recommending media files
本申请要求于2019年07月08日提交国家知识产权局、申请号为201910609618.0、申请名称为“一种媒体文件推荐方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office on July 8, 2019, the application number is 201910609618.0, and the application name is "a media file recommendation method and device", the entire content of which is incorporated herein by reference Applying.
技术领域Technical field
本申请涉及终端技术领域,尤其涉及一种媒体文件推荐方法及装置。This application relates to the field of terminal technology, and in particular to a method and device for recommending media files.
背景技术Background technique
随着智能终端设备的应用和普及,智能语音设备在人机交互中,起到越来越重要的角色,而想要使得智能语音设备识别人类语音信息中所表达的情感,并能基于语音情感为用户推荐数据和服务,是如今人工智能研究的重要方向。With the application and popularization of smart terminal devices, smart voice devices play an increasingly important role in human-computer interaction, and it is desirable to make smart voice devices recognize the emotions expressed in human voice information and be based on voice emotions Recommending data and services for users is an important direction of artificial intelligence research today.
目前的基于用户语音情感的推荐方案,是基于梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC)提取情感特征的算法,根据语音信息中的韵律特征和音质特征,提取用户的语音情感特征,根据语音情感特征和情感类型的对应关系查询数据库,为用户推荐相同或相近情感类型的数据或服务。例如,根据用户语音情感为用户推荐一首悲伤的歌曲,一部搞笑的影片等。The current recommendation scheme based on user's voice emotion is an algorithm that extracts emotional features based on Mel-Frequency Cepstral Coefficients (MFCC), and extracts the user's voice emotional features based on the prosodic features and sound quality features in the voice information , Query the database according to the corresponding relationship between voice emotion characteristics and emotion types, and recommend data or services of the same or similar emotion types for users. For example, recommend a sad song, a funny movie, etc. to the user according to the user's voice and emotion.
但是,这种匹配方法只支持粗粒度的情感匹配,也就是基于多媒体文件级别的数据推荐,但是用户想要了解一个多媒体文件最精彩的部分,例如用户输入语音信息“我想要看XXX影片最搞笑的片段”,“想看惊悚的电影片段”时,并不能为用户进行推荐,或者其推荐的准确性较低,用户体验差。However, this matching method only supports coarse-grained emotional matching, which is based on multimedia file-level data recommendation, but the user wants to know the most exciting part of a multimedia file, for example, the user enters the voice message "I want to watch XXX movies the most "Funny clips" and "want to watch scary movie clips" cannot be recommended for users, or the recommendation accuracy is low, and the user experience is poor.
发明内容Summary of the invention
本申请提供一种媒体文件推荐方法及装置,解决了现有技术中基于用户语音情感的推荐方案推荐的准确性较低,用户体验差的问题。The present application provides a method and device for recommending media files, which solves the problems of low accuracy and poor user experience in recommending a recommendation scheme based on user voice emotions in the prior art.
为达到上述目的,本申请采用如下技术方案:In order to achieve the above objectives, this application adopts the following technical solutions:
第一方面,提供一种媒体文件推荐方法,应用于电子设备,该方法包括:接收语音信号,将语音信号转换为文本信息;根据文本信息获取用户意图和用户意图中包括的槽位信息;槽位信息包括情感信息和时序信息;根据用户意图和槽位信息查询媒体文件库,得到与用户意图和槽位信息对应的媒体文件。In a first aspect, a media file recommendation method is provided, which is applied to an electronic device. The method includes: receiving a voice signal, and converting the voice signal into text information; obtaining user intent and slot information included in the user intent according to the text information; The position information includes emotional information and timing information; the media file library is queried according to the user's intention and slot information, and the media file corresponding to the user's intention and slot information is obtained.
本申请实施例中,电子设备根据用户语音信息中包含的用户意图和槽位信息,查询媒体文件库,根据时序信息和情感信息为用户匹配最接近用户需求和情感需求的多媒体文件,从而可以准确识别用户情感需求,智能地为用户推荐细粒度数据,提升用户的使用体验。In the embodiment of this application, the electronic device queries the media file library according to the user's intention and slot information contained in the user's voice information, and matches the user with the multimedia file that is closest to the user's needs and emotional needs based on the timing information and emotional information, so that it can be accurate Identify users' emotional needs, intelligently recommend fine-grained data for users, and improve users' experience.
在一种可能的设计方式中,媒体文件库中存储有多个用户意图、槽位信息与多个媒体文件标识的第一映射关系;根据用户意图和槽位信息查询媒体文件库,得到与用户意图和槽位信息对应的媒体文件包括:根据第一映射关系,获取与用户意图和槽位 信息对应的媒体文件。上述可能的实现方式中,电子设备根据用户意图和槽位信息的映射关系,查询媒体文件库,从而可以根据不同用户的不同情感需求,匹配最适合的媒体文件推荐给用户,从而提高智能推荐的准确性和灵活性,提升用户体验。In a possible design method, the media file library stores multiple user intentions, slot information, and the first mapping relationship between multiple media file identifiers; query the media file library according to user intentions and slot information, and get The media file corresponding to the intention and the slot information includes: obtaining the media file corresponding to the user's intention and the slot information according to the first mapping relationship. In the foregoing possible implementation manners, the electronic device queries the media file library according to the mapping relationship between the user's intention and the slot information, so that the most suitable media file can be matched to recommend to the user according to the different emotional needs of different users, thereby improving the intelligent recommendation. Accuracy and flexibility improve user experience.
在一种可能的设计方式中,在获取文本信息中的用户意图和槽位信息之前,方法还包括:确定文本信息中是否包含用户意图;若确定文本信息未包含用户意图,则通过梅尔频率倒谱系数MFCC算法获取语音信号的情感特征向量;根据情感特征向量查询媒体文件库,得到与情感特征向量对应的媒体文件,其中,媒体文件库中存储有多个情感特征向量与多个情感信息的第二映射关系,每种情感信息对应多个媒体文件。上述可能的实现方式中,如用户语音信息中未包含明确的用户意图,则电子设备可以根据用户语音信息提取到的用户语音情感特征,根据对应的情感信息匹配媒体文件,从而提高智能推荐的灵活性,提升用户体验。In a possible design method, before obtaining the user intent and slot information in the text information, the method further includes: determining whether the text information contains the user intent; if it is determined that the text information does not contain the user intent, use the Mel Frequency Cepstral coefficient MFCC algorithm obtains the emotional feature vector of the speech signal; query the media file library according to the emotional feature vector to obtain the media file corresponding to the emotional feature vector, where multiple emotional feature vectors and multiple emotional information are stored in the media file library In the second mapping relationship, each type of emotional information corresponds to multiple media files. In the foregoing possible implementation manners, if the user voice information does not contain a clear user intention, the electronic device can match the media files according to the corresponding emotional information according to the user voice emotion characteristics extracted from the user voice information, thereby improving the flexibility of intelligent recommendation To improve user experience.
在一种可能的设计方式中,在接收语音信号之前,方法还包括:获取用户评论多个媒体文件的情感信息;确定情感信息为细粒度情感信息或粗粒度情感信息;若情感信息为细粒度情感信息,则获取细粒度情感信息中的槽位,在媒体文件库中建立第一映射关系;若情感信息为粗粒度情感信息,则根据情感信息得到情感特征向量,获得媒体文件的情感信息,建立媒体文件的第二映射关系。上述可能的实现方式中,电子设备可以根据海量的用户多媒体评论信息,提取情感信息和时序信息等建立映射关系,从而生成多媒体文件库,提升智能推荐的准备性。In a possible design method, before receiving the voice signal, the method further includes: obtaining the emotional information of the user commenting on multiple media files; determining that the emotional information is fine-grained emotional information or coarse-grained emotional information; if the emotional information is fine-grained Emotional information is obtained from the slots in the fine-grained emotional information, and the first mapping relationship is established in the media file library; if the emotional information is coarse-grained emotional information, the emotional feature vector is obtained according to the emotional information to obtain the emotional information of the media file, Establish a second mapping relationship for the media file. In the foregoing possible implementation manners, the electronic device can extract emotional information and timing information to establish a mapping relationship based on massive user multimedia comment information, thereby generating a multimedia file library, and improving the readiness of intelligent recommendation.
在一种可能的设计方式中,将语音信号转换为文本信息包括:通过自动语音识别ASR将语音信号转换为文本信息。上述可能的实现方式中,电子设备可以通过自动语音识别技术,识别出用户语音信息中包括的文本信息,从而提高智能推荐的准确性。In a possible design manner, converting the voice signal into text information includes: converting the voice signal into text information through automatic voice recognition ASR. In the foregoing possible implementation manners, the electronic device can recognize the text information included in the user's voice information through automatic voice recognition technology, thereby improving the accuracy of the smart recommendation.
在一种可能的设计方式中,获取文本信息中的用户意图包括:通过自然语音理解NLU技术获取文本信息中的用户意图。电子设备可以通过自然语音理解技术,识别出用户语音信息中包括的用户意图,根据用户意图匹配推荐,从而提高智能推荐的准确性。In a possible design manner, obtaining user intentions in text information includes: obtaining user intentions in text information through natural speech understanding NLU technology. The electronic device can recognize the user's intention included in the user's voice information through natural speech understanding technology, and match the recommendation according to the user's intention, thereby improving the accuracy of the intelligent recommendation.
第二方面,提供一种电子设备,该电子设备包括处理器,以及与处理器连接的存储器,存储器用于存储指令,当指令被处理器执行时,使得电子设备用于执行:接收语音信号,将语音信号转换为文本信息;根据文本信息获取用户意图和用户意图中包括的槽位信息;槽位信息包括情感信息和时序信息;根据用户意图和槽位信息查询媒体文件库,得到与用户意图和槽位信息对应的媒体文件。In a second aspect, an electronic device is provided. The electronic device includes a processor and a memory connected to the processor. The memory is used for storing instructions. When the instructions are executed by the processor, the electronic device is used for execution: receiving voice signals, Convert voice signals into text information; obtain user intentions and slot information included in user intentions according to text information; slot information includes emotional information and timing information; query media file libraries according to user intentions and slot information to obtain user intentions The media file corresponding to the slot information.
在一种可能的设计方式中,媒体文件库中存储有多个用户意图、槽位信息与多个媒体文件标识的第一映射关系;电子设备具体用于执行:根据第一映射关系,获取与用户意图和槽位信息对应的媒体文件。In a possible design, the media file library stores multiple first mapping relationships between user intents, slot information, and multiple media file identifiers; the electronic device is specifically used to execute: according to the first mapping relationship, obtain and Media files corresponding to user intent and slot information.
在一种可能的设计方式中,电子设备还用于执行:确定文本信息中是否包含用户意图;若确定文本信息未包含用户意图,则通过梅尔频率倒谱系数MFCC算法获取语音信号的情感特征向量;根据情感特征向量查询媒体文件库,得到与情感特征向量对应的媒体文件,其中,媒体文件库中存储有多个情感特征向量与多个情感标签的第二映射关系。In a possible design method, the electronic device is also used to perform: determine whether the text information contains the user's intention; if it is determined that the text information does not contain the user's intention, obtain the emotional characteristics of the voice signal through the Mel frequency cepstral coefficient MFCC algorithm Vector; query the media file library according to the emotional feature vector to obtain the media file corresponding to the emotional feature vector, wherein the media file library stores the second mapping relationship between multiple emotional feature vectors and multiple emotional tags.
在一种可能的设计方式中,电子设备还用于执行:获取用户评论多个媒体文件的 情感信息;确定情感信息为细粒度情感信息或粗粒度情感信息;若情感信息为细粒度情感信息,则获取细粒度情感信息中的槽位,在媒体文件库中建立第一映射关系;若情感信息为粗粒度情感信息,则根据情感标签得到情感特征向量,获得媒体文件的情感标签,建立媒体文件的第二映射关系。In a possible design method, the electronic device is also used to perform: obtain the emotional information of the user commenting on multiple media files; determine the emotional information as fine-grained emotional information or coarse-grained emotional information; if the emotional information is fine-grained emotional information, Then obtain the slots in the fine-grained emotional information, and establish the first mapping relationship in the media file library; if the emotional information is coarse-grained emotional information, obtain the emotional feature vector according to the emotional tag, obtain the emotional tag of the media file, and create the media file The second mapping relationship.
在一种可能的设计方式中,将语音信号转换为文本信息包括:通过自动语音识别ASR将语音信号转换为文本信息。In a possible design manner, converting the voice signal into text information includes: converting the voice signal into text information through automatic voice recognition ASR.
在一种可能的设计方式中,获取文本信息中的用户意图包括:通过自然语音理解NLU技术获取文本信息中的用户意图。In a possible design manner, obtaining user intentions in text information includes: obtaining user intentions in text information through natural speech understanding NLU technology.
第三方面,提供一种芯片系统,该芯片系统应用于电子设备;芯片系统包括一个或多个接口电路和一个或多个处理器;接口电路和处理器通过线路互联;接口电路用于从电子设备的存储器接收信号,并向处理器发送信号,信号包括存储器中存储的计算机指令;当处理器执行计算机指令时,电子设备执行第一方面及其任一种可能的设计方式的方法。In a third aspect, a chip system is provided, which is applied to electronic equipment; the chip system includes one or more interface circuits and one or more processors; the interface circuit and the processor are interconnected by wires; the interface circuit is used for The memory of the device receives the signal and sends the signal to the processor. The signal includes the computer instruction stored in the memory; when the processor executes the computer instruction, the electronic device executes the first aspect and any of the possible design methods.
第四方面,提供一种可读存储介质,该可读存储介质中存储有指令,当可读存储介质在电子设备上运行时,使得电子设备执行第一方面及其任一种可能的设计方式的方法。In a fourth aspect, a readable storage medium is provided. The readable storage medium stores instructions. When the readable storage medium runs on an electronic device, the electronic device executes the first aspect and any of its possible design methods Methods.
第五方面,提供一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行第一方面及其任一种可能的设计方式的方法。In a fifth aspect, a computer program product is provided. When the computer program product runs on a computer, the computer executes the first aspect and any of the possible design methods.
可以理解地,上述提供的任一种媒体文件推荐的电子设备、系统、可读存储介质和计算机程序产品,均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。Understandably, any of the electronic devices, systems, readable storage media, and computer program products recommended by the media files provided above are all used to execute the corresponding methods provided above, and therefore, the beneficial effects that can be achieved Reference may be made to the beneficial effects in the corresponding methods provided above, which will not be repeated here.
附图说明Description of the drawings
图1为本申请实施例提供的一种电子设备的硬件结构示意图;FIG. 1 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the application;
图2为本申请实施例提供的一种电子设备的软件系统架构图;FIG. 2 is a software system architecture diagram of an electronic device provided by an embodiment of this application;
图3为本申请实施例提供的一种媒体文件推荐方法的流程示意图;FIG. 3 is a schematic flowchart of a method for recommending media files according to an embodiment of the application;
图4为本申请实施例提供的一种提取情感特征向量的流程示意图;4 is a schematic diagram of a process for extracting emotional feature vectors provided by an embodiment of the application;
图5为本申请实施例提供的一种媒体文件推荐方法中建立媒体文件库的流程示意图;FIG. 5 is a schematic flowchart of establishing a media file library in a method for recommending media files according to an embodiment of the application;
图6为本申请实施例提供的一种芯片系统的结构示意图。FIG. 6 is a schematic structural diagram of a chip system provided by an embodiment of the application.
具体实施方式Detailed ways
在介绍本申请的方法实施例之前,先对本申请实施例中涉及到的技术做如下说明:Before introducing the method embodiments of the present application, the following description of the technology involved in the embodiments of the present application is made:
智能语音设备:是一种能够接收用户语音信息,能够输出语音信息,可以与用户进行语音交互的电子设备。Intelligent voice device: It is an electronic device that can receive user voice information, can output voice information, and can interact with the user by voice.
自动语音识别(Automatic Speech Recognition,ASR)技术:是一种将人类的语音信息转换为文本信息的技术,目标是让计算机能够“听写”出不同人所说出的连续语音,也就是俗称的“语音听写机”,是实现“声音”到“文字”转换的技术。Automatic Speech Recognition (ASR) technology: It is a technology that converts human speech information into text information. The goal is to enable computers to "dictate" continuous speech spoken by different people, which is commonly known as " "Voice dictation machine" is a technology that realizes the conversion from "sound" to "text".
自然语言理解(Natural Language Understanding,NLU)技术:是一种识别出人类自然语言中的文本内容和意图的技术,即让计算机“理解”自然语言,从而使用自然语言同计算机进行通讯的技术,实现人机之间的自然语言通信。其涵盖领域非常广泛, 包括句子检测,分词,词性标注,句法分析,文本分类/聚类,文字角度,信息抽取/自动摘要,机器翻译,自动问答,文本生成等多个领域。Natural Language Understanding (NLU) technology: It is a technology that recognizes the text content and intentions in human natural language, that is, a technology that allows computers to "understand" natural language, so as to use natural language to communicate with computers. Natural language communication between man and machine. It covers a wide range of fields, including sentence detection, word segmentation, part-of-speech tagging, syntactic analysis, text classification/clustering, text angle, information extraction/automatic summarization, machine translation, automatic question answering, text generation and many other fields.
槽位:人机对话中的一个概念,槽位是针对用户语音信息中,识别到的关键信息的定义。也就是用户意图转化为明确的用户指令所需要的信息,一个槽位与一件事情的处理中所需要获取的一种信息相对应。Slot: A concept in human-machine dialogue. Slot is the definition of key information identified in the user's voice information. That is, the information needed to transform the user's intention into a clear user instruction, and a slot corresponds to a kind of information that needs to be obtained in the processing of a thing.
本申请实施例提供一种媒体文件推荐方法,该方法可以应用于包括智能语音装置的电子设备,例如语音助手、智能音箱、智能手机、平板电脑、计算机、穿戴性电子设备和智能机器人等。通过该方法,电子设备可以智能地识别出用户语音信息中所表达的情感和推荐需求,为用户推荐细粒度的数据,例如基于片段级的媒体文件,提高数据推荐的准确性,从而提升用户的使用体验。The embodiments of the present application provide a media file recommendation method, which can be applied to electronic devices including smart voice devices, such as voice assistants, smart speakers, smart phones, tablet computers, computers, wearable electronic devices, and smart robots. Through this method, the electronic device can intelligently recognize the emotions and recommendation needs expressed in the user’s voice information, and recommend fine-grained data to the user, such as segment-level media files, to improve the accuracy of data recommendation, thereby improving the user’s Use experience.
下面将结合附图对本申请实施例的实施方式进行详细描述。请参考图1,为本申请实施例提供的一种电子设备100的可能的结构示意图。如图1所示,电子设备100可以包括:处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。The implementation of the embodiments of the present application will be described in detail below in conjunction with the accompanying drawings. Please refer to FIG. 1, which is a schematic diagram of a possible structure of an electronic device 100 provided by an embodiment of this application. As shown in FIG. 1, the electronic device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, and a battery 142, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, a display screen 194, and a subscriber identification module (SIM) card interface 195, etc.
其中,上述传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L和骨传导传感器180M等传感器。Among them, the aforementioned sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, and a touch sensor 180K, Ambient light sensor 180L and bone conduction sensor 180M and other sensors.
可以理解的是,本实施例示意的结构并不构成对电子设备100的具体限定。在另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structure illustrated in this embodiment does not constitute a specific limitation on the electronic device 100. In other embodiments, the electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components. The illustrated components can be implemented in hardware, software, or a combination of software and hardware.
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 110 may include one or more processing units. For example, the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Among them, the different processing units may be independent devices or integrated in one or more processors.
控制器可以是电子设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。The controller may be the nerve center and command center of the electronic device 100. The controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。A memory may also be provided in the processor 110 to store instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路 (inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter (universal asynchronous transmitter) interface. receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / Or Universal Serial Bus (USB) interface, etc.
可以理解的是,本实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备100的结构限定。在另一些实施例中,电子设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。It can be understood that the interface connection relationship between the modules illustrated in this embodiment is merely a schematic description, and does not constitute a structural limitation of the electronic device 100. In other embodiments, the electronic device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example, save music, video and other files in an external memory card.
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行电子设备100的各种功能应用以及数据处理。例如,在本申请实施例中,处理器110可以通过执行存储在内部存储器121中的指令,内部存储器121可以包括存储程序区和存储数据区。The internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions. The processor 110 executes various functional applications and data processing of the electronic device 100 by running instructions stored in the internal memory 121. For example, in the embodiment of the present application, the processor 110 may execute instructions stored in the internal memory 121, and the internal memory 121 may include a storage program area and a storage data area.
其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。Among them, the storage program area can store an operating system, at least one application program (such as a sound playback function, an image playback function, etc.) required by at least one function. The data storage area can store data (such as audio data, phone book, etc.) created during the use of the electronic device 100. In addition, the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), etc.
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。The electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. For example, music playback, recording, etc.
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。The audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal. The audio module 170 can also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110, or part of the functional modules of the audio module 170 may be provided in the processor 110.
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备100可以通过扬声器170A收听音乐,或收听免提通话。The speaker 170A, also called a "speaker", is used to convert audio electrical signals into sound signals. The electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。The receiver 170B, also called "earpiece", is used to convert audio electrical signals into sound signals. When the electronic device 100 answers a call or voice message, it can receive the voice by bringing the receiver 170B close to the human ear.
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。电子设备100可以设置至少一个麦克风170C。在另一些实施例中,电子设备100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。The microphone 170C, also called "microphone", "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can approach the microphone 170C through the mouth to make a sound, and input the sound signal to the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement noise reduction functions in addition to collecting sound signals. In some other embodiments, the electronic device 100 can also be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions.
电子设备100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本发明实施例以分层架构的Android系统为例,示例性说明电子设 备100的软件结构。The software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiment of the present invention takes a layered Android system as an example to illustrate the software structure of the electronic device 100 by way of example.
图2是本发明实施例的电子设备100的软件结构框图。FIG. 2 is a software structure block diagram of an electronic device 100 according to an embodiment of the present invention.
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the Android runtime and system library, and the kernel layer.
应用程序层可以包括一系列应用程序包。The application layer can include a series of application packages.
如图2所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息和语音助手等应用程序。As shown in Figure 2, the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, SMS and voice assistant.
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。The application framework layer provides application programming interfaces (application programming interface, API) and programming frameworks for applications in the application layer. The application framework layer includes some predefined functions.
如图2所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。As shown in Figure 2, the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, and a notification manager.
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。The window manager is used to manage window programs. The window manager can obtain the size of the display, determine whether there is a status bar, lock the screen, take a screenshot, etc.
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。The content provider is used to store and retrieve data and make these data accessible to applications. The data may include video, image, audio, phone calls made and received, browsing history and bookmarks, phone book, etc.
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。The view system includes visual controls, such as controls that display text and controls that display pictures. The view system can be used to build applications. The display interface can be composed of one or more views. For example, a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.
电话管理器用于提供电子设备100的通信功能。例如通话状态的管理(包括接通,挂断等)。The phone manager is used to provide the communication function of the electronic device 100. For example, the management of the call status (including connecting, hanging up, etc.).
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, etc.
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。The notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can disappear automatically after a short stay without user interaction. For example, the notification manager is used to notify the download completion, message reminder, etc. The notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, text messages are prompted in the status bar, prompt sounds, electronic devices vibrate, and indicator lights flash.
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。The core library consists of two parts: one part is the function functions that the java language needs to call, and the other part is the core library of Android.
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。The application layer and the application framework layer run in a virtual machine. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。The system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。The surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support multiple audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。The 3D graphics processing library is used to realize 3D graphics drawing, image rendering, synthesis, and layer processing.
2D图形引擎是2D绘图的绘图引擎。The 2D graphics engine is a drawing engine for 2D drawing.
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。The kernel layer is the layer between hardware and software. The kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
本申请实施例提供一种媒体文件推荐方法,如图3所示,该方法可以包括301-303:The embodiment of the present application provides a media file recommendation method. As shown in FIG. 3, the method may include 301-303:
301:电子设备接收语音信号,将所述语音信号转换为文本信息。301: The electronic device receives a voice signal, and converts the voice signal into text information.
即接收用户说话时的语音信号,利用ASR技术,将该语音信号转换为对应的文本信息。That is, the voice signal when the user is speaking is received, and the ASR technology is used to convert the voice signal into corresponding text information.
其中,ASR技术将语音信息转换为文本信息的过程可以包括:语音信号预处理与特征提取;声学模型与模式匹配;语言模型与语言处理。首先,选择单词(句)、音节或音素中的一种作为语音识别单元,对语音信息进行语音特征提取;然后,将提取到的语音特征与预先建立好的声学模型(模式)进行匹配与比较,得到最佳的识别结果;再通过语言模型进行匹配,即匹配由识别语音命令构成的语法网络或由统计方法构成的语言模型,并进行语法、语义分析的语言处理,从而生成该语音信息对应的文本信息。Among them, the process of ASR technology to convert voice information into text information may include: voice signal preprocessing and feature extraction; acoustic model and pattern matching; language model and language processing. First, select one of words (sentences), syllables or phonemes as the voice recognition unit to extract voice features from the voice information; then, match and compare the extracted voice features with the pre-established acoustic model (pattern) , Get the best recognition results; then match the language model, that is, match the grammatical network formed by recognizing voice commands or the language model formed by statistical methods, and perform language processing of grammar and semantic analysis to generate the corresponding voice information Text information.
例如,电子设备根据接收到的一段用户语音音频,将其转换为文本信息:“我想要听一首悲伤的歌曲”。For example, the electronic device converts a piece of user voice audio received into a text message: "I want to listen to a sad song."
结合前述的电子设备100的架构,上述方法可以为电子设备100通过麦克风170C获取用户的语音信号,将语音信号发送给处理器110进行处理。处理器110中的音频模块170可以对语音信号进行处理。具体的,可以为系统通过命令指示应用程序层的语音助手程序,调用应用程序框架层的相关程序和核心库的相关函数,对语音信号进行处理,将其转换为文本信息。In combination with the aforementioned architecture of the electronic device 100, the above method may be that the electronic device 100 obtains the user's voice signal through the microphone 170C, and sends the voice signal to the processor 110 for processing. The audio module 170 in the processor 110 may process the voice signal. Specifically, the system can instruct the voice assistant program of the application layer through commands, call related programs of the application framework layer and related functions of the core library, process the voice signal, and convert it into text information.
302:根据文本信息获取用户意图,和用户意图中包括的槽位信息。302: Acquire the user's intention according to the text information, and the slot information included in the user's intention.
其中,用户意图即用户需求,也就是指示用户希望电子设备完成什么样的任务的信息。Among them, user intentions are user needs, that is, information indicating what tasks the user wants the electronic device to complete.
用户意图可以为上述文本信息中包含的意图关键词,其中,意图关键词可以用于将用户需求划分为某个类型。具体到本申请相关的实施例,意图关键词可以包括:媒体数据属性、情感意图、媒体数据文件名称、媒体数据相关的关键字等。具体来说,媒体数据属性,如音乐、电影、综艺、戏剧、美术、文学作品和照片等,可以将用户需求进行媒体数据类型的划分。情感意图,例如,快乐的,悲伤的,恐怖的等,可以将用户需求按照媒体数据被预定义的情感类型进行划分。媒体数据文件名称,则能获取到明确的用户需求的媒体数据,例如,《简爱》、《喜剧之王》等。媒体数据相关的关键字,例如,影片中某一个人物的片段,影片中针对具体的某一处剧情的片段,音乐中某种节奏的片段,文学作品中针对某一处情节的描述等等,可以定位用户的需求。The user intention may be an intention keyword included in the above text information, where the intention keyword may be used to classify user needs into a certain type. Specific to the relevant embodiments of the present application, the intention keywords may include: media data attributes, emotional intentions, media data file names, keywords related to media data, and so on. Specifically, media data attributes, such as music, movies, variety shows, dramas, fine arts, literary works, photos, etc., can classify media data types according to user needs. Emotional intentions, for example, happy, sad, scary, etc., can divide user needs according to the pre-defined emotional types of media data. The name of the media data file can obtain the media data clearly required by the user, such as "Jane Eyre" and "The King of Comedy". Keywords related to media data, for example, a segment of a certain character in a film, a segment of a specific plot in a film, a segment of a certain rhythm in music, a description of a plot in a literary work, etc., Can locate the needs of users.
槽位信息可以包括:时序信息和情感信息。其中,时序信息可以为用标签标注媒体文件的部分内容,也可称为时序标签,其对应媒体文件的一部分内容,可以为媒体 文件的某一个具体时刻,或者一个时序片段。例如,一部电影的第12:05分钟,或者某音乐的第2分钟至第3分钟。The slot information may include: timing information and emotional information. Wherein, the timing information may be a part of the content of the media file marked with a tag, or may be called a timing tag, which corresponds to a part of the content of the media file, and may be a specific time of the media file, or a timing segment. For example, the 12th to the third minute of a movie, or the second to third minute of a certain piece of music.
情感信息可以为用标签标注媒体文件的情感类型,也可称为情感标签。具体可以包括:美好的、快乐的、悲伤的、恐怖的、欢快的和刺激的等等。Emotional information can be the emotional type of the media file marked with tags, and can also be called emotional tags. Specifically, it can include: beautiful, happy, sad, scary, cheerful, and exciting.
该情感标签与时序标签存在对应关系,例如,某电影的第12:05是美好的,某音乐的第2分钟至第3分钟是欢快的。There is a corresponding relationship between the emotion tag and the timing tag. For example, the 12th:05th of a certain movie is beautiful, and the second to third minutes of a certain music are cheerful.
例如,由301获取到的文本信息为“我想要听一首悲伤的歌曲”,可以提取到的意图关键词为:歌曲,提取到的槽位信息为:悲伤的”。再例如,由301获取到的文本信息为“我想要看夏洛特烦恼中搞笑的一段情节”,可以提取到的意图关键词为:电影《夏洛特烦恼》,提取到的槽位信息中时序标签为:一段情节,提取到的槽位信息中情感标签为:搞笑的。For example, the text information obtained by 301 is "I want to listen to a sad song", the intent keywords that can be extracted are: song, and the slot information extracted is: sad. For another example, by 301 The obtained text information is "I want to see a funny plot in Charlotte's troubles", and the intent keywords that can be extracted are: movie "Charlotte's troubles", and the timing tags in the extracted slot information are: For a plot, the emotional tag in the extracted slot information is: funny.
具体的,可以通过NLU技术获取该文本信息中的用户意图。具体可以为通过深度学习技术和神经网络算法,识别出文本信息中包括的所有字、词,进行文本语义理解,确定出用户意图。本申请实施例对此技术的具体实现过程不做详细的介绍。Specifically, the user's intention in the text information can be obtained through NLU technology. Specifically, it can use deep learning technology and neural network algorithms to identify all the words and words included in the text information, perform text semantic understanding, and determine the user's intention. The specific implementation process of this technology is not described in detail in the embodiment of this application.
进一步的,若根据上述过程没有提取到明确的用户情感需求,例如上述的意图关键词、情感标签、时序信息等的至少一个,则电子设备利用MFCC算法提取用户的语音情感特征,将该情感特征匹配的情感标签作为用户意图情感需求。具体为,将获取到的用户语音信息进行MFCC算法处理,得到表示用户语音情感特征的MFCC特征向量;可以将该MFCC特征向量与预设的MFCC特征向量和情感标签对进行匹配,将匹配到的情感标签作为用户意图,进行下面的步骤。Further, if a clear user emotional need is not extracted according to the above process, such as at least one of the above intention keywords, emotional tags, timing information, etc., the electronic device uses the MFCC algorithm to extract the user’s voice emotional characteristics, and the emotional characteristics The matched emotional label serves as the user's intentional emotional demand. Specifically, the obtained user voice information is processed by the MFCC algorithm to obtain the MFCC feature vector representing the emotional characteristics of the user's voice; the MFCC feature vector can be matched with the preset MFCC feature vector and emotional label pair, and the matched As the user's intention, the emotion label is used for the following steps.
其中,MFCC算法提取用户语音情感特征的具体过程可以为如图4所示,包括:模数转换、预加重、分帧加窗、傅里叶变换、Mel滤波、倒谱、能量与差分这几个处理过程,从而生成MFCC特征向量。Among them, the specific process of the MFCC algorithm to extract the user’s voice emotional features can be as shown in Figure 4, including: analog-to-digital conversion, pre-emphasis, frame and windowing, Fourier transform, Mel filtering, cepstrum, energy and difference. This process generates MFCC feature vectors.
首先,模数转换,即将输入的模拟信号转换为数字信号。预加重处理是将数字信号通过一个高通滤波器,目的是提升高频部分,使信号的频谱变得平坦,保持在低频到高频的整个频带中,能用同样的信噪比求频谱。同时,也是为了消除发生过程中声带和嘴唇的效应,来补偿语音信号受到发音系统所抑制的高频部分,也为了突出高频的共振峰。First, analog-to-digital conversion, that is, convert the input analog signal into a digital signal. Pre-emphasis is to pass the digital signal through a high-pass filter. The purpose is to boost the high-frequency part, flatten the frequency spectrum of the signal, and keep it in the entire frequency band from low to high frequency. The same signal-to-noise ratio can be used to find the frequency spectrum. At the same time, it is also to eliminate the effects of the vocal cords and lips in the process of occurrence, to compensate for the high frequency part of the voice signal suppressed by the pronunciation system, and to highlight the high frequency formant.
分帧加窗处理是将N个采样点集合成一个观测单位,称为帧。将每一帧乘以汉明窗(汉明窗指定了一个周期的信号),以增加帧左端和右端的连续性。Framing and windowing processing is to gather N sampling points into one observation unit, called frame. Multiply each frame by the Hamming window (the Hamming window specifies a period of signal) to increase the continuity between the left and right ends of the frame.
由于信号在时域上的变换通常很难看出信号的特性,所以通常将它转换为频域上的能量分布来观察,不同的能量分布,就能代表不同语音的特性。所以在乘上汉明窗后,每帧还必须再经过快速傅里叶变换以得到在频谱上的能量分布。对分帧加窗后的各帧信号进行快速傅里叶变换得到各帧的频谱。并对语音信号的频谱取模平方得到语音信号的功率谱。Since the transformation of the signal in the time domain is usually difficult to see the characteristics of the signal, it is usually converted to the energy distribution in the frequency domain for observation. Different energy distributions can represent the characteristics of different voices. Therefore, after multiplying the Hamming window, each frame must undergo a fast Fourier transform to obtain the energy distribution on the frequency spectrum. Fast Fourier transform is performed on each frame signal after frame division and windowing to obtain the frequency spectrum of each frame. And take the modulus square of the frequency spectrum of the speech signal to obtain the power spectrum of the speech signal.
通过Mel滤波对频谱进行平滑化,并消除谐波的作用,突显原先语音的共振峰。因此一段语音的音调或音高,是不会呈现在MFCC参数内,换句话说,以MFCC为特征的语音辨识系统,并不会受到输入语音的音调不同而有所影响。此外,还可以降低运算量。The frequency spectrum is smoothed by Mel filtering, and the effect of harmonics is eliminated, highlighting the formant of the original voice. Therefore, the pitch or pitch of a piece of speech will not appear in the MFCC parameters. In other words, a speech recognition system featuring MFCC will not be affected by the pitch of the input speech. In addition, the amount of calculation can be reduced.
倒谱处理为信号的傅里叶变换谱,经对数运算后,再进行的傅里叶反变换。此步可得到每个滤波器组输出的对数能量。The cepstrum is processed into the Fourier transform spectrum of the signal, and the inverse Fourier transform is performed after logarithmic operation. In this step, the logarithmic energy output by each filter bank can be obtained.
能量与差分处理为,标准的倒谱参数MFCC只反映了语音参数的静态特性,语音的动态特性可以用这些静态特征的差分谱来描述。实验证明:把动态、静态特征结合起来才能有效提高系统的识别性能。Energy and difference processing means that the standard cepstrum parameter MFCC only reflects the static characteristics of speech parameters, and the dynamic characteristics of speech can be described by the difference spectrum of these static characteristics. Experiments show that combining dynamic and static features can effectively improve the recognition performance of the system.
进一步的在另一种可能的实施例中,上述用户意图还可以包括其他指示用户的推荐需求的关键字,例如,指示某个剧情的信息,指示某段台词的信息,指示某位演员的信息,指示某段情节的文字等等信息。举例来说,语音设备获取到用户的语音信息为“我想要看喜剧之王中女主角哭的片段”,可以提取到用户的需求特征包括关键字:女主角哭的,明确的媒体数据信息为:电影《喜剧之王》,时序信息:片段。则电子设备可以根据将上述关键字作为情感需求进行查询。Further in another possible embodiment, the above user intention may also include other keywords indicating the user's recommendation needs, for example, information indicating a certain plot, information indicating a certain line, information indicating a certain actor , Indicating the text and other information of a certain plot. For example, the voice device obtained by the voice device of the user is "I want to watch a clip of the heroine crying in the king of comedy", and the user's demand characteristics can be extracted including keywords: the heroine crying, clear media data information For: the movie "The King of Comedy", timing information: fragment. Then, the electronic device can query the above keywords as emotional needs.
结合前述的电子设备的软件架构,上述方法可以为系统通过命令指示应用程序层的语音助手程序,调用应用程序框架层的相关程序和核心库的相关函数,对文本信息进行识别和处理,根据一定的算法提取其包含的意图和槽位信息。In combination with the aforementioned software architecture of the electronic device, the above method can instruct the system to instruct the voice assistant program at the application layer through commands, call related programs in the application framework layer and related functions in the core library, to recognize and process text information, according to certain The algorithm extracts the intention and slot information it contains.
303:电子设备根据用户意图和槽位信息查询媒体文件库,得到与用户意图和槽位信息对应的媒体文件。303: The electronic device queries the media file library according to the user's intention and the slot information, and obtains the media file corresponding to the user's intention and the slot information.
根据提取到的用户意图和槽位信息从媒体文件库中查询数据,其中,媒体文件库为电子设备预先建立好的,或者电子设备通过云服务获取的,该云服务可以为能够向电子设备提供数据处理和数据存储业务的云端设备,具体可以为服务器。Query data from the media file library according to the extracted user intent and slot information. The media file library is pre-established by the electronic device or is obtained by the electronic device through a cloud service. The cloud service can be provided to the electronic device. The cloud device for data processing and data storage services may specifically be a server.
其中,媒体文件库存储有多个用户意图、槽位信息与多个媒体文件标识的第一映射关系,具体可以为包含情感标签和时序标签的细粒度的媒体文件库。例如,包括音乐、电影、戏剧、美术、文学作品和照片等海量的媒体文件。每个媒体文件可以包含一个宏观标签,例如音乐、电影、戏剧、美术、文学作品和照片等,也可以包括具体的多媒体文件名称,例如,《喜剧之王》,该宏观标签可以对应于用户的意图关键字。每个多媒体文件还可以包含情感标签,例如,快乐的,悲伤的,恐怖的等。另外,每个多媒体文件还可以包括至少一个时序信息,例如,第三分钟到第四分钟,最后十分钟等。该表示情感意图的槽位信息可以与表示时序的槽位信息组成对应关系,例如,该电影的第三分钟到第四分钟对应的情感标签是伤感的,该电影的最后十分钟对应的情感标签是欢乐的等。The media file library stores a first mapping relationship between multiple user intentions, slot information, and multiple media file identifiers. Specifically, it may be a fine-grained media file library containing emotion tags and time series tags. For example, it includes massive media files such as music, movies, dramas, fine arts, literary works, and photos. Each media file can contain a macro tag, such as music, movie, drama, art, literature, and photos, etc., or it can include a specific multimedia file name, such as "The King of Comedy". The macro tag can correspond to the user's Intent keywords. Each multimedia file can also contain emotional tags, for example, happy, sad, scary, etc. In addition, each multimedia file may also include at least one timing information, for example, the third minute to the fourth minute, the last ten minutes, and so on. The slot information representing the emotional intent may form a corresponding relationship with the slot information representing the time sequence. For example, the emotional label corresponding to the third minute to the fourth minute of the movie is sad, and the emotional label corresponding to the last ten minutes of the movie It is joyful waiting.
具体的媒体文件库的建立过程将在下文详细介绍,此处不再赘述。The specific process of establishing the media file library will be described in detail below, and will not be repeated here.
根据意图和槽位查询媒体文件库,得到意图和槽位对应的媒体文件,根据第一映射关系,获取与意图和槽位对应的媒体文件。也就是根据上述步骤获取到的情感标签,时序信息,宏观标签等,在媒体文件库中进行查询,将情感需求匹配度最高的多媒体数据作为匹配数据。The media file library is queried according to the intent and the slot to obtain the media file corresponding to the intent and the slot, and the media file corresponding to the intent and the slot is obtained according to the first mapping relationship. That is, according to the emotion tags, time sequence information, macro tags, etc., obtained in the above steps, query in the media file library, and use the multimedia data with the highest degree of emotional demand matching as the matching data.
具体匹配过程可以为,首先根据宏观标签查询到对应的数据库,例如电影,或者电影《喜剧之王》;再根据情感标签查询媒体文件库中,关于该电影的情感标签对应的电影片段,匹配到最接近该情感标签对应的时序标签,该时序标签对应的媒体数据片段即为电子设备为用户匹配到的媒体文件。The specific matching process can be as follows: first query the corresponding database based on the macro tags, such as the movie, or the movie "King of Comedy"; then query the media file database according to the emotion tags, and match the movie fragments corresponding to the emotion tags of the movie. The time series tag that is closest to the emotion tag, and the media data segment corresponding to the time series tag is the media file matched by the electronic device for the user.
进一步的,该情感标签还可以包括一个对应的推荐值,该推荐值可以为表示该多 媒体数据的该情感标签对应的数据片段被量化的情感数值,可以用于进行匹配度计算。例如可以用标注该情感标签的数据数量、数据搜索量或者用户打分来表示该推荐值,推荐值越高表示情感标签的匹配度越高,推荐值越低表示情感标签的匹配度越低。例如,某一个影片的多个时序片段的情感标签都为搞笑的,为用户匹配最搞笑的时序片段的时候,可以根据多个时序片段的情感标签对应的推荐值来确定,将推荐值最高的对应的时序片段,作为匹配度最高的数据。Further, the sentiment label may also include a corresponding recommended value, and the recommended value may be a quantified sentiment value representing the data segment corresponding to the sentiment label of the multimedia data, which may be used for matching degree calculation. For example, the recommended value can be represented by the amount of data annotated with the sentiment tag, the amount of data searched, or the user score. The higher the recommendation value, the higher the matching degree of the sentiment label, and the lower the recommendation value, the lower the matching degree of the sentiment label. For example, the emotional tags of multiple time series segments of a certain movie are all funny. When matching the most funny time series segment for the user, it can be determined according to the recommended value corresponding to the emotional tags of the multiple time series segments. The corresponding time sequence segment is regarded as the data with the highest matching degree.
媒体文件库可以包括细粒度的多媒体标签的文件库,可以为不只有情感类型相关的标签,也可以包括其他的可能的用户推荐需求的标签。例如,影片中某一个人物的片段,影片中针对具体的某一处剧情的片段,音乐中某种节奏的片段,文学作品中针对某一处情节的描述等等。The media file library may include a file library of fine-grained multimedia tags, not only tags related to emotion types, but also other tags that may be recommended by users. For example, a segment of a certain character in a film, a segment of a specific plot in a film, a segment of a certain rhythm in music, a description of a plot in a literary work, and so on.
根据上述匹配度最高的多媒体数据,向用户推荐,可以发送语音信息或者直接将该多媒体数据发送给用户。According to the above-mentioned multimedia data with the highest matching degree, it is recommended to the user to send voice information or directly send the multimedia data to the user.
举例来说,电子设备获取到用户语音信息“我想要看喜剧之王中最感人的一段情节”,对该语音信息进行处理,识别到如下情感需求:情感标签为:最感人的;表示时序的槽位信息为:一段情节;提取到的多媒体文件名称为:电影《喜剧之王》。电子设备根据上述情感需求在多媒体文件库中进行查询,查询关于电影《喜剧之王》的情感标签为感人、悲伤或者催泪等的时序片段,在查询结果中选择推荐值最高的数据,例如电影第40分钟到第50分钟的剧情片段,则电子设备将该多媒体数据作为匹配数据,向用户推荐。For example, the electronic device obtains the user's voice information "I want to watch the most touching episode in the king of comedy", processes the voice information, and recognizes the following emotional needs: the emotional label is: the most touching; represents the sequence The slot information of is: a plot; the name of the extracted multimedia file is: the movie "King of Comedy". The electronic device queries the multimedia file library according to the above emotional needs, and queries the time series segments of the movie "King of Comedy" whose emotional tags are touching, sad, or tears, and selects the data with the highest recommended value in the query results, such as the movie No. For a plot segment from 40 minutes to 50 minutes, the electronic device uses the multimedia data as matching data to recommend to the user.
在另一种可能的实施例中,电子设备获取到用户语音信息只包含宏观标签,例如,电子设备获取到的用户语音信息为“我想要看一部电影”,宏观标签为电影,则电子设备根据该宏观标签在媒体文件库中进行查询,匹配到电影后将推荐值较高的推荐给用户。In another possible embodiment, the user’s voice information acquired by the electronic device contains only macro tags. For example, the user’s voice information acquired by the electronic device is "I want to watch a movie" and the macro tag is a movie. The device queries the media file library according to the macro tag, and recommends the higher recommended value to the user after matching the movie.
在另一种可能的实施例中,电子设备获取到用户语音信息只包含宏观标签和情感标签,例如,电子设备获取到的用户语音信息为“我想要看一部恐怖电影”,宏观标签为电影,情感标签为恐怖的,则电子设备根据该宏观标签和情感标签在媒体文件库中进行查询,匹配到恐怖电影后选择推荐值较高的推荐给用户。In another possible embodiment, the user's voice information obtained by the electronic device only contains macro tags and emotional tags. For example, the user's voice information obtained by the electronic device is "I want to watch a horror movie", and the macro tag is If the movie has an emotional tag of horror, the electronic device queries the media file library according to the macro tag and emotional tag, and after matching the horror movie, selects a higher recommended value to recommend to the user.
在另一种可能的实施例中,若电子设备无法通过用户语音中表达的语义进行判断和推荐,也就是确定文本信息未包含用户意图,则通过梅尔频率倒谱系数MFCC算法获取该语音信号对应的情感特征向量。根据情感特征向量查询媒体文件库,在媒体文件库中选择相近的情感特征对应的情感标签下的媒体文件进行推荐。其中,媒体文件库中存储有多个情感特征向量与多个情感标签的第二映射关系,每种情感标签对应至少一个媒体文件,可以将该情感标签所匹配的任一媒体文件作为保底的推荐数据推荐给用户。In another possible embodiment, if the electronic device cannot make judgments and recommendations based on the semantics expressed in the user's voice, that is, if it is determined that the text information does not contain the user's intention, the voice signal is obtained by the Mel frequency cepstral coefficient MFCC algorithm Corresponding emotional feature vector. The media file library is queried according to the emotional feature vector, and media files under emotional tags corresponding to similar emotional features are selected for recommendation in the media file library. Among them, the media file library stores the second mapping relationship between multiple emotional feature vectors and multiple emotional tags, each emotional tag corresponds to at least one media file, and any media file matched by the emotional tag can be used as a guaranteed recommendation The data is recommended to users.
结合前述的电子设备的软件架构,上述方法可以为系统通过语音助手程序,调用应用程序框架层的相关程序和核心库的相关函数,根据提取到的用户意图和槽位,通过一定的匹配算法,得到对应的媒体文件。Combined with the aforementioned software architecture of the electronic device, the above method can call the related programs of the application framework layer and related functions of the core library through the voice assistant program, and use a certain matching algorithm according to the extracted user intentions and slots. Get the corresponding media file.
进一步的,在上述实施例的303中,电子设备建立多媒体文件库的过程可以为如图5,包括:Further, in 303 of the foregoing embodiment, the process of establishing the multimedia file library by the electronic device may be as shown in FIG. 5, including:
501:获取多媒体评论信息。501: Obtain multimedia comment information.
获取海量的关于多媒体数据的评论信息,可以为用户评论多个媒体文件的情感信息。可以通过互联网上的各种渠道,例如,论坛、贴吧、新闻和应用程序等各类网站的用户评论中关于多媒体文件的评论,还可以包括影视频网站的用户评论区、弹幕留言区等。具体可以为利用网络爬虫技术从互联网抓取评论,根据评论提取模型,获取到海量的关于多媒体文件的评论信息,例如关于音乐、电影、戏剧、美术、文学作品或者图像等的评论信息。Obtain a large amount of comment information on multimedia data, and can comment on emotional information of multiple media files for users. It can be through various channels on the Internet, for example, comments on multimedia files in user comments on various websites such as forums, post bars, news, and applications, and can also include user comment areas and bullet screen message areas on movie and video websites. Specifically, it can use web crawler technology to grab comments from the Internet, and obtain a large amount of comment information about multimedia files according to the comment extraction model, such as comment information about music, movies, dramas, fine arts, literary works, or images.
确定情感信息为细粒度情感信息或是粗粒度情感信息。其中,粗粒度情感信息表示该评论信息可以为宏观的,例如,“这个音乐很伤感”;也可以为关于时序片段级的评论信息即为细粒度情感信息,例如,“影片基调是喜剧,但是最后15分钟还是很煽情”,“影片整体比较平淡,但是影片30-40分钟的剧情好惊悚啊”。Determine whether the emotional information is fine-grained emotional information or coarse-grained emotional information. Among them, the coarse-grained emotional information indicates that the review information can be macroscopic, for example, "This music is very sad"; it can also be the review information about the sequence segment level, that is, the fine-grained emotional information, for example, "The keynote of the movie is comedy, but The last 15 minutes is still very sensational", "The overall film is relatively plain, but the plot of the film is terrifying for 30-40 minutes."
502:若该情感信息为细粒度情感信息,则根据获取到的槽位信息建立该媒体文件的第一映射关系;若情感信息为粗粒度情感信息,则根据情感标签得到情感特征向量,根据情感标签和情感特征向量建立该媒体文件的第二映射关系。502: If the emotion information is fine-grained emotion information, establish the first mapping relationship of the media file according to the obtained slot information; if the emotion information is coarse-grained emotion information, obtain the emotion feature vector according to the emotion label, The tag and the emotional feature vector establish a second mapping relationship of the media file.
具体为,将获取到的多媒体评论信息进行标注,可以为人工标注的方法,或是规则匹配的算法,获取该多媒体评论信息中的评论关键字,例如,情感标签,时序标签,宏观标签或其他指示推荐需求的关键字中的对应关系。Specifically, the obtained multimedia comment information is labeled, which can be a manual labeling method or a rule matching algorithm to obtain comment keywords in the multimedia comment information, such as sentiment tags, time series tags, macro tags, or other The corresponding relationship in the keywords indicating the recommendation requirements.
首先,确定该多媒体评论信息为细粒度情感评论信息或粗粒度情感评论信息;若该情感信息为细粒度情感评论信息,则获取该细粒度情感评论信息中的槽位,在媒体文件库中建立该多媒体文件与槽位信息的第一映射关系;槽位具体可以包括时序标签、情感标签等。First, determine that the multimedia review information is fine-grained emotional review information or coarse-grained emotional review information; if the emotional information is fine-grained emotional review information, obtain the slot in the fine-grained emotional review information and establish it in the media file library The first mapping relationship between the multimedia file and the slot information; the slot may specifically include timing tags, emotional tags, and so on.
若该情感信息为粗粒度情感评论信息,则根据情感标签得到情感特征向量,获得媒体文件的情感标签,建立该媒体文件的第二映射关系。If the emotional information is coarse-grained emotional comment information, the emotional feature vector is obtained according to the emotional tag, the emotional tag of the media file is obtained, and the second mapping relationship of the media file is established.
例如,保存某电影的时序标签和情感标签的对应关系,与该多媒体数据对应进行存储。举例来说,某时序信息和情感标签对为,第三分钟到第四分钟对应的情感标签是伤感的,最后十分钟对应的情感标签是欢乐的。For example, the corresponding relationship between the time series tag and the emotional tag of a certain movie is stored, and the multimedia data is stored correspondingly. For example, a certain timing information and emotional label pair are, the emotional label corresponding to the third minute to the fourth minute is sad, and the emotional label corresponding to the last ten minutes is happy.
进一步的,若没有捕捉到与时序信息相关联的情感信息,则直接将各个平台上关于多媒体的情感标签或者宏观标签保存至媒体文件库,例如,喜剧、悲剧、快歌、悲情歌等,可以作为保底的数据推荐。Further, if the emotional information associated with the timing information is not captured, the emotional tags or macro tags on the multimedia on each platform are directly saved to the media file library, for example, comedy, tragedy, fast song, sad song, etc. Recommended as a guaranteed data.
根据多媒体文件及其映射关系,建立多媒体文件库。According to the multimedia file and its mapping relationship, establish a multimedia file library.
根据生成的海量的关于多媒体文件的第一映射关系、第二映射关系,进行数据存储,建立多媒体文件库,以便根据用户语音信息中的情感需求,从该多媒体文件库中进行信息查询、数据匹配。According to the generated massive first mapping relationship and second mapping relationship of multimedia files, data storage is carried out, and a multimedia file library is established, so as to perform information query and data matching from the multimedia file library according to the emotional needs of the user's voice information .
进一步的,该多媒体文件库可以是持续更新的,不断获取新的关于多媒体文件的评论信息,提取评论信息中包括的关于时序片段的情感标签,从而细粒度化和丰富多媒体文件库的资源,例如,关于某部影片可以细粒度化到建立电影每一帧的情感标签、情节描述等映射关系,从而实现更精准地为用户匹配推荐多媒体文件的时序片段。Further, the multimedia file library may be continuously updated, continuously acquiring new comment information about the multimedia file, and extracting emotional tags about time sequence segments included in the comment information, so as to fine-grain and enrich the resources of the multimedia file library, for example , For a certain movie, it can be fine-grained to establish the emotional tag, plot description and other mapping relationships of each frame of the movie, so as to achieve more accurate matching and recommendation of the time series of multimedia files for users.
本申请另一些实施例提供了一种电子设备,该电子设备可以包括:存储器和一个或多个处理器,该存储器和处理器耦合。该存储器用于存储计算机程序代码,该计算 机程序代码包括计算机指令。当处理器执行计算机指令时,电子设备可执行上述方法实施例中的各个功能或者步骤。Other embodiments of the present application provide an electronic device, which may include a memory and one or more processors, and the memory and the processor are coupled. The memory is used to store computer program code, and the computer program code includes computer instructions. When the processor executes the computer instructions, the electronic device can execute various functions or steps in the foregoing method embodiments.
本申请实施例还提供一种芯片系统,如图6所示,该芯片系统包括至少一个处理器601和至少一个接口电路602。处理器601和接口电路602可通过线路互联。例如,接口电路602可用于从其它装置(例如电子设备的存储器)接收信号。又例如,接口电路602可用于向其它装置(例如处理器601)发送信号。示例性的,接口电路602可读取存储器中存储的指令,并将该指令发送给处理器601。当所述指令被处理器601执行时,可使得电子设备执行上述实施例中电子设备执行的各个功能或者步骤。当然,该芯片系统还可以包含其他分立器件,本申请实施例对此不作具体限定。The embodiment of the present application also provides a chip system. As shown in FIG. 6, the chip system includes at least one processor 601 and at least one interface circuit 602. The processor 601 and the interface circuit 602 may be interconnected by wires. For example, the interface circuit 602 may be used to receive signals from other devices (such as the memory of an electronic device). For another example, the interface circuit 602 may be used to send signals to other devices (for example, the processor 601). Exemplarily, the interface circuit 602 can read an instruction stored in the memory, and send the instruction to the processor 601. When the instructions are executed by the processor 601, the electronic device can be made to execute various functions or steps executed by the electronic device in the foregoing embodiments. Of course, the chip system may also include other discrete devices, which are not specifically limited in the embodiment of the present application.
本申请实施例还提供一种计算机存储介质,该计算机存储介质包括计算机指令,当所述计算机指令在上述电子设备上运行时,使得该电子设备执行上述方法实施例中的各个功能或者步骤。The embodiments of the present application also provide a computer storage medium, which includes computer instructions, which when the computer instructions run on the above-mentioned electronic device, cause the electronic device to execute each function or step in the above-mentioned method embodiment.
本申请实施例还提供一种计算机程序产品,当所述计算机程序产品在计算机上运行时,使得所述计算机执行上述方法实施例中的各个功能或者步骤。The embodiments of the present application also provide a computer program product, which when the computer program product runs on a computer, causes the computer to execute each function or step in the foregoing method embodiment.
通过以上实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。Through the description of the above embodiments, those skilled in the art can clearly understand that for the convenience and conciseness of the description, only the division of the above-mentioned functional modules is used as an example. In practical applications, the above-mentioned functions can be allocated by Different functional modules are completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be It can be combined or integrated into another device, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate parts may or may not be physically separate. The parts displayed as units may be one physical unit or multiple physical units, that is, they may be located in one place, or they may be distributed to multiple different places. . Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application are essentially or the part that contributes to the prior art, or all or part of the technical solutions can be embodied in the form of software products, which are stored in a storage medium It includes several instructions to make a device (may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read only memory (read only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program codes.
以上内容,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above content is only the specific implementation of this application, but the protection scope of this application is not limited to this, and any changes or substitutions within the technical scope disclosed in this application should be covered within the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (15)

  1. 一种媒体文件推荐方法,其特征在于,应用于电子设备,所述方法包括:A media file recommendation method, characterized in that it is applied to an electronic device, and the method includes:
    接收语音信号,将所述语音信号转换为文本信息;Receiving a voice signal, and converting the voice signal into text information;
    根据所述文本信息获取用户意图和所述用户意图中包括的槽位信息;所述槽位信息包括情感信息和时序信息;Obtaining user intentions and slot information included in the user intentions according to the text information; the slot information includes emotion information and time sequence information;
    根据所述用户意图和所述槽位信息查询媒体文件库,得到与所述用户意图和所述槽位信息对应的媒体文件。The media file library is queried according to the user's intention and the slot information to obtain a media file corresponding to the user's intention and the slot information.
  2. 根据权利要求1所述的方法,其特征在于,所述媒体文件库中存储有多个用户意图、槽位信息与多个媒体文件标识的第一映射关系;The method according to claim 1, wherein the media file library stores a first mapping relationship between multiple user intentions, slot information, and multiple media file identifiers;
    所述根据所述用户意图和所述槽位信息查询媒体文件库,得到与所述用户意图和所述槽位信息对应的媒体文件包括:The querying the media file library according to the user intention and the slot information to obtain the media file corresponding to the user intention and the slot information includes:
    根据所述第一映射关系,获取与所述用户意图和所述槽位信息对应的媒体文件。According to the first mapping relationship, a media file corresponding to the user's intention and the slot information is obtained.
  3. 根据权利要求1或2所述的方法,其特征在于,在获取所述文本信息中的所述用户意图和所述槽位信息之前,所述方法还包括:The method according to claim 1 or 2, characterized in that, before obtaining the user intention and the slot information in the text information, the method further comprises:
    确定所述文本信息中是否包含所述用户意图;Determining whether the user intention is included in the text information;
    若确定所述文本信息未包含所述用户意图,则通过梅尔频率倒谱系数MFCC算法获取所述语音信号的情感特征向量;If it is determined that the text information does not contain the user's intention, obtain the emotional feature vector of the speech signal through the Mel frequency cepstrum coefficient MFCC algorithm;
    根据所述情感特征向量查询所述媒体文件库,得到与所述情感特征向量对应的媒体文件,其中,所述媒体文件库中存储有多个情感特征向量与多个情感信息的第二映射关系,每种情感信息对应多个媒体文件。Query the media file library according to the emotional feature vector to obtain a media file corresponding to the emotional feature vector, wherein the media file library stores a second mapping relationship between multiple emotional feature vectors and multiple emotional information , Each emotion information corresponds to multiple media files.
  4. 根据权利要求3所述的方法,其特征在于,在接收语音信号之前,所述方法还包括:The method according to claim 3, characterized in that, before receiving the voice signal, the method further comprises:
    获取用户评论多个媒体文件的情感信息;Obtain the emotional information of users commenting on multiple media files;
    确定所述情感信息为细粒度情感信息或粗粒度情感信息;Determining that the emotional information is fine-grained emotional information or coarse-grained emotional information;
    若所述情感信息为所述细粒度情感信息,则获取所述细粒度情感信息中的槽位,在所述媒体文件库中建立第一映射关系;If the emotional information is the fine-grained emotional information, acquiring slots in the fine-grained emotional information, and establishing a first mapping relationship in the media file library;
    若所述情感信息为所述粗粒度情感信息,则根据情感信息得到情感特征向量,获得所述媒体文件的情感信息,建立所述媒体文件的所述第二映射关系。If the emotion information is the coarse-grained emotion information, an emotion feature vector is obtained according to the emotion information, the emotion information of the media file is obtained, and the second mapping relationship of the media file is established.
  5. 根据权利要求1所述的方法,其特征在于,将所述语音信号转换为文本信息包括:The method according to claim 1, wherein converting the voice signal into text information comprises:
    通过自动语音识别ASR将所述语音信号转换为所述文本信息。The voice signal is converted into the text information through automatic voice recognition ASR.
  6. 根据权利要求1所述的方法,其特征在于,所述获取所述文本信息中的用户意图包括:The method according to claim 1, wherein said obtaining user intentions in said text information comprises:
    通过自然语音理解NLU技术获取所述文本信息中的所述用户意图。Obtain the user's intention in the text information through natural speech understanding NLU technology.
  7. 一种电子设备,其特征在于,所述电子设备包括处理器,以及与处理器连接的存储器,所述存储器用于存储指令,当所述指令被所述处理器执行时,使得所述电子设备用于执行:An electronic device, characterized in that the electronic device includes a processor and a memory connected to the processor, the memory is used to store instructions, and when the instructions are executed by the processor, the electronic device Used to execute:
    接收语音信号,将所述语音信号转换为文本信息;Receiving a voice signal, and converting the voice signal into text information;
    根据所述文本信息获取用户意图和所述用户意图中包括的槽位信息;所述槽位信 息包括情感信息和时序信息;Obtaining user intentions and slot information included in the user intentions according to the text information; the slot information includes emotion information and time sequence information;
    根据所述用户意图和所述槽位信息查询媒体文件库,得到与所述用户意图和所述槽位信息对应的媒体文件。The media file library is queried according to the user's intention and the slot information to obtain a media file corresponding to the user's intention and the slot information.
  8. 根据权利要求7所述的电子设备,其特征在于,所述媒体文件库中存储有多个用户意图、槽位信息与多个媒体文件标识的第一映射关系;The electronic device according to claim 7, wherein the media file library stores a plurality of first mapping relationships between user intentions, slot information, and multiple media file identifiers;
    所述电子设备具体用于执行:The electronic device is specifically used to execute:
    根据所述第一映射关系,获取与所述用户意图和所述槽位信息对应的媒体文件。According to the first mapping relationship, a media file corresponding to the user's intention and the slot information is obtained.
  9. 根据权利要求7或8所述的电子设备,其特征在于,所述电子设备还用于执行:The electronic device according to claim 7 or 8, wherein the electronic device is further configured to execute:
    确定所述文本信息中是否包含所述用户意图;Determining whether the user intention is included in the text information;
    若确定所述文本信息未包含所述用户意图,则通过梅尔频率倒谱系数MFCC算法获取所述语音信号的情感特征向量;If it is determined that the text information does not contain the user's intention, obtain the emotional feature vector of the speech signal through the Mel frequency cepstrum coefficient MFCC algorithm;
    根据所述情感特征向量查询所述媒体文件库,得到与所述情感特征向量对应的媒体文件,其中,所述媒体文件库中存储有多个情感特征向量与多个情感标签的第二映射关系。Query the media file library according to the emotional feature vector to obtain a media file corresponding to the emotional feature vector, wherein the media file library stores a second mapping relationship between multiple emotional feature vectors and multiple emotional tags .
  10. 根据权利要求9所述的电子设备,其特征在于,所述电子设备还用于执行:The electronic device according to claim 9, wherein the electronic device is further configured to execute:
    获取用户评论多个媒体文件的情感信息;Obtain the emotional information of users commenting on multiple media files;
    确定所述情感信息为细粒度情感信息或粗粒度情感信息;Determining that the emotional information is fine-grained emotional information or coarse-grained emotional information;
    若所述情感信息为所述细粒度情感信息,则获取所述细粒度情感信息中的槽位,在所述媒体文件库中建立第一映射关系;If the emotional information is the fine-grained emotional information, acquiring slots in the fine-grained emotional information, and establishing a first mapping relationship in the media file library;
    若所述情感信息为所述粗粒度情感信息,则根据情感标签得到情感特征向量,获得所述媒体文件的情感标签,建立所述媒体文件的所述第二映射关系。If the emotion information is the coarse-grained emotion information, an emotion feature vector is obtained according to an emotion tag, an emotion tag of the media file is obtained, and the second mapping relationship of the media file is established.
  11. 根据权利要求7所述的电子设备,其特征在于,将所述语音信号转换为文本信息包括:8. The electronic device of claim 7, wherein converting the voice signal into text information comprises:
    通过自动语音识别ASR将所述语音信号转换为所述文本信息。The voice signal is converted into the text information through automatic voice recognition ASR.
  12. 根据权利要求7所述的电子设备,其特征在于,所述获取所述文本信息中的用户意图包括:The electronic device according to claim 7, wherein said obtaining the user's intention in the text information comprises:
    通过自然语音理解NLU技术获取所述文本信息中的所述用户意图。Obtain the user's intention in the text information through natural speech understanding NLU technology.
  13. 一种芯片系统,其特征在于,所述芯片系统应用于电子设备;所述芯片系统包括一个或多个接口电路和一个或多个处理器;所述接口电路和所述处理器通过线路互联;所述接口电路用于从所述电子设备的存储器接收信号,并向所述处理器发送所述信号,所述信号包括所述存储器中存储的计算机指令;当所述处理器执行所述计算机指令时,所述电子设备执行如权利要求1-6中任一项所述的媒体文件推荐方法。A chip system, characterized in that the chip system is applied to electronic equipment; the chip system includes one or more interface circuits and one or more processors; the interface circuit and the processor are interconnected by wires; The interface circuit is used to receive a signal from the memory of the electronic device and send the signal to the processor, and the signal includes a computer instruction stored in the memory; when the processor executes the computer instruction At this time, the electronic device executes the media file recommendation method according to any one of claims 1-6.
  14. 一种可读存储介质,其特征在于,所述可读存储介质中存储有指令,当所述可读存储介质在电子设备上运行时,使得所述电子设备执行权利要求1-6任一项所述的媒体文件推荐方法。A readable storage medium, characterized in that instructions are stored in the readable storage medium, and when the readable storage medium runs on an electronic device, the electronic device executes any one of claims 1-6 The described media file recommendation method.
  15. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行权利要求1-6任一项所述的媒体文件推荐方法。A computer program product, characterized in that when the computer program product runs on a computer, the computer is caused to execute the media file recommendation method according to any one of claims 1-6.
PCT/CN2020/100858 2019-07-08 2020-07-08 Media files recommending method and device WO2021004481A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910609618.0 2019-07-08
CN201910609618.0A CN110473546B (en) 2019-07-08 2019-07-08 Media file recommendation method and device

Publications (1)

Publication Number Publication Date
WO2021004481A1 true WO2021004481A1 (en) 2021-01-14

Family

ID=68506827

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/100858 WO2021004481A1 (en) 2019-07-08 2020-07-08 Media files recommending method and device

Country Status (2)

Country Link
CN (1) CN110473546B (en)
WO (1) WO2021004481A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113140138A (en) * 2021-04-25 2021-07-20 新东方教育科技集团有限公司 Interactive teaching method, device, storage medium and electronic equipment
CN113297934A (en) * 2021-05-11 2021-08-24 国家计算机网络与信息安全管理中心 Multi-mode video behavior analysis method for detecting internet violent harmful scene
CN113903342A (en) * 2021-10-29 2022-01-07 镁佳(北京)科技有限公司 Voice recognition error correction method and device
CN116108373A (en) * 2023-04-17 2023-05-12 京东科技信息技术有限公司 Bill data classifying and labeling system, electronic equipment and storage medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110473546B (en) * 2019-07-08 2022-05-31 华为技术有限公司 Media file recommendation method and device
CN112948662A (en) * 2019-12-10 2021-06-11 北京搜狗科技发展有限公司 Recommendation method and device and recommendation device
CN111666377A (en) * 2020-06-03 2020-09-15 贵州航天云网科技有限公司 Talent portrait construction method and system based on big data modeling
CN113808619B (en) * 2021-08-13 2023-10-20 北京百度网讯科技有限公司 Voice emotion recognition method and device and electronic equipment
CN116416993A (en) * 2021-12-30 2023-07-11 华为技术有限公司 Voice recognition method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040193421A1 (en) * 2003-03-25 2004-09-30 International Business Machines Corporation Synthetically generated speech responses including prosodic characteristics of speech inputs
CN107071542A (en) * 2017-04-18 2017-08-18 百度在线网络技术(北京)有限公司 Video segment player method and device
CN107222757A (en) * 2017-07-05 2017-09-29 深圳创维数字技术有限公司 A kind of voice search method, set top box, storage medium, server and system
CN108804609A (en) * 2018-05-30 2018-11-13 平安科技(深圳)有限公司 Song recommendations method and apparatus
CN109189978A (en) * 2018-08-27 2019-01-11 广州酷狗计算机科技有限公司 The method, apparatus and storage medium of audio search are carried out based on speech message
CN110473546A (en) * 2019-07-08 2019-11-19 华为技术有限公司 A kind of media file recommendation method and device

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970791B (en) * 2013-02-01 2018-01-23 华为技术有限公司 A kind of method, apparatus for recommending video from video library
US9378741B2 (en) * 2013-03-12 2016-06-28 Microsoft Technology Licensing, Llc Search results using intonation nuances
US9788777B1 (en) * 2013-08-12 2017-10-17 The Neilsen Company (US), LLC Methods and apparatus to identify a mood of media
CN106302987A (en) * 2016-07-28 2017-01-04 乐视控股(北京)有限公司 A kind of audio frequency recommends method and apparatus
CN106570496B (en) * 2016-11-22 2019-10-01 上海智臻智能网络科技股份有限公司 Emotion identification method and apparatus and intelligent interactive method and equipment
US10558701B2 (en) * 2017-02-08 2020-02-11 International Business Machines Corporation Method and system to recommend images in a social application
CN107562850A (en) * 2017-08-28 2018-01-09 百度在线网络技术(北京)有限公司 Music recommends method, apparatus, equipment and storage medium
CN108197115B (en) * 2018-01-26 2022-04-22 上海智臻智能网络科技股份有限公司 Intelligent interaction method and device, computer equipment and computer readable storage medium
CN109740154B (en) * 2018-12-26 2021-10-26 西安电子科技大学 Online comment fine-grained emotion analysis method based on multi-task learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040193421A1 (en) * 2003-03-25 2004-09-30 International Business Machines Corporation Synthetically generated speech responses including prosodic characteristics of speech inputs
CN107071542A (en) * 2017-04-18 2017-08-18 百度在线网络技术(北京)有限公司 Video segment player method and device
CN107222757A (en) * 2017-07-05 2017-09-29 深圳创维数字技术有限公司 A kind of voice search method, set top box, storage medium, server and system
CN108804609A (en) * 2018-05-30 2018-11-13 平安科技(深圳)有限公司 Song recommendations method and apparatus
CN109189978A (en) * 2018-08-27 2019-01-11 广州酷狗计算机科技有限公司 The method, apparatus and storage medium of audio search are carried out based on speech message
CN110473546A (en) * 2019-07-08 2019-11-19 华为技术有限公司 A kind of media file recommendation method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113140138A (en) * 2021-04-25 2021-07-20 新东方教育科技集团有限公司 Interactive teaching method, device, storage medium and electronic equipment
CN113297934A (en) * 2021-05-11 2021-08-24 国家计算机网络与信息安全管理中心 Multi-mode video behavior analysis method for detecting internet violent harmful scene
CN113297934B (en) * 2021-05-11 2024-03-29 国家计算机网络与信息安全管理中心 Multi-mode video behavior analysis method for detecting Internet violence harmful scene
CN113903342A (en) * 2021-10-29 2022-01-07 镁佳(北京)科技有限公司 Voice recognition error correction method and device
CN113903342B (en) * 2021-10-29 2022-09-13 镁佳(北京)科技有限公司 Voice recognition error correction method and device
CN116108373A (en) * 2023-04-17 2023-05-12 京东科技信息技术有限公司 Bill data classifying and labeling system, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110473546B (en) 2022-05-31
CN110473546A (en) 2019-11-19

Similar Documents

Publication Publication Date Title
WO2021004481A1 (en) Media files recommending method and device
US10811005B2 (en) Adapting voice input processing based on voice input characteristics
CN110288985B (en) Voice data processing method and device, electronic equipment and storage medium
US20200126566A1 (en) Method and apparatus for voice interaction
JP2021103328A (en) Voice conversion method, device, and electronic apparatus
WO2020177190A1 (en) Processing method, apparatus and device
CN110097870B (en) Voice processing method, device, equipment and storage medium
WO2019242414A1 (en) Voice processing method and apparatus, storage medium, and electronic device
WO2019199742A1 (en) Continuous detection of words and related user experience
US10699706B1 (en) Systems and methods for device communications
CN109543021B (en) Intelligent robot-oriented story data processing method and system
WO2019228138A1 (en) Music playback method and apparatus, storage medium, and electronic device
CN112927674B (en) Voice style migration method and device, readable medium and electronic equipment
CN109710799B (en) Voice interaction method, medium, device and computing equipment
WO2020173211A1 (en) Method and apparatus for triggering special image effects and hardware device
CN111640434A (en) Method and apparatus for controlling voice device
EP3550449A1 (en) Search method and electronic device using the method
CN110706707A (en) Method, apparatus, device and computer-readable storage medium for voice interaction
CN111460231A (en) Electronic device, search method for electronic device, and medium
CN110889008B (en) Music recommendation method and device, computing device and storage medium
US11238865B2 (en) Function performance based on input intonation
CN113611316A (en) Man-machine interaction method, device, equipment and storage medium
WO2019228140A1 (en) Instruction execution method and apparatus, storage medium, and electronic device
CN115798459B (en) Audio processing method and device, storage medium and electronic equipment
CN114974213A (en) Audio processing method, electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20836195

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20836195

Country of ref document: EP

Kind code of ref document: A1