WO2019101099A1 - 视频节目识别方法、设备、终端、系统和存储介质 - Google Patents

视频节目识别方法、设备、终端、系统和存储介质 Download PDF

Info

Publication number
WO2019101099A1
WO2019101099A1 PCT/CN2018/116686 CN2018116686W WO2019101099A1 WO 2019101099 A1 WO2019101099 A1 WO 2019101099A1 CN 2018116686 W CN2018116686 W CN 2018116686W WO 2019101099 A1 WO2019101099 A1 WO 2019101099A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
video program
target
video
voiceprint feature
Prior art date
Application number
PCT/CN2018/116686
Other languages
English (en)
French (fr)
Inventor
郭恺懿
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019101099A1 publication Critical patent/WO2019101099A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Definitions

  • the present application relates to the field of computers, and in particular, to a video program identification method, device, terminal, system, and storage medium.
  • the information of the video program can be quickly understood by the electronic terminal.
  • the user only needs to open an application with the function of recognizing the video program, and the information of the video program can be obtained through the application.
  • the technical problem to be solved by the embodiments of the present application is to provide a video program identification method, device, terminal, system, and storage medium, which solves the technical problem that the prior art cannot support the identification of non-live broadcast video programs.
  • an embodiment of the present application discloses a video program identification method, which is executed by a server, and includes:
  • the audio information including sound information
  • Target video program including target voice content information in the video program associated with the target person information; the target voice content information including information matching the voice content information of the audio information.
  • FIG. 1 Another aspect of the embodiment of the present application discloses a video program identification method, which is executed by a terminal, and includes:
  • the audio information including voice information
  • Information of the target video program is received and displayed from the server.
  • a video program identification device including a processor, an input device, an output device, a memory, and a communication device, wherein the processor, the input device, the output device, the memory, and the communication device are connected to each other, wherein The memory is for storing application code, the communication device is for performing information interaction with an external device; the processor is configured to invoke the program code to perform the method as described above.
  • a terminal including a processor, an input device, an output device, a memory, and a communication device, wherein the processor, the input device, the output device, the memory, and the communication device are connected to each other, wherein the A memory is for storing application code, the communication device is for interacting with an external device; the processor is configured to invoke the program code to perform the method as described above.
  • FIG. 1 Another aspect of an embodiment of the present application discloses a video program identification system including a terminal and a server; wherein the terminal includes a terminal as described above, and the server includes the video program identification device as described above.
  • FIG. 1 Another aspect of an embodiment of the present application discloses a computer readable storage medium storing a computer program, the computer program comprising program instructions, the program instructions causing the processor when executed by a processor Perform the above method.
  • FIG. 1 is a schematic structural diagram of a system for identifying a video program according to an embodiment of the present application
  • FIG. 2a is a schematic flow chart of a video program identification method according to an embodiment of the present application.
  • 2b is a schematic flowchart of a video program identification method according to another embodiment of the present application.
  • FIG. 3 is a schematic diagram of inputting a video program identification instruction according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the principle of acoustic feature extraction provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of establishing a video database according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of the principle of establishing a voiceprint feature model provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a video program identification apparatus according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a video program identification apparatus according to another embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • the terminal described in the embodiments of the present application includes, but is not limited to, other portable devices such as a mobile phone, a laptop computer or a tablet computer with a touch sensitive surface (for example, a touch screen display and/or a touch pad).
  • a touch sensitive surface for example, a touch screen display and/or a touch pad.
  • the device is not a portable communication device, but a desktop computer having a touch sensitive surface (eg, a touch screen display and/or a touch pad).
  • the terminal including a display and a touch sensitive surface is described.
  • the terminal can include one or more other physical user interface devices such as a physical keyboard, mouse, and/or joystick.
  • an application having a function of recognizing a video program generally only supports recognizing a video program in a live television broadcast, because when identifying a video program in a live television broadcast, the audio search or recognition range of the video program can be narrowed down to the current time period.
  • the limited live audio information is thus enabled to quickly identify video programs.
  • a video program of a non-live type for example, on-demand
  • a non-live type for example, on-demand
  • FIG. 1 is a schematic structural diagram of a system for identifying a video program according to an embodiment of the present disclosure, which is a schematic structural diagram of a video program identification system according to an embodiment of the present application.
  • the system architecture may include one or more servers, networks 1 to K, and a plurality of terminals (or devices) 11 to 1n...K1 to Kn connected to each network, where:
  • the server may include, but is not limited to, a background server, a component server, a video program identification system server, etc., and the server may communicate with a plurality of terminals via the Internet.
  • the server provides a video program identification service for the terminal to support the operation of the video program identification system.
  • the terminal (or device) can be installed and run with related clients (including, for example, a video program identification client, etc.).
  • a client is a program that corresponds to a server and provides local services to customers.
  • the local service may include, but is not limited to, searching or identifying a video program, obtaining consulting information of a video program, and the like.
  • the client may include: an application running locally, a function running on a web browser (also referred to as a Web App), and the like.
  • the server needs to run the corresponding server-side program to provide the corresponding social services, such as video database services, data calculation, decision execution and so on.
  • the user can send the audio information in the collected video program to the server for video program identification through the video program identification client installed in the terminal, and the server returns the information of the recognized video program to the terminal.
  • the terminal in the embodiment of the present application may include, but is not limited to, any smart operating system-based handheld electronic product, which can perform human-computer interaction with a user through an input device such as a keyboard, a virtual keyboard, a touch panel, a touch screen, and a voice control device. , such as smartphones, tablets, personal computers, etc.
  • the intelligent operating systems include, but are not limited to any operating system function to enrich the device by providing a variety of applications to mobile terminals, such as Andrews (Android TM), iOS TM, Windows Phone TM like.
  • system architecture of the video program identification method provided by the present application is not limited to that shown in FIG. 1.
  • FIG. 2 is a schematic flowchart of a video program identification method according to an embodiment of the present application. As shown in Figure 2a, the method is performed by a server and includes the following steps:
  • Step S101 receiving audio information in a video program from the terminal, the audio information including sound information.
  • Step S102 identifying target person information corresponding to the sound information.
  • Step S103 searching for a video program associated with the target person information from the video database, where the video database stores the person information and the video program associated with the person information.
  • Step S104 Searching for a target video program including target voice content information in the video program associated with the target person information; the target voice content information includes information matching the voice content information of the audio information.
  • FIG. 2b is a schematic flowchart of a video program identification method according to another embodiment of the present application.
  • the interaction between the terminal and the server may include the following steps:
  • Step S200 receiving an input video program identification instruction
  • the user when the user wants to recognize the video program being played, it can be identified by activating a client for video program identification installed in the terminal. Then, after the client of the video program identification is activated, as shown in FIG. 3, the input program of the video program identification instruction provided by the embodiment of the present application can be input by the user by shaking the terminal through the function of “shaking the TV” in the client.
  • the video program identification command at this time, the terminal receives the input video program identification command.
  • FIG. 3 is only one embodiment of the embodiment of the present application.
  • the application does not limit the manner of inputting a video program identification instruction, and may also click a virtual button, or press a physical button, or input a voice command or the like. The way to enter the video program identification instruction.
  • the video program being played in the embodiment of the present application may be a video program being played by an electronic device other than the terminal, such as a television, a tablet computer, or the like, or may be a video program being played by the terminal itself.
  • the video program in the embodiment of the present application includes a live video program and a non-live video program.
  • Step S202 collecting audio information in the video program according to the video program identification instruction
  • the audio information in the video program being played may be collected.
  • the audio information in the embodiment of the present application includes sound information, that is, sound information in which a person speaks.
  • the audio information in the embodiment of the present application may be a piece of audio information of a preset duration, for example, a piece of audio information of 5-10 seconds.
  • Step S204 Send the audio information to the server.
  • the client for the video program identification of the terminal transmits the audio information to the server through a network, which is a video program identification device.
  • Step S206 receiving audio information in the video program
  • the server receives the audio information in the video program sent by the terminal through the network.
  • Step S208 Identify target person information corresponding to the voice information
  • the video database in the server may store at least two voiceprint feature models, each voiceprint feature model corresponding to one character information; the server may calculate the voiceprint feature of the voice information; and then store the sound according to the video database.
  • the pattern feature model identifies a target voiceprint feature model that matches the voiceprint feature; wherein the matched character information corresponding to the target voiceprint feature model is the target character information.
  • the target voiceprint feature model that matches the voiceprint feature according to the voiceprint feature model stored in the video database may include: The voiceprint feature having the largest proportion of the duration of the sound information is determined as the first voiceprint feature; and the voiceprint feature model matched with the first voiceprint feature is identified as the target voiceprint according to the voiceprint feature model stored in the video database Feature model. It can be seen that the first voiceprint feature is a voiceprint feature that has the largest proportion of the duration of the voice information.
  • the voiceprint feature model matched with the voiceprint feature is respectively identified according to the voiceprint feature model stored in the video database; and the voiceprint feature model with the highest matching degree is determined as the target voiceprint feature model.
  • the server calculates at least two voiceprint features. For example, in the embodiment of the present application, the server may first determine which character in the audio information has the largest proportion of time, thereby finding a voiceprint feature having the largest proportion, and then storing according to the video database.
  • the voiceprint feature model can identify the target voiceprint feature model that matches the most dominant voiceprint feature. Or the server can match the two voiceprint features, and then view which matching degree is higher, so as to find a voiceprint feature model with the highest matching degree, and use the highest voiceprint feature model with the highest matching degree as the target voiceprint feature model. Thereby, the accuracy of identifying the video program can be further improved.
  • voiceprint features with the largest proportion of the duration of the voice information
  • one of them can be randomly selected for matching.
  • voiceprint feature models with the highest matching degree one of them can be randomly selected as the target voiceprint feature model.
  • Step S210 Find a video program associated with the target person information from the video database
  • the video database stores person information and a video program associated with the person information.
  • the video program associated with the character information that is, the character participating in the performance or performance of the video program, such as the actor a associated video program a, indicates that the actor a starred in the video program a.
  • the server may first find the target person information from a plurality of person information stored in the video database, and then find a video program associated with the target person information.
  • Step S212 searching for a target video program including target voice content information in the video program associated with the target person information;
  • the target voice content information in the embodiment of the present application includes information that matches the voice content information of the audio information.
  • the voice content information in the embodiment of the present application may include an acoustic feature of the voice content; the audio database stores an acoustic feature of the voice content corresponding to the video program; then, when determining the target voice content information, the server may: The acoustic features of the extracted speech content are matched with the acoustic features of the speech content corresponding to the video program associated with the target character information; and the acoustic features of the matching video program associated with the target person information are determined as the acoustic features of the target speech content.
  • the acoustic feature of the matching video program associated with the target character information is the acoustic feature of the target voice content; the video program corresponding to the acoustic feature of the target voice content is the target video program.
  • the voice content in the embodiment of the present application is the content spoken by the character, for example, the video program is a TV drama or a movie, and the voice content is the dialogue information of the actor.
  • the method further includes the step of extracting the acoustic features of the voice content from the audio information, and the step may be performed between step S206 and step S212. It can also be performed in step S212.
  • the manner of extracting the acoustic features of the voice content from the audio information may be as shown in FIG. 4, and the smooth audio signal 401 is divided into a plurality of frames 402 through a time window, wherein, adjacent The time interval between the start positions of the two time windows is called "frame shift". As indicated by 403, the unit determined by each time window is called “frame”, and the length of time is called frame length. A time-frequency analysis is performed on each of the divided frames, and the acoustic characteristics 404 of each frame are extracted.
  • Audio information (which can be regarded as a voice signal) can be considered as a short-time stationary signal and a long-term non-stationary signal. In a short time, audio information can be considered as a smooth signal. This short-term general range is 10 to Between 30 milliseconds.
  • the distribution rule of the relevant feature parameters of the voice content information can be considered to be consistent in a short time (10-30 ms), but it is obviously changed in a long time.
  • time-frequency analysis is performed on the stationary signal to extract features. Therefore, when extracting the audio information, a time window of about 20 ms can be set, that is, "frame shift" 403, in which the speech signal can be considered to be stable.
  • each time window can extract a feature that can represent the signal in the time window, thereby obtaining the voice content information of the audio information, that is, the acoustic feature sequence of the voice content. .
  • This process we call it acoustic feature extraction.
  • This feature is capable of characterizing speech signal related information within this time window.
  • Step S214 Send the information of the target video program to the terminal.
  • the information of the target video program may include name information of the target video program, time information of completion of the target video program, and the like.
  • the server may also obtain consulting information of the target video program; and then send the consulting information of the target video program to the terminal.
  • the consultation information includes at least one of the following: profile information, person list information, tidbit information, comment information, episode information, complete video program link information, video program information matching the target video program, and the like.
  • the profile information may be introduction information of a summary or summary of the target video program; the person list information may be information of an actor or a performer participating in the target video program; the tidbit information may be surrounding tidbit information of the target video program being captured.
  • the comment information may be used for commenting on the user who has watched the target video program; the episode information may be the first episode of the currently played target video program, and the total number of episodes; the complete video program link information may be linked to Viewing information of all sets of the target video program; the video program information matching the target video program may be a video program similar to the program type of the target video program, or having the same one or more characters participating Information about other video programs.
  • Step S216 Receive and display information of the target video program sent by the server.
  • the terminal after receiving the information of the target video program sent by the server (ie, the video program identification device), the terminal prompts or directly displays the information of the target video program.
  • the server ie, the video program identification device
  • the target person information corresponding to the voice information is first identified; then the video program associated with the target person information is searched from the video database; the video database stores the person information. And a video program associated with the character information; and then searching only in the video program associated with the target person information to find the target video program including the target voice content information, thereby improving the recognition efficiency of the video program, and solving the prior art
  • the embodiment of the present application compares and recognizes each piece of audio with a large amount of video in the entire video library, which greatly reduces the search.
  • the recognition range improves the speed of search recognition, and satisfies the needs of users to recognize both live and non-live video programs.
  • the server may further include:
  • Step S500 collecting audio information of multiple video programs
  • the server will pre-capture enough audio information of the video program, and the audio information of the captured video program will serve as important data for establishing a video database.
  • Step S502 analyzing audio information of the plurality of video programs, obtaining character information associated with each video program, and voice content information of each video program, the voice content information including acoustic characteristics of the voice content;
  • the server may pre-mark all the video programs collected by hand to mark the character information corresponding to the segments of all the voice content in the video program (ie, the identity information of the person), and then extract the pitch by extracting the voice content from each segment.
  • Characteristic parameters such as spectrum and envelope, the energy of the pitch frame, the appearance frequency of the pitch formant and its trajectory, and the extracted characteristic parameters are the acoustic characteristics of the speech content.
  • Step S504 Establish an acoustic feature list, and store the acoustic feature list in a video database.
  • the acoustic feature list includes video programs associated with each of the character information, and acoustic characteristics of the corresponding voice content of the person information in each of the video programs. That is to say, the character information associated with each video program can be sorted first, and then the acoustic features of the speech content are composed of characteristic parameters such as the base audio spectrum and the envelope, the energy of the pitch frame, the appearance frequency of the pitch formant and its trajectory. Finally, the information is organized into a character information key, corresponding to all the video program lists associated with the character information, and then each video program is a key, corresponding to the acoustics of all the voice content in the video program associated with the character information.
  • a mapping table of feature lists that is, a list of acoustic features is established. A list of acoustic features as shown in Table 1 below:
  • Step S506 Perform model training using the acoustic features of the voice content to establish a plurality of voiceprint feature models.
  • each voiceprint feature model corresponds to one character information.
  • the server may adopt the deep neural network (DNN)-ivector, ie, the DNN-ivector system, by using the deep neural network (DNN)-ivector system. Grab the speaker's characteristics.
  • the main feature of the DNN-ivector system is that the previously extracted acoustic features are projected into a lower linear space by aligning them according to a certain sounding unit, and then the speaker information is mined. Then the server uses these feature information to carry out model training, and can establish a mapping table with the voiceprint feature model as the key and the character information as the value.
  • steps S504 and S506 in the embodiment of the present application may not be limited, and step S504 may be performed first, then step S506 may be performed first, or step S506 may be performed first, then step S504 may be performed, or steps S504 and S506 may be performed. Execute simultaneously.
  • the present application further provides a video program identification device, which is described in detail below with reference to the accompanying drawings:
  • FIG. 7 is a schematic structural diagram of a video program identification apparatus according to an embodiment of the present application.
  • the video program identification apparatus 70 may include: a first receiving unit 700, an identifying unit 702, a first searching unit 704, and a second searching unit 706. among them,
  • the first receiving unit 700 is configured to receive audio information in a video program from the terminal, where the audio information includes sound information;
  • the identifying unit 702 is configured to identify target person information corresponding to the sound information
  • the first searching unit 704 is configured to search, from the video database, a video program associated with the target person information; the video database stores the person information and the video program associated with the person information;
  • the second searching unit 706 is configured to search for a target video program that includes the target voice content information in the video program associated with the target person information; the target voice content information includes information that matches the voice content information of the audio information.
  • the video database stores at least two voiceprint feature models, and each voiceprint feature model corresponds to one character information
  • the identification unit 702 may include: a calculation unit and a first matching unit, where
  • a calculating unit configured to calculate a voiceprint feature of the sound information
  • the first matching unit is configured to identify a target voiceprint feature model that matches the voiceprint feature according to the voiceprint feature model stored in the video database; wherein the character information corresponding to the target voiceprint feature model is the target character information.
  • the first matching unit may be specifically configured to identify a target that matches the first voiceprint feature according to the voiceprint feature model stored in the video database.
  • a voiceprint feature model the first voiceprint feature is a voiceprint feature that has the largest proportion of the duration of the voice information.
  • the first matching unit may be specifically configured to identify, for each voiceprint feature, the voiceprint feature model stored in the video database.
  • the voiceprint feature model matched with the voiceprint feature; the voiceprint feature model with the highest matching degree is determined as the target voiceprint feature model.
  • the voice content information includes an acoustic feature of the voice content
  • the video database stores an acoustic feature of the voice content corresponding to the video program
  • the second searching unit 706 may specifically include a second matching unit, configured to match an acoustic feature of the voice content extracted from the audio information with an acoustic feature of the voice content corresponding to the video program associated with the target person information; The acoustic feature of the matching of the video program associated with the target person information is determined as the acoustic feature of the target voice content; the video program corresponding to the acoustic feature of the target voice content is determined as the target video program.
  • a second matching unit configured to match an acoustic feature of the voice content extracted from the audio information with an acoustic feature of the voice content corresponding to the video program associated with the target person information.
  • the video program identification device 70 may further include: an acquisition unit, an analysis unit, a list establishment unit, a model establishment unit, an information acquisition unit, and a first transmission unit, where
  • An acquisition unit configured to collect audio information of multiple video programs
  • An analyzing unit configured to analyze audio information of the plurality of video programs, obtain character information associated with each video program, and voice content information of each video program, the voice content information including acoustic characteristics of the voice content;
  • a list establishing unit for establishing an acoustic feature list, the acoustic feature list being stored in a video database; the acoustic feature list including a video program associated with each of the character information, and a corresponding voice of the character information in each video program The acoustic characteristics of the content.
  • the model establishing unit is configured to perform model training by using the acoustic features of the voice content to establish a plurality of voiceprint feature models; wherein each voiceprint feature model corresponds to a character information.
  • An information obtaining unit configured to obtain consulting information of the target video program
  • the first sending unit is configured to send the consulting information to the terminal.
  • the consultation information includes at least one of the following:
  • Profile information person list information, tidbit information, comment information, episode information, complete video program link information, video program information matching the target video program.
  • the video program identification device 70 in the embodiment of the present application is the server (ie, the video program identification device) in the foregoing embodiment of FIG. 1 to FIG. 6.
  • the functions of each module in the video program identification device 70 can be referenced.
  • the specific implementation manners of the embodiments in FIG. 1 to FIG. 6 in the foregoing method embodiments are not described herein again.
  • the present application further provides another video program identification device, which will be described in detail below with reference to the accompanying drawings:
  • FIG. 8 is a schematic structural diagram of a video program identification apparatus according to another embodiment of the present application.
  • the video program identification apparatus 80 may include: a second receiving unit 800, an information collecting unit 802, a second sending unit 804, and a receiving display. Unit 806, wherein
  • a second receiving unit 800 configured to receive an input video program identification instruction
  • the information collecting unit 802 is configured to collect audio information in the video program according to the video program identification instruction, where the audio information includes sound information;
  • a second sending unit 804 configured to send the audio information to the server, so that the server finds the target video program according to the audio information
  • the receiving display unit 806 is configured to receive and display information of the target video program from the server.
  • the video program identification device 80 in the embodiment of the present application is the terminal in the foregoing embodiment of FIG. 1 to FIG. 6.
  • the functions of each module in the video program identification device 80 can be referred to the reference in the foregoing method embodiments. 1 to the specific implementation manner of the embodiment of FIG. 6 , and details are not described herein again.
  • FIG. 9 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 90 may include a processor 901, an input unit 902, an output unit 903, a memory 904, and a communication unit 905, and the processor 901, the input unit 902, and the output unit. 903.
  • the memory 904 and the communication unit 905 can be connected to each other through a bus 906.
  • the memory 904 may be a high speed RAM memory, or may be a non-volatile memory such as at least one disk memory, and the memory 704 includes a flash in the embodiment of the present application.
  • the memory 904 can optionally also be at least one storage system located remotely from the aforementioned processor 901.
  • the memory 904 is configured to store application code, and may include an operating system, a network communication module, a user interface module, and a video program identification program.
  • the communication unit 905 is configured to perform information interaction with an external unit; the processor 901 is configured to invoke the program. Code, perform the following steps:
  • the audio information including sound information
  • Finding a video program associated with the target person information from a video database the video database stores person information and a video program associated with the person information;
  • Target video program including target voice content information in the video program associated with the target person information; the target voice content information including information matching the voice content information of the audio information.
  • the video database stores at least two voiceprint feature models, and each voiceprint feature model corresponds to one character information.
  • the processor 901 identifies that the target person information corresponding to the voice information may include:
  • the processor 901 may identify the target voiceprint feature model that matches the voiceprint feature according to the voiceprint feature model stored in the video database.
  • the voiceprint feature model stored in the video database, a target voiceprint feature model that matches the first voiceprint feature; the first voiceprint feature is a voiceprint feature that has the largest proportion of the duration of the voice information.
  • the processor 901 may identify the target voiceprint feature model that matches the voiceprint feature according to the voiceprint feature model stored in the video database.
  • a voiceprint feature model matching the voiceprint feature is identified according to the voiceprint feature model stored in the video database;
  • the voiceprint feature model with the highest matching degree is determined as the target voiceprint feature model.
  • the voice content information includes an acoustic feature of the voice content
  • the audio database stores an acoustic feature of the voice content corresponding to the video program
  • the searching, by the processor 901, the target video program that includes the target voice content information in the video program associated with the target person information may include:
  • the feature is determined as an acoustic feature of the target voice content; the video program corresponding to the acoustic feature of the target voice content is determined as the target video program.
  • the processor 901 may further perform:
  • Analyzing audio information of the plurality of video programs obtaining character information associated with each video program, and voice content information of each video program, the voice content information including acoustic characteristics of the voice content;
  • the acoustic feature list is stored in the video database; the acoustic feature list includes a video program associated with each of the character information, and the corresponding voice content of the character information in each video program Acoustic characteristics.
  • the processor 901 extracts the acoustic features of the voice content of each video program, it may also perform:
  • the model is trained by using the acoustic features of the voice content to establish a plurality of voiceprint feature models; wherein each voiceprint feature model corresponds to a character information.
  • the processor 901 may further perform:
  • the consultation information is transmitted to the terminal through the communication unit 905.
  • the consultation information includes at least one of the following:
  • Profile information person list information, tidbit information, comment information, episode information, complete video program link information, video program information matching the target video program.
  • server 90 in the embodiment of the present application is the server in the foregoing embodiment of FIG. 1 to FIG. 6 , and may specifically refer to the specific implementation manner of the embodiment of FIG. 1 to FIG. 6 in the foregoing method embodiments. Let me repeat.
  • the terminal 10 may include a baseband chip 100, a memory 105 (one or more computer readable storage media), a communication module 106, and a peripheral system 107. These components can communicate over one or more communication buses 104.
  • the peripheral system 107 is mainly used to implement the interaction function between the terminal 10 and the user/external environment, and mainly includes the input and output devices of the terminal 10.
  • the peripheral system 107 can include: a touch screen controller, a camera controller, an audio controller, and a sensor management module. Each controller may be coupled to a respective peripheral device (such as touch display 108, camera 109, audio circuit 1010, and sensor 1011). It should be noted that the peripheral system 107 may also include other I/O peripherals.
  • the baseband chip 100 can be integrated to include one or more processors 101, a clock module 222, and a power management module 103.
  • the clock module 102 integrated in the baseband chip 100 is primarily used to generate the clocks required for data transfer and timing control for the processor 101.
  • the power management module 103 integrated in the baseband chip 100 is mainly used to provide a stable, high-precision voltage for the processor 101, the radio frequency module 106, and the peripheral system.
  • the communication module 106 is configured to receive and transmit radio frequency signals, including a Subscriber Identification Module (SIM) card 1061 and a Wireless Fidelity (Wi-Fi) 1062, and mainly integrates the receiver and the transmitter of the terminal 10.
  • the communication module 106 communicates with the communication network and other communication devices via radio frequency signals.
  • the communication module 106 can include, but is not limited to, an antenna system, a radio frequency (RF) transceiver, one or more amplifiers, a tuner, one or more oscillators, a digital signal processor, and a codec. (CODEC) chips, SIM cards, storage media, etc.
  • the communication module 106 can be implemented on a separate chip.
  • Memory 105 is coupled to processor 101 for storing various software programs and/or sets of instructions.
  • memory 105 can include high speed random access memory, and can also include non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid state storage devices.
  • the memory 105 can store an operating system.
  • the memory 105 can also store a network communication program that can be used to communicate with one or more additional devices, one or more terminal devices, one or more network devices.
  • the memory 105 can also store a user interface program, which can realistically display the content image of the application through a graphical operation interface, and receive user control operations on the application through input controls such as menus, dialog boxes, and keys. .
  • the memory 105 can also store one or more applications. 10, these applications may include: social applications (e.g., Facebook TM), video recognition application program, a map-based applications (e.g., Google Maps), the browser (e.g., Safari TM, Google Chrome TM), etc. .
  • social applications e.g., Facebook TM
  • video recognition application program e.g., video recognition application program
  • map-based applications e.g., Google Maps
  • the browser e.g., Safari TM, Google Chrome TM
  • processor 101 can be used to read and execute computer readable instructions. Specifically, the processor 101 can be used to invoke a program stored in the memory 105, such as the video program identification application provided by the present application, and execute instructions included in the program, including the following steps:
  • the audio information including sound information
  • the communication module 106 receives and displays information of the target video program transmitted by the video program identification device via the touch display 108.
  • the terminal 10 in the embodiment of the present application is the terminal in the foregoing embodiment of FIG. 1 to FIG. 6 , and may specifically refer to the specific implementation manner of the embodiment of FIG. 1 to FIG. 6 in the foregoing method embodiments. Let me repeat.
  • the structure of the above terminal 10 is only an example provided by the embodiment of the present application, and the terminal 10 may have more or less components than the illustrated components, may combine two or more components, or may have components. Different configurations are implemented.
  • the target person information corresponding to the voice information is first identified; then the video program associated with the target person information is searched from the video database; the video database stores the person information. And a video program associated with the character information; and then searching only in the video program associated with the target person information to find the target video program including the target voice content information, thereby improving the recognition efficiency of the video program, and solving the prior art
  • the embodiment of the present application compares and recognizes each piece of audio with a large amount of video in the entire video library, which greatly reduces the search.
  • the recognition range improves the speed of search recognition, and satisfies the needs of users to recognize both live and non-live video programs.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

本申请公开了一种视频节目识别方法、设备、终端、系统和存储介质,其中,该方法由服务器执行,包括:从终端接收视频节目中的音频信息,所述音频信息包括声音信息;识别所述声音信息对应的目标人物信息;从视频数据库中查找与所述目标人物信息关联的视频节目;所述视频数据库存储有人物信息以及与人物信息关联的视频节目;及,在所述目标人物信息关联的视频节目中查找包含所述目标语音内容信息的目标视频节目;所述目标语音内容信息包括与所述音频信息的语音内容信息匹配的信息。

Description

视频节目识别方法、设备、终端、系统和存储介质
本申请要求于2017年11月22日提交中国专利局、申请号为201711180259.9、申请名称为“视频节目识别方法、相关装置、设备和系统”的中国专利申请的优先权。
技术领域
本申请涉及计算机领域,尤其涉及视频节目识别方法、设备、终端、系统和存储介质。
发明背景
随着电子科技技术以及互联网技术的发展,电子终端(特别是智能移动终端)的功能越来越强大,只要用户按照自身的需求在电子终端上安装各种应用程序安装包,便可以通过各种应用程序来完成各种事务。
例如,当用户一开始观看视频节目时不知道该视频节目的信息,包括节目名称、演员信息等等,那么通过电子终端可以快速了解该视频节目的信息。用户只需打开某个具有识别视频节目功能的应用程序,即可通过该应用程序获取到该视频节目的信息。
发明内容
本申请实施例所要解决的技术问题在于,提供一种视频节目识别方法、设备、终端、系统和存储介质,解决现有技术无法支持识别非直播类的视频节目的技术问题。
为了解决上述技术问题,本申请实施例一方面公开了一种视频节目识别方法,由服务器执行,包括:
从终端接收视频节目中的音频信息,所述音频信息包括声音信息;
识别所述声音信息对应的目标人物信息;
从视频数据库中查找与所述目标人物信息关联的视频节目;所述视频数据库存储有人物信息以及与人物信息关联的视频节目;及,
在所述目标人物信息关联的视频节目中查找包含目标语音内容信息的目标视频节目;所述目标语音内容信息包括与所述音频信息的语音内容信息匹配的信息。
本申请实施例另一方面公开了一种视频节目识别方法,由终端执行,包括:
接收输入的视频节目识别指令;
根据所述视频节目识别指令采集视频节目中的音频信息,所述音频信息包括语音信息;
将所述音频信息发送给服务器;以使所述服务器根据上述方法查找到目标视频节目;
从所述服务器接收并显示所述目标视频节目的信息。
本申请实施例另一方面公开了一种视频节目识别设备,包括处理器、输入设备、输出设备、存储器和通信设备,所述处理器、输入设备、输出设备、存储器和通信设备相互连接,其中,所述存储器用于存储应用程序代码,所述通信设备用于与外部设备进行信息交互;所述处理器被配置用于调用所述程序代码,执行如上述方法。
本申请实施例另一方面公开了一种终端,包括处理器、输入设备、输出设备、存储器和通信设备,所述处理器、输入设备、输出设备、存储器和通信设备相互连接,其中,所述存储器用于存储应用程序代码,所述通信设备用于与外部设备进行信息交互;所述处理器被配置用于调用所述程序代码,执行如上述方法。
本申请实施例另一方面公开了一种视频节目识别系统,包括终端和 服务器;其中,所述终端包括如上述终端,所述服务器包括如上述视频节目识别设备。
本申请实施例另一方面公开了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如上述方法。
附图简要说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的视频节目识别方法的系统构架示意图;
图2a是本申请实施例提供一种视频节目识别方法的示意流程图;
图2b是本申请另一实施例提供的一种视频节目识别方法的示意流程图;
图3是本申请实施例提供的视频节目识别指令的输入示意图;
图4是本申请实施例提供的声学特征提取的原理示意图;
图5是本申请实施例提供的建立视频数据库的流程示意图;
图6是本申请实施例提供的声纹特征模型的建立的原理示意图;
图7是本申请实施例提供的视频节目识别装置的结构示意图;
图8是本申请提供的另一实施例的视频节目识别装置的结构示意图;
图9是本申请实施例提供的服务器的结构示意图;
图10是本申请实施例提供的终端的结构示意图。
实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
具体实现中,本申请实施例中描述的终端包括但不限于诸如具有触摸敏感表面(例如,触摸屏显示器和/或触摸板)的移动电话、膝上型计算机或平板计算机之类的其它便携式设备。还应当理解的是,在某些实施例中,所述设备并非便携式通信设备,而是具有触摸敏感表面(例如,触摸屏显示器和/或触摸板)的台式计算机。
在接下来的讨论中,描述了包括显示器和触摸敏感表面的终端。然而,应当理解的是,终端可以包括诸如物理键盘、鼠标和/或控制杆的一个或多个其它物理用户接口设备。
现有技术中,具有识别视频节目功能的应用程序一般只支持识别电视直播中的视频节目,因为在识别电视直播中的视频节目时,可以将视频节目的音频搜索或识别范围缩小到当前时间段的有限的直播音频信息中,从而实现快速识别视频节目。但非直播类(例如点播)的视频节目与固定时段播出的直播节目不同,无法通过引入时间信息作为搜索条件,若针对所有视频节目进行识别,搜索或识别的视频节目量巨大,导致识别的效率很低,因此当前具有识别视频节目功能的应用程序不支持识别非直播类的视频节目。
如何提高视频节目的识别效率,满足用户既识别直播类又识别非直 播类的视频节目的需求,是当前人们关注的技术问题。
为了更好的理解本申请实施例提供的一种视频节目识别方法、视频节目识别装置,下面先对本申请实施例适用的视频节目识别方法的系统构架进行描述。参阅图1,图1是本申请实施例提供的视频节目识别方法的系统构架示意图,即本申请实施例提供的视频节目识别系统的结构示意图。如图1所示,系统构架可以包括一个或多个服务器、网络1~K以及与每个网络相连的多个终端(或设备)11~1n…K1~Kn,其中:
服务器可以包括但不限于后台服务器、组件服务器、视频节目识别系统服务器等,服务器可以通过互联网与多个终端进行通信。服务器为终端提供视频节目识别服务,为视频节目识别系统的运行提供支持。终端(或设备)可以安装并运行有相关的客户端(Client)(例如包括视频节目识别客户端等)。客户端(Client)是指与服务器相对应,为客户提供本地服务的程序。这里,该本地服务可包括但不限于:搜索或识别视频节目,获取视频节目的咨询信息等等。
具体的,客户端可包括:本地运行的应用程序、运行于网络浏览器上的功能(又称为Web App)等。对于客户端,服务器上需要运行有相应的服务器端程序来提供相应的社交服务,如视频数据库服务,数据计算、决策执行等等。
本申请实施例中用户可以通过终端中安装的视频节目识别客户端将采集的视频节目中的音频信息发送给服务器进行视频节目识别,服务器向终端返回识别出的视频节目的信息。
本申请实施例中的终端可以包括但不限于任何一种基于智能操作系统的手持式电子产品,其可与用户通过键盘、虚拟键盘、触摸板、触摸屏以及声控设备等输入设备来进行人机交互,诸如智能手机、平板电脑、个人电脑等。其中,智能操作系统包括但不限于任何通过向终端提 供各种移动应用来丰富设备功能的操作系统,诸如安卓(Android TM)、iOS TM、Windows Phone TM等。
需要说明的是,本申请提供的视频节目识别方法的系统构架不限于图1所示。
基于图1所示的视频节目识别方法的系统构架,参见图2a,是本申请实施例提供一种视频节目识别方法的示意流程图。如图2a所示,该方法由服务器执行,包括如下步骤:
步骤S101,从终端接收视频节目中的音频信息,音频信息包括声音信息。
步骤S102,识别声音信息对应的目标人物信息。
步骤S103,从视频数据库中查找与目标人物信息关联的视频节目,视频数据库存储有人物信息以及与人物信息关联的视频节目。
步骤S104,在目标人物信息关联的视频节目中查找包含目标语音内容信息的目标视频节目;目标语音内容信息包括与音频信息的语音内容信息匹配的信息。
参见图2b,是本申请另一实施例提供的一种视频节目识别方法的示意流程图,涉及终端和服务器之间的交互,可以包括以下步骤:
步骤S200:接收输入的视频节目识别指令;
具体地,当用户想要对正在播放的视频节目进行识别的时候,可以通过启动终端中安装的用于视频节目识别的客户端,来进行识别。那么在启动了该视频节目识别的客户端后,如图3示出的本申请实施例提供的视频节目识别指令的输入示意图,用户可以通过客户端中“摇电视”的功能,摇动终端来输入视频节目识别指令,此时,终端即接收到该输入的视频节目识别指令。
可理解的是,图3只是本申请实施例的其中一种实施方式,本申请 不限定输入视频节目识别指令的方式,还可以通过点击虚拟按钮,或按动物理按键,或输入语音指令等其他方式来输入视频节目识别指令。
本申请实施例中该正在播放的视频节目可以是终端以外的电子设备正在播放的视频节目,例如电视、平板电脑等等,也可以是终端自身正在播放的视频节目。本申请实施例中的视频节目包括直播类的视频节目和非直播类的视频节目。
步骤S202:根据该视频节目识别指令采集视频节目中的音频信息;
具体地,终端的用于视频节目识别的客户端接收到该视频节目识别指令后,即可以采集正在播放的视频节目中的音频信息。本申请实施例中的音频信息包括声音信息,即为存在人物说话的声音信息。本申请实施例中的音频信息可以为预设时长的一段音频信息,例如5-10秒的一段音频信息。
步骤S204:将该音频信息发送给服务器;
具体地,终端的用于视频节目识别的客户端通过网络将该音频信息发送给服务器,该服务器即为视频节目识别设备。
步骤S206:接收视频节目中的音频信息;
具体地,服务器通过网络接收到终端发送的该视频节目中的音频信息。
步骤S208:识别该声音信息对应的目标人物信息;
具体地,服务器中的视频数据库可以存储有至少两个声纹特征模型,每个声纹特征模型对应一个人物信息;服务器可以通过计算该声音信息的声纹特征;然后根据该视频数据库存储的声纹特征模型识别与该声纹特征匹配的目标声纹特征模型;其中,匹配的该目标声纹特征模型对应的人物信息为该目标人物信息。
在本申请的其中一个实施例中,在计算出至少两个声纹特征的情况 下,该根据该视频数据库存储的声纹特征模型识别与该声纹特征匹配的目标声纹特征模型可以包括:将在声音信息的时长中占比最大的声纹特征确定为第一声纹特征;根据该视频数据库存储的声纹特征模型识别出与第一声纹特征匹配的声纹特征模型作为目标声纹特征模型。可见,该第一声纹特征为在该声音信息的时长中占比最大的声纹特征。
或者,针对每个声纹特征,根据该视频数据库存储的声纹特征模型分别识别与该声纹特征匹配的声纹特征模型;将匹配度最高的声纹特征模型确定为目标声纹特征模型。
具体地,若用户通过终端采集的5-10秒的一段音频信息中,存在两个或两个以上的人物对话时,那么服务器即计算出至少两个声纹特征。以存在两个人物对话为例,本申请实施例,服务器可以先判断在该段音频信息哪个人物说话的时长占比最大,从而找到占比最大的一个声纹特征,那么根据该视频数据库存储的声纹特征模型识别与该占比最大的声纹特征匹配的目标声纹特征模型即可。或者服务器可以两个声纹特征都进行匹配,然后查看哪个匹配度更高,从而找到匹配度最高的一个声纹特征模型,将该匹配度最高的一个声纹特征模型作为目标声纹特征模型。从而可以进一步提高识别视频节目的准确率。
可理解的是,若在该声音信息的时长中占比最大的声纹特征有两个或两个以上,那么可以随机选取其中一个来进行匹配。或者若匹配度最高的声纹特征模型有两个或两个以上,那么可以随机选取其中一个作为目标声纹特征模型。
步骤S210:从视频数据库中查找与该目标人物信息关联的视频节目;
具体地,该视频数据库存储有人物信息以及与人物信息关联的视频节目。与人物信息关联的视频节目,也就是说,该人物参与该视频节目 的演出或表演,例如演员a关联视频节目a,那么表明演员a出演了该视频节目a。服务器可以先从视频数据库中存储的多个人物信息中找到该目标人物信息,然后查找该目标人物信息关联的视频节目。
步骤S212:在该目标人物信息关联的视频节目中查找包含目标语音内容信息的目标视频节目;
具体地,本申请实施例中的该目标语音内容信息包括与该音频信息的语音内容信息匹配的信息。本申请实施例中的语音内容信息可以包括语音内容的声学特征;该视频数据库中存储有视频节目对应的语音内容的声学特征;那么在确定该目标语音内容信息时,服务器可以将从该音频信息中提取出的语音内容的声学特征与该目标人物信息关联的视频节目对应的语音内容的声学特征进行匹配;将目标人物信息关联的视频节目中匹配成功的声学特征确定为目标语音内容的声学特征;将目标语音内容的声学特征对应的视频节目确定为目标视频节目。那么,该目标人物信息关联的视频节目中匹配成功的声学特征为目标语音内容的声学特征;该目标语音内容的声学特征对应的视频节目为目标视频节目。
需要说明的是,本申请实施例中的语音内容即为人物说话的内容,例如视频节目是电视剧或电影,该语音内容即为演员的对白信息。本申请实施例在步骤S206之后,即服务器接收到视频节目的音频信息之后,还包括从该音频信息中提取语音内容的声学特征的步骤,该步骤可以在步骤S206至步骤S212之间来执行,也可以在步骤S212中来执行。
本申请的其中一个实施例方式中,从音频信息中提取语音内容的声学特征的提取方式可以如图4所示,通过时间窗将平稳的音频信号401划分成多个帧402,其中,相邻两个时间窗起始位置的时间间隔称之为“帧移”,如403所示,每一个时间窗所确定的单位称之为“帧”,其时间长度称之为帧长。对划分的每个帧进行时频分析,提取出每帧的声学 特征404。
音频信息(可以看作语音信号)可以认为是一种短时平稳信号和长时非平稳信号,在短时间内,可以认为音频信息还是可以当成平稳信号来处理,这个短时一般范围在10到30毫秒之间。语音内容信息的相关特征参数的分布规律在短时间(10-30ms)内可以认为是一致的,而在长时间来看则是有明显变化的。在数字信号处理时,对平稳信号进行时频分析,从而提取特征。因此,在对音频信息进行特征提取的时候,可以设置一个20ms左右的时间窗,即“帧移”403,在这个时间窗内可以认为语音信号是平稳的。然后以这个窗为单位在语音信号上进行滑动,每一个时间窗都可以提取出一个能够表征这个时间窗内信号的特征,从而就得到了音频信息的语音内容信息,即语音内容的声学特征序列。这个过程,我们称之为声学特征提取。这个特征能够表征出在这个时间窗内的语音信号相关信息。通过上述技术手段即可以实现将一段语音转化得到一个以帧为单位的特征序列。提取出的声学特征可以由梅尔倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)、感知线性预测系数(Perceptual Linear Prediction,PLP)来表征。
步骤S214:将该目标视频节目的信息发送给终端。
具体地,该目标视频节目的信息可以包括该目标视频节目的名称信息,该目标视频节目的完成的时间信息等。服务器还可以获取该目标视频节目的咨询信息;然后将该目标视频节目的咨询信息发送给终端。该咨询信息包括以下至少一项:简介信息、人物列表信息、花絮信息、评论信息、集数信息、完整视频节目链接信息、与该目标视频节目相匹配的视频节目信息等等。
其中,简介信息可以为该目标视频节目的概要或摘要的介绍信息;人物列表信息可以为参与该目标视频节目的演员或表演者的信息;花絮 信息可以为拍摄该目标视频节目是的周边花絮信息;评论信息可以为观看过该目标视频节目的用户进行评论信息;集数信息可以为当前播放的目标视频节目处于第几集,以及总共有多少集的信息;完整视频节目链接信息可以为链接到查看该目标视频节目的所有集数的信息;与该目标视频节目相匹配的视频节目信息可以为与该目标视频节目的节目类型类似的视频节目,或具有相同的1个或多个人物参演的其它视频节目的信息。
步骤S216:接收并显示服务器发送的该目标视频节目的信息。
具体地,终端接收到服务器(即视频节目识别设备)发送的该目标视频节目的信息后,向用户提示或直接显示该目标视频节目的信息。
实施本申请实施例,通过接收视频节目中的音频信息后,先识别该声音信息对应的目标人物信息;然后从视频数据库中查找与该目标人物信息关联的视频节目;该视频数据库存储有人物信息以及与人物信息关联的视频节目;然后只在该目标人物信息关联的视频节目中查找,以查找出包含目标语音内容信息的目标视频节目,从而提高了视频节目的识别效率,解决了现有技术中由于视频节目太多,若针对所有视频节目进行识别,识别的效率很低的问题,本申请实施例比起在整个视频库中与海量视频的每一段音频都进行匹配识别,大大减少了搜索识别范围,提高了搜索识别的速度,满足了用户既识别直播类又识别非直播类的视频节目的需求。
进一步地,如图5示出的本申请实施例提供的建立视频数据库的流程示意图,本申请实施例中在步骤S206之前,服务器还可以包括:
步骤S500:采集多个视频节目的音频信息;
具体地,服务器将预先采集足够多的视频节目的音频信息,该采集 的视频节目的音频信息将作为建立视频数据库的重要数据。
步骤S502:分析该多个视频节目的音频信息,获得每个视频节目关联的人物信息,以及每个视频节目的语音内容信息,语音内容信息包括语音内容的声学特征;
具体地,服务器可以预先将采集的所有视频节目通过人工标注的方式,标注出视频节目里面所有语音内容的片段所对应的人物信息(即人物身份信息),然后通过从每一段对语音内容提取基音频谱及包络、基音帧的能量、基音共振峰的出现频率及其轨迹等特征参数,提取的特征参数即为语音内容的声学特征。
步骤S504:建立声学特征列表,将该声学特征列表存储在视频数据库中。
具体地,该声学特征列表包括每个人物信息各自关联的视频节目,以及该人物信息在每个视频节目中对应的语音内容的声学特征。也就是说,可以先整理每个视频节目关联的人物信息,然后整理以基音频谱及包络、基音帧的能量、基音共振峰的出现频率及其轨迹等特征参数组成语音内容的声学特征列表,最后将这些信息整理成以人物信息为键,对应到该人物信息关联的所有视频节目列表,再以每一部视频节目为键,对应到该人物信息关联的该视频节目中所有语音内容的声学特征列表的映射表,即,建立了声学特征列表。如下表1所示的声学特征列表:
Figure PCTCN2018116686-appb-000001
Figure PCTCN2018116686-appb-000002
表1
步骤S506:利用该语音内容的声学特征进行模型训练,建立多个声纹特征模型。
具体地,每个声纹特征模型对应一个人物信息。如图6示出的本申请实施例提供的声纹特征模型的建立的原理示意图,服务器可以将建立的声学特征列表通过采用深度神经网络(Deep Neural Network,DNN)-ivector,即DNN-ivector系统对说话人特征的进行抓取。DNN-ivector系统主要特点就是将之前提取的声学特征通过按照一定的发声单元对齐后投影到一个较低的线性空间中,然后进行说话人信息的挖掘。然后服务器利用这些特征信息进行模型训练,可建立以声纹特征模型为键,人物信息为值的映射表。如表2所示的声纹特征模型:
Figure PCTCN2018116686-appb-000003
表2
需要说明的是,本申请实施例中步骤S504和S506的执行顺序可以不做限定,可以先执行步骤S504,后执行步骤S506,或者先执行步骤S506,后执行步骤S504,再或者步骤S504和S506同时执行。
为了便于更好地实施本申请实施例的上述方案,本申请还对应提供了一种视频节目识别装置,下面结合附图来进行详细说明:
如图7示出的本申请实施例提供的视频节目识别装置的结构示意图,视频节目识别装置70可以包括:第一接收单元700、识别单元702、第一查找单元704和第二查找单元706,其中,
第一接收单元700,用于从终端接收视频节目中的音频信息,该音频信息包括声音信息;
识别单元702,用于识别该声音信息对应的目标人物信息;
第一查找单元704,用于从视频数据库中查找与该目标人物信息关联的视频节目;该视频数据库存储有人物信息以及与人物信息关联的视频节目;
第二查找单元706,用于在该目标人物信息关联的视频节目中查找包含目标语音内容信息的目标视频节目;该目标语音内容信息包括与该音频信息的语音内容信息匹配的信息。
在其中的一个实施例中,该视频数据库存储有至少两个声纹特征模型,每个声纹特征模型对应一个人物信息;
识别单元702可以包括:计算单元和第一匹配单元,其中,
计算单元,用于计算该声音信息的声纹特征;
第一匹配单元,用于根据该视频数据库存储的声纹特征模型识别与该声纹特征匹配的目标声纹特征模型;其中,该目标声纹特征模型对应的人物信息为该目标人物信息。
在其中的一个实施例中,在计算出至少两个声纹特征的情况下,该第一匹配单元可以具体用于根据该视频数据库存储的声纹特征模型识别与第一声纹特征匹配的目标声纹特征模型;该第一声纹特征为在该声 音信息的时长中占比最大的声纹特征。
在其中的一个实施例中,在计算出至少两个声纹特征的情况下,该第一匹配单元可以具体用于,针对每个声纹特征,根据该视频数据库存储的声纹特征模型识别出与该声纹特征匹配的声纹特征模型;将匹配度最高的声纹特征模型确定为目标声纹特征模型。
在其中的一个实施例中,该语音内容信息包括语音内容的声学特征;该视频数据库中存储有视频节目对应的语音内容的声学特征;
第二查找单元706可以具体包括第二匹配单元,用于将从该音频信息中提取出的语音内容的声学特征与该目标人物信息关联的视频节目对应的语音内容的声学特征进行匹配;将该目标人物信息关联的视频节目中匹配成功的声学特征确定为目标语音内容的声学特征;将该目标语音内容的声学特征对应的视频节目确定为目标视频节目。
在其中的一个实施例中,视频节目识别装置70还可以包括:采集单元、分析单元、列表建立单元、模型建立单元、信息获取单元和第一发送单元,其中,
采集单元,用于采集多个视频节目的音频信息;
分析单元,用于分析该多个视频节目的音频信息,获得每个视频节目关联的人物信息,以及每个视频节目的语音内容信息,语音内容信息包括语音内容的声学特征;
列表建立单元,用于建立声学特征列表,将该声学特征列表存储在视频数据库中;该声学特征列表包括每个人物信息各自关联的视频节目,以及该人物信息在每个视频节目中对应的语音内容的声学特征。
模型建立单元,用于利用该语音内容的声学特征进行模型训练,建立多个声纹特征模型;其中,每个声纹特征模型对应一个人物信息。
信息获取单元,用于获取该目标视频节目的咨询信息;
第一发送单元,用于将该咨询信息发送给终端。
该咨询信息包括以下至少一项:
简介信息、人物列表信息、花絮信息、评论信息、集数信息、完整视频节目链接信息、与该目标视频节目相匹配的视频节目信息。
需要说明的是,本申请实施例中的视频节目识别装置70为上述图1至图6实施例中的服务器(即视频节目识别设备),该视频节目识别装置70中各模块的功能可对应参考上述各方法实施例中图1至图6实施例的具体实现方式,这里不再赘述。
为了便于更好地实施本申请实施例的上述方案,本申请还对应提供了另一种视频节目识别装置,下面结合附图来进行详细说明:
如图8示出的本申请提供的另一实施例的视频节目识别装置的结构示意图,视频节目识别装置80可以包括:第二接收单元800、信息采集单元802、第二发送单元804和接收显示单元806,其中,
第二接收单元800,用于接收输入的视频节目识别指令;
信息采集单元802,用于根据该视频节目识别指令采集视频节目中的音频信息,该音频信息包括声音信息;
第二发送单元804,用于将该音频信息发送给服务器,以使服务器根据音频信息查找到目标视频节目;
接收显示单元806,用于从服务器接收并显示该目标视频节目的信息。
需要说明的是,本申请实施例中的视频节目识别装置80为上述图1至图6实施例中的终端,该视频节目识别装置80中各模块的功能可对应参考上述各方法实施例中图1至图6实施例的具体实现方式,这里不再赘述。
为了便于更好地实施本申请实施例的上述方案,本申请还对应提供了一种服务器,下面结合附图来进行详细说明:
如图9示出的本申请实施例提供的服务器的结构示意图,服务器90可以包括处理器901、输入单元902、输出单元903、存储器904和通信单元905,处理器901、输入单元902、输出单元903、存储器904和通信单元905可以通过总线906相互连接。存储器904可以是高速RAM存储器,也可以是非易失性的存储器(non-volatile memory),例如至少一个磁盘存储器,存储器704包括本申请实施例中的flash。存储器904可选的还可以是至少一个位于远离前述处理器901的存储系统。存储器904用于存储应用程序代码,可以包括操作系统、网络通信模块、用户接口模块以及视频节目识别程序,通信单元905用于与外部单元进行信息交互;处理器901被配置用于调用所述程序代码,执行以下步骤:
通过通信单元905从终端接收视频节目中的音频信息,所述音频信息包括声音信息;
识别所述声音信息对应的目标人物信息;
从视频数据库中查找与所述目标人物信息关联的视频节目;所述视频数据库存储有人物信息以及与人物信息关联的视频节目;
在所述目标人物信息关联的视频节目中查找包含目标语音内容信息的目标视频节目;所述目标语音内容信息包括与所述音频信息的语音内容信息匹配的信息。
具体地,所述视频数据库存储有至少两个声纹特征模型,每个声纹特征模型对应一个人物信息;处理器901识别所述声音信息对应的目标人物信息可以包括:
计算所述声音信息的声纹特征;
根据所述视频数据库存储的声纹特征模型识别与所述声纹特征匹配的目标声纹特征模型;其中,所述目标声纹特征模型对应的人物信息为所述目标人物信息。
具体地,在计算出至少两个声纹特征的情况下,处理器901根据所述视频数据库存储的声纹特征模型识别与所述声纹特征匹配的目标声纹特征模型可以包括:
根据所述视频数据库存储的声纹特征模型识别与第一声纹特征匹配的目标声纹特征模型;所述第一声纹特征为在所述声音信息的时长中占比最大的声纹特征。
具体地,在计算出至少两个声纹特征的情况下,处理器901根据所述视频数据库存储的声纹特征模型识别与所述声纹特征匹配的目标声纹特征模型可以包括:
针对每个声纹特征,根据所述视频数据库存储的声纹特征模型识别出与该声纹特征匹配的声纹特征模型;
将匹配度最高的声纹特征模型确定为所述目标声纹特征模型。
具体地,所述语音内容信息包括语音内容的声学特征;所述视频数据库中存储有视频节目对应的语音内容的声学特征;
处理器901在所述目标人物信息关联的视频节目中查找包含目标语音内容信息的目标视频节目可以包括:
将从所述音频信息中提取出的语音内容的声学特征与所述目标人物信息关联的视频节目对应的语音内容的声学特征进行匹配;将所述目标人物信息关联的视频节目中匹配成功的声学特征确定为目标语音内容的声学特征;将所述目标语音内容的声学特征对应的视频节目确定为所述目标视频节目。
具体地,处理器901接收视频节目中的音频信息之前,还可以执行:
通过通信单元905采集多个视频节目的音频信息;
分析所述多个视频节目的音频信息,获得每个视频节目关联的人物信息,以及每个视频节目的语音内容信息,所述语音内容信息包括语音内容的声学特征;
建立声学特征列表,将所述声学特征列表存储在所述视频数据库中;所述声学特征列表包括每个人物信息各自关联的视频节目,以及所述人物信息在每个视频节目中对应的语音内容的声学特征。
具体地,处理器901提取出每个视频节目的语音内容的声学特征之后,还可以执行:
利用所述语音内容的声学特征进行模型训练,建立多个声纹特征模型;其中,每个声纹特征模型对应一个人物信息。
具体地,处理器901在所述目标人物信息关联的视频节目中查找包含目标语音内容信息的目标视频节目之后,还可以执行:
通过通信单元905获取所述目标视频节目的咨询信息;
通过通信单元905将所述咨询信息发送给终端。
所述咨询信息包括以下至少一项:
简介信息、人物列表信息、花絮信息、评论信息、集数信息、完整视频节目链接信息、与所述目标视频节目相匹配的视频节目信息。
需要说明的是,本申请实施例中的服务器90为上述图1至图6实施例中的服务器,具体可对应参考上述各方法实施例中图1至图6实施例的具体实现方式,这里不再赘述。
为了便于更好地实施本申请实施例的上述方案,本申请还对应提供了一种终端,下面结合附图来进行详细说明:
如图10示出的本申请实施例提供的终端的结构示意图,终端10可 包括:基带芯片100、存储器105(一个或多个计算机可读存储介质)、通信模块106、外围系统107。这些部件可在一个或多个通信总线104上通信。
外围系统107主要用于实现终端10和用户/外部环境之间的交互功能,主要包括终端10的输入输出装置。具体实现中,外围系统107可包括:触摸屏控制器、摄像头控制器、音频控制器以及传感器管理模块。其中,各个控制器可与各自对应的外围设备(如触摸显示屏108、摄像头109、音频电路1010以及传感器1011)耦合。需要说明的,外围系统107还可以包括其他I/O外设。
基带芯片100可集成包括:一个或多个处理器101、时钟模块222以及电源管理模块103。集成于基带芯片100中的时钟模块102主要用于为处理器101产生数据传输和时序控制所需要的时钟。集成于基带芯片100中的电源管理模块103主要用于为处理器101、射频模块106以及外围系统提供稳定的、高精确度的电压。
通信模块106用于接收和发送射频信号,包括用户身份识别卡(Subscriber Identification Module,SIM)卡1061和无线保真(WirelessFidelity,Wi-Fi)1062,主要集成了终端10的接收器和发射器。通信模块106通过射频信号与通信网络和其他通信设备通信。具体实现中,通信模块106可包括但不限于:天线系统、射频(Radio Frequency,RF)收发器、一个或多个放大器、调谐器、一个或多个振荡器、数字信号处理器、编译码器(CODEC)芯片、SIM卡和存储介质等。在一些实施例中,可在单独的芯片上实现通信模块106。
存储器105与处理器101耦合,用于存储各种软件程序和/或多组指令。具体实现中,存储器105可包括高速随机存取的存储器,并且也可包括非易失性存储器,例如一个或多个磁盘存储设备、闪存设备或其他 非易失性固态存储设备。存储器105可以存储操作系统。存储器105还可以存储网络通信程序,该网络通信程序可用于与一个或多个附加设备,一个或多个终端设备,一个或多个网络设备进行通信。存储器105还可以存储用户接口程序,该用户接口程序可以通过图形化的操作界面将应用程序的内容形象逼真的显示出来,并通过菜单、对话框以及按键等输入控件接收用户对应用程序的控制操作。
存储器105还可以存储一个或多个应用程序。如图10所示,这些应用程序可包括:社交应用程序(例如Facebook TM),视频节目识别应用程序,地图类应用程序(例如谷歌地图),浏览器(例如Safari TM,Google Chrome TM)等等。
本申请中,处理器101可用于读取和执行计算机可读指令。具体的,处理器101可用于调用存储于存储器105中的程序,例如本申请提供的视频节目识别应用程序,并执行该程序包含的指令,包括以下步骤:
通过触摸显示屏108接收输入的视频节目识别指令;或通过震动传感器接收输入的视频节目识别指令;
根据所述视频节目识别指令采集视频节目中的音频信息,所述音频信息包括声音信息;
将所述音频信息通过通信模块106发送给视频节目识别设备;以使所述视频节目识别设备根据上述图1至图6各个实施例中的方法识别查找到目标视频节目;
通信模块106接收并通过触摸显示屏108显示所述视频节目识别设备发送的所述目标视频节目的信息。
需要说明的是,本申请实施例中的终端10为上述图1至图6实施例中的终端,具体可对应参考上述各方法实施例中图1至图6实施例的具体实现方式,这里不再赘述。上述终端10的结构仅为本申请实施例 提供的一个例子,并且,终端10可具有比示出的部件更多或更少的部件,可以组合两个或更多个部件,或者可具有部件的不同配置实现。
实施本申请实施例,通过接收视频节目中的音频信息后,先识别该声音信息对应的目标人物信息;然后从视频数据库中查找与该目标人物信息关联的视频节目;该视频数据库存储有人物信息以及与人物信息关联的视频节目;然后只在该目标人物信息关联的视频节目中查找,以查找出包含目标语音内容信息的目标视频节目,从而提高了视频节目的识别效率,解决了现有技术中由于视频节目太多,若针对所有视频节目进行识别,识别的效率很低的问题,本申请实施例比起在整个视频库中与海量视频的每一段音频都进行匹配识别,大大减少了搜索识别范围,提高了搜索识别的速度,满足了用户既识别直播类又识别非直播类的视频节目的需求。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (14)

  1. 一种视频节目识别方法,其特征在于,由服务器执行,包括:
    从终端接收视频节目中的音频信息,所述音频信息包括声音信息;
    识别所述声音信息对应的目标人物信息;
    从视频数据库中查找与所述目标人物信息关联的视频节目;所述视频数据库存储有人物信息以及与人物信息关联的视频节目;及,
    在所述目标人物信息关联的视频节目中查找包含目标语音内容信息的目标视频节目;所述目标语音内容信息包括与所述音频信息的语音内容信息匹配的信息。
  2. 如权利要求1所述的方法,其特征在于,所述视频数据库存储有至少两个声纹特征模型,每个声纹特征模型对应一个人物信息;
    所述识别所述声音信息对应的目标人物信息包括:
    计算所述声音信息的声纹特征;
    根据所述视频数据库存储的声纹特征模型识别与所述声纹特征匹配的目标声纹特征模型;其中,所述目标声纹特征模型对应的人物信息为所述目标人物信息。
  3. 如权利要求2所述的方法,其特征在于,在计算出至少两个声纹特征的情况下,所述根据所述视频数据库存储的声纹特征模型识别与所述声纹特征匹配的目标声纹特征模型包括:
    根据所述视频数据库存储的声纹特征模型识别出与所述第一声纹特征匹配的声纹特征模型作为所述目标声纹特征模型;所述第一声纹特征为在所述声音信息的时长中占比最大的声纹特征。
  4. 如权利要求2所述的方法,其特征在于,在计算出至少两个声纹特征的情况下,所述根据所述视频数据库存储的声纹特征模型识别与所述声纹特征匹配的目标声纹特征模型包括:
    针对每个声纹特征,根据所述视频数据库存储的声纹特征模型识别出与该声纹特征匹配的声纹特征模型;
    将匹配度最高的声纹特征模型确定为所述目标声纹特征模型。
  5. 如权利要求1所述的方法,其特征在于,所述语音内容信息包括语音内容的声学特征;所述视频数据库中存储有视频节目对应的语音内容的声学特征;
    所述在所述目标人物信息关联的视频节目中查找包含目标语音内容信息的目标视频节目包括:
    将从所述音频信息中提取出的语音内容的声学特征与所述目标人物信息关联的视频节目对应的语音内容的声学特征进行匹配;
    将所述目标人物信息关联的视频节目中匹配成功的声学特征确定为目标语音内容的声学特征;
    将所述目标语音内容的声学特征对应的视频节目确定为所述目标视频节目。
  6. 如权利要求1所述的方法,其特征在于,还包括:
    采集多个视频节目的音频信息;
    分析所述多个视频节目的音频信息,获得每个视频节目关联的人物信息,以及每个视频节目的语音内容信息,所述语音内容信息包括语音内容的声学特征;
    建立声学特征列表,将所述声学特征列表存储在所述视频数据库中;所述声学特征列表包括每个人物信息各自关联的视频节目,以及所述人物信息在每个视频节目中对应的语音内容的声学特征。
  7. 如权利要求6所述的方法,其特征在于,还包括:
    利用所述语音内容的声学特征进行模型训练,建立多个声纹特征模型;其中,每个声纹特征模型对应一个人物信息。
  8. 如权利要求1-7任一项所述的方法,其特征在于,还包括:
    获取所述目标视频节目的咨询信息,将所述咨询信息发送给所述终端。
  9. 如权利要求8所述的方法,其特征在于,所述咨询信息包括以下至少一项:
    简介信息、人物列表信息、花絮信息、评论信息、集数信息、完整视频节目链接信息、与所述目标视频节目相匹配的视频节目信息。
  10. 一种视频节目识别方法,其特征在于,由终端执行,包括:
    接收输入的视频节目识别指令;
    根据所述视频节目识别指令采集视频节目中的音频信息;
    将所述音频信息发送给服务器,以使所述服务器根据所述如权利要求1-9任一项所述的方法查找到目标视频节目;
    从所述服务器接收并显示所述目标视频节目的信息。
  11. 一种视频节目识别设备,其特征在于,包括处理器、输入设备、 输出设备、存储器和通信设备,所述处理器、输入设备、输出设备、存储器和通信设备相互连接,其中,所述存储器用于存储应用程序代码,所述通信设备用于与外部设备进行信息交互;所述处理器被配置用于调用所述程序代码,执行如权利要求1-9任一项所述的方法。
  12. 一种终端,其特征在于,包括处理器、输入设备、输出设备、存储器和通信设备,所述处理器、输入设备、输出设备、存储器和通信设备相互连接,其中,所述存储器用于存储应用程序代码,所述通信设备用于与外部设备进行信息交互;所述处理器被配置用于调用所述程序代码,执行如权利要求10所述的方法。
  13. 一种视频节目识别系统,其特征在于,包括终端和服务器;其中,所述终端包括如权利要求12所述的终端,所述服务器包括如权利要求11所述的视频节目识别设备。
  14. 一种计算机可读存储介质,其特征在于,存储有计算机可读指令,可以使至少一个处理器执行如权利要求1至10中任一项所述的方法。
PCT/CN2018/116686 2017-11-22 2018-11-21 视频节目识别方法、设备、终端、系统和存储介质 WO2019101099A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711180259.9A CN108322770B (zh) 2017-11-22 2017-11-22 视频节目识别方法、相关装置、设备和系统
CN201711180259.9 2017-11-22

Publications (1)

Publication Number Publication Date
WO2019101099A1 true WO2019101099A1 (zh) 2019-05-31

Family

ID=62891439

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/116686 WO2019101099A1 (zh) 2017-11-22 2018-11-21 视频节目识别方法、设备、终端、系统和存储介质

Country Status (2)

Country Link
CN (1) CN108322770B (zh)
WO (1) WO2019101099A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108322770B (zh) * 2017-11-22 2020-02-18 腾讯科技(深圳)有限公司 视频节目识别方法、相关装置、设备和系统
CN112261436B (zh) * 2019-07-04 2024-04-02 青岛海尔多媒体有限公司 视频播放的方法、装置及系统
CN110505504B (zh) * 2019-07-18 2022-09-23 平安科技(深圳)有限公司 视频节目处理方法、装置、计算机设备及存储介质
CN110996021A (zh) * 2019-11-30 2020-04-10 咪咕文化科技有限公司 导播切换方法、电子设备和计算机可读存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101506828A (zh) * 2006-06-09 2009-08-12 索尼爱立信移动通讯股份有限公司 媒体辨识
US20110289098A1 (en) * 2010-05-19 2011-11-24 Google Inc. Presenting mobile content based on programming context
CN105142018A (zh) * 2015-08-12 2015-12-09 深圳Tcl数字技术有限公司 基于音频指纹的节目识别方法及装置
CN105874454A (zh) * 2013-12-31 2016-08-17 谷歌公司 用于基于场境信息生成搜索结果的方法、系统和介质
CN105868684A (zh) * 2015-12-10 2016-08-17 乐视网信息技术(北京)股份有限公司 视频信息获取方法及装置
CN106254939A (zh) * 2016-09-30 2016-12-21 北京小米移动软件有限公司 信息提示方法及装置
CN108322770A (zh) * 2017-11-22 2018-07-24 腾讯科技(深圳)有限公司 视频节目识别方法、相关装置、设备和系统

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101506828A (zh) * 2006-06-09 2009-08-12 索尼爱立信移动通讯股份有限公司 媒体辨识
US20110289098A1 (en) * 2010-05-19 2011-11-24 Google Inc. Presenting mobile content based on programming context
CN105874454A (zh) * 2013-12-31 2016-08-17 谷歌公司 用于基于场境信息生成搜索结果的方法、系统和介质
CN105142018A (zh) * 2015-08-12 2015-12-09 深圳Tcl数字技术有限公司 基于音频指纹的节目识别方法及装置
CN105868684A (zh) * 2015-12-10 2016-08-17 乐视网信息技术(北京)股份有限公司 视频信息获取方法及装置
CN106254939A (zh) * 2016-09-30 2016-12-21 北京小米移动软件有限公司 信息提示方法及装置
CN108322770A (zh) * 2017-11-22 2018-07-24 腾讯科技(深圳)有限公司 视频节目识别方法、相关装置、设备和系统

Also Published As

Publication number Publication date
CN108322770B (zh) 2020-02-18
CN108322770A (zh) 2018-07-24

Similar Documents

Publication Publication Date Title
EP3271917B1 (en) Communicating metadata that identifies a current speaker
WO2019101099A1 (zh) 视频节目识别方法、设备、终端、系统和存储介质
US9348906B2 (en) Method and system for performing an audio information collection and query
CN110164415B (zh) 一种基于语音识别的推荐方法、装置及介质
US11011170B2 (en) Speech processing method and device
WO2021128880A1 (zh) 一种语音识别方法、装置和用于语音识别的装置
CN107623614A (zh) 用于推送信息的方法和装置
US20060173859A1 (en) Apparatus and method for extracting context and providing information based on context in multimedia communication system
WO2021008538A1 (zh) 语音交互方法及相关装置
CN107291704B (zh) 处理方法和装置、用于处理的装置
CN111919249A (zh) 词语的连续检测和相关的用户体验
CN107666536B (zh) 一种寻找终端的方法和装置、一种用于寻找终端的装置
TW202022849A (zh) 語音資料的識別方法、裝置及系統
CN108538284A (zh) 同声翻译结果的展现方法及装置、同声翻译方法及装置
US11354520B2 (en) Data processing method and apparatus providing translation based on acoustic model, and storage medium
CN108628813A (zh) 处理方法和装置、用于处理的装置
CN111696553A (zh) 一种语音处理方法、装置及可读介质
JP2022137114A (ja) 端末機及びその動作方法
CN105302335B (zh) 词汇推荐方法和装置及计算机可读存储介质
CN111739535A (zh) 一种语音识别方法、装置和电子设备
CN112309387A (zh) 用于处理信息的方法和装置
CN112700770A (zh) 语音控制方法、音箱设备、计算设备和存储介质
CN112837668B (zh) 一种语音处理方法、装置和用于处理语音的装置
CN109426359A (zh) 一种输入方法、装置以及机器可读介质
CN113707130B (zh) 一种语音识别方法、装置和用于语音识别的装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18880595

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18880595

Country of ref document: EP

Kind code of ref document: A1