WO2013189156A1 - Video search system, method and video search server based on natural interaction input - Google Patents

Video search system, method and video search server based on natural interaction input Download PDF

Info

Publication number
WO2013189156A1
WO2013189156A1 PCT/CN2012/086283 CN2012086283W WO2013189156A1 WO 2013189156 A1 WO2013189156 A1 WO 2013189156A1 CN 2012086283 W CN2012086283 W CN 2012086283W WO 2013189156 A1 WO2013189156 A1 WO 2013189156A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
semantic
user
search
data
Prior art date
Application number
PCT/CN2012/086283
Other languages
French (fr)
Chinese (zh)
Inventor
王勇进
张瑞
张钰林
Original Assignee
海信集团有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海信集团有限公司 filed Critical 海信集团有限公司
Publication of WO2013189156A1 publication Critical patent/WO2013189156A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Definitions

  • the present invention relates to the field of video search technology, and more particularly to a video search system and method based on natural interactive input (e.g., voice input), and a video search server.
  • natural interactive input e.g., voice input
  • SIRI Personalized Intelligent Assistant
  • Another object of the present invention is to provide a video search method based on natural interactive input, which can realize intelligent perception of a user's video target task and provide a more natural and friendly user experience.
  • Still another object of the present invention is to provide a video search server having natural language semantic analysis capabilities and intelligent video search capabilities.
  • the video search system based on natural interaction input provided by the embodiment of the present invention includes a user end and a video search server.
  • the user end includes a voice collection module and a human machine interface, and the voice collection module collects voice input of the user to generate user voice data and provide the same to the human machine interface.
  • the video search server comprises a control module, a speech recognition module, a natural language processing module, a video relational database and a video search module; the video relational database stores a video semantic space and a semantic descriptor sub-set of the video text data in the video semantic space.
  • the control module receives the user voice data provided by the human-machine interface of the client and provides the voice recognition module to obtain the user text data, and provides the user text data to the natural language processing module to obtain the user text semantic analysis result data, and utilizes the user semantic analysis result.
  • the data is pre-searched in the video relational database to obtain video pre-search results.
  • the video pre-search result includes a subset of semantic descriptions of the associated video text data that matches the user text semantic analysis result data in the video semantic space.
  • the video search module receives the user text semantic analysis result data and the video pre-search result provided by the control module, and uses the user text semantic analysis result data in the semantic space of the video semantic descriptor to be similar to the semantic description sub-set included in the video pre-search result. Degree comparison The result is the final result of the video output to the control module, which is then provided by the control module to the human machine interface for presentation to the user.
  • a video search method based on natural interactive input includes the following steps: (a) collecting natural interaction input of the user to obtain user text data; (b) performing natural language semantic analysis on user text data. Obtaining user text semantic analysis result data; (c) using user text semantic analysis result data to perform pre-search to obtain video pre-search results, the video pre-search results including related video text data matching user text semantic analysis result data in video semantic space Semantic description sub-set; (d) projecting the user text semantic analysis result data into the video semantic space, and performing similarity comparison with the semantic description sub-set included in the video pre-search result, respectively, and outputting the video final search result; and (e ) Present the video's final search results to the user.
  • a video input method based on voice input includes the following steps: (1) Quantifying video file semantic analysis result data obtained by performing natural language semantic analysis on the collected video text data. And based on the latent semantic index, the training semantic learning space is obtained, and the collected semantics of the video text data in the semantic space of the video is obtained; (2) collecting the natural interaction input of the user to obtain the user text data; (3) Performing natural language semantic analysis on user text data to obtain user text semantic analysis result data; (4) using user text semantic analysis result data in the semantic space of the video semantic space to at least partially collect video text data in the video semantic space
  • the semantic description sub-set performs similarity comparison to output the video final search result; and (5) presents the video final search result to the user.
  • a video search server includes: a video relational database, a natural language processing module, a control module, and a video search module.
  • the video relational database stores the video semantic space and the semantic description of the video text data in the video semantic space
  • the control module provides the user text data representing the user video requirement to the natural language processing module to obtain the user text semantic analysis result data; the video search module obtains the semantic description of the user text semantic analysis result data in the video semantic space, And using the semantic descriptor to perform similarity comparison on at least part of the video text data in the semantic description sub-set of the video semantic space to output the video final search result to the control module.
  • the natural interactive input based video search system and method and video search server in the above various embodiments of the present invention have at least one or more of the following advantages: being capable of being guided by a user's video target task, allowing the user to interact using natural language Through natural language processing technology, using the video-related knowledge base for reasoning operations, the user can quickly obtain related videos from the database by providing a simple description of the video content, thereby realizing intelligent perception of the user's video target tasks; It can realize natural and friendly human-computer interaction mode and interface, and has the ability to continuously learn and upgrade; therefore, it can effectively enhance the user experience.
  • FIG. 1 is a schematic diagram of a video search system architecture based on natural interactive input (eg, voice input) according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a module of the user end shown in FIG. 1.
  • FIG. 3 is a schematic diagram of a module of the video search server shown in FIG. 1.
  • FIG. 4 is a flowchart of a video input method based on voice input according to an embodiment of the present invention.
  • FIG. 5 is a flowchart of another voice input based video search method according to an embodiment of the present invention. detailed description
  • FIG. 1 is a schematic structural diagram of a video search system based on natural interactive input (for example, voice input) according to an embodiment of the present invention.
  • the voice input based video search system 100 of the present embodiment includes a client 10 and a video search server 30; the client 10 receives user voice input and generates user voice data, and the video search server 30 generates voice data according to the user.
  • a video search is performed and the video final search results are returned to the client 10 for presentation to the user.
  • one video search server 30 may correspond to multiple user terminals 10, so that the user voice data of each client terminal 10 can be respectively responded to and the corresponding video is returned. Final search results.
  • FIG. 2 is a schematic diagram of a module of the client 10 according to an embodiment of the present invention.
  • the client 10 includes, for example, a voice collection module 11 and a human machine interface 13.
  • voice The acquisition module 11 collects user voice input and generates user voice data, which is transmitted to the video search server 30 through the human machine interface 13.
  • the tasks of the human machine interface 13 include, for example, human-computer interaction, user information recording, and user authentication.
  • two usage modes may be specifically provided for the user, such as a public mode and a privacy mode.
  • the video search server 30 may perform video search in two ways of enabling or skipping user authentication, so that The user's personal information is protected, and suitable video search results can be provided to users of different ages.
  • the client 10 is, for example, a smart TV with a TV remote control (having an Internet function), a desktop computer, a notebook computer, a smart phone, and the like; when the client terminal 10 is a smart TV with a TV remote controller.
  • the voice collection module 11 may be a microphone built in the TV remote controller, and the human machine interface 13 may be a Hyper Text Transport Protocol (HTTP) website service running on a smart TV (for example, port 80), which will
  • HTTP Hyper Text Transport Protocol
  • the user voice data output by the microphone is transmitted to the video search server 30 for video search, and the video final search result may be displayed subsequently for presentation to the user; further, it can be understood that before the user voice data is transmitted to the video search server 30
  • Data compression can be performed on user voice data first.
  • FIG. 3 is a block diagram of a video search server 30 according to an embodiment of the present invention.
  • the video search server 30 includes a control module 31, a voice recognition module 33, a natural language processing module 35, a video data collection module 36, a video relational database 37, a semantic space learning module 38, a video search module 39, and a server.
  • Management module 32 It is explained here that each module in the video search server 30 can be implemented in hardware and/or software according to the needs of actual design flexibility; further, the video search server 30 can be a group consisting of a single server or multiple servers. Group, plus the necessary peripheral components.
  • the video search server 30 includes two parts, an online and an offline, and the online part is mainly controlled by a module. 31.
  • the speech recognition module 33, the natural language processing module 35, and the video search module 39 are configured.
  • the offline portion is mainly composed of a video data collection module 36, a video relational database 37, and a semantic space learning module 38, and shares a natural language with the online portion. Processing module 35.
  • control module 31 serves as a scheduling center of the entire video search server 30, and receives the user voice data transmitted by the client terminal 10 (for example, transmitted by a wired or wireless network connection) and finally returns the video final search result as an output to the client. 10.
  • the control module 31 first verifies the identity of the user, and determines whether to perform a video search and/or return a video before the final search result according to the authentication result. Search results filtering is required first.
  • the speech recognition module 33 is for speech recognition of speech data for conversion to corresponding textual data, which is typically coupled to a speech library (not shown in Figure 3) for speech instruction matching operations.
  • the voice recognition module 33 can convert the user voice data provided by the control module 31 into user text data representing the user's video requirements and return it to the control module 31.
  • the natural language processing module 35 is adapted to perform semantic analysis on text data (such as user text data, video text data, etc.), for example, to perform Chinese semantic analysis: including word segmentation, part-of-speech tagging, named entity analysis, and the like.
  • text data such as user text data, video text data, etc.
  • Chinese semantic analysis including word segmentation, part-of-speech tagging, named entity analysis, and the like.
  • the natural language processing module 35 can also perform semantic analysis on different language texts, and is not limited to Chinese, but also English, etc., but only needs to provide semantic libraries of different languages to support.
  • the natural language processing module 35 can perform semantic analysis on the user text data provided by the control module 31 to return the user text semantic analysis result data to the control module 31.
  • the user text semantic analysis result data can be understood as user text data after the operation of word segmentation, part-of-speech tagging, and the like.
  • the video data collection module 36 is configured to collect video data and provide video text data, the video
  • the text data may be text data such as a movie name, an alias, a director name, an actor name, a video production date, and a video theme type (such as a video name, an alias, a director name, and the like) searched from a network (including a film and television program providing partner).
  • Video descriptions such as war films, comedy films, etc., video regions (such as China, the United States, etc.) or languages (such as Chinese, English, etc.), video categories (such as movies, TV shows, etc.), and data validity tags. text.
  • the video data collection module 36 can operate in a periodic automatic collection or manually triggered collection.
  • the video text data provided by the video data collection module 36 is first transmitted to the natural language processing module 35 for natural language semantic analysis to form the video text semantic analysis result data and stored in the video relation database 37; it can be understood that
  • the video text data provided by the video data collecting module 36 may also be stored in the video relation database 37, and then the natural language processing module 37 performs word segmentation, part-of-speech tagging, etc. on the video text data stored in the video relational database 37 (ie, semantic analysis). )operating.
  • the video text semantic analysis result data can be understood as the result data after the operation of the word segmentation and the part-of-speech tagging of the video text data.
  • the video relational database 37 performs a video search as a data source of the video search server 30, and includes a data table such as a video data table, a backup data table, a user table, and a query record table.
  • the video data table stores, for example, semantically analyzed video text data
  • the backup data table stores, for example, duplicated and culled data
  • the user table stores, for example, user data
  • the query record table for example, saves the user's video search record.
  • the semantic spatial learning module 38 is a major part of the machine learning of the voice input based video search system 100, which is primarily responsible for quantifying the video text data in the video relational database 37 and then based on Latent semantic indexing (LSI) versus video.
  • LSI Latent semantic indexing
  • the main semantics of the relational database 37 are analyzed and learned to obtain the video semantic space, and the collected video text is found.
  • the data is in a semantically described sub-set of the video semantic space (i.e., a projected set in the video semantic space) and stored in the video relational database 37.
  • the process of establishing the video semantic space may be: the semantic space learning module 38 uses the semantic analysis result data of the video text stored in the video relational database 37 as a training sample set, so that a vocabulary containing a large number of useful words is established, and then Using this vocabulary, each video text data (ie, video description) can be quantified and ultimately represented by a vector; at this point, each element in the vector will represent a word in a certain video text data. The number of occurrences, the vector is also the word frequency of the video text data.
  • some special directions can be calculated in the linear space to which the word frequency vector belongs, and the vector representing these special directions is a set of standard orthogonal vector groups. They form a new linear space.
  • the special physical meaning of this set of vectors is: Any of these vectors represent certain vocabulary that often appear simultaneously in a specific context. Each such specific context corresponds to a semantic topic, that is, the simultaneous appearance of certain vocabulary Represents a semantic. However, only a part of this set of special vectors constituting the new linear space has a very high degree of semantic discrimination and is therefore preserved. These preserved vectors ultimately constitute the video semantic space.
  • the video text data in the video relational database 37 will find a projection in the video semantic space, i.e., a semantic descriptor of the video text data in the video semantic space.
  • the video search module 39 is connected to the control module 31 and the video relation database 37, which can receive the user text semantic analysis result data provided by the control module 31 and can acquire the video semantic space from the video relation database 37 (for example, the coordinate axis of the semantic space, etc.) Information) and projecting the user text semantic analysis result data in the video semantic space to obtain a projection (ie, a semantic descriptor) of the user text data in the video semantic space. Subsequently, the video search module 39 can utilize the semantic description The descriptor performs a video search operation.
  • the video search operation of the video search module 39 in the embodiment of the present invention may be: First, let the control module 31 perform video pre-search in the video relational database 37 by using the user text semantic analysis result data (that is, the semantically analyzed user text data). For example, a classified search: that is, a video director name search, a video actor name search, a video production age search, a video theme type search, a video region or language type search, and a video category search, etc.; It is possible to reduce the workload of the video search module 39 for video search and improve the search efficiency.
  • the user text semantic analysis result data that is, the semantically analyzed user text data.
  • a classified search that is, a video director name search, a video actor name search, a video production age search, a video theme type search, a video region or language type search, and a video category search, etc.
  • the video pre-search result includes, for example, a set of semantic descriptors of the related video text data matching the user text data in the video semantic space, and the semantic description sub-collection is provided to the video search module 39 along with the user text semantic analysis result data. .
  • the video search module 39 compares the semantic description of the user text data in the semantic space of the video semantic space with the related video text data included in the video pre-search result in the semantic description sub-set of the video semantic space to obtain a video final search result. And transmitted to the control module 31, and then provided by the control module 31 to the human machine interface 13 of the client 10 for presentation to the user.
  • the similarity comparison can be realized by calculating the Euclidean distance, but the present invention is not limited thereto, and other methods for calculating the similarity between projections in the semantic space can be employed.
  • the final search result of the video here may be a list of videos sorted according to the degree of similarity.
  • the semantic space search is performed on a part of the video text data in the semantic description sub-set of the video semantic space.
  • the video pre-search may also be omitted, and the semantic descriptor of the video semantic space is directly utilized by the user to perform semantic space search in the semantic description sub-set of the video semantic space. Frequency final search results.
  • the server management module 32 is disposed in the video search server 30 as a module that is not user-oriented.
  • the voice recognition module 33 of the above embodiment of the present invention can also be integrated into the client terminal 10 instead of the video search server 30, so that the client terminal 10 can convert the user voice data into user text data before transmitting it to the video search server 30.
  • Control module 31 in the middle can also be integrated into the client terminal 10 instead of the video search server 30, so that the client terminal 10 can convert the user voice data into user text data before transmitting it to the video search server 30.
  • a video input method based on voice input mainly includes S400-S410:
  • S400 collecting voice input of the user to generate user voice data
  • S402 Perform speech recognition on user voice data to obtain user text data.
  • S406 Perform a pre-search (for example, the foregoing classified search) by using the user text semantic analysis result data to obtain a video pre-search result, where the video pre-search result includes the semantics of the related video text data in the video semantic space that matches the user text semantic analysis result data. Describe subsets;
  • S408 Projecting the user text semantic analysis result data to the video semantic space, and performing similarity comparison with the semantic description sub-set included in the video pre-search result to output a final video search result (for example, sorting according to the similarity score) Video list); and
  • S410 Present the final search result of the video to the user.
  • another video input method based on voice input includes, for example, mainly steps S500-S510:
  • S500 The video text semantic analysis result data obtained by performing natural language semantic analysis on the collected video text data is quantized and trained based on the latent semantic index to obtain a video semantic space, and the collected video text data is obtained in the video. a semantic description sub-collection of the semantic space;
  • S502 collecting voice input of the user and converting the data into user text
  • S504 performing natural language semantic analysis on user text data to obtain user text semantic analysis result data
  • step S506 Using a semantic description of the user text semantic analysis result data in the video semantic space, performing similarity comparison on the at least partially collected video text data in a semantic description sub-set of the video semantic space to output a video final search result;
  • the method includes the foregoing performing a video pre-search (for example, the foregoing classified search), performing a semantic space search, and directly performing a semantic space search without performing a video pre-search;
  • S508 Present the final search result of the video to the user.
  • the natural interactive input mode is not limited to voice input, and can also be direct natural language text input, or even gesture input; accordingly, in the video search methods of the above embodiments, The text conversion step of the user voice data is not required; and the module design in the video search system can be appropriately increased, decreased, and/or changed according to the actual situation.
  • the video search system and method based on natural interaction input, such as voice input, and the video search server provided by the embodiments of the present invention have at least one or more of the following advantages.
  • the video search system and method based on natural interactive input, such as voice input, and the video search server provided by the present invention have at least one or more of the following advantages: being capable of being guided by a user's video target task, allowing the user to interact using natural language, Through the natural language processing technology, using the video-related knowledge base for the reasoning operation, the user can quickly obtain the relevant video from the database by providing a simple description of the video content, thereby realizing the intelligent perception of the user's video target task; It can realize natural and friendly human-computer interaction mode and interface, and has the ability to continuously learn and upgrade; therefore, it can effectively enhance the user experience.

Landscapes

  • Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the field of video search technologies, and provides a video search system and method and a video search server based on natural interaction input. A user end of the video search system receives natural interaction input of a user and provides the input to the video search server thereof for video search, wherein the video search server may comprise an online part and an offline part. The offline part performs semantic analysis on collected video information, and establishes a video semantic space and a video relationship database. The online part, according to the natural interaction input of the user, obtains text data of the user and performs the semantic analysis, uses a semantic analysis result to perform video pre-search in the relationship database, and according to semantic description of the semantic analysis result in the video semantic space, performs comparison and search in a semantic description subset contained in a video pre-search result to output a video final search result to the user. The user only needs to provide simple description of video content to quickly obtain relevant video from the database, thereby implementing intelligent awareness of a video target task of the user.

Description

基于自然交互输入的视频搜索系统及方法和视频搜索服务器  Video search system and method based on natural interactive input and video search server
技术领域 Technical field
本发明涉及视频搜索技术领域, 特别是关于基于自然交互输入 (例如语 音输入)的视频搜索系统及方法、 以及视频搜索服务器。 背景技术  The present invention relates to the field of video search technology, and more particularly to a video search system and method based on natural interactive input (e.g., voice input), and a video search server. Background technique
随着电子信息和网络技术的发展, 具有网络接入功能的智能电视逐渐 成为电视市场的主流。 其中, 视频则是智能电视用户最主要的需求。 不同 于个人计算机外围设备的鼠标和键盘, 目前智能电视的人机交互仍然以传 统的遥控器方式为主; 然而, 大量的按钮、 复杂的使用模式和菜单、 繁琐 且令人困惑的界面元素, 随着电视的复杂化和功能的不断增强, 传统的人 机交互方式也因此变得越来越不能满足用户的需求。  With the development of electronic information and network technology, smart TVs with network access functions have gradually become the mainstream of the TV market. Among them, video is the most important requirement of smart TV users. Unlike the mouse and keyboard of personal computer peripherals, the current human-computer interaction of smart TVs is still dominated by traditional remote controls; however, a large number of buttons, complex usage patterns and menus, cumbersome and confusing interface elements, With the complication and function of TV, the traditional human-computer interaction method has become more and more unable to meet the needs of users.
近期以来, 随着语音识别技术的发展, 出现了以美国苹果 (APPLE)公司 推出的个人语音助理 (Personalized Intelligent Assistant, SIRI)为代表的产品, 其能够让用户通过自然语言与设备终端进行交互, 并能够提供例如发短信、 查天气等多项功能。 目前, SIRI 尚不能支持中文语音输入。 近年来, 国内 相关行业也开始进行基于语音等自然交互方式的研究与应用并取得了一定 的成果, 但总的来看, 基于语音等自然交互方式的产品应用仍难以满足用 户的体验要求。 发明内容 本发明的发明目的之一在于提供一种基于自然交互输入的视频搜索系 统, 能实现对用户的视频目标任务的智能感知, 提供更自然友好流畅的用 户体验。 Recently, with the development of speech recognition technology, there has been a product represented by Apple's Personalized Intelligent Assistant (SIRI), which enables users to interact with device terminals through natural language. And can provide many functions such as texting, weather checking, etc. Currently, SIRI does not support Chinese speech input. In recent years, domestic related industries have begun research and application based on natural interaction methods such as voice and achieved certain results. However, in general, product applications based on natural interaction methods such as voice are still difficult to meet user experience requirements. Summary of the invention One of the objects of the present invention is to provide a video search system based on natural interactive input, which can realize intelligent sensing of a user's video target task and provide a more natural and friendly user experience.
本发明的另一发明目的在于提供一种基于自然交互输入的视频搜索方 法, 能实现对用户的视频目标任务的智能感知, 提供更自然友好流畅的用 户体验。  Another object of the present invention is to provide a video search method based on natural interactive input, which can realize intelligent perception of a user's video target task and provide a more natural and friendly user experience.
本发明的再一发明目的在于提供一种视频搜索服务器, 具有自然语言 语义分析能力及智能的视频搜索能力。  Still another object of the present invention is to provide a video search server having natural language semantic analysis capabilities and intelligent video search capabilities.
具体地, 本发明实施例提供的一种基于自然交互输入的视频搜索系统, 包括用户端和视频搜索服务器。 其中, 用户端包括语音采集模块和人机界 面, 语音采集模块采集用户的语音输入以生成用户语音数据并提供至人机 界面。 视频搜索服务器包括控制模块、 语音识别模块、 自然语言处理模块、 视频关系数据库以及视频搜索模块; 视频关系数据库储存视频语义空间以 及视频文本数据在该视频语义空间的语义描述子集合。 控制模块接收用户 端的人机界面提供的用户语音数据并提供至语音识别模块以获取用户文本 数据, 将用户文本数据提供至自然语言处理模块以获取用户文本语义分析 结果数据, 并利用用户语义分析结果数据在该视频关系数据库中进行预搜 索以获取视频预搜索结果。 该视频预搜索结果包含与该用户文本语义分析 结果数据匹配的相关视频文本数据于该视频语义空间的语义描述子集合。 视频搜索模块接收控制模块提供的用户文本语义分析结果数据和视频预搜 索结果、 利用用户文本语义分析结果数据于视频语义空间的语义描述子与 视频预搜索结果所包含的语义描述子集合分别进行相似度比较、 并根据比 较结果输出视频最终搜索结果至控制模块, 再由控制模块提供至人机界面 以呈现给用户。 Specifically, the video search system based on natural interaction input provided by the embodiment of the present invention includes a user end and a video search server. The user end includes a voice collection module and a human machine interface, and the voice collection module collects voice input of the user to generate user voice data and provide the same to the human machine interface. The video search server comprises a control module, a speech recognition module, a natural language processing module, a video relational database and a video search module; the video relational database stores a video semantic space and a semantic descriptor sub-set of the video text data in the video semantic space. The control module receives the user voice data provided by the human-machine interface of the client and provides the voice recognition module to obtain the user text data, and provides the user text data to the natural language processing module to obtain the user text semantic analysis result data, and utilizes the user semantic analysis result. The data is pre-searched in the video relational database to obtain video pre-search results. The video pre-search result includes a subset of semantic descriptions of the associated video text data that matches the user text semantic analysis result data in the video semantic space. The video search module receives the user text semantic analysis result data and the video pre-search result provided by the control module, and uses the user text semantic analysis result data in the semantic space of the video semantic descriptor to be similar to the semantic description sub-set included in the video pre-search result. Degree comparison The result is the final result of the video output to the control module, which is then provided by the control module to the human machine interface for presentation to the user.
此外, 本发明实施例提供的一种基于自然交互输入的视频搜索方法, 其包括步骤: (a)采集用户的自然交互输入以得到用户文本数据; (b)对用户 文本数据进行自然语言语义分析得到用户文本语义分析结果数据; (c)利用 用户文本语义分析结果数据进行预搜索得到视频预搜索结果, 该视频预搜 索结果包含与用户文本语义分析结果数据匹配的相关视频文本数据在视频 语义空间的语义描述子集合; (d)将用户文本语义分析结果数据投影到该视 频语义空间后与视频预搜索结果所包含的语义描述子集合分别进行相似度 比较并输出视频最终搜索结果; 以及 (e)将视频最终搜索结果呈现给用户。  In addition, a video search method based on natural interactive input provided by the embodiment of the present invention includes the following steps: (a) collecting natural interaction input of the user to obtain user text data; (b) performing natural language semantic analysis on user text data. Obtaining user text semantic analysis result data; (c) using user text semantic analysis result data to perform pre-search to obtain video pre-search results, the video pre-search results including related video text data matching user text semantic analysis result data in video semantic space Semantic description sub-set; (d) projecting the user text semantic analysis result data into the video semantic space, and performing similarity comparison with the semantic description sub-set included in the video pre-search result, respectively, and outputting the video final search result; and (e ) Present the video's final search results to the user.
本发明另一实施例提供的一种基于语音输入的视频搜索方法, 其包括 步骤: (1)利用对收集到的视频文本数据进行自然语言语义分析后而得到的 视频文本语义分析结果数据进行量化并基于潜在语义索引进行训练学习得 到视频语义空间、 并取得收集到的视频文本数据在该视频语义空间的语义 描述子集合; (2)采集用户的自然交互输入以得到用户文本数据; (3)对用户 文本数据进行自然语言语义分析得到用户文本语义分析结果数据; (4)利用 用户文本语义分析结果数据于该视频语义空间的语义描述子在至少部分收 集到的视频文本数据于该视频语义空间的语义描述子集合中进行相似度比 较以输出视频最终搜索结果; 以及 (5)将视频最终搜索结果呈现给用户。  A video input method based on voice input according to another embodiment of the present invention includes the following steps: (1) Quantifying video file semantic analysis result data obtained by performing natural language semantic analysis on the collected video text data. And based on the latent semantic index, the training semantic learning space is obtained, and the collected semantics of the video text data in the semantic space of the video is obtained; (2) collecting the natural interaction input of the user to obtain the user text data; (3) Performing natural language semantic analysis on user text data to obtain user text semantic analysis result data; (4) using user text semantic analysis result data in the semantic space of the video semantic space to at least partially collect video text data in the video semantic space The semantic description sub-set performs similarity comparison to output the video final search result; and (5) presents the video final search result to the user.
另外, 本发明实施例提供的一种视频搜索服务器, 包括: 视频关系数 据库、 自然语言处理模块、 控制模块、 以及视频搜索模块。 其中, 视频关 系数据库储存视频语义空间以及视频文本数据在该视频语义空间的语义描 述子集合; 控制模块将代表用户视频需求的用户文本数据提供至自然语言 处理模块以获取用户文本语义分析结果数据; 视频搜索模块获取用户文本 语义分析结果数据在该视频语义空间的语义描述子、 并利用该语义描述子 在至少部分视频文本数据于该视频语义空间的语义描述子集合中进行相似 度比较以输出视频最终搜索结果至控制模块。 In addition, a video search server according to an embodiment of the present invention includes: a video relational database, a natural language processing module, a control module, and a video search module. Wherein, the video relational database stores the video semantic space and the semantic description of the video text data in the video semantic space The control module provides the user text data representing the user video requirement to the natural language processing module to obtain the user text semantic analysis result data; the video search module obtains the semantic description of the user text semantic analysis result data in the video semantic space, And using the semantic descriptor to perform similarity comparison on at least part of the video text data in the semantic description sub-set of the video semantic space to output the video final search result to the control module.
本发明上述各个实施例中的基于自然交互输入的视频搜索系统及方法 和视频搜索服务器至少具有以下优点中的一个或多个: 能够以用户的视频 目标任务为导向, 允许用户使用自然语言进行交互, 通过自然语言处理技 术, 利用视频相关知识库进行推理运算, 用户只需提供对视频内容的简单 描述即可从数据库中快速获取相关视频, 从而可实现对用户的视频目标任 务的智能感知; 此外, 能够实现自然友好方便的人机交互方式和界面, 具 有不断学习升级的能力; 因此, 可有效提升用户的使用体验。  The natural interactive input based video search system and method and video search server in the above various embodiments of the present invention have at least one or more of the following advantages: being capable of being guided by a user's video target task, allowing the user to interact using natural language Through natural language processing technology, using the video-related knowledge base for reasoning operations, the user can quickly obtain related videos from the database by providing a simple description of the video content, thereby realizing intelligent perception of the user's video target tasks; It can realize natural and friendly human-computer interaction mode and interface, and has the ability to continuously learn and upgrade; therefore, it can effectively enhance the user experience.
上述说明仅是本发明技术方案的概述, 为了能够更清楚了解本发明的 技术手段, 而可依照说明书的内容予以实施, 并且为了让本发明的上述和 其他目的、 特征和优点能够更明显易懂, 以下特举较佳实施例, 并配合附 图, 详细说明如下。 附图说明  The above description is only an overview of the technical solutions of the present invention, and the technical means of the present invention can be more clearly understood, and can be implemented in accordance with the contents of the specification, and the above and other objects, features and advantages of the present invention can be more clearly understood. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, preferred embodiments will be described in detail with reference to the accompanying drawings. DRAWINGS
图 1 为本发明实施例的一种基于自然交互输入 (例如语音输入)的视频 搜索系统架构示意图。  FIG. 1 is a schematic diagram of a video search system architecture based on natural interactive input (eg, voice input) according to an embodiment of the present invention.
图 2为图 1所示用户端的一种模块示意图。  FIG. 2 is a schematic diagram of a module of the user end shown in FIG. 1.
图 3为图 1所示视频搜索服务器的一种模块示意图。 图 4为本发明实施例的一种基于语音输入的视频搜索方法的流程图。 图 5为本发明实施例的另一种基于语音输入的视频搜索方法的流程图。 具体实施方式 FIG. 3 is a schematic diagram of a module of the video search server shown in FIG. 1. FIG. 4 is a flowchart of a video input method based on voice input according to an embodiment of the present invention. FIG. 5 is a flowchart of another voice input based video search method according to an embodiment of the present invention. detailed description
为更进一步阐述本发明为达成预定发明目的所采取的技术手段及功效, 以下结合附图及较佳实施例, 对依据本发明提出的基于自然交互输入的视 频搜索系统及方法和视频搜索服务器其具体实施方式、 方法、 步骤及功效, 详细说明如后。  In order to further explain the technical means and functions of the present invention for achieving the intended purpose of the invention, the video search system and method based on natural interactive input and the video search server according to the present invention are hereinafter combined with the accompanying drawings and preferred embodiments. The detailed description, methods, steps and effects are described in detail below.
有关本发明的前述及其他技术内容、 特点及功效, 在以下配合参考图 式的较佳实施例详细说明中将可清楚的呈现。 通过具体实施方式的说明, 当可对本发明为达成预定目的所采取的技术手段及功效得以更加深入且具 体的了解, 然而所附图式仅是提供参考与说明之用, 并非用来对本发明加 以限制。  The foregoing and other technical aspects, features and advantages of the present invention will be apparent from the following description of the preferred embodiments of the invention. The technical means and functions of the present invention for achieving the intended purpose can be more deeply and specifically understood by the description of the embodiments. However, the drawings are only for reference and description, and are not intended to be used for the present invention. limit.
请参阅图 1, 其为本发明实施例的一种基于自然交互输入 (例如语音输 入)的视频搜索系统的架构示意图。 如图 1所示, 本实施例的基于语音输入 的视频搜索系统 100包括用户端 10与视频搜索服务器 30; 用户端 10接收 用户语音输入并生成用户语音数据, 由视频搜索服务器 30根据用户语音数 据进行视频搜索并返回视频最终搜索结果至用户端 10以呈现给用户。 需要 说明的是, 在本实施例的基于语音输入的视频搜索系统 100 中, 一个视频 搜索服务器 30可以对应多个用户端 10, 从而可分别响应各个用户端 10的 用户语音数据并返回对应的视频最终搜索结果。  Please refer to FIG. 1, which is a schematic structural diagram of a video search system based on natural interactive input (for example, voice input) according to an embodiment of the present invention. As shown in FIG. 1, the voice input based video search system 100 of the present embodiment includes a client 10 and a video search server 30; the client 10 receives user voice input and generates user voice data, and the video search server 30 generates voice data according to the user. A video search is performed and the video final search results are returned to the client 10 for presentation to the user. It should be noted that, in the voice input-based video search system 100 of the present embodiment, one video search server 30 may correspond to multiple user terminals 10, so that the user voice data of each client terminal 10 can be respectively responded to and the corresponding video is returned. Final search results.
请参阅图 2, 其为本发明实施例的用户端 10的一种模块示意图。 如图 2所示, 用户端 10例如包括语音采集模块 11与人机界面 13。 其中, 语音 采集模块 11采集用户语音输入并生成用户语音数据, 该用户语音数据通过 人机界面 13传送至视频搜索服务器 30。 人机界面 13的任务例如包括人机 交互、 用户信息记录和用户认证等。 在用户认证方面, 可以为用户专门提 供两种使用模式, 例如公开模式和隐私模式; 与此对应, 视频搜索服务器 30可以在启用或跳过用户认证两种方式下进行视频搜索, 这样既可以对用 户的个人信息进行保护, 又可以对不同年龄范围的用户提供适合的视频搜 索结果。在本实施例中,用户端 10例如是带电视遥控器的智能电视 (具有上 网功能)、 桌上型电脑、 笔记本电脑、 智能手机等电子产品; 当用户端 10 为带电视遥控器的智能电视, 则语音采集模块 11可以是内置于电视遥控器 的麦克风,人机界面 13可以是运行在智能电视上 (例如 80端口)的超文本传 输协议 (Hyper Text Transport Protocol, HTTP)网站服务,其将麦克风输出的用 户语音数据传送至视频搜索服务器 30做视频搜索之用, 并且后续还可显示 视频最终搜索结果以呈现给用户; 此外, 可以理解的是, 在传送用户语音 数据至视频搜索服务器 30之前可先对用户语音数据进行数据压缩。 Please refer to FIG. 2 , which is a schematic diagram of a module of the client 10 according to an embodiment of the present invention. As shown in FIG. 2, the client 10 includes, for example, a voice collection module 11 and a human machine interface 13. Among them, voice The acquisition module 11 collects user voice input and generates user voice data, which is transmitted to the video search server 30 through the human machine interface 13. The tasks of the human machine interface 13 include, for example, human-computer interaction, user information recording, and user authentication. In terms of user authentication, two usage modes may be specifically provided for the user, such as a public mode and a privacy mode. Correspondingly, the video search server 30 may perform video search in two ways of enabling or skipping user authentication, so that The user's personal information is protected, and suitable video search results can be provided to users of different ages. In this embodiment, the client 10 is, for example, a smart TV with a TV remote control (having an Internet function), a desktop computer, a notebook computer, a smart phone, and the like; when the client terminal 10 is a smart TV with a TV remote controller. The voice collection module 11 may be a microphone built in the TV remote controller, and the human machine interface 13 may be a Hyper Text Transport Protocol (HTTP) website service running on a smart TV (for example, port 80), which will The user voice data output by the microphone is transmitted to the video search server 30 for video search, and the video final search result may be displayed subsequently for presentation to the user; further, it can be understood that before the user voice data is transmitted to the video search server 30 Data compression can be performed on user voice data first.
请参阅图 3, 其为本发明实施例的视频搜索服务器 30的一种模块示意 图。 如图 3所示, 视频搜索服务器 30包括控制模块 31、 语音识别模块 33、 自然语言处理模块 35、视频数据收集模块 36、视频关系数据库 37、 语义空 间学习模块 38、视频搜索模块 39、以及服务器管理模块 32。在此说明的是, 视频搜索服务器 30 中的各个模块可以根据实际设计弹性的需要以硬件及 / 或软件的方式实现; 此外, 视频搜索服务器 30可以是由单个服务器或者是 多个服务器构成的群组、 再加上必要的外围设备构成。 另外, 在本实施例 中, 视频搜索服务器 30包括线上和线下两部分, 线上部分主要由控制模块 31、 语音识别模块 33、 自然语言处理模块 35和视频搜索模块 39构成, 线 下部分主要由视频数据收集模块 36、视频关系数据库 37和语义空间学习模 块 38构成, 并与线上部分共用自然语言处理模块 35。 Please refer to FIG. 3, which is a block diagram of a video search server 30 according to an embodiment of the present invention. As shown in FIG. 3, the video search server 30 includes a control module 31, a voice recognition module 33, a natural language processing module 35, a video data collection module 36, a video relational database 37, a semantic space learning module 38, a video search module 39, and a server. Management module 32. It is explained here that each module in the video search server 30 can be implemented in hardware and/or software according to the needs of actual design flexibility; further, the video search server 30 can be a group consisting of a single server or multiple servers. Group, plus the necessary peripheral components. In addition, in this embodiment, the video search server 30 includes two parts, an online and an offline, and the online part is mainly controlled by a module. 31. The speech recognition module 33, the natural language processing module 35, and the video search module 39 are configured. The offline portion is mainly composed of a video data collection module 36, a video relational database 37, and a semantic space learning module 38, and shares a natural language with the online portion. Processing module 35.
具体地, 控制模块 31作为整个视频搜索服务器 30的调度中心, 其接 收用户端 10传送 (例如以有线或无线网络连接方式传送)过来的用户语音数 据并最终返回视频最终搜索结果作为输出给用户端 10。 在此, 当用户端 10 的人机界面 13设置有用户认证机制的情形下, 控制模块 31会先验证用户 的身份, 根据认证结果确定后续是否进行视频搜索及 /或返回视频最终搜索 结果之前是否需要先进行搜索结果过滤。  Specifically, the control module 31 serves as a scheduling center of the entire video search server 30, and receives the user voice data transmitted by the client terminal 10 (for example, transmitted by a wired or wireless network connection) and finally returns the video final search result as an output to the client. 10. Here, when the human-machine interface 13 of the client 10 is provided with a user authentication mechanism, the control module 31 first verifies the identity of the user, and determines whether to perform a video search and/or return a video before the final search result according to the authentication result. Search results filtering is required first.
语音识别模块 33用于对语音数据进行语音识别以转换成对应的文本数 据, 其通常会连接至语音库 (图 3未示出)进行语音指令匹配操作。在本实施 例中, 语音识别模块 33可以将控制模块 31提供的用户语音数据转换成代 表用户视频需求的用户文本数据并返回给控制模块 31。  The speech recognition module 33 is for speech recognition of speech data for conversion to corresponding textual data, which is typically coupled to a speech library (not shown in Figure 3) for speech instruction matching operations. In the present embodiment, the voice recognition module 33 can convert the user voice data provided by the control module 31 into user text data representing the user's video requirements and return it to the control module 31.
自然语言处理模块 35适于对文本数据 (例如用户文本数据、视频文本数 据等)进行语义分析, 例如可以完成中文语义分析: 包括分词、 词性标注、 命名实体分析等等。 当然, 可以理解的是, 自然语言处理模块 35也可对不 同语言文本进行语义分析, 并不限于中文, 也可以是英文等等, 只是需要 提供不同语言的语义库来支持。 在本实施例中, 自然语言处理模块 35可以 对控制模块 31提供的用户文本数据进行语义分析以返回用户文本语义分析 结果数据至控制模块 31。 在此, 用户文本语义分析结果数据可以理解为进 行分词、 词性标注等操作后的用户文本数据。  The natural language processing module 35 is adapted to perform semantic analysis on text data (such as user text data, video text data, etc.), for example, to perform Chinese semantic analysis: including word segmentation, part-of-speech tagging, named entity analysis, and the like. Of course, it can be understood that the natural language processing module 35 can also perform semantic analysis on different language texts, and is not limited to Chinese, but also English, etc., but only needs to provide semantic libraries of different languages to support. In the present embodiment, the natural language processing module 35 can perform semantic analysis on the user text data provided by the control module 31 to return the user text semantic analysis result data to the control module 31. Here, the user text semantic analysis result data can be understood as user text data after the operation of word segmentation, part-of-speech tagging, and the like.
视频数据收集模块 36用于收集视频数据并提供视频文本数据, 该视频 文本数据可以是从网络 (包括影视节目提供合作商)搜索到的电影、 电视剧、 歌曲、 电视节目等文本数据, 例如包括视频名、 别名、 导演名、 演员名、 视频制作年代、 视频主题类型 (例如战争片、 喜剧片等)、 视频地区 (例如中 国、 美国等等)或语言 (例如中文、 英文等)类型、 视频类别 (例如电影、 电视 剧等)等等字段以及数据有效性标记等视频描述文本。视频数据收集模块 36 的工作方式可以是周期性自动收集或是人工触发收集。 在本实施例中, 视 频数据收集模块 36提供的视频文本数据会先传送至自然语言处理模块 35 进行自然语言语义分析形成视频文本语义分析结果数据后储存至视频关系 数据库 37; 可以理解的是, 视频数据收集模块 36提供的视频文本数据也可 先储存至视频关系数据库 37,再由自然语言处理模块 37对储存在视频关系 数据库 37中的视频文本数据进行分词、 词性标注等 (也即语义分析)操作。 在此, 视频文本语义分析结果数据可以理解为对视频文本数据进行分词、 词性标注等操作后的结果数据。 The video data collection module 36 is configured to collect video data and provide video text data, the video The text data may be text data such as a movie name, an alias, a director name, an actor name, a video production date, and a video theme type (such as a video name, an alias, a director name, and the like) searched from a network (including a film and television program providing partner). Video descriptions such as war films, comedy films, etc., video regions (such as China, the United States, etc.) or languages (such as Chinese, English, etc.), video categories (such as movies, TV shows, etc.), and data validity tags. text. The video data collection module 36 can operate in a periodic automatic collection or manually triggered collection. In this embodiment, the video text data provided by the video data collection module 36 is first transmitted to the natural language processing module 35 for natural language semantic analysis to form the video text semantic analysis result data and stored in the video relation database 37; it can be understood that The video text data provided by the video data collecting module 36 may also be stored in the video relation database 37, and then the natural language processing module 37 performs word segmentation, part-of-speech tagging, etc. on the video text data stored in the video relational database 37 (ie, semantic analysis). )operating. Here, the video text semantic analysis result data can be understood as the result data after the operation of the word segmentation and the part-of-speech tagging of the video text data.
视频关系数据库 37作为视频搜索服务器 30执行视频搜索的数据源, 其包括视频数据表、 备份数据表、 用户表及查询记录表等数据表。 其中, 视频数据表例如保存经过语义分析后的视频文本数据, 备份数据表例如保 存重复和剔除的数据, 用户表例如保存用户数据, 查询记录表例如保存用 户的视频搜索记录。  The video relational database 37 performs a video search as a data source of the video search server 30, and includes a data table such as a video data table, a backup data table, a user table, and a query record table. The video data table stores, for example, semantically analyzed video text data, the backup data table stores, for example, duplicated and culled data, the user table stores, for example, user data, and the query record table, for example, saves the user's video search record.
语义空间学习模块 38是基于语音输入的视频搜索系统 100的机器学习 的主要部分, 其主要负责将视频关系数据库 37中的视频文本数据量化, 然 后基于潜在语义索引 (Latent semantic indexing, LSI)对视频关系数据库 37中 主要的一些语义进行分析学习得到视频语义空间、 并找到收集到的视频文 本数据在该视频语义空间的语义描述子集合 (也即在该视频语义空间的投影 集合), 并储存至视频关系数据库 37中。 The semantic spatial learning module 38 is a major part of the machine learning of the voice input based video search system 100, which is primarily responsible for quantifying the video text data in the video relational database 37 and then based on Latent semantic indexing (LSI) versus video. The main semantics of the relational database 37 are analyzed and learned to obtain the video semantic space, and the collected video text is found. The data is in a semantically described sub-set of the video semantic space (i.e., a projected set in the video semantic space) and stored in the video relational database 37.
视频语义空间的建立过程可以是: 语义空间学习模块 38将储存在视频 关系数据库 37中经语义分析后的视频文本语义分析结果数据作为训练样本 集, 因此包含大量有用词汇的词表被建立, 然后利用这个词表, 每个视频 文本数据 (也即视频描述)都能够被数量化并最终由一个向量来表示; 此时, 向量中的每一个元素将代表某一个词在某一个视频文本数据中出现的次 数, 该向量也即是视频文本数据的词频。 之后, 利用大量视频文本数据的 词频向量, 通过子空间机器学习的方法, 在词频向量所属线性空间中可以 计算出一些特殊的方向, 表示这些特殊的方向的向量是一组标准正交的向 量组, 它们构成一个新的线性空间。 这组向量的特殊物理意义是: 其中任 一个向量都表示在特定语境下经常同时出现的某些词汇, 每一种这样的特 定语境便对应一个语义题目, 即某些词汇的同时出现就表示一个语义。 但 是, 构成新的线性空间的这组特殊向量中一般只有一部分具有非常高的语 义区分度, 因此被保留下来。 这些被保留下来的向量最终构成视频语义空 间。视频关系数据库 37中的视频文本数据将在该视频语义空间中找到投影, 也即视频文本数据在该视频语义空间中的语义描述子。  The process of establishing the video semantic space may be: the semantic space learning module 38 uses the semantic analysis result data of the video text stored in the video relational database 37 as a training sample set, so that a vocabulary containing a large number of useful words is established, and then Using this vocabulary, each video text data (ie, video description) can be quantified and ultimately represented by a vector; at this point, each element in the vector will represent a word in a certain video text data. The number of occurrences, the vector is also the word frequency of the video text data. After that, using the word frequency vector of a large amount of video text data, through the subspace machine learning method, some special directions can be calculated in the linear space to which the word frequency vector belongs, and the vector representing these special directions is a set of standard orthogonal vector groups. They form a new linear space. The special physical meaning of this set of vectors is: Any of these vectors represent certain vocabulary that often appear simultaneously in a specific context. Each such specific context corresponds to a semantic topic, that is, the simultaneous appearance of certain vocabulary Represents a semantic. However, only a part of this set of special vectors constituting the new linear space has a very high degree of semantic discrimination and is therefore preserved. These preserved vectors ultimately constitute the video semantic space. The video text data in the video relational database 37 will find a projection in the video semantic space, i.e., a semantic descriptor of the video text data in the video semantic space.
视频搜索模块 39连接至控制模块 31与视频关系数据库 37中, 其可接 收控制模块 31提供的用户文本语义分析结果数据并可从视频关系数据库 37 获取视频语义空间 (例如该语义空间的坐标轴等信息)、并将该用户文本语义 分析结果数据投影在该视频语义空间以得到用户文本数据在该视频语义空 间的投影 (也即语义描述子)。 后续, 视频搜索模块 39就可以利用该语义描 述子进行视频搜索操作。 The video search module 39 is connected to the control module 31 and the video relation database 37, which can receive the user text semantic analysis result data provided by the control module 31 and can acquire the video semantic space from the video relation database 37 (for example, the coordinate axis of the semantic space, etc.) Information) and projecting the user text semantic analysis result data in the video semantic space to obtain a projection (ie, a semantic descriptor) of the user text data in the video semantic space. Subsequently, the video search module 39 can utilize the semantic description The descriptor performs a video search operation.
本发明实施例中视频搜索模块 39的视频搜索操作可以为: 首先, 让控 制模块 31 利用用户文本语义分析结果数据 (也即语义分析后的用户文本数 据)在视频关系数据库 37中进行视频预搜索,例如进行分类搜索:也即视频 导演名搜索、 视频演员名搜索、 视频制作年代搜索、 视频主题类型搜索、 视频地区或语言类型搜索、 和视频类别搜索等等中的多个或全部; 这样, 就可以减小后续视频搜索模块 39进行视频搜索的工作量, 提高搜索效率。 在此, 视频预搜索结果例如包含与用户文本数据匹配的相关视频文本数据 在视频语义空间的语义描述子的集合, 该语义描述子集合会随同用户文本 语义分析结果数据一同提供给视频搜索模块 39。之后, 视频搜索模块 39将 用户文本数据于视频语义空间的语义描述子和视频预搜索结果所包含的相 关视频文本数据在该视频语义空间的语义描述子集合进行相似度比较搜索 得到视频最终搜索结果并传送至控制模块 31,再由控制模块 31提供至用户 端 10的人机界面 13 以呈现给用户。 在此, 相似度比较可以通过计算欧式 距离来实现, 但本发明并不以此为限, 其他可以计算语义空间中投影之间 的相似度的方法均可采用。 另外, 此处的视频最终搜索结果可以是按照相 似度的分值高低排序的视频列表。  The video search operation of the video search module 39 in the embodiment of the present invention may be: First, let the control module 31 perform video pre-search in the video relational database 37 by using the user text semantic analysis result data (that is, the semantically analyzed user text data). For example, a classified search: that is, a video director name search, a video actor name search, a video production age search, a video theme type search, a video region or language type search, and a video category search, etc.; It is possible to reduce the workload of the video search module 39 for video search and improve the search efficiency. Here, the video pre-search result includes, for example, a set of semantic descriptors of the related video text data matching the user text data in the video semantic space, and the semantic description sub-collection is provided to the video search module 39 along with the user text semantic analysis result data. . Afterwards, the video search module 39 compares the semantic description of the user text data in the semantic space of the video semantic space with the related video text data included in the video pre-search result in the semantic description sub-set of the video semantic space to obtain a video final search result. And transmitted to the control module 31, and then provided by the control module 31 to the human machine interface 13 of the client 10 for presentation to the user. Here, the similarity comparison can be realized by calculating the Euclidean distance, but the present invention is not limited thereto, and other methods for calculating the similarity between projections in the semantic space can be employed. In addition, the final search result of the video here may be a list of videos sorted according to the degree of similarity.
需要说明的是, 在本发明实施例中, 并不限于前述利用用户文本数据 于视频语义空间的语义描述子在部分的视频文本数据于该视频语义空间的 语义描述子集合中进行语义空间搜索, 在其他实施例中, 也可不做视频预 搜索, 而直接利用用户文本数据于视频语义空间的语义描述子在全部视频 文本数据于该视频语义空间的语义描述子集合中进行语义空间搜索得到视 频最终搜索结果。 It should be noted that, in the embodiment of the present invention, it is not limited to the foregoing semantic descriptor in the video semantic space by using user text data, and the semantic space search is performed on a part of the video text data in the semantic description sub-set of the video semantic space. In other embodiments, the video pre-search may also be omitted, and the semantic descriptor of the video semantic space is directly utilized by the user to perform semantic space search in the semantic description sub-set of the video semantic space. Frequency final search results.
另外, 为提供管理和开发人员一个对视频搜索服务器进行调试、 测试、 部署、 维护的界面, 服务器管理模块 32被配置在视频搜索服务器 30中, 其是作为非面向用户的一个模块。  In addition, to provide a management and developer interface for debugging, testing, deploying, and maintaining the video search server, the server management module 32 is disposed in the video search server 30 as a module that is not user-oriented.
再者, 本发明上述实施例的语音识别模块 33也可整合于用户端 10而 非视频搜索服务器 30,如此用户端 10可以将用户语音数据先转换成用户文 本数据后再传送给视频搜索服务器 30中的控制模块 31。  Furthermore, the voice recognition module 33 of the above embodiment of the present invention can also be integrated into the client terminal 10 instead of the video search server 30, so that the client terminal 10 can convert the user voice data into user text data before transmitting it to the video search server 30. Control module 31 in the middle.
下面将简述几种可应用上述实施例的基于自然交互输入例如语音输入 的视频搜索系统 100的基于语音输入的视频搜索方法。  Several voice input based video search methods of the video search system 100 based on natural interactive input such as voice input to which the above-described embodiments can be applied will be briefly described below.
如图 4 所示, 一种基于语音输入的视频搜索方法例如主要包括 S400-S410:  As shown in FIG. 4, a video input method based on voice input mainly includes S400-S410:
S400: 采集用户的语音输入以生成用户语音数据;  S400: collecting voice input of the user to generate user voice data;
S402: 对用户语音数据进行语音识别得到用户文本数据;  S402: Perform speech recognition on user voice data to obtain user text data.
S404: 对用户文本数据进行自然语言语义分析得到用户文本语义分析 结果数据;  S404: performing natural language semantic analysis on user text data to obtain user text semantic analysis result data;
S406: 利用用户文本语义分析结果数据进行预搜索 (例如前述的分类搜 索)得到视频预搜索结果, 该视频预搜索结果包含与用户文本语义分析结果 数据匹配的相关视频文本数据在视频语义空间的语义描述子集合;  S406: Perform a pre-search (for example, the foregoing classified search) by using the user text semantic analysis result data to obtain a video pre-search result, where the video pre-search result includes the semantics of the related video text data in the video semantic space that matches the user text semantic analysis result data. Describe subsets;
S408 : 将用户文本语义分析结果数据投影到视频语义空间后与视频预 搜索结果所包含的语义描述子集合分别进行相似度比较以输出视频最终搜 索结果 (例如是按照相似度的分值高低排序的视频列表); 以及  S408: Projecting the user text semantic analysis result data to the video semantic space, and performing similarity comparison with the semantic description sub-set included in the video pre-search result to output a final video search result (for example, sorting according to the similarity score) Video list); and
S410: 将视频最终搜索结果呈现给用户。 如图 5所示, 另一种基于语音输入的视频搜索方法例如主要包括步骤 S500-S510: S410: Present the final search result of the video to the user. As shown in FIG. 5, another video input method based on voice input includes, for example, mainly steps S500-S510:
S500: 利用对收集到的视频文本数据进行自然语言语义分析后而得到 的视频文本语义分析结果数据进行量化并基于潜在语义索引进行训练学习 得到视频语义空间、 并取得收集到的视频文本数据在视频语义空间的语义 描述子集合;  S500: The video text semantic analysis result data obtained by performing natural language semantic analysis on the collected video text data is quantized and trained based on the latent semantic index to obtain a video semantic space, and the collected video text data is obtained in the video. a semantic description sub-collection of the semantic space;
S502: 采集用户的语音输入并转换成用户文本数据;  S502: collecting voice input of the user and converting the data into user text;
S504: 对用户文本数据进行自然语言语义分析得到用户文本语义分析 结果数据;  S504: performing natural language semantic analysis on user text data to obtain user text semantic analysis result data;
S506: 利用用户文本语义分析结果数据于视频语义空间的语义描述子 在至少部分收集到的视频文本数据于视频语义空间的语义描述子集合中进 行相似度比较以输出视频最终搜索结果; 更具体地, 在步骤 S506中, 其包 含前述的先进行视频预搜索 (例如前述的分类搜索)再进行语义空间搜索、与 不做视频预搜索而直接进行语义空间搜索两种情形; 以及  S506: Using a semantic description of the user text semantic analysis result data in the video semantic space, performing similarity comparison on the at least partially collected video text data in a semantic description sub-set of the video semantic space to output a video final search result; In step S506, the method includes the foregoing performing a video pre-search (for example, the foregoing classified search), performing a semantic space search, and directly performing a semantic space search without performing a video pre-search;
S508 : 将视频最终搜索结果呈现给用户。  S508: Present the final search result of the video to the user.
另外, 本领域技术人员可以理解的是, 自然交互输入方式并不限于语 音输入, 也可为直接的自然语言文本输入, 甚至是手势输入; 相应地, 在 上述各个实施例的视频搜索方法中, 则不需要用户语音数据的文本转换步 骤; 而视频搜索系统中的模块设计也可相应地根据实际情形做适当地增减 及 /或变更。  In addition, those skilled in the art can understand that the natural interactive input mode is not limited to voice input, and can also be direct natural language text input, or even gesture input; accordingly, in the video search methods of the above embodiments, The text conversion step of the user voice data is not required; and the module design in the video search system can be appropriately increased, decreased, and/or changed according to the actual situation.
综上所述, 本发明实施例提供的基于自然交互输入例如语音输入的视 频搜索系统及方法以及视频搜索服务器至少具有以下优点中的一个或多 个: 能够以用户的视频目标任务为导向, 允许用户使用自然语言进行交互, 通过自然语言处理技术, 利用视频相关知识库进行推理运算, 用户只需提 供对视频内容的简单描述即可从数据库中快速获取相关视频, 从而可实现 对用户的视频目标任务的智能感知; 此外, 能够实现自然友好方便的人机 交互方式和界面, 具有不断学习升级的能力; 因此, 可有效提升用户的使 用体验。 In summary, the video search system and method based on natural interaction input, such as voice input, and the video search server provided by the embodiments of the present invention have at least one or more of the following advantages. A: It can be guided by the user's video target task, allowing users to interact in natural language. Through natural language processing technology, using video-related knowledge base for reasoning operations, users only need to provide a simple description of the video content from the database. Quickly acquire relevant videos, so as to realize the intelligent perception of the user's video target tasks; in addition, it can realize the natural friendly and convenient human-computer interaction mode and interface, and has the ability to continuously learn and upgrade; therefore, the user experience can be effectively improved.
以上所述, 仅是本发明的较佳实施例而已, 并非对本发明作任何形式 上的限制, 虽然本发明已以较佳实施例揭露如上, 然并非用以限定本发明, 任何熟悉本专业的技术人员, 在不脱离本发明技术方案范围内, 当可利用 上述揭示的技术内容作出些许更动或修饰为等同变化的等效实施例, 但凡 是未脱离本发明技术方案内容, 依据本发明的技术实质对以上实施例所作 的任何简单修改、 等同变化与修饰, 均仍属于本发明技术方案的范围内。 工业实用性  The above is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Although the present invention has been described above by way of a preferred embodiment, it is not intended to limit the present invention. A person skilled in the art can make some modifications or modifications to equivalent embodiments by using the above-disclosed technical contents without departing from the technical scope of the present invention. It is still within the scope of the technical solution of the present invention to make any simple modifications, equivalent changes and modifications to the above embodiments. Industrial applicability
本发明提供的基于自然交互输入例如语音输入的视频搜索系统及方法 以及视频搜索服务器至少具有以下优点中的一个或多个: 能够以用户的视 频目标任务为导向, 允许用户使用自然语言进行交互, 通过自然语言处理 技术, 利用视频相关知识库进行推理运算, 用户只需提供对视频内容的简 单描述即可从数据库中快速获取相关视频, 从而可实现对用户的视频目标 任务的智能感知; 此外, 能够实现自然友好方便的人机交互方式和界面, 具有不断学习升级的能力; 因此, 可有效提升用户的使用体验。  The video search system and method based on natural interactive input, such as voice input, and the video search server provided by the present invention have at least one or more of the following advantages: being capable of being guided by a user's video target task, allowing the user to interact using natural language, Through the natural language processing technology, using the video-related knowledge base for the reasoning operation, the user can quickly obtain the relevant video from the database by providing a simple description of the video content, thereby realizing the intelligent perception of the user's video target task; It can realize natural and friendly human-computer interaction mode and interface, and has the ability to continuously learn and upgrade; therefore, it can effectively enhance the user experience.

Claims

权 利 要 求 书 Claim
1.一种基于自然交互输入的视频搜索系统, 其特征在于, 包括: 用户端, 包括语音采集模块和人机界面, 该语音采集模块采集用户的 语音输入以生成用户语音数据并提供至该人机界面; 以及 A video search system based on natural interaction input, comprising: a user end, comprising a voice collection module and a human machine interface, the voice collection module collecting voice input of the user to generate user voice data and providing the voice data to the person Machine interface;
视频搜索服务器, 包括控制模块、 语音识别模块、 自然语言处理模块、 视频关系数据库以及视频搜索模块, 该视频关系数据库储存视频语义空间 以及视频文本数据在该视频语义空间的语义描述子集合,  a video search server, comprising: a control module, a voice recognition module, a natural language processing module, a video relational database, and a video search module, wherein the video relational database stores a video semantic space and a semantic descriptor sub-collection of the video text data in the video semantic space,
其中, 该控制模块接收用户端的人机界面提供的用户语音数据并提供 至语音识别模块以获取用户文本数据, 将用户文本数据提供至自然语言处 理模块以获取用户文本语义分析结果数据, 并利用用户文本语义分析结果 数据在视频关系数据库中进行预搜索以获取视频预搜索结果; 该视频预搜 索结果包含与用户文本语义分析结果数据匹配的相关视频文本数据在视频 语义空间的语义描述子集合,  The control module receives the user voice data provided by the human-machine interface of the user end and provides the voice data to the voice recognition module to obtain the user text data, and provides the user text data to the natural language processing module to obtain the user text semantic analysis result data, and utilizes the user. The text semantic analysis result data is pre-searched in the video relational database to obtain a video pre-search result; the video pre-search result includes a semantic descriptor sub-set of the related video text data matching the user text semantic analysis result data in the video semantic space,
该视频搜索模块接收控制模块提供的用户文本语义分析结果数据和视 频预搜索结果, 利用用户文本语义分析结果数据于视频语义空间的语义描 述子与视频预搜索结果所包含的语义描述子集合分别进行相似度比较, 并 根据比较结果输出视频最终搜索结果至控制模块, 再由控制模块提供至人 机界面以呈现给用户。  The video search module receives the user text semantic analysis result data and the video pre-search result provided by the control module, and uses the user text semantic analysis result data to perform the semantic descriptor of the video semantic space and the semantic description sub-set included in the video pre-search result respectively. The similarity is compared, and the final search result of the video is output to the control module according to the comparison result, and then provided by the control module to the human-machine interface for presentation to the user.
2.如权利要求 1 所述的基于自然交互输入的视频搜索系统, 其特征在 于, 该视频搜索服务器还包括:  The video search system of the natural interactive input according to claim 1, wherein the video search server further comprises:
视频数据收集模块, 收集视频数据以提供视频文本数据至该自然语言 处理模块, 由自然语言处理模块输出视频文本语义分析结果数据至视频关 系数据库进行储存; 以及 a video data collecting module, collecting video data to provide video text data to the natural language processing module, and outputting video text semantic analysis result data to the video gateway by the natural language processing module Database for storage; and
语义空间学习模块, 利用视频关系数据库储存的视频文本语义分析结 果数据进行训练学习得到视频语义空间并找到视频文本数据各自在视频语 义空间的语义描述子后储存至视频关系数据库。  The semantic spatial learning module uses the video text semantic analysis result data stored in the video relational database to perform training and learning to obtain the video semantic space and find the video text data in the video semantic space and store them in the video relational database.
3.—种基于自然交互输入的视频搜索方法, 其特征在于, 包括步骤: 采集用户的自然交互输入以得到用户文本数据;  3. A video search method based on natural interaction input, comprising the steps of: collecting a natural interaction input of a user to obtain user text data;
对该用户文本数据进行自然语言语义分析得到用户文本语义分析结果 数据;  Performing natural language semantic analysis on the user text data to obtain user text semantic analysis result data;
利用该用户文本语义分析结果数据进行预搜索得到视频预搜索结果, 该视频预搜索结果包含与该用户文本语义分析结果数据匹配的相关视频文 本数据在一视频语义空间的语义描述子集合;  Pre-searching the user text semantic analysis result data to obtain a video pre-search result, where the video pre-search result includes a semantic descriptor sub-set of the related video text data matching the user text semantic analysis result data in a video semantic space;
将该用户文本语义分析结果数据投影到该视频语义空间后与该视频预 搜索结果所包含的语义描述子集合分别进行相似度比较以输出视频最终搜 索结果; 以及  And projecting the user text semantic analysis result data into the video semantic space, and performing similarity comparison with the semantic description sub-set included in the video pre-search result to output a video final search result;
将该视频最终搜索结果呈现给用户。  Present the final search results of the video to the user.
4.如权利要求 3所述的基于自然交互输入的视频搜索方法, 其特征在 于, 还包括步骤:  4. The video search method based on natural interactive input according to claim 3, further comprising the steps of:
收集取得视频文本数据;  Collect and obtain video text data;
对所取得的该视频文本数据进行自然语言语义分析得到视频文本语义 分析结果数据; 以及  Performing natural language semantic analysis on the obtained video text data to obtain video text semantic analysis result data;
利用该视频文本语义分析结果数据进行训练学习得到该视频语义空间 并找到所取得的该视频文本数据各自在该视频语义空间的语义描述子。 The video text semantic analysis result data is used for training learning to obtain the video semantic space and find the semantic descriptors of the obtained video text data in the video semantic space.
5.如权利要求 3所述的基于自然交互输入的视频搜索方法, 其特征在 于, 利用该用户文本语义分析结果数据进行预搜索得到视频预搜索结果的 步骤包括: The video search method based on natural interactive input according to claim 3, wherein the step of pre-searching the video pre-search result by using the user text semantic analysis result data comprises:
利用该用户文本语义分析结果数据进行分类搜索, 该分类搜索包括视 频导演名搜索、 视频演员名搜索、 视频制作年代搜索、 视频主题类型搜索、 视频地区或语言类型搜索、 和视频类别搜索中的多个或全部。  The user text semantic analysis result data is used for classification search, which includes video director name search, video actor name search, video creation time search, video theme type search, video area or language type search, and video category search. One or all.
6.—种基于自然交互输入的视频搜索方法, 其特征在于, 包括步骤: 利用对收集到的视频文本数据进行自然语言语义分析后而得到的视频 文本语义分析结果数据进行量化并基于潜在语义索引进行训练学习得到视 频语义空间, 并取得收集到的视频文本数据在该视频语义空间的语义描述 子集合;  6. A video search method based on natural interactive input, comprising the steps of: quantifying video text semantic analysis result data obtained by performing natural language semantic analysis on the collected video text data and based on latent semantic indexing Performing training to obtain a video semantic space, and obtaining a semantic descriptor sub-collection of the collected video text data in the video semantic space;
采集用户的自然交互输入以得到用户文本数据;  Collecting user's natural interaction input to obtain user text data;
对该用户文本数据进行自然语言语义分析得到用户文本语义分析结果 数据;  Performing natural language semantic analysis on the user text data to obtain user text semantic analysis result data;
利用该用户文本语义分析结果数据于该视频语义空间的语义描述子在 至少部分收集到的视频文本数据于该视频语义空间的语义描述子集合中进 行相似度比较以输出视频最终搜索结果; 以及  Using the user text semantic analysis result data, the semantic descriptor of the video semantic space performs similarity comparison on the at least partially collected video text data in the semantic description sub-set of the video semantic space to output a video final search result;
将视频最终搜索结果呈现给用户。  Present the video's final search results to the user.
7.—种视频搜索服务器, 其特征在于, 包括:  7. A video search server, comprising:
视频关系数据库, 储存视频语义空间以及视频文本数据在该视频语义 空间的语义描述子集合;  a video relational database, storing a video semantic space and a semantic descriptor sub-collection of video text data in the video semantic space;
自然语言处理模块; 控制模块, 将代表用户视频需求的用户文本数据提供至该自然语言处 理模块以获取用户文本语义分析结果数据; 以及 Natural language processing module; a control module, providing user text data representing a user video requirement to the natural language processing module to obtain user text semantic analysis result data;
视频搜索模块, 获取该用户文本语义分析结果数据在该视频语义空间 的语义描述子, 并利用该语义描述子在至少部分视频文本数据于该视频语 义空间的语义描述子集合中进行相似度比较以输出视频最终搜索结果至该 控制模块。  a video search module, obtaining a semantic descriptor of the user text semantic analysis result data in the video semantic space, and using the semantic descriptor to perform similarity comparison in at least part of the video text data in the semantic description sub-set of the video semantic space Output video final search results to the control module.
8.如权利要求 7所述的视频搜索服务器, 其特征在于, 该控制模块进 一步利用该用户文本语义分析结果数据在该视频关系数据库中进行预搜索 以得到视频预搜索结果, 该视频预搜索结果包含与该用户文本语义分析结 果匹配的相关视频文本数据于该视频语义空间的语义描述子集合; 相应地, 该视频搜索模块是利用与该用户文本语义分析结果数据对应的该语义描述 子在视频预搜索结果包含的语义描述子集合中进行相似度比较以输出视频 最终搜索结果至该控制模块。  The video search server according to claim 7, wherein the control module further performs a pre-search in the video relation database by using the user text semantic analysis result data to obtain a video pre-search result, the video pre-search result. Correspondingly, the video search module includes a semantic descriptor sub-set corresponding to the user text semantic analysis result; and correspondingly, the video search module utilizes the semantic descriptor corresponding to the user text semantic analysis result data in the video The similarity comparison is performed in the semantic description sub-set included in the pre-search result to output the video final search result to the control module.
9.如权利要求 7所述的视频搜索服务器, 其特征在于, 还包括: 语音识别模块, 当控制模块接收用户语音数据后, 经由该语音识别模 块将该用户语音数据转换成该代表用户视频需求的用户文本数据。  The video search server according to claim 7, further comprising: a voice recognition module, after the control module receives the user voice data, converting the user voice data into the representative user video requirement via the voice recognition module User text data.
10.如权利要求 7、 8或 9所述的视频搜索服务器, 其特征在于, 还包 括:  10. The video search server of claim 7, 8 or 9, further comprising:
视频数据收集模块, 收集视频数据以提供视频文本数据至该自然语言 处理模块, 由该自然语言处理模块输出视频文本语义分析结果数据至该视 频关系数据库进行储存; 以及  a video data collecting module, which collects video data to provide video text data to the natural language processing module, and the natural language processing module outputs video text semantic analysis result data to the video relation database for storage;
语义空间学习模块, 对该视频关系数据库储存的该视频文本语义分析结果 ί据进行量化和基于潜在语义索引进行训练学习得到该视频语义空间并 1J视频文本数据各自在该视频语义空间的语义描述子后储存至该视频关 Semantic spatial learning module, semantic analysis result of the video text stored in the video relational database According to the quantization and training based on the latent semantic index, the video semantic space is obtained and the 1J video text data is respectively stored in the video semantic space after the semantic description of the video semantic space.
^据库。 ^ Database.
PCT/CN2012/086283 2012-06-18 2012-12-10 Video search system, method and video search server based on natural interaction input WO2013189156A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210199239.7 2012-06-18
CN201210199239.7A CN102750366B (en) 2012-06-18 2012-06-18 Video search system and method based on natural interactive import and video search server

Publications (1)

Publication Number Publication Date
WO2013189156A1 true WO2013189156A1 (en) 2013-12-27

Family

ID=47030551

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/086283 WO2013189156A1 (en) 2012-06-18 2012-12-10 Video search system, method and video search server based on natural interaction input

Country Status (2)

Country Link
CN (1) CN102750366B (en)
WO (1) WO2013189156A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750366B (en) * 2012-06-18 2015-05-27 海信集团有限公司 Video search system and method based on natural interactive import and video search server
CN103970791B (en) * 2013-02-01 2018-01-23 华为技术有限公司 A kind of method, apparatus for recommending video from video library
CN104240700B (en) * 2014-08-26 2018-09-07 智歌科技(北京)有限公司 A kind of global voice interactive method and system towards vehicle-mounted terminal equipment
US9886958B2 (en) * 2015-12-11 2018-02-06 Microsoft Technology Licensing, Llc Language and domain independent model based approach for on-screen item selection
CN106776872A (en) * 2016-11-29 2017-05-31 暴风集团股份有限公司 Defining the meaning of one's words according to voice carries out the method and system of phonetic search
CN108549655A (en) * 2018-03-09 2018-09-18 阿里巴巴集团控股有限公司 A kind of production method of films and television programs, device and equipment
CN109089133B (en) 2018-08-07 2020-08-11 北京市商汤科技开发有限公司 Video processing method and device, electronic equipment and storage medium
CN110968736B (en) * 2019-12-04 2021-02-02 深圳追一科技有限公司 Video generation method and device, electronic equipment and storage medium
CN113596602A (en) * 2021-07-28 2021-11-02 深圳创维-Rgb电子有限公司 Intelligent matching method, television and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120131060A1 (en) * 2010-11-24 2012-05-24 Robert Heidasch Systems and methods performing semantic analysis to facilitate audio information searches
CN102750366A (en) * 2012-06-18 2012-10-24 海信集团有限公司 Video search system and method based on natural interactive import and video search server

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059522A1 (en) * 2006-08-29 2008-03-06 International Business Machines Corporation System and method for automatically creating personal profiles for video characters
CN100565532C (en) * 2008-05-28 2009-12-02 叶睿智 A kind of multimedia resource search method based on the audio content retrieval
CN101382937B (en) * 2008-07-01 2011-03-30 深圳先进技术研究院 Multimedia resource processing method based on speech recognition and on-line teaching system thereof
CN101404035A (en) * 2008-11-21 2009-04-08 北京得意音通技术有限责任公司 Information search method based on text or voice
CN102063476B (en) * 2010-12-13 2013-07-10 百度时代网络技术(北京)有限公司 Video searching method and system
CN102262624A (en) * 2011-08-08 2011-11-30 中国科学院自动化研究所 System and method for realizing cross-language communication based on multi-mode assistance

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120131060A1 (en) * 2010-11-24 2012-05-24 Robert Heidasch Systems and methods performing semantic analysis to facilitate audio information searches
CN102750366A (en) * 2012-06-18 2012-10-24 海信集团有限公司 Video search system and method based on natural interactive import and video search server

Also Published As

Publication number Publication date
CN102750366B (en) 2015-05-27
CN102750366A (en) 2012-10-24

Similar Documents

Publication Publication Date Title
WO2013189156A1 (en) Video search system, method and video search server based on natural interaction input
US11842727B2 (en) Natural language processing with contextual data representing displayed content
CN106575293B (en) Isolated language detection system and method
US20180121547A1 (en) Systems and methods for providing information discovery and retrieval
US9972358B2 (en) Interactive video generation
CN111666416B (en) Method and device for generating semantic matching model
CN112000820A (en) Media asset recommendation method and display device
US10650814B2 (en) Interactive question-answering apparatus and method thereof
US9519355B2 (en) Mobile device event control with digital images
US11782979B2 (en) Method and apparatus for video searches and index construction
CN111432282B (en) Video recommendation method and device
JP7140913B2 (en) Video distribution statute of limitations determination method and device
CN111522909A (en) Voice interaction method and server
CN112182196A (en) Service equipment applied to multi-turn conversation and multi-turn conversation method
CN108256071B (en) Method and device for generating screen recording file, terminal and storage medium
TWI748266B (en) Search method, electronic device and non-transitory computer-readable recording medium
CN115273840A (en) Voice interaction device and voice interaction method
CN116758362A (en) Image processing method, device, computer equipment and storage medium
KR102122918B1 (en) Interactive question-anwering apparatus and method thereof
CN114781365A (en) End-to-end model training method, semantic understanding method, device, equipment and medium
CN116030375A (en) Video feature extraction and model training method, device, equipment and storage medium
CN114840711A (en) Intelligent device and theme construction method
CN111950288B (en) Entity labeling method in named entity recognition and intelligent device
CN115114931A (en) Model training method, short video recall method, device, equipment and medium
CN111344664B (en) Electronic apparatus and control method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12879332

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12879332

Country of ref document: EP

Kind code of ref document: A1