WO2013189156A1

WO2013189156A1 - Video search system, method and video search server based on natural interaction input

Info

Publication number: WO2013189156A1
Application number: PCT/CN2012/086283
Authority: WO
Inventors: 王勇进; 张瑞; 张钰林
Original assignee: 海信集团有限公司
Priority date: 2012-06-18
Filing date: 2012-12-10
Publication date: 2013-12-27
Also published as: CN102750366B; CN102750366A

Abstract

The present invention relates to the field of video search technologies, and provides a video search system and method and a video search server based on natural interaction input. A user end of the video search system receives natural interaction input of a user and provides the input to the video search server thereof for video search, wherein the video search server may comprise an online part and an offline part. The offline part performs semantic analysis on collected video information, and establishes a video semantic space and a video relationship database. The online part, according to the natural interaction input of the user, obtains text data of the user and performs the semantic analysis, uses a semantic analysis result to perform video pre-search in the relationship database, and according to semantic description of the semantic analysis result in the video semantic space, performs comparison and search in a semantic description subset contained in a video pre-search result to output a video final search result to the user. The user only needs to provide simple description of video content to quickly obtain relevant video from the database, thereby implementing intelligent awareness of a video target task of the user.

Description

Video search system and method based on natural interactive input and video search server

Technical field

The present invention relates to the field of video search technology, and more particularly to a video search system and method based on natural interactive input (e.g., voice input), and a video search server. Background technique

With the development of electronic information and network technology, smart TVs with network access functions have gradually become the mainstream of the TV market. Among them, video is the most important requirement of smart TV users. Unlike the mouse and keyboard of personal computer peripherals, the current human-computer interaction of smart TVs is still dominated by traditional remote controls; however, a large number of buttons, complex usage patterns and menus, cumbersome and confusing interface elements, With the complication and function of TV, the traditional human-computer interaction method has become more and more unable to meet the needs of users.

Recently, with the development of speech recognition technology, there has been a product represented by Apple's Personalized Intelligent Assistant (SIRI), which enables users to interact with device terminals through natural language. And can provide many functions such as texting, weather checking, etc. Currently, SIRI does not support Chinese speech input. In recent years, domestic related industries have begun research and application based on natural interaction methods such as voice and achieved certain results. However, in general, product applications based on natural interaction methods such as voice are still difficult to meet user experience requirements. Summary of the invention One of the objects of the present invention is to provide a video search system based on natural interactive input, which can realize intelligent sensing of a user's video target task and provide a more natural and friendly user experience.

Another object of the present invention is to provide a video search method based on natural interactive input, which can realize intelligent perception of a user's video target task and provide a more natural and friendly user experience.

Still another object of the present invention is to provide a video search server having natural language semantic analysis capabilities and intelligent video search capabilities.

Specifically, the video search system based on natural interaction input provided by the embodiment of the present invention includes a user end and a video search server. The user end includes a voice collection module and a human machine interface, and the voice collection module collects voice input of the user to generate user voice data and provide the same to the human machine interface. The video search server comprises a control module, a speech recognition module, a natural language processing module, a video relational database and a video search module; the video relational database stores a video semantic space and a semantic descriptor sub-set of the video text data in the video semantic space. The control module receives the user voice data provided by the human-machine interface of the client and provides the voice recognition module to obtain the user text data, and provides the user text data to the natural language processing module to obtain the user text semantic analysis result data, and utilizes the user semantic analysis result. The data is pre-searched in the video relational database to obtain video pre-search results. The video pre-search result includes a subset of semantic descriptions of the associated video text data that matches the user text semantic analysis result data in the video semantic space. The video search module receives the user text semantic analysis result data and the video pre-search result provided by the control module, and uses the user text semantic analysis result data in the semantic space of the video semantic descriptor to be similar to the semantic description sub-set included in the video pre-search result. Degree comparison The result is the final result of the video output to the control module, which is then provided by the control module to the human machine interface for presentation to the user.

In addition, a video search method based on natural interactive input provided by the embodiment of the present invention includes the following steps: (a) collecting natural interaction input of the user to obtain user text data; (b) performing natural language semantic analysis on user text data. Obtaining user text semantic analysis result data; (c) using user text semantic analysis result data to perform pre-search to obtain video pre-search results, the video pre-search results including related video text data matching user text semantic analysis result data in video semantic space Semantic description sub-set; (d) projecting the user text semantic analysis result data into the video semantic space, and performing similarity comparison with the semantic description sub-set included in the video pre-search result, respectively, and outputting the video final search result; and (e ) Present the video's final search results to the user.

A video input method based on voice input according to another embodiment of the present invention includes the following steps: (1) Quantifying video file semantic analysis result data obtained by performing natural language semantic analysis on the collected video text data. And based on the latent semantic index, the training semantic learning space is obtained, and the collected semantics of the video text data in the semantic space of the video is obtained; (2) collecting the natural interaction input of the user to obtain the user text data; (3) Performing natural language semantic analysis on user text data to obtain user text semantic analysis result data; (4) using user text semantic analysis result data in the semantic space of the video semantic space to at least partially collect video text data in the video semantic space The semantic description sub-set performs similarity comparison to output the video final search result; and (5) presents the video final search result to the user.

In addition, a video search server according to an embodiment of the present invention includes: a video relational database, a natural language processing module, a control module, and a video search module. Wherein, the video relational database stores the video semantic space and the semantic description of the video text data in the video semantic space The control module provides the user text data representing the user video requirement to the natural language processing module to obtain the user text semantic analysis result data; the video search module obtains the semantic description of the user text semantic analysis result data in the video semantic space, And using the semantic descriptor to perform similarity comparison on at least part of the video text data in the semantic description sub-set of the video semantic space to output the video final search result to the control module.

The natural interactive input based video search system and method and video search server in the above various embodiments of the present invention have at least one or more of the following advantages: being capable of being guided by a user's video target task, allowing the user to interact using natural language Through natural language processing technology, using the video-related knowledge base for reasoning operations, the user can quickly obtain related videos from the database by providing a simple description of the video content, thereby realizing intelligent perception of the user's video target tasks; It can realize natural and friendly human-computer interaction mode and interface, and has the ability to continuously learn and upgrade; therefore, it can effectively enhance the user experience.

The above description is only an overview of the technical solutions of the present invention, and the technical means of the present invention can be more clearly understood, and can be implemented in accordance with the contents of the specification, and the above and other objects, features and advantages of the present invention can be more clearly understood. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, preferred embodiments will be described in detail with reference to the accompanying drawings. DRAWINGS

FIG. 1 is a schematic diagram of a video search system architecture based on natural interactive input (eg, voice input) according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a module of the user end shown in FIG. 1.

FIG. 3 is a schematic diagram of a module of the video search server shown in FIG. 1. FIG. 4 is a flowchart of a video input method based on voice input according to an embodiment of the present invention. FIG. 5 is a flowchart of another voice input based video search method according to an embodiment of the present invention. detailed description

In order to further explain the technical means and functions of the present invention for achieving the intended purpose of the invention, the video search system and method based on natural interactive input and the video search server according to the present invention are hereinafter combined with the accompanying drawings and preferred embodiments. The detailed description, methods, steps and effects are described in detail below.

The foregoing and other technical aspects, features and advantages of the present invention will be apparent from the following description of the preferred embodiments of the invention. The technical means and functions of the present invention for achieving the intended purpose can be more deeply and specifically understood by the description of the embodiments. However, the drawings are only for reference and description, and are not intended to be used for the present invention. limit.

Please refer to FIG. 1, which is a schematic structural diagram of a video search system based on natural interactive input (for example, voice input) according to an embodiment of the present invention. As shown in FIG. 1, the voice input based video search system 100 of the present embodiment includes a client 10 and a video search server 30; the client 10 receives user voice input and generates user voice data, and the video search server 30 generates voice data according to the user. A video search is performed and the video final search results are returned to the client 10 for presentation to the user. It should be noted that, in the voice input-based video search system 100 of the present embodiment, one video search server 30 may correspond to multiple user terminals 10, so that the user voice data of each client terminal 10 can be respectively responded to and the corresponding video is returned. Final search results.

Please refer to FIG. 2 , which is a schematic diagram of a module of the client 10 according to an embodiment of the present invention. As shown in FIG. 2, the client 10 includes, for example, a voice collection module 11 and a human machine interface 13. Among them, voice The acquisition module 11 collects user voice input and generates user voice data, which is transmitted to the video search server 30 through the human machine interface 13. The tasks of the human machine interface 13 include, for example, human-computer interaction, user information recording, and user authentication. In terms of user authentication, two usage modes may be specifically provided for the user, such as a public mode and a privacy mode. Correspondingly, the video search server 30 may perform video search in two ways of enabling or skipping user authentication, so that The user's personal information is protected, and suitable video search results can be provided to users of different ages. In this embodiment, the client 10 is, for example, a smart TV with a TV remote control (having an Internet function), a desktop computer, a notebook computer, a smart phone, and the like; when the client terminal 10 is a smart TV with a TV remote controller. The voice collection module 11 may be a microphone built in the TV remote controller, and the human machine interface 13 may be a Hyper Text Transport Protocol (HTTP) website service running on a smart TV (for example, port 80), which will The user voice data output by the microphone is transmitted to the video search server 30 for video search, and the video final search result may be displayed subsequently for presentation to the user; further, it can be understood that before the user voice data is transmitted to the video search server 30 Data compression can be performed on user voice data first.

Please refer to FIG. 3, which is a block diagram of a video search server 30 according to an embodiment of the present invention. As shown in FIG. 3, the video search server 30 includes a control module 31, a voice recognition module 33, a natural language processing module 35, a video data collection module 36, a video relational database 37, a semantic space learning module 38, a video search module 39, and a server. Management module 32. It is explained here that each module in the video search server 30 can be implemented in hardware and/or software according to the needs of actual design flexibility; further, the video search server 30 can be a group consisting of a single server or multiple servers. Group, plus the necessary peripheral components. In addition, in this embodiment, the video search server 30 includes two parts, an online and an offline, and the online part is mainly controlled by a module. 31. The speech recognition module 33, the natural language processing module 35, and the video search module 39 are configured. The offline portion is mainly composed of a video data collection module 36, a video relational database 37, and a semantic space learning module 38, and shares a natural language with the online portion. Processing module 35.

Specifically, the control module 31 serves as a scheduling center of the entire video search server 30, and receives the user voice data transmitted by the client terminal 10 (for example, transmitted by a wired or wireless network connection) and finally returns the video final search result as an output to the client. 10. Here, when the human-machine interface 13 of the client 10 is provided with a user authentication mechanism, the control module 31 first verifies the identity of the user, and determines whether to perform a video search and/or return a video before the final search result according to the authentication result. Search results filtering is required first.

The speech recognition module 33 is for speech recognition of speech data for conversion to corresponding textual data, which is typically coupled to a speech library (not shown in Figure 3) for speech instruction matching operations. In the present embodiment, the voice recognition module 33 can convert the user voice data provided by the control module 31 into user text data representing the user's video requirements and return it to the control module 31.

The natural language processing module 35 is adapted to perform semantic analysis on text data (such as user text data, video text data, etc.), for example, to perform Chinese semantic analysis: including word segmentation, part-of-speech tagging, named entity analysis, and the like. Of course, it can be understood that the natural language processing module 35 can also perform semantic analysis on different language texts, and is not limited to Chinese, but also English, etc., but only needs to provide semantic libraries of different languages to support. In the present embodiment, the natural language processing module 35 can perform semantic analysis on the user text data provided by the control module 31 to return the user text semantic analysis result data to the control module 31. Here, the user text semantic analysis result data can be understood as user text data after the operation of word segmentation, part-of-speech tagging, and the like.

The video data collection module 36 is configured to collect video data and provide video text data, the video The text data may be text data such as a movie name, an alias, a director name, an actor name, a video production date, and a video theme type (such as a video name, an alias, a director name, and the like) searched from a network (including a film and television program providing partner). Video descriptions such as war films, comedy films, etc., video regions (such as China, the United States, etc.) or languages (such as Chinese, English, etc.), video categories (such as movies, TV shows, etc.), and data validity tags. text. The video data collection module 36 can operate in a periodic automatic collection or manually triggered collection. In this embodiment, the video text data provided by the video data collection module 36 is first transmitted to the natural language processing module 35 for natural language semantic analysis to form the video text semantic analysis result data and stored in the video relation database 37; it can be understood that The video text data provided by the video data collecting module 36 may also be stored in the video relation database 37, and then the natural language processing module 37 performs word segmentation, part-of-speech tagging, etc. on the video text data stored in the video relational database 37 (ie, semantic analysis). )operating. Here, the video text semantic analysis result data can be understood as the result data after the operation of the word segmentation and the part-of-speech tagging of the video text data.

The video relational database 37 performs a video search as a data source of the video search server 30, and includes a data table such as a video data table, a backup data table, a user table, and a query record table. The video data table stores, for example, semantically analyzed video text data, the backup data table stores, for example, duplicated and culled data, the user table stores, for example, user data, and the query record table, for example, saves the user's video search record.

The semantic spatial learning module 38 is a major part of the machine learning of the voice input based video search system 100, which is primarily responsible for quantifying the video text data in the video relational database 37 and then based on Latent semantic indexing (LSI) versus video. The main semantics of the relational database 37 are analyzed and learned to obtain the video semantic space, and the collected video text is found. The data is in a semantically described sub-set of the video semantic space (i.e., a projected set in the video semantic space) and stored in the video relational database 37.

The process of establishing the video semantic space may be: the semantic space learning module 38 uses the semantic analysis result data of the video text stored in the video relational database 37 as a training sample set, so that a vocabulary containing a large number of useful words is established, and then Using this vocabulary, each video text data (ie, video description) can be quantified and ultimately represented by a vector; at this point, each element in the vector will represent a word in a certain video text data. The number of occurrences, the vector is also the word frequency of the video text data. After that, using the word frequency vector of a large amount of video text data, through the subspace machine learning method, some special directions can be calculated in the linear space to which the word frequency vector belongs, and the vector representing these special directions is a set of standard orthogonal vector groups. They form a new linear space. The special physical meaning of this set of vectors is: Any of these vectors represent certain vocabulary that often appear simultaneously in a specific context. Each such specific context corresponds to a semantic topic, that is, the simultaneous appearance of certain vocabulary Represents a semantic. However, only a part of this set of special vectors constituting the new linear space has a very high degree of semantic discrimination and is therefore preserved. These preserved vectors ultimately constitute the video semantic space. The video text data in the video relational database 37 will find a projection in the video semantic space, i.e., a semantic descriptor of the video text data in the video semantic space.

The video search module 39 is connected to the control module 31 and the video relation database 37, which can receive the user text semantic analysis result data provided by the control module 31 and can acquire the video semantic space from the video relation database 37 (for example, the coordinate axis of the semantic space, etc.) Information) and projecting the user text semantic analysis result data in the video semantic space to obtain a projection (ie, a semantic descriptor) of the user text data in the video semantic space. Subsequently, the video search module 39 can utilize the semantic description The descriptor performs a video search operation.

The video search operation of the video search module 39 in the embodiment of the present invention may be: First, let the control module 31 perform video pre-search in the video relational database 37 by using the user text semantic analysis result data (that is, the semantically analyzed user text data). For example, a classified search: that is, a video director name search, a video actor name search, a video production age search, a video theme type search, a video region or language type search, and a video category search, etc.; It is possible to reduce the workload of the video search module 39 for video search and improve the search efficiency. Here, the video pre-search result includes, for example, a set of semantic descriptors of the related video text data matching the user text data in the video semantic space, and the semantic description sub-collection is provided to the video search module 39 along with the user text semantic analysis result data. . Afterwards, the video search module 39 compares the semantic description of the user text data in the semantic space of the video semantic space with the related video text data included in the video pre-search result in the semantic description sub-set of the video semantic space to obtain a video final search result. And transmitted to the control module 31, and then provided by the control module 31 to the human machine interface 13 of the client 10 for presentation to the user. Here, the similarity comparison can be realized by calculating the Euclidean distance, but the present invention is not limited thereto, and other methods for calculating the similarity between projections in the semantic space can be employed. In addition, the final search result of the video here may be a list of videos sorted according to the degree of similarity.

It should be noted that, in the embodiment of the present invention, it is not limited to the foregoing semantic descriptor in the video semantic space by using user text data, and the semantic space search is performed on a part of the video text data in the semantic description sub-set of the video semantic space. In other embodiments, the video pre-search may also be omitted, and the semantic descriptor of the video semantic space is directly utilized by the user to perform semantic space search in the semantic description sub-set of the video semantic space. Frequency final search results.

In addition, to provide a management and developer interface for debugging, testing, deploying, and maintaining the video search server, the server management module 32 is disposed in the video search server 30 as a module that is not user-oriented.

Furthermore, the voice recognition module 33 of the above embodiment of the present invention can also be integrated into the client terminal 10 instead of the video search server 30, so that the client terminal 10 can convert the user voice data into user text data before transmitting it to the video search server 30. Control module 31 in the middle.

Several voice input based video search methods of the video search system 100 based on natural interactive input such as voice input to which the above-described embodiments can be applied will be briefly described below.

As shown in FIG. 4, a video input method based on voice input mainly includes S400-S410:

S400: collecting voice input of the user to generate user voice data;

S402: Perform speech recognition on user voice data to obtain user text data.

S404: performing natural language semantic analysis on user text data to obtain user text semantic analysis result data;

S406: Perform a pre-search (for example, the foregoing classified search) by using the user text semantic analysis result data to obtain a video pre-search result, where the video pre-search result includes the semantics of the related video text data in the video semantic space that matches the user text semantic analysis result data. Describe subsets;

S408: Projecting the user text semantic analysis result data to the video semantic space, and performing similarity comparison with the semantic description sub-set included in the video pre-search result to output a final video search result (for example, sorting according to the similarity score) Video list); and

S410: Present the final search result of the video to the user. As shown in FIG. 5, another video input method based on voice input includes, for example, mainly steps S500-S510:

S500: The video text semantic analysis result data obtained by performing natural language semantic analysis on the collected video text data is quantized and trained based on the latent semantic index to obtain a video semantic space, and the collected video text data is obtained in the video. a semantic description sub-collection of the semantic space;

S502: collecting voice input of the user and converting the data into user text;

S504: performing natural language semantic analysis on user text data to obtain user text semantic analysis result data;

S506: Using a semantic description of the user text semantic analysis result data in the video semantic space, performing similarity comparison on the at least partially collected video text data in a semantic description sub-set of the video semantic space to output a video final search result; In step S506, the method includes the foregoing performing a video pre-search (for example, the foregoing classified search), performing a semantic space search, and directly performing a semantic space search without performing a video pre-search;

S508: Present the final search result of the video to the user.

In addition, those skilled in the art can understand that the natural interactive input mode is not limited to voice input, and can also be direct natural language text input, or even gesture input; accordingly, in the video search methods of the above embodiments, The text conversion step of the user voice data is not required; and the module design in the video search system can be appropriately increased, decreased, and/or changed according to the actual situation.

In summary, the video search system and method based on natural interaction input, such as voice input, and the video search server provided by the embodiments of the present invention have at least one or more of the following advantages. A: It can be guided by the user's video target task, allowing users to interact in natural language. Through natural language processing technology, using video-related knowledge base for reasoning operations, users only need to provide a simple description of the video content from the database. Quickly acquire relevant videos, so as to realize the intelligent perception of the user's video target tasks; in addition, it can realize the natural friendly and convenient human-computer interaction mode and interface, and has the ability to continuously learn and upgrade; therefore, the user experience can be effectively improved.

The above is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Although the present invention has been described above by way of a preferred embodiment, it is not intended to limit the present invention. A person skilled in the art can make some modifications or modifications to equivalent embodiments by using the above-disclosed technical contents without departing from the technical scope of the present invention. It is still within the scope of the technical solution of the present invention to make any simple modifications, equivalent changes and modifications to the above embodiments. Industrial applicability

The video search system and method based on natural interactive input, such as voice input, and the video search server provided by the present invention have at least one or more of the following advantages: being capable of being guided by a user's video target task, allowing the user to interact using natural language, Through the natural language processing technology, using the video-related knowledge base for the reasoning operation, the user can quickly obtain the relevant video from the database by providing a simple description of the video content, thereby realizing the intelligent perception of the user's video target task; It can realize natural and friendly human-computer interaction mode and interface, and has the ability to continuously learn and upgrade; therefore, it can effectively enhance the user experience.

Claims

Claim

A video search system based on natural interaction input, comprising: a user end, comprising a voice collection module and a human machine interface, the voice collection module collecting voice input of the user to generate user voice data and providing the voice data to the person Machine interface;

a video search server, comprising: a control module, a voice recognition module, a natural language processing module, a video relational database, and a video search module, wherein the video relational database stores a video semantic space and a semantic descriptor sub-collection of the video text data in the video semantic space,

The control module receives the user voice data provided by the human-machine interface of the user end and provides the voice data to the voice recognition module to obtain the user text data, and provides the user text data to the natural language processing module to obtain the user text semantic analysis result data, and utilizes the user. The text semantic analysis result data is pre-searched in the video relational database to obtain a video pre-search result; the video pre-search result includes a semantic descriptor sub-set of the related video text data matching the user text semantic analysis result data in the video semantic space,

The video search module receives the user text semantic analysis result data and the video pre-search result provided by the control module, and uses the user text semantic analysis result data to perform the semantic descriptor of the video semantic space and the semantic description sub-set included in the video pre-search result respectively. The similarity is compared, and the final search result of the video is output to the control module according to the comparison result, and then provided by the control module to the human-machine interface for presentation to the user.

The video search system of the natural interactive input according to claim 1, wherein the video search server further comprises:

a video data collecting module, collecting video data to provide video text data to the natural language processing module, and outputting video text semantic analysis result data to the video gateway by the natural language processing module Database for storage; and

The semantic spatial learning module uses the video text semantic analysis result data stored in the video relational database to perform training and learning to obtain the video semantic space and find the video text data in the video semantic space and store them in the video relational database.

3. A video search method based on natural interaction input, comprising the steps of: collecting a natural interaction input of a user to obtain user text data;

Performing natural language semantic analysis on the user text data to obtain user text semantic analysis result data;

Pre-searching the user text semantic analysis result data to obtain a video pre-search result, where the video pre-search result includes a semantic descriptor sub-set of the related video text data matching the user text semantic analysis result data in a video semantic space;

And projecting the user text semantic analysis result data into the video semantic space, and performing similarity comparison with the semantic description sub-set included in the video pre-search result to output a video final search result;

Present the final search results of the video to the user.

4. The video search method based on natural interactive input according to claim 3, further comprising the steps of:

Collect and obtain video text data;

Performing natural language semantic analysis on the obtained video text data to obtain video text semantic analysis result data;

The video text semantic analysis result data is used for training learning to obtain the video semantic space and find the semantic descriptors of the obtained video text data in the video semantic space.

The video search method based on natural interactive input according to claim 3, wherein the step of pre-searching the video pre-search result by using the user text semantic analysis result data comprises:

The user text semantic analysis result data is used for classification search, which includes video director name search, video actor name search, video creation time search, video theme type search, video area or language type search, and video category search. One or all.

6. A video search method based on natural interactive input, comprising the steps of: quantifying video text semantic analysis result data obtained by performing natural language semantic analysis on the collected video text data and based on latent semantic indexing Performing training to obtain a video semantic space, and obtaining a semantic descriptor sub-collection of the collected video text data in the video semantic space;

Collecting user's natural interaction input to obtain user text data;

Using the user text semantic analysis result data, the semantic descriptor of the video semantic space performs similarity comparison on the at least partially collected video text data in the semantic description sub-set of the video semantic space to output a video final search result;

Present the video's final search results to the user.

7. A video search server, comprising:

a video relational database, storing a video semantic space and a semantic descriptor sub-collection of video text data in the video semantic space;

Natural language processing module; a control module, providing user text data representing a user video requirement to the natural language processing module to obtain user text semantic analysis result data;

a video search module, obtaining a semantic descriptor of the user text semantic analysis result data in the video semantic space, and using the semantic descriptor to perform similarity comparison in at least part of the video text data in the semantic description sub-set of the video semantic space Output video final search results to the control module.

The video search server according to claim 7, wherein the control module further performs a pre-search in the video relation database by using the user text semantic analysis result data to obtain a video pre-search result, the video pre-search result. Correspondingly, the video search module includes a semantic descriptor sub-set corresponding to the user text semantic analysis result; and correspondingly, the video search module utilizes the semantic descriptor corresponding to the user text semantic analysis result data in the video The similarity comparison is performed in the semantic description sub-set included in the pre-search result to output the video final search result to the control module.

The video search server according to claim 7, further comprising: a voice recognition module, after the control module receives the user voice data, converting the user voice data into the representative user video requirement via the voice recognition module User text data.

10. The video search server of claim 7, 8 or 9, further comprising:

a video data collecting module, which collects video data to provide video text data to the natural language processing module, and the natural language processing module outputs video text semantic analysis result data to the video relation database for storage;

Semantic spatial learning module, semantic analysis result of the video text stored in the video relational database According to the quantization and training based on the latent semantic index, the video semantic space is obtained and the 1J video text data is respectively stored in the video semantic space after the semantic description of the video semantic space.

^ Database.