WO2018090468A1 - 视频节目的搜索方法和装置 - Google Patents

视频节目的搜索方法和装置 Download PDF

Info

Publication number
WO2018090468A1
WO2018090468A1 PCT/CN2016/113642 CN2016113642W WO2018090468A1 WO 2018090468 A1 WO2018090468 A1 WO 2018090468A1 CN 2016113642 W CN2016113642 W CN 2016113642W WO 2018090468 A1 WO2018090468 A1 WO 2018090468A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
video program
index
description
vector
Prior art date
Application number
PCT/CN2016/113642
Other languages
English (en)
French (fr)
Inventor
李贤�
Original Assignee
广州视源电子科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州视源电子科技股份有限公司 filed Critical 广州视源电子科技股份有限公司
Publication of WO2018090468A1 publication Critical patent/WO2018090468A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Definitions

  • the present invention relates to the field of computers, and more particularly to a method and apparatus for searching for video programs.
  • Rocchio algorithm is derived from vector space model theory.
  • the basic idea of vector space model is to use vector to represent a text, and the subsequent processing can be transformed into vector operation in space.
  • the process of Rocchio algorithm training is actually the process of building a class feature vector. For a given unknown text, the vector of the text is generated, then the similarity between the vector and each class feature vector is calculated, and finally the text is divided into the most Go in similar categories.
  • the above algorithm has the disadvantage that the Rocchio algorithm cannot mine the underlying semantics of the document. Second, it assumes that the training data is absolutely correct, because it does not have any mechanism to quantitatively measure whether the sample contains noise, and thus has no resistance to the erroneous data.
  • the method and device for searching for a video program proposed by the embodiment of the invention can extract the latent semantics of the document and improve the accuracy and search efficiency of the searched video program.
  • the calculated cosine similarity is sorted from large to small, and a video program corresponding to the column vector whose order number belongs to the cosine similarity of the sorting interval is selected and provided to the user.
  • the process of constructing the index matrix by the description document describing the video program includes: using the word frequency of the i-th keyword in the description document of the j-th video program as the i-th element of the j-th column of the index matrix Numerical value
  • the process of constructing the query vector describing the term includes: setting a keyword represented by an ith element of the query vector to be the same as a keyword represented by an i-th row element of the index matrix, and corresponding to the i-th element The keyword appears in the description term as the value of the i-th element of the query vector; wherein the query vector is a column vector.
  • description documents of the video program stored in the database describing the same video category, formatting the terms contained in the description documents according to the standard entry format; wherein the database stores description documents of multiple video categories a description document describing a video program, and different video descriptions describing the description of the document are different from each other;
  • the constructing the query vector describing the term is specifically:
  • the query vector describing the term is constructed based on the word frequency that appears in the description entry for each of the extracted keywords.
  • T is an orthogonal matrix, and the matrix T
  • S is a diagonal matrix, the diagonal elements of the matrix S are singular values of the index matrix H
  • D is an orthogonal matrix, and each column of the matrix D is The right singular vector of the index matrix H
  • the query vector is Q;
  • T K T K *S K *D K T ;
  • T K is a matrix formed by the first K columns of the matrix T
  • S K a diagonal matrix formed by the first K diagonal elements of the matrix S
  • D K is a matrix formed by the first K columns of the matrix D; the value of K is greater than the maximum sort number contained in the sorting interval;
  • index matrix H K of the revised latent semantic index model calculating a row vector obtained by multiplying the transposed matrix Q T of the query vector by the matrix T K and the matrix D K and the matrix S K multiplies the cosine similarity between the two rows of the j-th row vector of the resulting matrix as the cosine similarity of the j-th column vector of the index matrix H K to the query vector Q.
  • searching method further includes:
  • the latent semantic index model corresponding to the video category to which the new video program belongs is updated.
  • an embodiment of the present invention provides a search device for a video program, including:
  • a user information receiving module configured to receive a description entry of the video program input by the user and a video category to which the video program belongs;
  • a query vector construction module configured to select a latent semantic index model corresponding to the video category, and construct a query vector describing the entry according to a construction manner of an index matrix of the semantic index model; wherein the potential The semantic index model is obtained by performing singular value decomposition on an index matrix constructed by a description document describing a video program of the same video category;
  • a similarity calculation module configured to calculate, according to the latent semantic index model, a cosine similarity between each column vector of the index matrix and the query vector;
  • the video program selection module is configured to perform the sorting of the calculated cosine similarity from large to small, and select a video program corresponding to the column vector whose sorting number belongs to the cosine similarity of the sorting interval, and provide the video program to the user.
  • the query vector construction module includes a unit for constructing an index matrix according to a description document describing the video program, specifically for: displaying a word frequency of the i-th keyword in the description document of the j-th video program. The value of the ith element of the jth column of the index matrix;
  • the unit included in the query vector construction module is configured to construct a query vector describing the term, and is specifically configured to: set a keyword represented by an ith element of the query vector and an element represented by an ith row element of the index matrix
  • the keywords are the same, and the word frequency corresponding to the keyword corresponding to the i-th element is used as the value of the i-th element of the query vector; wherein the query vector is a column vector.
  • the query vector construction module includes a unit for constructing an index matrix according to a description document describing a video program of the same video category, specifically:
  • a first format adjustment unit configured to perform, for a description of a document included in the database that describes a video program of the same video category, a format adjustment of the terms included in the description document according to a standard entry format; wherein the database
  • a description document storing a plurality of video categories, one description document describing a video program, and different description videos describing the video programs are different from each other;
  • a first tool calling unit for calling a word segmentation tool
  • a first word segment unit configured to perform word segmentation on the formatted words of all the description documents by using the word segmentation tool to obtain a first word set
  • a first keyword extracting unit configured to extract a keyword from the first word set according to a TF-IDF algorithm
  • An index matrix construction unit configured to construct an index matrix according to a word frequency that appears in each description document for each extracted keyword; wherein a row order of the index matrix is generated according to a keyword in all the description documents
  • the total word frequency is arranged from high to low
  • the column order of the index matrix is arranged from high to low according to the word frequency that the keyword appears in each description document.
  • the query vector construction module further includes a unit for constructing the query vector of the description term, specifically:
  • a second format adjustment unit configured to perform format adjustment on the description entry according to a standard entry format
  • a second tool calling unit for calling a word segmentation tool
  • a second word segment unit configured to perform word segmentation on the formatted reference word by using the word segmentation tool to obtain a second word set
  • a second keyword extracting unit configured to extract a keyword from the second word set according to a TF-IDF algorithm
  • the query vector construction unit is configured to construct the query vector describing the term according to the word frequency that appears in the description entry for each extracted keyword.
  • T is an orthogonal matrix, and the matrix T
  • S is a diagonal matrix, the diagonal elements of the matrix S are singular values of the index matrix H
  • D is an orthogonal matrix, and each column of the matrix D is The right singular vector of the index matrix H
  • the query vector is Q;
  • the similarity calculation module specifically includes:
  • T K T K *S K *D K T ;
  • T K is the first K column of the matrix T
  • S K is a diagonal matrix formed by the first K diagonal elements of the matrix S
  • D K is a matrix formed by the first K columns of the matrix D; the value of K is greater than the maximum ordering of the sorting interval number;
  • a calculating unit configured to calculate, for the index matrix H K of the revised latent semantic index model, a row vector obtained by multiplying a transposed matrix Q T of the query vector by the matrix T K and the matrix D K The cosine similarity between the two line vectors of the j-th row vector of the matrix obtained by multiplying the matrix S K as the cosine similarity of the j-th column vector of the index matrix H K and the query vector Q.
  • the searching device further includes:
  • a model updating module configured to update a latent semantic index model corresponding to a video category to which the new video program belongs when the database adds a description document describing the new video program.
  • the method and device for searching a video program can obtain a description term and an index of a searched video by calculating a cosine similarity between a query vector of a searched video and an index matrix of a latent semantic index model.
  • Each column vector of the matrix represents the degree of correlation between the description documents. The higher the value, the higher the degree of correlation, and then the video program corresponding to the description document with high degree of relevance to the description term is recommended to the user, and
  • the semantic index model is constructed (trained) according to the description document describing the video program, which can mine the potential semantics of the document and improve the accuracy of the search video program.
  • the calculation can further improve the efficiency of searching for a video program.
  • FIG. 1 is a schematic flow chart of an embodiment of a method for searching for a video program provided by the present invention
  • FIG. 2 is a schematic structural diagram of an embodiment of a search apparatus for a video program provided by the present invention
  • FIG. 3 is a schematic structural diagram of an embodiment of a query vector construction module of a search device for a video program provided by the present invention
  • FIG. 4 is a schematic structural diagram of an embodiment of a similarity calculation module of a search device for a video program provided by the present invention.
  • FIG. 1 is a schematic flowchart of an embodiment of a video program search method provided by the present invention.
  • the search method includes steps S1 to S4, specifically:
  • the latent semantic index model is By description
  • the index matrix constructed by the description file of the video program of the same video category is obtained by performing singular value decomposition;
  • the value of the i-th element of the j-th column of the index matrix represents the i-th keyword in the j-th video program Describe the word frequency appearing in the document;
  • the query vector is a column vector, the keyword represented by the ith element of the query vector is the same as the keyword represented by the ith row element of the index matrix, and the query vector
  • the value of the i-th element represents the word frequency in which the keyword corresponding to the i-th element appears in the description entry;
  • the above sorting interval is generally preferably the top 10 sorting numbers.
  • step S2 the process of constructing the index file according to the description document of the video program describing the same video category in the above step S2 is specifically:
  • description documents of the video program stored in the database describing the same video category, formatting the terms contained in the description documents according to the standard entry format; wherein the database stores description documents of multiple video categories
  • a description document describes a video program, and different descriptions of the description of the video program are different from each other; for the format adjustment of the entry, the limitation of the entry can be, but is not limited to, the lowercase in the entry is unified into uppercase and the redundant in the entry. The space is deleted, the punctuation marks in the unified entry, the full-width format or the half-width format of the terms are unified into one type.
  • the word segmentation tool is called; preferably, the word segmentation tool is a jieba word segmentation tool, but is not limited to this word segmentation tool.
  • the word segmentation tool uses the word segmentation tool to segment the words of all the described description documents after the format adjustment, to obtain the first word set; the word segmentation tool has a plurality of modes for word segmentation, except that the normal word segmentation mode is divided, You can also continue to segment long words to improve the recall rate. Especially for short texts, you can cut out more words than normal, and improve the accuracy of subsequent output video programs.
  • the value of the i-th element of the j-th column of the index matrix represents the i-th keyword at the jth
  • the frequency of words appearing in the description of the video program The same keyword represented by all the elements in the i-th row of the index matrix, and the keywords represented by the elements of different rows are different. For example, assuming that all elements of the first row of the index matrix represent the keyword A, and the elements of the first column of the index matrix represent the description document B, the values of the elements of the first row and the first column of the index matrix represent the keyword A at Describe the probability of document B appearing.
  • the query vector for constructing the description item in the above step S2 is specifically:
  • Formatting the description entry according to the standard entry format for example, unifying lowercase in the entry into uppercase, deleting extra spaces in the entry, punctuation in the unified entry, and enclosing the full angle of the entry
  • the format or half-width format is unified into one type.
  • the word segmentation tool is called; preferably, the word segmentation tool is a jieba word segmentation tool, but is not limited to this word segmentation tool.
  • the word segmentation tool uses the word segmentation tool to segment the formatted article to obtain a second word set; the word segmentation tool has a plurality of modes for word segmentation, except that the normal word segmentation mode can be continued. Long words are segmented to improve the recall rate. Especially for short texts, more words can be cut out than normal, and the accuracy of subsequent output video programs can be improved.
  • the query vector describing the term is constructed based on the word frequency that appears in the description entry for each of the extracted keywords.
  • the query vector describing the term it is ensured that the keyword represented by the i-th element of the query vector is the same as the keyword represented by the i-th row element of the index matrix of the latent semantic index model. Therefore, it makes sense to compare the query vector with the cosine similarity of each column vector of the index matrix.
  • the process of constructing a vector also needs to follow the following principle: the keyword represented by the ith element of the query vector is the same as the keyword represented by the ith row element of the index matrix, and the ith of the query vector
  • the numerical value of the element represents the word frequency in which the keyword corresponding to the i-th element appears in the description term; for example, in the assumption that all elements of the first row of the index matrix represent the keyword A, the first row of the query vector
  • the keyword represented by the element is the keyword A
  • the value of the element in the first row of the query vector represents the word frequency in which the keyword A appears in the description term.
  • T is an orthogonal matrix, and the matrix T
  • S is a diagonal matrix, the diagonal elements of the matrix S are singular values of the index matrix H
  • D is an orthogonal matrix, and each column of the matrix D is The right singular vector of the index matrix H
  • the query vector is Q;
  • step S3 The specific implementation process of the above step S3 is specifically as follows:
  • T K T K *S K *D K T ;
  • T K is a matrix formed by the first K columns of the matrix T
  • S K a diagonal matrix formed by the first K diagonal elements of the matrix S
  • D K is a matrix formed by the first K columns of the matrix D; the value of K is greater than the maximum sort number contained in the sorting interval;
  • index matrix H K of the revised latent semantic index model calculating a row vector obtained by multiplying the transposed matrix Q T of the query vector by the matrix T K and the matrix D K and the matrix S K multiplies the cosine similarity between the two rows of the j-th row vector of the resulting matrix as the cosine similarity of the j-th column vector of the index matrix H K to the query vector Q.
  • the K value here is a threshold selection, which can be selected according to the actual situation.
  • the decomposition process adopts the K rank of H, so that the singular values after the first K maximum singular values of the index matrix H are all zero.
  • the above revision of the latent semantic index model can improve retrieval efficiency.
  • searching method further includes:
  • the latent semantic index model corresponding to the video category to which the new video program belongs is updated.
  • the video program search method can obtain the description term and the index matrix of the search video by calculating the cosine similarity between the query vector of the search video and the index vector of the index matrix of the latent semantic index model.
  • the degree of correlation between the description documents represented by each column vector the higher the value, the higher the degree of correlation, and then the video program corresponding to the description document with high degree of relevance to the description term is recommended to the user, and due to the latent semantic index
  • the model is constructed (trained) according to the description document describing the video program, which can mine the potential semantics of the document and improve the accuracy of the search video program.
  • the calculation can further improve the efficiency of searching for a video program.
  • FIG. 2 it is a schematic structural diagram of an embodiment of a search apparatus for a video program provided by the present invention.
  • the search device is capable of performing the entire process of the search method of the video program provided by the foregoing embodiment, and the search device includes:
  • the user information receiving module 10 is configured to receive a description entry of the description video program input by the user and a video category to which the video program belongs;
  • a query vector construction module 20 configured to select a latent semantic index model corresponding to the video category, and construct a query vector describing the entry according to a construction manner of an index matrix of the semantic index model;
  • Potential language The semantic index model is obtained by performing singular value decomposition on an index matrix constructed by a description document describing a video program of the same video category;
  • the similarity calculation module 30 is configured to calculate, according to the latent semantic index model, a cosine similarity between each column vector of the index matrix and the query vector;
  • the video program selection module 40 is configured to perform the sorting of the calculated cosine similarity from large to small, and select a video program corresponding to the column vector of the cosine similarity whose sorting number belongs to the sorting interval to be provided to the user.
  • the query vector construction module includes a unit for constructing an index matrix according to a description document describing the video program, specifically for: displaying a word frequency of the i-th keyword in the description document of the j-th video program. The value of the ith element of the jth column of the index matrix;
  • the unit included in the query vector construction module is configured to construct a query vector describing the term, and is specifically configured to: set a keyword represented by an ith element of the query vector and an element represented by an ith row element of the index matrix
  • the keywords are the same, and the word frequency corresponding to the keyword corresponding to the i-th element is used as the value of the i-th element of the query vector; wherein the query vector is a column vector.
  • the query vector construction module 20 includes a description for a video program according to the description of the same video category.
  • the document is constructed as a unit of the index matrix, specifically:
  • the first format adjustment unit 21 is configured to perform format adjustment on the all the description documents of the video program that are stored in the database and describe the video program of the same video category according to the standard entry format; wherein the database is A description document storing a plurality of video categories, one description document describing a video program, and different description videos describing the video programs are different from each other;
  • a first tool calling unit 22 configured to call a word segmentation tool
  • a first word segmentation unit 23 configured to use the word segmentation tool to perform word segmentation on the formatted words of all the description documents to obtain a first word set;
  • a first keyword extracting unit 34 configured to extract a keyword from the first word set according to a TF-IDF algorithm
  • the index matrix construction unit 25 is configured to construct an index matrix according to the word frequency appearing in each of the extracted description keywords in each of the extracted keywords; wherein the row order of the index matrix is present in all the description documents according to keywords
  • the total word frequency is arranged from high to low, and the column order of the index matrix is arranged from high to low according to the word frequency that the keyword appears in each description document.
  • the query vector construction module 20 further includes a unit for constructing the query vector of the description term, specifically:
  • a second format adjustment unit 26 configured to perform format adjustment on the description entry according to a standard entry format
  • a second tool calling unit 27 configured to call a word segmentation tool
  • a second word segmentation unit 28 configured to perform segmentation on the formatted article by using the word segmentation tool to obtain a second word set
  • a second keyword extracting unit 29 configured to extract a keyword from the second word set according to a TF-IDF algorithm
  • the query vector construction unit 31 is configured to construct the query vector describing the term according to the word frequency that appears in the description entry for each extracted keyword.
  • FIG. 4 is a schematic structural diagram of an embodiment of a similarity calculation module of a search device for a video program provided by the present invention, where the index matrix is H, and the singular value decomposition is performed on the index matrix.
  • the similarity calculation module 30 specifically includes:
  • a calculating unit 33 configured to calculate, for the index matrix H K of the revised latent semantic index model, a row vector obtained by multiplying a transposed matrix Q T of the query vector by the matrix T K and the matrix D The cosine similarity between the two-row vector of the j-th row vector of the matrix obtained by multiplying K by the matrix S K as the cosine similarity of the j-th column vector of the index matrix H K and the query vector Q.
  • the searching device further includes:
  • the model update module 50 is configured to update a latent semantic index model corresponding to a video category to which the new video program belongs when the database adds a description document describing the new video program.
  • the search device for the video program provided by the embodiment of the present invention can obtain the description term and the index matrix of the search video by calculating the cosine similarity between the query vector of the search video and the index vector of the index matrix of the latent semantic index model.
  • the degree of correlation between the description documents represented by each column vector, the higher the value, the higher the degree of correlation, and then the video program corresponding to the description document with high degree of relevance to the description term is recommended to the user, and due to the latent semantic index
  • the model is constructed (trained) according to the description document describing the video program, which can mine the potential semantics of the document and improve the accuracy of the search video program.
  • the calculation can further improve the efficiency of searching for a video program.
  • the computer program is programmed to execute the associated hardware, and the program can be stored in a computer readable storage medium, which, when executed, can include the flow of an embodiment of the methods described above.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Abstract

本发明公开了一种视频节目的搜索方法,包括:接收用户输入的描述视频节目的描述词条和所述视频节目所属的视频类别;选取与所述视频类别相对应的潜在语义索引模型,并根据所述语义索引模型的索引矩阵的构建方式,构建所述描述词条的查询向量;根据所述潜在语义索引模型,计算所述索引矩阵的每一列向量与所述查询向量的余弦相似度;对计算获得的余弦相似度进行从大到小的排序,并选取排序号属于排序区间的余弦相似度的列向量对应的视频节目提供给所述用户。相应地,本发明还公开了一种视频节目的搜索装置。采用本发明实施例,能挖掘出文档的潜在语义,提高搜索视频节目的准确度和搜索效率。

Description

视频节目的搜索方法和装置 技术领域
本发明涉及计算机领域,尤其涉及视频节目的搜索方法和装置。
背景技术
在做综艺节目推荐时,ContentBase方法是一种重要的策略,主要是通过综艺内容描述的相似度进行聚类推荐,这种方法将内容相近的文本进行了聚类,现有主要是基于TF-IDF的Rocchio算法,Rocchio算法来源于向量空间模型理论,向量空间模型Vector space model的基本思想为采用向量来表示一个文本,之后的处理过程就可以转化为空间中向量的运算。Rocchio算法训练的过程,其实就是建立类别特征向量的过程,对于给定的一个未知文本,生成该文本的向量,然后计算该向量与各类别特征向量的相似度,最后将该文本分到与其最相似的类别中去。
但是采用上述算法存在以缺点:Rocchio算法无法挖掘文档的潜在语义。二、它假设训练数据是绝对正确的,因为它没有任何定量衡量样本是否含有噪声的机制,因而也就对错误数据毫无抵抗力。
发明内容
本发明实施例提出的一种视频节目的搜索方法和装置,能挖掘出文档的潜在语义,提高搜索视频节目的准确度和搜索效率。
本发明实施例提供的一种视频节目的搜索方法,包括:
接收用户输入的描述视频节目的描述词条和所述视频节目所属的视频类别;
选取与所述视频类别相对应的潜在语义索引模型,并根据所述语义索引模型的索引矩阵的构建方式,构建所述描述词条的查询向量;其中,所述潜在语义索引模型是对由描述同一视频类别的视频节目的描述文档所构建成的索引矩阵进行奇异值分解而获得的;
根据所述潜在语义索引模型,计算所述索引矩阵的每一列向量与所述查询向量的余弦相似度;
对计算获得的余弦相似度进行从大到小的排序,并选取排序号属于排序区间的余弦相似度的列向量对应的视频节目提供给所述用户。
进一步地,由描述视频节目的描述文档构建成索引矩阵的过程包括:将第i个关键词在第j个视频节目的描述文档中出现的词频作为索引矩阵的第j列的第i个元素的数值;
构建所述描述词条的查询向量的过程包括:设置所述查询向量的第i个元素代表的关键词与所述索引矩阵的第i行元素代表的关键词相同,并将第i个元素对应的关键词在所述描述词条中出现的词频作为所述查询向量的第i个元素的数值;其中,所述查询向量为列向量。
进一步地,由描述同一视频类别的视频节目的描述文档构建成索引矩阵的过程,具体为:
对于数据库存储的描述同一视频类别的视频节目的所有描述文档,根据标准词条格式,对所述所有描述文档包含的词条进行格式调整;其中,所述数据库存储有多种视频类别的描述文档,一个描述文档描述一个视频节目,不同的描述文档描述的视频节目互不相同;
调用分词工具;
利用所述分词工具对格式调整后的所述所有描述文档的词条进行分词,获得第一词语集;
根据TF-IDF算法从所述第一词语集中提取关键词;
根据所提取的每一个关键词在每一个描述文档中出现的词频,构建索引矩阵;其中,所述索引矩阵的行顺序是根据关键词在所述所有描述文档出现的总词频进行由高到低的排列,所述索引矩阵的列顺序根据关键词在每一个描述文档中出现的词频进行由高到低的排列。
进一步地,所述构建所述描述词条的查询向量,具体为:
根据标准词条格式,对所述描述词条进行格式调整;
调用分词工具;
利用所述分词工具对格式调整后的所述描述词条进行分词,获得第二词语集;
根据TF-IDF算法从所述第二词语集中提取关键词;
根据所提取的每一个关键词在所述描述词条中出现的词频,构建所述描述词条的查询向量。
进一步地,所述索引矩阵为H,则对所述索引矩阵进行奇异值分解所获得的所述潜在语义索引模型为:H=T*S*DT;其中,T为正交矩阵,矩阵T的每一列是所述索引矩阵H的左奇异向量;S为对角矩阵,矩阵S的对角线元素是所述索引矩阵H的奇异值;D为正交矩阵,矩阵D的每一列为所述索引矩阵H的右奇异向量;所述查询向量为Q;
所述根据所述潜在语义索引模型,计算所述索引矩阵的每一列向量与所述查询向量的余弦相似度,具体为:
选取TK、SK和DK矩阵,修订所述潜在语义索引模型为HK=TK*SK*DK T;其中,TK为由矩阵T的前K列形成的矩阵,SK为由矩阵S的前K个对角线元素形成的对角矩阵,DK为由 矩阵D的前K列形成的矩阵;K的数值大于所述排序区间包含的最大排序号;
对于修订后的所述潜在语义索引模型的索引矩阵HK,计算所述查询向量的转置矩阵QT与所述矩阵TK相乘所得的行向量和所述矩阵DK与所述矩阵SK相乘所得矩阵的第j行向量的两行向量之间的余弦相似度,作为所述索引矩阵HK的第j列向量与所述查询向量Q的余弦相似度。
进一步地,所述搜索方法还包括:
当数据库增加描述新的视频节目的描述文档时,对与所述新的视频节目所属的视频类别相对应的潜在语义索引模型进行更新。
相应地,本发明实施例提供一种视频节目的搜索装置,包括:
用户信息接收模块,用于接收用户输入的描述视频节目的描述词条和所述视频节目所属的视频类别;
查询向量构建模块,用于选取与所述视频类别相对应的潜在语义索引模型,并根据所述语义索引模型的索引矩阵的构建方式,构建所述描述词条的查询向量;其中,所述潜在语义索引模型是对由描述同一视频类别的视频节目的描述文档所构建成的索引矩阵进行奇异值分解而获得的;
相似度计算模块,用于根据所述潜在语义索引模型,计算所述索引矩阵的每一列向量与所述查询向量的余弦相似度;
视频节目选取模块,用于对计算获得的余弦相似度进行从大到小的排序,并选取排序号属于排序区间的余弦相似度的列向量对应的视频节目提供给所述用户。
进一步地,所述查询向量构建模块包括的用于根据描述视频节目的描述文档构建成索引矩阵的单元,具体用于:将第i个关键词在第j个视频节目的描述文档中出现的词频作为索引矩阵的第j列的第i个元素的数值;
所述查询向量构建模块包括的用于构建描述词条的查询向量的单元,具体用于:设置所述查询向量的第i个元素代表的关键词与所述索引矩阵的第i行元素代表的关键词相同,并将第i个元素对应的关键词在所述描述词条中出现的词频作为所述查询向量的第i个元素的数值;其中,所述查询向量为列向量。
进一步地,所述查询向量构建模块包括用于根据描述同一视频类别的视频节目的描述文档构建成索引矩阵的单元,具体为:
第一格式调整单元,用于对于数据库存储的描述同一视频类别的视频节目的所有描述文档,根据标准词条格式,对所述所有描述文档包含的词条进行格式调整;其中,所述数据库 存储有多种视频类别的描述文档,一个描述文档描述一个视频节目,不同的描述文档描述的视频节目互不相同;
第一工具调用单元,用于调用分词工具;
第一分词单元,用于利用所述分词工具对格式调整后的所述所有描述文档的词条进行分词,获得第一词语集;
第一关键词提取单元,用于根据TF-IDF算法从所述第一词语集中提取关键词;
索引矩阵构建单元,用于根据所提取的每一个关键词在每一个描述文档中出现的词频,构建索引矩阵;其中,所述索引矩阵的行顺序是根据关键词在所述所有描述文档出现的总词频进行由高到低的排列,所述索引矩阵的列顺序根据关键词在每一个描述文档中出现的词频进行由高到低的排列。
进一步地,所述查询向量构建模块还包括用于构建所述描述词条的查询向量的单元,具体为:
第二格式调整单元,用于根据标准词条格式,对所述描述词条进行格式调整;
第二工具调用单元,用于调用分词工具;
第二分词单元,用于利用所述分词工具对格式调整后的所述描述词条进行分词,获得第二词语集;
第二关键词提取单元,用于根据TF-IDF算法从所述第二词语集中提取关键词;
查询向量构建单元,用于根据所提取的每一个关键词在所述描述词条中出现的词频,构建所述描述词条的查询向量。
进一步地,所述索引矩阵为H,则对所述索引矩阵进行奇异值分解所获得的所述潜在语义索引模型为:H=T*S*DT;其中,T为正交矩阵,矩阵T的每一列是所述索引矩阵H的左奇异向量;S为对角矩阵,矩阵S的对角线元素是所述索引矩阵H的奇异值;D为正交矩阵,矩阵D的每一列为所述索引矩阵H的右奇异向量;所述查询向量为Q;
所述相似度计算模块具体包括:
模型修订单元,用于选取TK、SK和DK矩阵,修订所述潜在语义索引模型为HK=TK*SK*DK T;其中,TK为由矩阵T的前K列形成的矩阵,SK为由矩阵S的前K个对角线元素形成的对角矩阵,DK为由矩阵D的前K列形成的矩阵;K的数值大于所述排序区间包含的最大排序号;
计算单元,用于对于修订后的所述潜在语义索引模型的索引矩阵HK,计算所述查询向量的转置矩阵QT与所述矩阵TK相乘所得的行向量和所述矩阵DK与所述矩阵SK相乘所得矩阵的第j行向量的两行向量之间的余弦相似度,作为所述索引矩阵HK的第j列向量与所述查询 向量Q的余弦相似度。
进一步地,所述搜索装置还包括:
模型更新模块,用于当数据库增加描述新的视频节目的描述文档时,对与所述新的视频节目所属的视频类别相对应的潜在语义索引模型进行更新。
实施本发明实施例,具有如下有益效果:
本发明实施例提供的视频节目的搜索方法和装置,通过计算要搜索视频的查询向量与潜在语义索引模型的索引矩阵的每一列向量的余弦相似度,可获得要搜索视频的描述词条与索引矩阵的每一列向量代表的描述文档之间的相关程度,数值越高,则相关程度越高,进而将与该描述词条相关程度高的描述文档所对应的视频节目推荐给用户,并由于潜在语义索引模型是根据描述视频节目的描述文档构建(训练)成的,能挖掘出文档的潜在语义,提高搜索视频节目的准确度。另外,通过用户输入的所述视频节目所属的视频类别,选择与该视频类别对应的潜在语义索引模型来进行计算,能进一步提高搜索视频节目的效率。
附图说明
图1是本发明提供的视频节目的搜索方法的一个实施例的流程示意图;
图2是本发明提供的视频节目的搜索装置的一个实施例的结构示意图;
图3是本发明提供的视频节目的搜索装置的查询向量构建模块的一个实施例的结构示意图;
图4是本发明提供的视频节目的搜索装置的相似度计算模块的一个实施例的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
参见图1,是本发明提供的视频节目的搜索方法的一个实施例的流程示意图;该搜索方法,包括步骤S1至S4,具体为:
S1,接收用户输入的描述视频节目的描述词条和所述视频节目所属的视频类别;
S2,选取与所述视频类别相对应的潜在语义索引模型,并根据所述语义索引模型的索引矩阵的构建方式,构建所述描述词条的查询向量;其中,所述潜在语义索引模型是对由描述 同一视频类别的视频节目的描述文档所构建成的索引矩阵进行奇异值分解而获得的;所述索引矩阵的第j列的第i个元素的数值代表第i个关键词在第j个视频节目的描述文档中出现的词频;所述查询向量为列向量,所述查询向量的第i个元素代表的关键词与所述索引矩阵的第i行元素代表的关键词相同,且所述查询向量的第i个元素的数值代表所述第i个元素对应的关键词在所述描述词条中出现的词频;
S3,根据所述潜在语义索引模型,计算所述索引矩阵的每一列向量与所述查询向量的余弦相似度;
S4,对计算获得的余弦相似度进行从大到小的排序,并选取排序号属于排序区间的余弦相似度的列向量对应的视频节目提供给所述用户。
需要说明的是,通过计算要搜索视频的查询向量与潜在语义索引模型的索引矩阵的每一列向量的余弦相似度,可获得要搜索视频的描述词条与索引矩阵的每一列向量代表的描述文档之间的相关程度,数值越高,则相关程度越高,进而将与该描述词条相关程度高的描述文档所对应的视频节目推荐给用户,并由于潜在语义索引模型是根据描述视频节目的描述文档构建(训练)成的,能挖掘出文档的潜在语义,提高搜索视频节目的准确度。另外,通过用户输入的所述视频节目所属的视频类别,选择与该视频类别对应的潜在语义索引模型来进行计算,能进一步提高搜索视频节目的效率。其中,上述的排序区间一般优选为排列在前的10个排序号。
进一步地,上述步骤S2中的根据描述同一视频类别的视频节目的描述文档构建成索引矩阵的过程,具体为:
对于数据库存储的描述同一视频类别的视频节目的所有描述文档,根据标准词条格式,对所述所有描述文档包含的词条进行格式调整;其中,所述数据库存储有多种视频类别的描述文档,一个描述文档描述一个视频节目,不同的描述文档描述的视频节目互不相同;对于对词条的格式调整,可以但不限于,将词条中的小写统一成大写、对词条中多余的空格删除、统一词条中的标点符号、将词条的全角格式或半角格式统一为一种等。
调用分词工具;优选地,所述分词工具为jieba分词工具,但不限于为此分词工具。
利用所述分词工具对格式调整后的所述所有描述文档的词条进行分词,获得第一词语集;分词工具对描述词条进行分词的模式有多种,除了按正常分词模式切分外,还可以继续长词进行切分,提高召回率,尤其对短文本,可以切出比正常切分出更多的词,对后续的输出视频节目的准确度有提升效果。
根据TF-IDF算法从所述第一词语集中提取关键词;
根据所提取的每一个关键词在每一个描述文档中出现的词频,构建索引矩阵;其中,所 述索引矩阵的行顺序是根据关键词在所述所有描述文档出现的总词频进行由高到低的排列,所述索引矩阵的列顺序根据关键词在每一个描述文档中出现的词频进行由高到低的排列。
需要说明的是,构建上述索引矩阵是预先根据数据库存储的描述文档构建而成的,构建过程需遵循:索引矩阵的第j列的第i个元素的数值代表第i个关键词在第j个视频节目的描述文档中出现的词频。其中,索引矩阵的第i行的所有元素所代表的同一个关键词,且不同行的元素所代表的关键词不相同。例如,假设索引矩阵的第1行的所有元素代表关键词A,索引矩阵的第1列的元素代表描述文档B,则该索引矩阵的第1行第1列的元素的数值代表关键词A在描述文档B出现的概率。
进一步地,上述步骤S2中的构建所述描述词条的查询向量,具体为:
根据标准词条格式,对所述描述词条进行格式调整;例如,将词条中的小写统一成大写、对词条中多余的空格删除、统一词条中的标点符号、将词条的全角格式或半角格式统一为一种等。
调用分词工具;优选地,所述分词工具为jieba分词工具,但不限于为此分词工具。
利用所述分词工具对格式调整后的所述描述词条进行分词,获得第二词语集;分词工具对描述词条进行分词的模式有多种,除了按正常分词模式切分外,还可以继续长词进行切分,提高召回率,尤其对短文本,可以切出比正常切分出更多的词,对后续的输出视频节目的准确度有提升效果。
根据TF-IDF算法从所述第二词语集中提取关键词;
根据所提取的每一个关键词在所述描述词条中出现的词频,构建所述描述词条的查询向量。
需要说明的是,构建所述描述词条的查询向量时,要确保所述查询向量的第i个元素代表的关键词与上述潜在语义索引模型的索引矩阵的第i行元素代表的关键词相同,使得比较查询向量与索引矩阵的每一列向量的余弦相似度具有意义。
另外,构建向量的过程还需遵循以下原则:所述查询向量的第i个元素代表的关键词与所述索引矩阵的第i行元素代表的关键词相同,且所述查询向量的第i个元素的数值代表所述第i个元素对应的关键词在所述描述词条中出现的词频;例如,在假设索引矩阵的第1行的所有元素代表关键词A,则查询向量的第1行的元素代表的关键词为关键词A,则查询向量的第1行的元素的数值代表关键词A在描述词条中出现的词频。
进一步地,所述索引矩阵为H,则对所述索引矩阵进行奇异值分解所获得的所述潜在语义索引模型为:H=T*S*DT;其中,T为正交矩阵,矩阵T的每一列是所述索引矩阵H的左奇异向量;S为对角矩阵,矩阵S的对角线元素是所述索引矩阵H的奇异值;D为正交矩阵, 矩阵D的每一列为所述索引矩阵H的右奇异向量;所述查询向量为Q;
上述步骤S3的具体实施过程具体为:
选取TK、SK和DK矩阵,修订所述潜在语义索引模型为HK=TK*SK*DK T;其中,TK为由矩阵T的前K列形成的矩阵,SK为由矩阵S的前K个对角线元素形成的对角矩阵,DK为由矩阵D的前K列形成的矩阵;K的数值大于所述排序区间包含的最大排序号;
对于修订后的所述潜在语义索引模型的索引矩阵HK,计算所述查询向量的转置矩阵QT与所述矩阵TK相乘所得的行向量和所述矩阵DK与所述矩阵SK相乘所得矩阵的第j行向量的两行向量之间的余弦相似度,作为所述索引矩阵HK的第j列向量与所述查询向量Q的余弦相似度。
需要说明的是,此处的K值是个阈值选择,可以根据实际情况选择,分解过程采用H的K秩,是让索引矩阵H的前K个最大奇异值以后的奇异值都为零。上述对潜在语义索引模型的修订,能够提高检索效率。
进一步地,所述搜索方法还包括:
当数据库增加描述新的视频节目的描述文档时,对与所述新的视频节目所属的视频类别相对应的潜在语义索引模型进行更新。
需要说明的是,由于视频节目会不断增加的,而对于描述新增加的视频节目的描述文档也会不断添加到数据库当中,因此需要对举在语义索引模型进行更新。
本发明实施例提供的视频节目的搜索方法,通过计算要搜索视频的查询向量与潜在语义索引模型的索引矩阵的每一列向量的余弦相似度,可获得要搜索视频的描述词条与索引矩阵的每一列向量代表的描述文档之间的相关程度,数值越高,则相关程度越高,进而将与该描述词条相关程度高的描述文档所对应的视频节目推荐给用户,并由于潜在语义索引模型是根据描述视频节目的描述文档构建(训练)成的,能挖掘出文档的潜在语义,提高搜索视频节目的准确度。另外,通过用户输入的所述视频节目所属的视频类别,选择与该视频类别对应的潜在语义索引模型来进行计算,能进一步提高搜索视频节目的效率。
参阅图2,是本发明提供的视频节目的搜索装置的一个实施例的结构示意图。该搜索装置能够执行上述实施例提供的视频节目的搜索方法的全部流程,该搜索装置,包括:
用户信息接收模块10,用于接收用户输入的描述视频节目的描述词条和所述视频节目所属的视频类别;
查询向量构建模块20,用于选取与所述视频类别相对应的潜在语义索引模型,并根据所述语义索引模型的索引矩阵的构建方式,构建所述描述词条的查询向量;其中,所述潜在语 义索引模型是对由描述同一视频类别的视频节目的描述文档所构建成的索引矩阵进行奇异值分解而获得的;
相似度计算模块30,用于根据所述潜在语义索引模型,计算所述索引矩阵的每一列向量与所述查询向量的余弦相似度;
视频节目选取模块40,用于对计算获得的余弦相似度进行从大到小的排序,并选取排序号属于排序区间的余弦相似度的列向量对应的视频节目提供给所述用户。
进一步地,所述查询向量构建模块包括的用于根据描述视频节目的描述文档构建成索引矩阵的单元,具体用于:将第i个关键词在第j个视频节目的描述文档中出现的词频作为索引矩阵的第j列的第i个元素的数值;
所述查询向量构建模块包括的用于构建描述词条的查询向量的单元,具体用于:设置所述查询向量的第i个元素代表的关键词与所述索引矩阵的第i行元素代表的关键词相同,并将第i个元素对应的关键词在所述描述词条中出现的词频作为所述查询向量的第i个元素的数值;其中,所述查询向量为列向量。
进一步地,参见图3,是本发明提供的视频节目的搜索装置的查询向量构建模块的一个实施例的结构示意图,所述查询向量构建模块20包括用于根据描述同一视频类别的视频节目的描述文档构建成索引矩阵的单元,具体为:
第一格式调整单元21,用于对于数据库存储的描述同一视频类别的视频节目的所有描述文档,根据标准词条格式,对所述所有描述文档包含的词条进行格式调整;其中,所述数据库存储有多种视频类别的描述文档,一个描述文档描述一个视频节目,不同的描述文档描述的视频节目互不相同;
第一工具调用单元22,用于调用分词工具;
第一分词单元23,用于利用所述分词工具对格式调整后的所述所有描述文档的词条进行分词,获得第一词语集;
第一关键词提取单元34,用于根据TF-IDF算法从所述第一词语集中提取关键词;
索引矩阵构建单元25,用于根据所提取的每一个关键词在每一个描述文档中出现的词频,构建索引矩阵;其中,所述索引矩阵的行顺序是根据关键词在所述所有描述文档出现的总词频进行由高到低的排列,所述索引矩阵的列顺序根据关键词在每一个描述文档中出现的词频进行由高到低的排列。
进一步地,所述查询向量构建模块20还包括用于构建所述描述词条的查询向量的单元,具体为:
第二格式调整单元26,用于根据标准词条格式,对所述描述词条进行格式调整;
第二工具调用单元27,用于调用分词工具;
第二分词单元28,用于利用所述分词工具对格式调整后的所述描述词条进行分词,获得第二词语集;
第二关键词提取单元29,用于根据TF-IDF算法从所述第二词语集中提取关键词;
查询向量构建单元31,用于根据所提取的每一个关键词在所述描述词条中出现的词频,构建所述描述词条的查询向量。
进一步地,参见图4,是本发明提供的视频节目的搜索装置的相似度计算模块的一个实施例的结构示意图,所述索引矩阵为H,则对所述索引矩阵进行奇异值分解所获得的所述潜在语义索引模型为:H=T*S*DT;其中,T为正交矩阵,矩阵T的每一列是所述索引矩阵H的左奇异向量;S为对角矩阵,矩阵S的对角线元素是所述索引矩阵H的奇异值;D为正交矩阵,矩阵D的每一列为所述索引矩阵H的右奇异向量;所述查询向量为Q;
所述相似度计算模块30具体包括:
模型修订单元32,用于选取TK、SK和DK矩阵,修订所述潜在语义索引模型为HK=TK*SK*DK T;其中,TK为由矩阵T的前K列形成的矩阵,SK为由矩阵S的前K个对角线元素形成的对角矩阵,DK为由矩阵D的前K列形成的矩阵;K的数值大于所述排序区间包含的最大排序号;
计算单元33,用于对于修订后的所述潜在语义索引模型的索引矩阵HK,计算所述查询向量的转置矩阵QT与所述矩阵TK相乘所得的行向量和所述矩阵DK与所述矩阵SK相乘所得矩阵的第j行向量的两行向量之间的余弦相似度,作为所述索引矩阵HK的第j列向量与所述查询向量Q的余弦相似度。
进一步地,所述搜索装置还包括:
模型更新模块50,用于当数据库增加描述新的视频节目的描述文档时,对与所述新的视频节目所属的视频类别相对应的潜在语义索引模型进行更新。
本发明实施例提供的视频节目的搜索装置,通过计算要搜索视频的查询向量与潜在语义索引模型的索引矩阵的每一列向量的余弦相似度,可获得要搜索视频的描述词条与索引矩阵的每一列向量代表的描述文档之间的相关程度,数值越高,则相关程度越高,进而将与该描述词条相关程度高的描述文档所对应的视频节目推荐给用户,并由于潜在语义索引模型是根据描述视频节目的描述文档构建(训练)成的,能挖掘出文档的潜在语义,提高搜索视频节目的准确度。另外,通过用户输入的所述视频节目所属的视频类别,选择与该视频类别对应的潜在语义索引模型来进行计算,能进一步提高搜索视频节目的效率。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计 算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所述是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也视为本发明的保护范围。

Claims (12)

  1. 一种视频节目的搜索方法,其特征在于,包括:
    接收用户输入的描述视频节目的描述词条和所述视频节目所属的视频类别;
    选取与所述视频类别相对应的潜在语义索引模型,并根据所述语义索引模型的索引矩阵的构建方式,构建所述描述词条的查询向量;其中,所述潜在语义索引模型是对由描述同一视频类别的视频节目的描述文档所构建成的索引矩阵进行奇异值分解而获得的;
    根据所述潜在语义索引模型,计算所述索引矩阵的每一列向量与所述查询向量的余弦相似度;
    对计算获得的余弦相似度进行从大到小的排序,并选取排序号属于排序区间的余弦相似度的列向量对应的视频节目提供给所述用户。
  2. 如权利要求1所述的视频节目的搜索方法,其特征在于,
    由描述视频节目的描述文档构建成索引矩阵的过程包括:将第i个关键词在第j个视频节目的描述文档中出现的词频作为索引矩阵的第j列的第i个元素的数值;
    构建所述描述词条的查询向量的过程包括:设置所述查询向量的第i个元素代表的关键词与所述索引矩阵的第i行元素代表的关键词相同,并将第i个元素对应的关键词在所述描述词条中出现的词频作为所述查询向量的第i个元素的数值;其中,所述查询向量为列向量。
  3. 如权利要求1或2所述的视频节目的搜索方法,其特征在于,由描述同一视频类别的视频节目的描述文档构建成索引矩阵的过程,具体为:
    对于数据库存储的描述同一视频类别的视频节目的所有描述文档,根据标准词条格式,对所述所有描述文档包含的词条进行格式调整;其中,所述数据库存储有多种视频类别的描述文档,一个描述文档描述一个视频节目,不同的描述文档描述的视频节目互不相同;
    调用分词工具;
    利用所述分词工具对格式调整后的所述所有描述文档的词条进行分词,获得第一词语集;
    根据TF-IDF算法从所述第一词语集中提取关键词;
    根据所提取的每一个关键词在每一个描述文档中出现的词频,构建索引矩阵;其中,所述索引矩阵的行顺序是根据关键词在所述所有描述文档出现的总词频进行由高到低的排列,所述索引矩阵的列顺序根据关键词在每一个描述文档中出现的词频进行由高到低的排列。
  4. 如权利要求1或2所述的视频节目的搜索方法,其特征在于,所述构建所述描述词条的查询向量,具体为:
    根据标准词条格式,对所述描述词条进行格式调整;
    调用分词工具;
    利用所述分词工具对格式调整后的所述描述词条进行分词,获得第二词语集;
    根据TF-IDF算法从所述第二词语集中提取关键词;
    根据所提取的每一个关键词在所述描述词条中出现的词频,构建所述描述词条的查询向量。
  5. 如权利要求3所述的视频节目的搜索方法,其特征在于,所述索引矩阵为H,则对所述索引矩阵进行奇异值分解所获得的所述潜在语义索引模型为:H=T*S*DT;其中,T为正交矩阵,矩阵T的每一列是所述索引矩阵H的左奇异向量;S为对角矩阵,矩阵S的对角线元素是所述索引矩阵H的奇异值;D为正交矩阵,矩阵D的每一列为所述索引矩阵H的右奇异向量;所述查询向量为Q;
    所述根据所述潜在语义索引模型,计算所述索引矩阵的每一列向量与所述查询向量的余弦相似度,具体为:
    选取TK、SK和DK矩阵,修订所述潜在语义索引模型为HK=TK*SK*DK T;其中,TK为由矩阵T的前K列形成的矩阵,SK为由矩阵S的前K个对角线元素形成的对角矩阵,DK为由矩阵D的前K列形成的矩阵;K的数值大于所述排序区间包含的最大排序号;
    对于修订后的所述潜在语义索引模型的索引矩阵HK,计算所述查询向量的转置矩阵QT与所述矩阵TK相乘所得的行向量和所述矩阵DK与所述矩阵SK相乘所得矩阵的第j行向量的两行向量之间的余弦相似度,作为所述索引矩阵HK的第j列向量与所述查询向量Q的余弦相似度。
  6. 如权利要求1所述的视频节目的搜索方法,其特征在于,所述搜索方法还包括:
    当数据库增加描述新的视频节目的描述文档时,对与所述新的视频节目所属的视频类别相对应的潜在语义索引模型进行更新。
  7. 一种视频节目的搜索装置,其特征在于,包括:
    用户信息接收模块,用于接收用户输入的描述视频节目的描述词条和所述视频节目所属的视频类别;
    查询向量构建模块,用于选取与所述视频类别相对应的潜在语义索引模型,并根据所述语义索引模型的索引矩阵的构建方式,构建所述描述词条的查询向量;其中,所述潜在语义索引模型是对由描述同一视频类别的视频节目的描述文档所构建成的索引矩阵进行奇异值分解而获得的;
    相似度计算模块,用于根据所述潜在语义索引模型,计算所述索引矩阵的每一列向量与所述查询向量的余弦相似度;
    视频节目选取模块,用于对计算获得的余弦相似度进行从大到小的排序,并选取排序号属于排序区间的余弦相似度的列向量对应的视频节目提供给所述用户。
  8. 如权利要求7所述的视频节目的搜索装置,其特征在于,
    所述查询向量构建模块包括的用于根据描述视频节目的描述文档构建成索引矩阵的单元,具体用于:将第i个关键词在第j个视频节目的描述文档中出现的词频作为索引矩阵的第j列的第i个元素的数值;
    所述查询向量构建模块包括的用于构建描述词条的查询向量的单元,具体用于:设置所述查询向量的第i个元素代表的关键词与所述索引矩阵的第i行元素代表的关键词相同,并将第i个元素对应的关键词在所述描述词条中出现的词频作为所述查询向量的第i个元素的数值;其中,所述查询向量为列向量。
  9. 如权利要求7或8所述的视频节目的搜索装置,其特征在于,所述查询向量构建模块包括用于根据描述同一视频类别的视频节目的描述文档构建成索引矩阵的单元,具体为:
    第一格式调整单元,用于对于数据库存储的描述同一视频类别的视频节目的所有描述文档,根据标准词条格式,对所述所有描述文档包含的词条进行格式调整;其中,所述数据库存储有多种视频类别的描述文档,一个描述文档描述一个视频节目,不同的描述文档描述的视频节目互不相同;
    第一工具调用单元,用于调用分词工具;
    第一分词单元,用于利用所述分词工具对格式调整后的所述所有描述文档的词条进行分词,获得第一词语集;
    第一关键词提取单元,用于根据TF-IDF算法从所述第一词语集中提取关键词;
    索引矩阵构建单元,用于根据所提取的每一个关键词在每一个描述文档中出现的词频,构建索引矩阵;其中,所述索引矩阵的行顺序是根据关键词在所述所有描述文档出现的总词频进行由高到低的排列,所述索引矩阵的列顺序根据关键词在每一个描述文档中出现的词频 进行由高到低的排列。
  10. 如权利要求7或8所述的视频节目的搜索装置,其特征在于,所述查询向量构建模块还包括用于构建所述描述词条的查询向量的单元,具体为:
    第二格式调整单元,用于根据标准词条格式,对所述描述词条进行格式调整;
    第二工具调用单元,用于调用分词工具;
    第二分词单元,用于利用所述分词工具对格式调整后的所述描述词条进行分词,获得第二词语集;
    第二关键词提取单元,用于根据TF-IDF算法从所述第二词语集中提取关键词;
    查询向量构建单元,用于根据所提取的每一个关键词在所述描述词条中出现的词频,构建所述描述词条的查询向量。
  11. 如权利要求9所述的视频节目的搜索装置,其特征在于,所述索引矩阵为H,则对所述索引矩阵进行奇异值分解所获得的所述潜在语义索引模型为:H=T*S*DT;其中,T为正交矩阵,矩阵T的每一列是所述索引矩阵H的左奇异向量;S为对角矩阵,矩阵S的对角线元素是所述索引矩阵H的奇异值;D为正交矩阵,矩阵D的每一列为所述索引矩阵H的右奇异向量;所述查询向量为Q;
    所述相似度计算模块具体包括:
    模型修订单元,用于选取TK、SK和DK矩阵,修订所述潜在语义索引模型为HK=TK*SK*DK T;其中,TK为由矩阵T的前K列形成的矩阵,SK为由矩阵S的前K个对角线元素形成的对角矩阵,DK为由矩阵D的前K列形成的矩阵;K的数值大于所述排序区间包含的最大排序号;
    计算单元,用于对于修订后的所述潜在语义索引模型的索引矩阵HK,计算所述查询向量的转置矩阵QT与所述矩阵TK相乘所得的行向量和所述矩阵DK与所述矩阵SK相乘所得矩阵的第j行向量的两行向量之间的余弦相似度,作为所述索引矩阵HK的第j列向量与所述查询向量Q的余弦相似度。
  12. 如权利要求7所述的视频节目的搜索装置,其特征在于,所述搜索装置还包括:
    模型更新模块,用于当数据库增加描述新的视频节目的描述文档时,对与所述新的视频节目所属的视频类别相对应的潜在语义索引模型进行更新。
PCT/CN2016/113642 2016-11-18 2016-12-30 视频节目的搜索方法和装置 WO2018090468A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611019485.4A CN106708929B (zh) 2016-11-18 2016-11-18 视频节目的搜索方法和装置
CN201611019485.4 2016-11-18

Publications (1)

Publication Number Publication Date
WO2018090468A1 true WO2018090468A1 (zh) 2018-05-24

Family

ID=58939942

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/113642 WO2018090468A1 (zh) 2016-11-18 2016-12-30 视频节目的搜索方法和装置

Country Status (2)

Country Link
CN (1) CN106708929B (zh)
WO (1) WO2018090468A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111984851A (zh) * 2020-09-03 2020-11-24 平安国际智慧城市科技股份有限公司 医学资料搜索方法、装置、电子装置及存储介质
CN113094703A (zh) * 2021-03-11 2021-07-09 北京六方云信息技术有限公司 针对web入侵检测的输出内容过滤方法及系统
CN114564496A (zh) * 2022-03-01 2022-05-31 北京有竹居网络技术有限公司 一种内容推荐方法及装置

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416026B (zh) * 2018-03-09 2023-04-18 腾讯科技(深圳)有限公司 索引生成方法、内容搜索方法、装置及设备
CN110555127A (zh) * 2018-03-30 2019-12-10 优酷网络技术(北京)有限公司 多媒体内容的生成方法及装置
CN109918616B (zh) * 2019-01-23 2020-01-31 中国人民解放军32801部队 一种基于语义索引精度增强的可视媒体处理方法
CN111177512A (zh) * 2019-12-24 2020-05-19 绍兴市上虞区理工高等研究院 一种基于大数据的科技成果缺失处理方法及装置
CN111651635B (zh) * 2020-05-28 2023-04-28 拾音智能科技有限公司 一种基于自然语言描述的视频检索方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6189002B1 (en) * 1998-12-14 2001-02-13 Dolphin Search Process and system for retrieval of documents using context-relevant semantic profiles
CN103152618A (zh) * 2011-12-07 2013-06-12 北京四达时代软件技术股份有限公司 数字电视增值业务内容推荐方法及装置
CN103559196A (zh) * 2013-09-23 2014-02-05 浙江大学 一种基于多核典型相关分析的视频检索方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009213067A (ja) * 2008-03-06 2009-09-17 Toshiba Corp 番組推薦装置および番組推薦方法
CN104657376B (zh) * 2013-11-20 2018-09-18 航天信息股份有限公司 基于节目关系的视频节目的搜索方法和装置
CN104199933B (zh) * 2014-09-04 2017-07-07 华中科技大学 一种多模态信息融合的足球视频事件检测与语义标注方法
CN105653690B (zh) * 2015-12-30 2018-11-23 武汉大学 异常行为预警信息约束的视频大数据快速检索方法及系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6189002B1 (en) * 1998-12-14 2001-02-13 Dolphin Search Process and system for retrieval of documents using context-relevant semantic profiles
CN103152618A (zh) * 2011-12-07 2013-06-12 北京四达时代软件技术股份有限公司 数字电视增值业务内容推荐方法及装置
CN103559196A (zh) * 2013-09-23 2014-02-05 浙江大学 一种基于多核典型相关分析的视频检索方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WANG, ZHENFENG: "The Research and Application of Full-text Retrieval Technology Based on Lucene", ELECTRONIC TECHNOLOGY & INFORMATION SCIENCE , CHINA MASTER'S THESES FULL- TEXT DATABASE (ELECTRONIC JOURNALS, 31 July 2015 (2015-07-31) *
WU, CHUNJINAG: "Latent Semantic Retrieval Based on Document Clustering Analysis", ELECTRONIC TECHNOLOGY & INFORMATION SCIENCE , CHINA MASTER'S THESES FULL-TEXT DATABASE (ELECTRONIC JOURNALS, 30 November 2013 (2013-11-30), pages 5 - 6 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111984851A (zh) * 2020-09-03 2020-11-24 平安国际智慧城市科技股份有限公司 医学资料搜索方法、装置、电子装置及存储介质
CN111984851B (zh) * 2020-09-03 2023-11-14 深圳平安智慧医健科技有限公司 医学资料搜索方法、装置、电子装置及存储介质
CN113094703A (zh) * 2021-03-11 2021-07-09 北京六方云信息技术有限公司 针对web入侵检测的输出内容过滤方法及系统
CN114564496A (zh) * 2022-03-01 2022-05-31 北京有竹居网络技术有限公司 一种内容推荐方法及装置
CN114564496B (zh) * 2022-03-01 2023-09-19 北京有竹居网络技术有限公司 一种内容推荐方法及装置

Also Published As

Publication number Publication date
CN106708929A (zh) 2017-05-24
CN106708929B (zh) 2020-06-26

Similar Documents

Publication Publication Date Title
WO2018090468A1 (zh) 视频节目的搜索方法和装置
CN110162593B (zh) 一种搜索结果处理、相似度模型训练方法及装置
CN111444320B (zh) 文本检索方法、装置、计算机设备和存储介质
WO2019091026A1 (zh) 知识库文档快速检索方法、应用服务器及计算机可读存储介质
US10025819B2 (en) Generating a query statement based on unstructured input
CN107180045B (zh) 一种互联网文本蕴含地理实体关系的抽取方法
US8027977B2 (en) Recommending content using discriminatively trained document similarity
WO2019218514A1 (zh) 网页目标信息的提取方法、装置及存储介质
CN112819023B (zh) 样本集的获取方法、装置、计算机设备和存储介质
Sarawagi et al. Open-domain quantity queries on web tables: annotation, response, and consensus models
US20050081146A1 (en) Relation chart-creating program, relation chart-creating method, and relation chart-creating apparatus
WO2015149533A1 (zh) 一种基于网页内容分类进行分词处理的方法和装置
JP6056610B2 (ja) テキスト情報処理装置、テキスト情報処理方法、及びテキスト情報処理プログラム
WO2020232898A1 (zh) 文本分类方法、装置、电子设备及计算机非易失性可读存储介质
CN111324771B (zh) 视频标签的确定方法、装置、电子设备及存储介质
CN112988980B (zh) 目标产品查询方法、装置、计算机设备和存储介质
CN111291177A (zh) 一种信息处理方法、装置和计算机存储介质
WO2017215242A1 (zh) 简历搜索方法及装置
JPWO2008032780A1 (ja) 検索方法、類似度計算方法、類似度計算及び同一文書照合システムと、そのプログラム
CN106570196B (zh) 视频节目的搜索方法和装置
CN111400584A (zh) 联想词的推荐方法、装置、计算机设备和存储介质
CN110019474B (zh) 异构数据库中的同义数据自动关联方法、装置及电子设备
CN116010552A (zh) 一种基于关键词词库的工程造价数据解析系统及其方法
WO2018070026A1 (ja) 商品情報表示システム、商品情報表示方法、及びプログラム
CN113157857B (zh) 面向新闻的热点话题检测方法、装置及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16921698

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16921698

Country of ref document: EP

Kind code of ref document: A1