WO2019136841A1 - 直播间内容标签提取方法、存储介质、电子设备及系统 - Google Patents

直播间内容标签提取方法、存储介质、电子设备及系统 Download PDF

Info

Publication number
WO2019136841A1
WO2019136841A1 PCT/CN2018/081286 CN2018081286W WO2019136841A1 WO 2019136841 A1 WO2019136841 A1 WO 2019136841A1 CN 2018081286 W CN2018081286 W CN 2018081286W WO 2019136841 A1 WO2019136841 A1 WO 2019136841A1
Authority
WO
WIPO (PCT)
Prior art keywords
content
live
tag
words
word
Prior art date
Application number
PCT/CN2018/081286
Other languages
English (en)
French (fr)
Inventor
王璐
张文明
陈少杰
Original Assignee
武汉斗鱼网络科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 武汉斗鱼网络科技有限公司 filed Critical 武汉斗鱼网络科技有限公司
Publication of WO2019136841A1 publication Critical patent/WO2019136841A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting

Definitions

  • the present invention relates to the field of big data recommendation technologies, and in particular, to a method, a storage medium, an electronic device and a system for extracting content tags between live broadcasts.
  • the live broadcast room is a carrier of information.
  • the embedded information can be summarized, which facilitates the organization and arrangement of the content of the live broadcast platform. Therefore, how to use an effective method to label the live content with accurate content tags is a very important issue.
  • the first is to use the manually defined partition of the live broadcast website as the label.
  • the disadvantage of this is that a live broadcast room only corresponds to one partition, and the label is not rich enough; and the meaning of the partition is relatively broad, and it is difficult to describe the characteristics of the live broadcast room.
  • the second is to manually mark the live room, but because of the many live broadcasts, the labor cost is too high.
  • the object of the present invention is to provide a method for extracting content labels between live broadcasts, a storage medium, an electronic device and a system, which solve the defects of high labor cost and poor label diversity of the conventional solution.
  • the present invention discloses a method for extracting content tags between live broadcasts:
  • a live vocabulary dictionary for storing vocabulary related to the content of the live platform; and segmenting the title and the title of the live broadcast within the preset time according to the live vocabulary dictionary;
  • the calculation formula for calculating the correlation between the content tag and the live broadcast within the set time between the live broadcasts is:
  • M represents the live ID number
  • L represents the content label
  • Wr is a collection of tag-related words under the content tag L, the set containing the words wr 1 , wr 2 , . . . , wr m , m representing the number of words in the wr;
  • N(wr i ) is the number of times wr appears in the barrage text of the live room M;
  • w i represents a set of all tag-related words appearing in the barrage text of the live room M, the set containing the words w 1 , w 2 , . . . , w m , n representing the number of words in w i ;
  • N(w i ) is the total number of occurrences of w i in the barrage text of the live room M;
  • R is the number of all live rooms
  • R(wr) is the number of live broadcasts in the barrage text containing the words in the tag-related word set wr.
  • the content label includes a general-purpose label and a partition-type label, where the general-purpose label is a content label related to the live content, and the partition-type label is a keyword-related content in the live broadcast room under the partition. label.
  • the set time is one month.
  • the invention also discloses a storage medium on which a computer program is stored, and when the computer program is executed by the processor, a method for extracting content tags between live broadcasts based on barrage text is implemented.
  • the invention also discloses an electronic device comprising a memory and a processor, wherein the memory stores a computer program running on the processor, and when the processor executes the computer program, the method for extracting the content label based on the barrage text is implemented.
  • the invention also discloses a live content content label extraction system based on barrage text, comprising:
  • a live vocabulary dictionary for storing vocabulary related to the content of the live platform
  • a word segmentation module configured to perform segmentation of a title and a barrage between the live broadcasts in a preset time according to the live vocabulary dictionary
  • a content tag construction module configured to perform word frequency statistics on the text after the word segmentation, and extract a word whose word frequency exceeds a preset value or the number of live broadcasts exceeds a preset number as a content tag candidate word, and the meanings are similar.
  • the content tag candidate word is abstracted as a content tag, and the content tag candidate words having similar meanings are used as tag association words under the content tag;
  • the tag relevance calculation module is configured to calculate a correlation between all the content tags and the live broadcast in the set time, and select one or more content tags as the content tags of the live broadcast according to the relevance ranking.
  • the calculation formula for calculating the correlation between the content tag and the live broadcast within the set time between the live broadcasts is:
  • M represents the live ID number
  • L represents the content label
  • Wr is a collection of tag-related words under the content tag L, the set containing the words wr 1 , wr 2 , . . . , wr m , m representing the number of words in the wr;
  • N(wr i ) is the number of occurrences of the word wr in the barrage text of the live room M;
  • w i represents a set of all tag-related words appearing in the barrage text of the live room M, the set containing the words w 1 , w 2 , . . . , w m , n representing the number of words in w i ;
  • N(w i ) is the total number of occurrences of w i in the barrage text of the live room M;
  • R is the number of all live rooms
  • R(wr) is the number of live broadcasts in the barrage text containing the words in the tag-related word set wr.
  • the content label includes a general-purpose label and a partition-type label, where the general-purpose label is a content label related to the live content, and the partition-type label is a keyword-related content in the live broadcast room under the partition. label.
  • the set time is one month.
  • the present invention classifies the title and the barrage between the live broadcasts in the preset time; performs word frequency statistics on the text after the word segmentation, and extracts words whose word frequency exceeds the preset value or the number of live broadcasts exceeds the preset number as the content.
  • a tag candidate word abstracting a similar content tag candidate word into a content tag, and using a similar content tag candidate word as a tag associated word under the content tag; calculating a correlation between all content tags and a live broadcast within a set time
  • one or more content tags are selected as the content tags of the live broadcast room, and the number of occurrences of the content tags and the number of live broadcasts are fully considered, the tag diversity is good, and the labor cost is saved.
  • FIG. 1 is a schematic flowchart of a method for extracting content labels between live broadcasts according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a content label extraction system between live broadcasts according to an embodiment of the present invention.
  • an embodiment of the present invention provides a content label extraction method based on a barrage text, including:
  • the live vocabulary dictionary is used for storing vocabulary related to the content of the live platform, and segmenting the title and the barrage of the live broadcast within a preset time according to the live vocabulary dictionary.
  • the live vocabulary dictionary contains games, secondary elements, and live related related nouns and other network vocabulary.
  • the main source is Sogou's cell vocabulary and manual methods from websites such as forums. collect.
  • the live vocabulary dictionary is constructed to be able to reasonably segment the barrage text. Since there are many common words and proper nouns in the barrage text, it is necessary to construct a dictionary of word segments covering a wide range of contents.
  • Content tags are created according to the live content of the platform.
  • the content tags include general-purpose tags and partition-type tags.
  • the general-purpose tags are content tags related to the live content, and do not involve specific aspects of proprietary knowledge;
  • the partition class tags are key in the live broadcast room under the partition.
  • the word-related content tag is summarized by observing the key words that are often used in the room title under the partition, and using the proprietary knowledge related to the partition to refine.
  • Both types of content tags can be generated using the following steps:
  • the live vocabulary dictionary constructed in the first step is used for word segmentation.
  • manual screening selects appropriate words as candidate words for the content tag.
  • S3 Calculate the correlation between all content tags and the live broadcast within the set time, and select one or more content tags as the content tags of the live broadcast according to the relevance ranking.
  • M represents the live ID number
  • L represents the content label
  • Wr is a collection of tag-related words under the content tag L, the set containing the words wr 1 , wr 2 , . . . , wr m , m representing the number of words in the wr;
  • N(wr i ) is the number of occurrences of the word wr in the barrage text of the live room M;
  • w i represents a set of all tag-related words appearing in the barrage text of the live room M, the set containing the words w 1 , w 2 , . . . , w m , n representing the number of words in w i ;
  • N(w i ) is the total number of occurrences of w i in the barrage text of the live room M;
  • R is the number of all live rooms
  • R(wr) is the number of live broadcasts in the barrage text containing the words in the tag-related word set wr.
  • the highest 10 tags are sorted from the highest to the lowest, and the top 10 tags with the highest scores are used as the content tags corresponding to the live broadcast.
  • the present invention classifies the title and the barrage between the live broadcasts in the preset time; performs word frequency statistics on the text after the word segmentation, and extracts words whose word frequency exceeds the preset value or the number of live broadcasts exceeds the preset number as the content.
  • a tag candidate word abstracting a similar content tag candidate word into a content tag, and using a similar content tag candidate word as a tag associated word under the content tag; calculating a correlation between all content tags and a live broadcast within a set time
  • one or more content tags are selected as the content tags of the live broadcast room, and the number of occurrences of the content tags and the number of live broadcasts are fully considered, the tag diversity is good, and the labor cost is saved.
  • the embodiment of the invention further discloses a storage medium on which a computer program is stored, and when the computer program is executed by the processor, a content label extraction method based on the barrage text is realized.
  • the embodiment of the invention further discloses an electronic device, comprising a memory and a processor, wherein the memory stores a computer program running on the processor, and when the processor executes the computer program, the content label extraction method based on the barrage text is realized.
  • an embodiment of the present invention further discloses a live content content label extraction system based on a barrage text, including:
  • Live vocabulary dictionary live vocabulary dictionary for storing vocabulary related to the content of the live platform
  • the word segmentation module is configured to segment the title and the barrage of the live broadcast within a preset time according to the live vocabulary dictionary;
  • the content tag construction module is configured to perform word frequency statistics on the text after the word segmentation, and extract a word whose word frequency exceeds a preset value or a number of live broadcasts exceeds a preset number as a content tag candidate word, and a similar content tag
  • the candidate word is abstracted as a content tag, and the content tag candidate words with similar meanings are used as tag association words under the content tag;
  • the tag relevance calculation module is configured to calculate a correlation between all the content tags and the live broadcast in the set time, and select one or more content tags as the content tags of the live broadcast according to the relevance ranking.
  • M represents the live ID number
  • L represents the content label
  • Wr is a collection of tag-related words under the content tag L, the set containing the words wr 1 , wr 2 , . . . , wr m , m representing the number of words in the wr;
  • N(wr i ) is the number of occurrences of the word wr in the barrage text of the live room M;
  • w i represents a set of all tag-related words appearing in the barrage text of the live room M, the set containing the words w 1 , w 2 , . . . , w m , n representing the number of words in w i ;
  • N(w i ) is the total number of occurrences of w i in the barrage text of the live room M;
  • R is the number of all live rooms
  • R(wr) is the number of live broadcasts in the barrage text containing the words in the tag-related word set wr.
  • the live vocabulary dictionary contains games, secondary elements, and live-related proper nouns and other online vocabulary.
  • the content tag includes a general class tag and a partition class tag, the general class tag is a content tag related to the live content, and the partition class tag is a content tag related to the keyword in the live zone of the partition.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

本发明公开了一种直播间内容标签提取方法、存储介质、电子设备及系统,涉及大数据推荐技术领域,本发明根据直播词汇词典对预设时间内的直播间标题和弹幕进行分词;对分词后的文本进行词频统计,提取词频超过预设值或直播间个数超过预设个数的词作为内容标签候选词,将意思相近的内容标签候选词抽象为一个内容标签,并将意思相近的内容标签候选词作为该内容标签下的标签关联词;计算设定时间内所有内容标签与直播间的相关度,根据相关度排序选择一个或多个内容标签作为该直播间的内容标签,充分考虑内容标签的出现次数以及出现的直播间多少,标签多样性好,节约人力成本。

Description

直播间内容标签提取方法、存储介质、电子设备及系统 技术领域
本发明涉及大数据推荐技术领域,具体涉及一种直播间内容标签提取方法、存储介质、电子设备及系统。
背景技术
直播间是一个信息的载体,给直播间打上符合其内容和表现形式的标签可以对内含信息进行归纳,从而有利于直播平台内容的组织和编排。因此,如何采用有效的方法对直播间打上准确的内容标签是一个十分重要的问题。
直播间标签提取的方式一般有以下几种方式。一是采用直播网站人工定义的分区作为标签,这样做的缺陷是一个直播间只对应一个分区,标签不够丰富;且分区的含义较为宽泛,难以描述直播间的特点。二是采用人工的方式给直播间打标,但是由于直播间众多这样做人工成本太高。
发明内容
针对现有技术中存在的缺陷,本发明的目的在于提供一种直播间内容标签提取方法、存储介质、电子设备及系统,解决传统方案人力成本较高、标签多样性较差的缺点。
为达到以上目的,本发明采取的技术方案是:本发明公开了一种直播间内容标签提取方法:
构建直播词汇词典,所述直播词汇词典用于存储与直播平台内容相关的词汇;根据所述直播词汇词典对预设时间内的直播间标题和弹 幕进行分词;
对分词后的文本进行词频统计,提取词频超过预设值或直播间个数超过预设个数的词作为内容标签候选词,将意思相近的内容标签候选词抽象为一个内容标签,并将所述意思相近的内容标签候选词作为该内容标签下的标签关联词;
计算设定时间内所有内容标签与直播间的相关度,根据相关度排序选择一个或多个内容标签作为该直播间的内容标签。
在上述技术方案的基础上,计算直播间的设定时间内的内容标签与该直播间的相关度的计算公式为:
Figure PCTCN2018081286-appb-000001
其中:
M表示直播间ID号,L表示内容标签;
wr是内容标签L下的标签关联词的集合,该集合包含词wr 1,wr 2,...,wr m,m表示wr中词的个数;
N(wr i)是直播间M的弹幕文本中出现wr的次数;
w i表示直播间M的弹幕文本中出现的全部标签关联词的集合,该集合包含词w 1,w 2,...,w m,n表示w i中词的个数;
N(w i)是直播间M的弹幕文本中的w i出现的总次数;
R是所有直播间个数;
R(wr)是弹幕文本中含有标签关联词集合wr中词语的直播间个数。
在上述技术方案的基础上,所述内容标签包括通用类标签和分区类标签,所述通用类标签为直播内容相关的内容标签,所述分区类标签为分区下直播间内关键词相关的内容标签。
在上述技术方案的基础上,所述设定时间为一个月。
本发明还公开了一种存储介质,该存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现基于弹幕文本的直播间内容标签提取方法。
本发明还公开了一种电子设备,包括存储器和处理器,存储器上储存有在处理器上运行的计算机程序,处理器执行计算机程序时实现基于弹幕文本的直播间内容标签提取方法。
本发明还公开了一种基于弹幕文本的直播间内容标签提取系统,包括:
直播词汇词典,所述直播词汇词典用于存储与直播平台内容相关的词汇;
分词模块,所述分词模块用于根据所述直播词汇词典对预设时间内的直播间标题和弹幕进行分词;
内容标签构建模块,所述内容标签构建模块用于对分词后的文本进行词频统计,提取词频超过预设值或直播间个数超过预设个数的词作为内容标签候选词,将意思相近的内容标签候选词抽象为一个内容标签,并将所述意思相近的内容标签候选词作为该内容标签下的标签关联词;
标签相关度计算模块,所述标签相关度计算模块用于计算设定时间内所有内容标签与直播间的相关度,根据相关度排序选择一个或多个内容标签作为该直播间的内容标签。
在上述技术方案的基础上,计算直播间的设定时间内的内容标签与该直播间的相关度的计算公式为:
Figure PCTCN2018081286-appb-000002
其中:
M表示直播间ID号,L表示内容标签;
wr是内容标签L下的标签关联词的集合,该集合包含词语wr 1,wr 2,...,wr m,m表示wr中词语的个数;
N(wr i)是直播间M的弹幕文本中出现词语wr的次数;
w i表示直播间M的弹幕文本中出现的全部标签关联词的集合,该集合包含词语w 1,w 2,...,w m,n表示w i中词语的个数;
N(w i)是直播间M的弹幕文本中的w i出现的总次数;
R是所有直播间个数;
R(wr)是弹幕文本中含有标签关联词集合wr中词语的直播间个数。
在上述技术方案的基础上,所述内容标签包括通用类标签和分区类标签,所述通用类标签为直播内容相关的内容标签,所述分区类标签为分区下直播间内关键词相关的内容标签。
在上述技术方案的基础上,所述设定时间为一个月。
与现有技术相比,本发明的优点在于:
本发明根据直播词汇词典对预设时间内的直播间标题和弹幕进行分词;对分词后的文本进行词频统计,提取词频超过预设值或直播间个数超过预设个数的词作为内容标签候选词,将意思相近的内容标签候选词抽象为一个内容标签,并将意思相近的内容标签候选词作为该内容标签下的标签关联词;计算设定时间内所有内容标签与直播间的相关度,根据相关度排序选择一个或多个内容标签作为该直播间的内容标签,充分考虑内容标签的出现次数以及出现的直播间多少,标签多样性好,节约人力成本。
附图说明
图1为本发明实施例中直播间内容标签提取方法的流程示意图;
图2为本发明实施例中直播间内容标签提取系统的结构示意图。
具体实施方式
以下结合附图及实施例对本发明作进一步详细说明。
参见图1所示,本发明实施例提供一种基于弹幕文本的直播间内容标签提取方法,包括:
S1,构建直播词汇词典,直播词汇词典用于存储与直播平台内容相关的词汇,根据直播词汇词典对预设时间内的直播间标题和弹幕进行分词。
首先构建一个与直播平台内容相关的直播词汇词典,直播词汇词典包含游戏、二次元、以及直播相关的专有名词以及其他网络词汇,主要来源是搜狗的细胞词库和从论坛等网站通过人工方式收集。直播词汇词典的构建是为能够对弹幕文本进行合理的分词,由于弹幕文本中有很多网络常用语和专有名词,因此需要构建一个涵盖内容非常广的分词词典。
S2,对分词后的文本进行词频统计,提取词频超过预设值或直播间个数超过预设个数的词作为内容标签候选词,将意思相近的内容标签候选词抽象为一个内容标签,并将意思相近的内容标签候选词作为该内容标签下的标签关联词。
根据平台的直播内容建立内容标签,内容标签包括通用类标签和分区类标签,通用类标签为直播内容相关的内容标签,不涉及特定方面的专有知识;分区类标签为分区下直播间内关键词相关的内容标签,归纳的方法是观察该分区下房间标题经常会用到的关键性词语,利用分区相关的专有知识进行提炼。
这两类内容标签均可以采用以下步骤进行产生:
1)对近一个月的直播间标题和弹幕采用第一步构建的直播词汇词典进行分词。
2)对分词后的文本进行词频统计,取出出现频率较高或者出现直播间个数较多的词语。
3)在上述词语中,人工筛选挑选合适的词语作为内容标签的候选词。
4)对内容的候选词进行归纳整理,将几个意思相近的词抽象为一个内容标签,并将这些词作为该内容标签下的标签关联词。
S3,计算设定时间内所有内容标签与直播间的相关度,根据相关度排序选择一个或多个内容标签作为该直播间的内容标签。
直播间的设定时间内的内容标签与该直播间的相关度的计算公式为:
Figure PCTCN2018081286-appb-000003
其中:
M表示直播间ID号,L表示内容标签;
wr是内容标签L下的标签关联词的集合,该集合包含词语wr 1,wr 2,...,wr m,m表示wr中词语的个数;
N(wr i)是直播间M的弹幕文本中出现词语wr的次数;
w i表示直播间M的弹幕文本中出现的全部标签关联词的集合,该集合包含词语w 1,w 2,...,w m,n表示w i中词语的个数;
N(w i)是直播间M的弹幕文本中的w i出现的总次数;
R是所有直播间个数;
R(wr)是弹幕文本中含有标签关联词集合wr中词语的直播间个数。
计算出设定时间内所有内容标签与直播间的相关度后,从高到低进行排序,取分数最大的前10个标签作为该直播间对应的内容标签。
本发明根据直播词汇词典对预设时间内的直播间标题和弹幕进行分词;对分词后的文本进行词频统计,提取词频超过预设值或直播间个数超过预设个数的词作为内容标签候选词,将意思相近的内容标签候选词抽象为一个内容标签,并将意思相近的内容标签候选词作为该内容标签下的标签关联词;计算设定时间内所有内容标签与直播间的相关度,根据相关度排序选择一个或多个内容标签作为该直播间的内容标签,充分考虑内容标签的出现次数以及出现的直播间多少,标签多样性好,节约人力成本。
本发明实施例还公开了一种存储介质,该存储介质上存储有计算机程序,计算机程序被处理器执行时实现基于弹幕文本的直播间内容标签提取方法。
本发明实施例还公开了一种电子设备,包括存储器和处理器,存储器上储存有在处理器上运行的计算机程序,处理器执行计算机程序时实现基于弹幕文本的直播间内容标签提取方法。
参见图2所示,本发明实施例还公开了一种基于弹幕文本的直播间内容标签提取系统,包括:
直播词汇词典,直播词汇词典用于存储与直播平台内容相关的词汇;
分词模块,分词模块用于根据直播词汇词典对预设时间内的直播间标题和弹幕进行分词;
内容标签构建模块,内容标签构建模块用于对分词后的文本进行词频统计,提取词频超过预设值或直播间个数超过预设个数的词作为内容标签候选词,将意思相近的内容标签候选词抽象为一个内容标 签,并将意思相近的内容标签候选词作为该内容标签下的标签关联词;
标签相关度计算模块,标签相关度计算模块用于计算设定时间内所有内容标签与直播间的相关度,根据相关度排序选择一个或多个内容标签作为该直播间的内容标签。
直播间的设定时间内的内容标签与该直播间的相关度的计算公式为:
Figure PCTCN2018081286-appb-000004
其中:
M表示直播间ID号,L表示内容标签;
wr是内容标签L下的标签关联词的集合,该集合包含词语wr 1,wr 2,...,wr m,m表示wr中词语的个数;
N(wr i)是直播间M的弹幕文本中出现词语wr的次数;
w i表示直播间M的弹幕文本中出现的全部标签关联词的集合,该集合包含词语w 1,w 2,...,w m,n表示w i中词语的个数;
N(w i)是直播间M的弹幕文本中的w i出现的总次数;
R是所有直播间个数;
R(wr)是弹幕文本中含有标签关联词集合wr中词语的直播间个数。
直播词汇词典包含游戏、二次元、以及直播相关的专有名词以及其他网络词汇。
内容标签包括通用类标签和分区类标签,通用类标签为直播内容相关的内容标签,分区类标签为分区下直播间内关键词相关的内容标签。
本发明不局限于上述实施方式,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也视为本发明的保护范围之内。本说明书中未作详细描述的内容属于本领域专业技术人员公知的现有技术。

Claims (10)

  1. 一种基于弹幕文本的直播间内容标签提取方法,其特征在于:
    构建直播词汇词典,所述直播词汇词典用于存储与直播平台内容相关的词汇;根据所述直播词汇词典对预设时间内的直播间标题和弹幕进行分词;
    对分词后的文本进行词频统计,提取词频超过预设值或直播间个数超过预设个数的词作为内容标签候选词,将意思相近的内容标签候选词抽象为一个内容标签,并将所述意思相近的内容标签候选词作为该内容标签下的标签关联词;
    计算设定时间内所有内容标签与直播间的相关度,根据相关度排序选择一个或多个内容标签作为该直播间的内容标签。
  2. 如权利要求1所述的一种基于弹幕文本的直播间内容标签提取方法,其特征在于:计算直播间的设定时间内的内容标签与该直播间的相关度的计算公式为:
    Figure PCTCN2018081286-appb-100001
    其中:
    M表示直播间ID号,L表示内容标签;
    wr是内容标签L下的标签关联词的集合,该集合包含词wr 1,wr 2,...,wr m,m表示wr中词的个数;
    N(wr i)是直播间M的弹幕文本中出现wr的次数;
    w i表示直播间M的弹幕文本中出现的全部标签关联词的集合,该集合包含词w 1,w 2,...,w m,n表示w i中词的个数;
    N(w i)是直播间M的弹幕文本中的w i出现的总次数;
    R是所有直播间个数;
    R(wr)是弹幕文本中含有标签关联词集合wr中词语的直播间个数。
  3. 如权利要求1所述的一种基于弹幕文本的直播间内容标签提取方法,其特征在于:所述内容标签包括通用类标签和分区类标签,所述通用类标签为直播内容相关的内容标签,所述分区类标签为分区下直播间内关键词相关的内容标签。
  4. 如权利要求1所述的一种基于弹幕文本的直播间内容标签提取方法,其特征在于:所述设定时间为一个月。
  5. 一种存储介质,该存储介质上存储有计算机程序,其特征在于:所述计算机程序被处理器执行时实现权利要求1至4任一项所述的方法。
  6. 一种电子设备,包括存储器和处理器,存储器上储存有在处理器上运行的计算机程序,其特征在于:处理器执行计算机程序时实现权利要求1至4任一项所述的方法。
  7. 一种基于弹幕文本的直播间内容标签提取系统,其特征在于,包括:
    直播词汇词典,所述直播词汇词典用于存储与直播平台内容相关的词汇;
    分词模块,所述分词模块用于根据所述直播词汇词典对预设时间内的直播间标题和弹幕进行分词;
    内容标签构建模块,所述内容标签构建模块用于对分词后的文本进行词频统计,提取词频超过预设值或直播间个数超过预设个数的词作为内容标签候选词,将意思相近的内容标签候选词抽象为一个内容标签,并将所述意思相近的内容标签候选词作为该内容标签下的标签 关联词;
    标签相关度计算模块,所述标签相关度计算模块用于计算设定时间内所有内容标签与直播间的相关度,根据相关度排序选择一个或多个内容标签作为该直播间的内容标签。
  8. 如权利要求7所述的一种基于弹幕文本的直播间内容标签提取系统,其特征在于:计算直播间的设定时间内的内容标签与该直播间的相关度的计算公式为:
    Figure PCTCN2018081286-appb-100002
    其中:
    M表示直播间ID号,L表示内容标签;
    wr是内容标签L下的标签关联词的集合,该集合包含词语wr 1,wr 2,...,wr m,m表示wr中词语的个数;
    N(wr i)是直播间M的弹幕文本中出现词语wr的次数;
    w i表示直播间M的弹幕文本中出现的全部标签关联词的集合,该集合包含词语w 1,w 2,...,w m,n表示w i中词语的个数;
    N(w i)是直播间M的弹幕文本中的w i出现的总次数;
    R是所有直播间个数;
    R(wr)是弹幕文本中含有标签关联词集合wr中词语的直播间个数。
  9. 如权利要求7所述的一种基于弹幕文本的直播间内容标签提取系统,其特征在于:所述内容标签包括通用类标签和分区类标签,所述通用类标签为直播内容相关的内容标签,所述分区类标签为分区下直播间内关键词相关的内容标签。
  10. 如权利要求7所述的一种基于弹幕文本的直播间内容标签提 取系统,其特征在于:所述设定时间为一个月。
PCT/CN2018/081286 2018-01-09 2018-03-30 直播间内容标签提取方法、存储介质、电子设备及系统 WO2019136841A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810019246.1 2018-01-09
CN201810019246.1A CN108280059B (zh) 2018-01-09 2018-01-09 直播间内容标签提取方法、存储介质、电子设备及系统

Publications (1)

Publication Number Publication Date
WO2019136841A1 true WO2019136841A1 (zh) 2019-07-18

Family

ID=62803367

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/081286 WO2019136841A1 (zh) 2018-01-09 2018-03-30 直播间内容标签提取方法、存储介质、电子设备及系统

Country Status (2)

Country Link
CN (1) CN108280059B (zh)
WO (1) WO2019136841A1 (zh)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034049A (zh) * 2018-07-23 2018-12-18 北京密境和风科技有限公司 跳舞视频的识别方法及装置
CN109063133B (zh) * 2018-08-02 2021-02-02 武汉斗鱼网络科技有限公司 一种直播间标签的添加方法、系统、设备及介质
CN110896488B (zh) * 2018-08-23 2022-01-04 武汉斗鱼网络科技有限公司 一种直播间的推荐方法以及相关设备
CN109379608B (zh) * 2018-09-13 2021-07-23 武汉斗鱼网络科技有限公司 一种直播间的推荐方法以及相关设备
CN109255066B (zh) * 2018-09-30 2021-11-09 武汉斗鱼网络科技有限公司 一种业务对象的标签标记方法、装置、服务器和存储介质
CN109547863B (zh) * 2018-10-22 2021-06-15 武汉斗鱼网络科技有限公司 一种标签的标记方法、装置、服务器和存储介质
CN109919213A (zh) * 2019-02-27 2019-06-21 上海六界信息技术有限公司 直播类型的确定方法、装置、设备及存储介质
CN110377843A (zh) * 2019-07-17 2019-10-25 网易(杭州)网络有限公司 直播间处理方法及装置、电子设备、存储介质
CN110519654B (zh) * 2019-09-11 2021-07-27 广州荔支网络技术有限公司 一种标签确定方法、装置、电子设备及存储介质
CN110688852B (zh) * 2019-09-27 2023-04-07 西安赢瑞电子有限公司 一种汉字词语频度存储方法
CN111027321B (zh) * 2019-11-30 2023-06-30 南京森林警察学院 一种警务相关智能组题方法
CN112995690B (zh) * 2021-02-26 2023-07-25 广州虎牙科技有限公司 直播内容品类识别方法、装置、电子设备和可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110258188A1 (en) * 2010-04-16 2011-10-20 Abdalmageed Wael Semantic Segmentation and Tagging Engine
CN105893478A (zh) * 2016-03-29 2016-08-24 广州华多网络科技有限公司 一种标签提取方法及设备
CN106096031A (zh) * 2016-06-27 2016-11-09 武汉斗鱼网络科技有限公司 一种带标签的视频排序方法及装置
CN106453284A (zh) * 2016-09-27 2017-02-22 北京金山安全软件有限公司 直播标签更新方法、装置和终端设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007087349A2 (en) * 2006-01-25 2007-08-02 Fortuna Joseph A Jr Method and system for automatic summarization and digest of celebrity news
CN106681985A (zh) * 2016-12-13 2017-05-17 成都数联铭品科技有限公司 基于主题自动匹配的多领域词典构建系统
CN106960042A (zh) * 2017-03-29 2017-07-18 中国科学技术大学苏州研究院 基于弹幕语义分析的网络直播监督方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110258188A1 (en) * 2010-04-16 2011-10-20 Abdalmageed Wael Semantic Segmentation and Tagging Engine
CN105893478A (zh) * 2016-03-29 2016-08-24 广州华多网络科技有限公司 一种标签提取方法及设备
CN106096031A (zh) * 2016-06-27 2016-11-09 武汉斗鱼网络科技有限公司 一种带标签的视频排序方法及装置
CN106453284A (zh) * 2016-09-27 2017-02-22 北京金山安全软件有限公司 直播标签更新方法、装置和终端设备

Also Published As

Publication number Publication date
CN108280059A (zh) 2018-07-13
CN108280059B (zh) 2020-08-04

Similar Documents

Publication Publication Date Title
WO2019136841A1 (zh) 直播间内容标签提取方法、存储介质、电子设备及系统
AU2016273851B2 (en) Accurate tag relevance prediction for image search
US8577882B2 (en) Method and system for searching multilingual documents
WO2019218514A1 (zh) 网页目标信息的提取方法、装置及存储介质
US10909427B2 (en) Method and device for classifying webpages
WO2015149533A1 (zh) 一种基于网页内容分类进行分词处理的方法和装置
CN109740152B (zh) 文本类目的确定方法、装置、存储介质和计算机设备
CN104881458B (zh) 一种网页主题的标注方法和装置
CN109791632B (zh) 场景片段分类器、场景分类器以及记录介质
CN107193892B (zh) 一种文档主题确定方法及装置
CN109299233B (zh) 文本数据处理方法、装置、计算机设备及存储介质
JP6056610B2 (ja) テキスト情報処理装置、テキスト情報処理方法、及びテキスト情報処理プログラム
US10831803B2 (en) System and method for true product word recognition
CN109582783B (zh) 热点话题检测方法及装置
WO2021112984A1 (en) Feature and context based search result generation
CN108388556B (zh) 同类实体的挖掘方法及系统
CN109815337B (zh) 确定文章类别的方法及装置
CN106570196B (zh) 视频节目的搜索方法和装置
Zhang et al. Integration of visual temporal information and textual distribution information for news web video event mining
CN113591476A (zh) 一种基于机器学习的数据标签推荐方法
JP2009199302A (ja) ドキュメントを解析するためのプログラム,装置および方法
CN111797224A (zh) 一种专利数据检索结果展示方法、装置、设备和存储介质
Imran et al. Event recognition from photo collections via pagerank
CN107577667B (zh) 一种实体词处理方法和装置
JP6260678B2 (ja) 情報処理装置、情報処理方法、及び情報処理プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18899680

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18899680

Country of ref document: EP

Kind code of ref document: A1