WO2019001075A1 - 一种垃圾弹幕的识别方法、装置及计算机设备 - Google Patents

一种垃圾弹幕的识别方法、装置及计算机设备 Download PDF

Info

Publication number
WO2019001075A1
WO2019001075A1 PCT/CN2018/082176 CN2018082176W WO2019001075A1 WO 2019001075 A1 WO2019001075 A1 WO 2019001075A1 CN 2018082176 W CN2018082176 W CN 2018082176W WO 2019001075 A1 WO2019001075 A1 WO 2019001075A1
Authority
WO
WIPO (PCT)
Prior art keywords
barrage
word
information
idf
probability
Prior art date
Application number
PCT/CN2018/082176
Other languages
English (en)
French (fr)
Inventor
龚灿
张文明
陈少杰
Original Assignee
武汉斗鱼网络科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 武汉斗鱼网络科技有限公司 filed Critical 武汉斗鱼网络科技有限公司
Publication of WO2019001075A1 publication Critical patent/WO2019001075A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/475End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data
    • H04N21/4756End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data for rating content, e.g. scoring a recommended movie
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles

Definitions

  • the invention belongs to the technical field of garbage barrage processing of a live broadcast platform, and particularly relates to a method, a device and a computer device for identifying a garbage barrage.
  • the barrage will be sent to the live platform server, and the live platform server will forward the barrage to all viewers in the live room.
  • some abnormal users often burst into a large amount of junk barrage information in the live broadcast room, such as sending a large amount of advertising information. This kind of advertising harassment directly reduces the user's participation, resulting in a decrease in the number of users on the live platform and a decrease in the revenue of the live platform.
  • the garbage barrage is generally identified by a manual extraction rule and a keyword fuzzy matching scheme, but this recognition method is labor-intensive and the recognition accuracy is not high.
  • the present invention provides a method, a device, and a computer device for remote procedure call, which are used to solve the problem that the thread scheduling and allocation cannot be flexibly performed when performing a remote procedure call in the prior art.
  • the present invention provides a method of remote procedure call, the method comprising:
  • the invention provides a method for identifying a garbage barrage, which is applied to a live broadcast platform, and the method comprises:
  • the feature information of the barrage information is extracted to obtain the first barrage information
  • TF-IDF word frequency-inverse document frequency
  • the first barrage information is preprocessed to remove data that affects the naive Bayesian model identification in the first barrage information, including:
  • the first barr information is cut according to the idiom rules in the custom vocabulary of the live broadcast platform, and the word bag model is formed, including:
  • the filtered words are combined in a predetermined order to form the word bag model.
  • the converting the word bag model into a word vector based on a preset mapping rule includes:
  • each word of the word bag model is mapped to a corresponding latitude of the word vector, and the word bag model is converted into the word vector.
  • the TF-IDF weighting is performed on each word in the word vector, and the TF-IDF weighting value of each word is obtained, including:
  • the naive Bayes model is used to calculate the first probability P1 of the barrage information in the case where all words appear.
  • P1 P ("garbage barrage”
  • a1, a2, a3, a4, a5, a6, ..., ai, ..., an) (p ("garbage barrage”
  • the TF-IDF weighting value of the words is used to calculate, according to the naive Bayesian model, the second probability P2 that the barrage information is a normal barrage in the case where all words appear.
  • P2 P ("normal barrage”
  • a1, a2, a3, a4, a5, a6, ..., an) (p ("normal barrage”
  • the invention provides an identification device for a garbage barrage, which is applied to a live broadcast platform, and the device comprises:
  • the extracting unit is configured to construct a rule based on the preset barrage information feature, perform feature extraction on the barrage information, and acquire the first barrage information;
  • a preprocessing unit configured to preprocess the first barrage information, and remove data that affects the naive Bayesian model identification in the first barrage information
  • a word-cutting unit configured to cut a word of the pre-processed first barrage information according to a phishing rule in the customized vocabulary of the live broadcast platform, to form a word bag model
  • a converting unit configured to convert the word bag model into a word vector based on a preset mapping rule
  • a weighting unit configured to perform word frequency-anti-document frequency TF-IDF weighting on each word in the word vector, and obtain a TF-IDF weighting value of each word;
  • the barrage information is a first probability P1 of the garbage barrage and the second probability P2 of the barrage information is a normal barrage;
  • the determining unit is configured to determine whether the first probability P1 is greater than the second probability P2, and if the first probability P1 is greater than the second probability P2, determine that the barrage information is a garbage barrage.
  • the present invention also provides a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the following steps:
  • the feature information of the barrage information is extracted to obtain the first barrage information
  • the pre-processed first bullet-screen information is cut into words to form a word bag model
  • the invention also provides a computer device for garbage barrage recognition, comprising:
  • At least one processor At least one processor
  • At least one memory communicatively coupled to the processor, wherein
  • the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-7.
  • the invention provides a method, a device and a computer device for identifying a garbage barrage.
  • the method comprises: constructing a rule based on a preset barrage information feature, extracting features of the barrage information, and acquiring first barrage information. Performing pre-processing on the first barrage information to remove data affecting the recognition of the naive Bayesian model in the first barrage information; and pre-predicting the idiom rules in the customized vocabulary of the live broadcast platform Processing the first barr information to perform word segmentation to form a word bag model; converting the word bag model into a word vector based on a preset mapping rule; performing TF-IDF on each word in the word vector Weighting, obtaining TF-IDF weighting values of the words; establishing the naive Bayesian model, based on the TF-IDF weighting values of the words, using the naive Bayesian model to calculate respectively in the word bag model In the case where all the words appear, the bullet information is the first probability P1 of the
  • FIG. 1 is a schematic flow chart of a method for identifying a garbage barr according to Embodiment 1 of the present invention
  • FIG. 2 is a schematic structural view of an apparatus for identifying a garbage bomb curtain according to Embodiment 2 of the present invention
  • FIG. 3 is a schematic diagram of the overall structure of a computer device for garbage salvage identification according to Embodiment 3 of the present invention.
  • This embodiment provides a method for identifying a garbage barrage. As shown in FIG. 1, the method includes:
  • the database query statement SQL is used to extract the barrage information marked with the type of the barrage from the relational database HIVE as the sample data.
  • the barrage information includes a barrage content and a barrage type.
  • the feature construction rule includes: using a specific identifier to represent a word that conforms to a certain type of feature.
  • the word “naked wolf” is added to the custom vocabulary, and the “naked wolf” is divided into a word in the process of subsequent word-cutting. It will not be cut into two words: “naked” and "wolf”.
  • the customized thesaurus includes some personalized words that appear at a high frequency, and the personalized words are different from the regular words; the customized thesaurus is updated incrementally every day to improve the accuracy of the customized thesaurus. degree.
  • S102 Perform pre-processing on the first barrage information, and remove data that affects the recognition of the naive Bayesian model in the first barrage information;
  • the first barrage information is preprocessed, and the first barrage information is removed to affect the naive Bayesian model recognition. data.
  • the word “happy” will be cut into “open” and “heart” when cutting the word, which affects the accuracy of the barrage recognition, so the first bullet
  • the screen information is preprocessed, the data of the first barrage information in which the barrage content is empty, the punctuation marks in the barrage content, and the data whose barrage type is empty are removed.
  • the first barrage information is cut and de-duplicated according to the idiom rules in the customized vocabulary of the live broadcast platform, and the word bag model is formed.
  • the words in the first barrage information that have little or no influence on the naive Bayesian model recognition are filtered according to the idiom rule, and the filtered words are obtained; the filtered The words are combined in a predetermined order to form the word bag model.
  • words that have little or no influence on the naive Bayesian model recognition can be added to the custom stop word library.
  • the word bag model converts each barr information into a collection of words; for example, if the barrage information is "the anchor of today's bare wolf is awesome", the sentence is After cutting the words, the stop words "today”, “", “ ⁇ ” are filtered out, and the remaining words are “host”, “naked wolf”, “play”, “great”, then the information of the barrage
  • the word bag model is ["host”, “naked wolf”, “play”, “great”].
  • S104 Convert the word bag model into a word vector based on a preset mapping rule; perform word frequency-anti-document frequency TF-IDF weighting on each word in the word vector, and obtain TF-IDF weighting of each word value;
  • each word of the word bag model is mapped to a corresponding latitude of the word vector based on a preset mapping rule, and the word bag model is converted into the Word vector.
  • the latitude of the word vector is 400,000 dimensions, that is, each barr information is represented as a vector V of 400,000 dimensions, and each vector position corresponds to one word.
  • mapping rule may be a sequential mapping or a random mapping, which is not limited herein.
  • each word in the word vector is subjected to TF-IDFT weighting, and the TF-IDF weighting value of each word is obtained.
  • the M is a quotient of the number of total barrage information and the number of barrage information including each word.
  • the naive Bayesian model is established, and based on the TF-IDF weighting value of each word, the naive Bayesian model is used to calculate that in the case where all words appear, the barrage information is a garbage barrage.
  • the first probability P1, and the bullet information is the second probability P2 of the normal barrage;
  • the naive Bayesian model is established, and based on the TF-IDF weighting value of each word, the naive Bayesian model is used to calculate the word bag respectively.
  • the barrage information is the first probability P1 of the garbage barrage and the barrage information is the second probability P2 of the normal barrage.
  • B) indicates the probability that the event A occurs under the premise that the event B has occurred, and is called the conditional probability of the event A under the event B.
  • the basic solution formula is as shown in formula (3):
  • the first probability P1 of the barrage information as a junk barrage can be calculated using equation (4):
  • P1 P("garbage barrage"
  • a1, a2, a3, a4, a5, a6,..., ai,...,an) (p("garbage barrage"
  • the ai is any one of the words
  • the n is the number of words in the barrage information
  • the TF-IDF(ai) is any of the words TF-IDF weighted value.
  • P2 P ("normal barrage”
  • a1, a2, a3, a4, a5, a6, ..., an) (p ("normal barrage”
  • the ai is any one of the words
  • the n is the number of words in the barrage information, i ⁇ n
  • the TF-IDF(ai) is any of the words TF-IDF weighted value.
  • this step after calculating the first probability P1 of the barrage information as the garbage barrage and the second probability P2 of the barrage information as the normal barrage, determining whether the first probability P1 is greater than the second a probability P2, if the first probability P1 is greater than the second probability P2, determining that the barrage information is a garbage barrage; if the first probability P1 is less than the second probability P2, determining the bomb
  • the curtain information is a normal barrage. This identifies the trash barrage.
  • the embodiment provides a device for identifying a garbage barrage.
  • the device includes: an extracting unit 21, a preprocessing unit 22, a word cutting unit 23, a converting unit 24, and a weighting unit. 25.
  • the database query statement SQL is used to extract the barrage information marked with the type of the barrage from the relational database HIVE as the sample data.
  • the barrage information includes a barrage content and a barrage type.
  • the extracting unit 21 constructs a rule based on the preset barrage information feature, and performs feature extraction on the barrage information to obtain the first barrage information.
  • the feature construction rule includes: using a specific identifier to represent a word that conforms to a certain type of feature.
  • the word “naked wolf” is added to the custom vocabulary, and the “naked wolf” is divided into a word in the process of subsequent word-cutting. It will not be cut into two words: “naked” and "wolf”.
  • the customized thesaurus includes some personalized words that appear at a high frequency, and the personalized words are different from the regular words; the customized thesaurus is updated incrementally every day to improve the accuracy of the customized thesaurus. degree.
  • the pre-processing unit 22 is configured to pre-process the first barrage information, and remove data that affects the naive Bayesian model recognition in the first barrage information. .
  • the preprocessing unit When the first barr information is preprocessed, the data of the first barrage information in which the barrage content is empty, the punctuation marks in the barrage content, and the data whose barrage type is empty are removed.
  • the word-cutting unit 23 is configured to perform the word-cutting on the pre-processed first barrage information according to the idiom rules in the customized vocabulary of the live broadcast platform. , to de-weight, constitute a word bag model.
  • the word-cutting unit 23 filters the words in the first barrage information that have little or no influence on the naive Bayesian model recognition according to the idiom rule, and obtains the filtered words;
  • the filtered words are combined in a predetermined order to form the word bag model.
  • words that have little or no influence on the naive Bayesian model recognition can be added to the custom stop word library.
  • the word bag model converts each barr information into a collection of words; for example, if the barrage information is "the anchor of today's bare wolf is awesome", the sentence is After cutting the words, the stop words "today”, “", “ ⁇ ” are filtered out, and the remaining words are “host”, “naked wolf”, “play”, “great”, then the information of the barrage
  • the word bag model is ["host”, “naked wolf”, “play”, “great”].
  • the converting unit 24 is configured to map each word of the bag model to a corresponding latitude of the word vector based on a preset mapping rule, and convert the word bag model into the word vector.
  • the latitude of the word vector is 400,000 dimensions, that is, each barr information is represented as a vector V of 400,000 dimensions, and each vector position corresponds to one word.
  • the conversion unit 24 can map the "Announcer” to the position of the vector V(0), and the “naked wolf” "Map to the position of V(1), map “play” to the position of V(2), map “great” to the position of V(3), so that the last word vector is (1,1,1, 2,0,0,0,0,...,0), the ellipsis omits 390,900,900 0 (because the word vector has a fixed latitude of 400,000).
  • mapping rule may be a sequential mapping or a random mapping, which is not limited herein.
  • the weighting unit 25 is configured to perform TF-IDF weighting on each word in the word vector to obtain a TF-IDF weighting value of each word.
  • the weighting unit 25 calculates a frequency TF in which the words appear in the barrage information; and calculates an inverse document frequency weighting value IDF of the words according to formula (1):
  • the M is a quotient of the number of total barrage information and the number of barrage information including each word.
  • the establishing unit 26 is configured to establish the naive Bayesian model, based on the TF-IDF weighting value of each word, using the naive Bayesian model to calculate respectively at all
  • the barrage information is a first probability P1 of the garbage barrage and the barrage information is a second probability P2 of the normal barrage;
  • the establishing unit 26 can calculate the first probability P1 of the barrage information as a garbage barrage using the formula (4):
  • P1 P("garbage barrage"
  • a1, a2, a3, a4, a5, a6,..., ai,...,an) (p("garbage barrage"
  • the ai is any one of the words
  • the n is the number of words in the barrage information
  • the TF-IDF(ai) is any of the words TF-IDF weighted value.
  • the establishing unit 26 can calculate the second probability P2 of the barrage information as a normal barrage using the formula (5):
  • P2 P ("normal barrage”
  • a1, a2, a3, a4, a5, a6, ..., an) (p ("normal barrage”
  • the ai is any one of the words
  • the n is the number of words in the barrage information, i ⁇ n
  • the TF-IDF(ai) is any of the words TF-IDF weighted value.
  • the determining unit 27 is configured to determine whether the first probability P1 is greater than The second probability P2, if the first probability P1 is greater than the second probability P2, determining that the barrage information is a garbage barrage; if the first probability P1 is smaller than the second probability P2, determining Unit 27 then determines that the barrage information is a normal barrage. This identifies the trash barrage.
  • the embodiment further provides a computer device for identifying the garbage barrage.
  • the computer device includes: a radio frequency (RF) circuit 310, a memory 320, an input unit 330, a display unit 340, and an audio circuit. 350, WiFi module 360, processor 370, and power supply 380 and other components.
  • RF radio frequency
  • FIG. 3 does not constitute a limitation to a computer device, and may include more or fewer components than those illustrated, or some components may be combined, or different component arrangements.
  • the RF circuit 310 can be used for receiving and transmitting signals, and in particular, receiving downlink information of the base station and processing it to the processor 370.
  • RF circuit 310 includes, but is not limited to, at least one amplifier, transceiver, coupler, Low Noise Amplifier (LNA), duplexer, and the like.
  • LNA Low Noise Amplifier
  • the memory 320 can be used to store software programs and modules, and the processor 370 executes various functional applications and data processing of the computer devices by running software programs and modules stored in the memory 320.
  • the memory 320 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function, and the like; the storage data area may store data created according to usage of the computer device, and the like.
  • memory 320 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • the input unit 330 can be configured to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the computer device.
  • the input unit 330 may include a touch panel 331 and other input devices 332.
  • the touch panel 331 can collect input operations of the user and drive the corresponding connecting device according to a preset program.
  • the touch panel 331 collects the output information and sends it to the processor 370.
  • the input unit 330 may also include other input devices 332.
  • other input devices 332 may include, but are not limited to, one or more of a touch panel, function keys (such as volume control buttons, switch buttons, etc.), trackballs, mice, joysticks, and the like.
  • the display unit 340 can be used to display information input by the user or information provided to the user as well as various menus of the computer device.
  • the display unit 340 can include a display panel 341.
  • the display panel 341 can be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
  • the touch panel 331 can cover the display panel 341. When the touch panel 331 detects a touch operation on or near the touch panel 331, it transmits to the processor 370 to determine the type of the touch event, and then the processor 370 according to the input event. The type provides a corresponding visual output on display panel 341.
  • the touch panel 331 and the display panel 341 are implemented as two separate components in FIG. 3 to implement input and input functions of the computer device, in some embodiments, the touch panel 331 may be integrated with the display panel 341. Implement the input and output functions of computer equipment.
  • An audio circuit 350, a speaker 351, and a microphone 352 can provide an audio interface between the user and the computer device.
  • the audio circuit 350 can transmit the converted electrical data of the received audio data to the speaker 351 and convert it into a sound signal output by the speaker 351;
  • WiFi is a short-range wireless transmission technology.
  • the computer device can help users to send and receive emails, browse web pages and access streaming media through the WiFi module 360. It provides users with wireless broadband Internet access.
  • FIG. 3 shows the WiFi module 360, it can be understood that it does not belong to the essential configuration of the computer device, and may be omitted as needed within the scope of not changing the essence of the invention.
  • Processor 370 is a control center for computer devices that connects various portions of the entire computer device using various interfaces and lines, by running or executing software programs and/or modules stored in memory 320, and recalling data stored in memory 320. , performing various functions and processing data of the computer device, thereby performing overall monitoring of the computer device.
  • the processor 370 may include one or more processing units; preferably, the processor 370 may integrate an application processor, wherein the application processor mainly processes an operating system, a user interface, an application, and the like.
  • the computer device also includes a power source 380 (such as a power adapter) that powers the various components.
  • a power source 380 such as a power adapter
  • the power source can be logically coupled to the processor 370 via a power management system.
  • the method, device and computer device for identifying the garbage barrage provided by the invention can provide at least the following beneficial effects:
  • the invention provides a method, a device and a computer device for identifying a garbage barrage.
  • the method comprises: constructing a rule based on a preset barrage information feature, performing feature extraction on the barrage information, and acquiring first barrage information; Performing pre-processing on the first barrage information to remove data affecting the recognition of the naive Bayesian model in the first barrage information; according to the idiom rules in the customized vocabulary of the live broadcast platform
  • the first barrage information is cut to form a word bag model; the word bag model is converted into a word vector based on a preset mapping rule; and the TF-IDF word frequency-anti-document frequency is performed on each word in the word vector Weighting, obtaining a TF-IDF weighting value of each word; establishing the naive Bayesian model, based on the TF-IDF weighting value of each word, using the naive Bayesian model to calculate the occurrence of all words respectively
  • the Bayesian model when used to calculate the probability of the garbage barrage, the accuracy of the calculation is improved, and the recognition accuracy of the garbage barrage is improved; There is a large amount of garbage barrage information, which increases the user's participation and ensures the number of users of the live broadcast platform.
  • the feature information is extracted in advance, the words of a certain intention are converted into corresponding specifics.
  • the identification can effectively reduce the number of feature extractions when extracting the feature information of the barrage information, so as to reduce the feature latitude of the subsequent naive Bayesian model, reduce the computational complexity, and improve the calculation efficiency.
  • modules in the devices of the embodiments can be adaptively changed and placed in one or more devices different from the embodiment.
  • the modules or units or components of the embodiments may be combined into one module or unit or component, and further they may be divided into a plurality of sub-modules or sub-units or sub-components.
  • any combination of the features disclosed in the specification, including the accompanying claims, the abstract and the drawings, and any methods so disclosed, or All processes or units of the device are combined.
  • Each feature disclosed in this specification (including the accompanying claims, the abstract and the drawings) may be replaced by alternative features that provide the same, equivalent or similar purpose.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • a microprocessor or digital signal processor may be used in practice to implement some or all of the functionality of some, or all, of the gateways, proxy servers, systems in accordance with embodiments of the present invention.
  • DSP digital signal processor
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • Such a program implementing the present invention may be stored on a computer readable storage medium or may be in the form of one or more signals.
  • Such signals may be downloaded from an Internet website, or provided on a carrier signal, or provided in any other form; the program, when executed by the processor, implements the steps of: constructing rules based on preset barrographic information features, The barrage information is extracted to obtain the first barrage information; the first barrage information is preprocessed, and the data affecting the naive Bayesian model identification in the first barrage information is removed;
  • the idiom rules in the customized vocabulary of the live broadcast platform cut the words of the first barrage information to form a word bag model; and convert the word bag model into a word vector based on a preset mapping rule; Each word in the vector performs TF-IDF word frequency-anti-document frequency weighting to obtain a TF-IDF weighting value of each word; establishing the naive Bayesian model, based on the TF-IDF weighting value of each word, utilizing The naive Bayesian model calculates the first probability P1 of the barrage information as the garbage barrage and the second probability

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供了一种垃圾弹幕的识别方法、装置及计算机设备,方法包括:基于预设的弹幕信息特征构建规则,进行特征提取获取第一弹幕信息;根据直播平台自定义词库中的成词规则对第一弹幕信息进行切词,构成词袋模型;基于预设的映射规则,将词袋模型转换为词向量;对词向量中的各词语进行TF-IDF词频-反文档频率加权,获取各词语的TF-IDF加权值;建立朴素贝叶斯模型,基于各词语的TF-IDF加权值,利用朴素贝叶斯模型分别计算在所有词语出现的情况下,弹幕信息为垃圾弹幕的第一概率、及弹幕信息为正常弹幕的第二概率;判断第一概率是否大于第二概率,若第一概率大于所述第二概率,则确定弹幕信息为垃圾弹幕。

Description

一种垃圾弹幕的识别方法、装置及计算机设备 技术领域
本发明属于直播平台的垃圾弹幕处理技术领域,尤其涉及一种垃圾弹幕的识别方法、装置及计算机设备。
背景技术
目前,随着直播行业的快速发展,直播受众也在不断的扩大中,各类型的直播内容也越来越丰富。观众可以在观看直播的同时也可以通过发送弹幕的方式参与评论与互动,从而极大的提升了用户的参与度,丰富了直播内容。
一般来说,观众每发送一条弹幕,该弹幕则会发送到直播平台服务器,而直播平台服务器则会将该弹幕转发到该直播间的所有观众。但是一些非正常用户为了获取利益,经常会在直播间内突发大量的垃圾弹幕信息,比如发送大量的广告信息。这种广告骚扰直接降低了用户的参与度,导致直播平台的用户量减少,也降低了直播平台的收益。
现有技术中一般是通过人工提取规则,关键词模糊匹配的方案来识别垃圾弹幕,但是这种识别方式比较浪费人力,并且识别精度不高。
发明内容
针对现有技术存在的问题,本发明实施例提供了一种远程过程调用的方法、装置及计算机设备,用于解决现有技术中,在进行远程过程调用时,无法灵活地进行线程调度分配,易引起多线程冲突导致程序崩溃,导致工作效率下降的技术问题。
本发明提供一种远程过程调用的方法,所述方法包括:
本发明提供一种垃圾弹幕的识别方法,应用于直播平台中,所述方法包括:
基于预设的弹幕信息特征构建规则,对所述弹幕信息进行特征提取,获取第一弹幕信息;
对所述第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据;
根据所述直播平台自定义词库中的成词规则对预处理后的所述第一弹幕信息进行切词,构成词袋模型;
基于预设的映射规则,将所述词袋模型转换为词向量;
对所述词向量中的各词语进行词频-反文档频率(TF-IDF,Term Frequency- Inverse Document Frequency)加权,获取所述各词语的TF-IDF加权值;
建立所述朴素贝叶斯模型,基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所述词袋模型中所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1、及所述弹幕信息为正常弹幕的第二概率P2;
判断所述第一概率P1是否大于所述第二概率P2,若所述第一概率P1大于所述第二概率P2,则确定所述弹幕信息为垃圾弹幕。
上述方案中,对所述第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据,包括:
去除所述第一弹幕信息中弹幕内容为空的数据、所述弹幕内容中的标点符号及弹幕类型为空的数据。
上述方案中,所述根据所述直播平台自定义词库中的成词规则对所述第一弹幕信息进行切词,构成词袋模型,包括:
根据所述成词规则对所述第一弹幕信息中的对所述朴素贝叶斯模型识别无影响的词语进行过滤,获取过滤后的词语;
将所述过滤后的词语按照预定的顺序进行组合,构成所述词袋模型。
上述方案中,所述基于预设的映射规则,将所述词袋模型转换为词向量,包括:
基于预设的词向量纬度,将所述词袋模型的各词语映射至所述词向量的相应纬度上,将所述词袋模型转换为所述词向量。
上述方案中,所述对所述词向量中的各词语进行TF-IDF加权,获取所述各词语的TF-IDF加权值,包括:
计算所述各词语在所述弹幕信息中出现的频率TF;
基于公式IDF=log2M计算所述各词语的反文档频率加权值IDF,所述M为总弹幕信息的数目分别与包含各词语的弹幕信息数目的商值;
根据公式TF-IDF=TF*IDF计算所述各词语的TF-IDF加权值。
上述方案中,所述基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1,包括:
利用公式P1=P(“垃圾弹幕”|a1,a2,a3,a4,a5,a6,...,ai,...,an)=(p(“垃圾弹幕”|a1)*TF-IDF(a1))*(p(“垃圾弹幕”|a2)*TF-IDF(a2))*(p(“垃圾弹幕”|a3)*TF-IDF(a3))*...*(p(“垃圾弹幕”|ai)*TF-IDF(ai))*...*(p(“垃圾弹幕”|an)*TF-IDF(an))计算所述弹幕信息为垃圾弹幕的第一概率P1;其中,所述ai为所述各词语中的任一词语,所述n为所述弹幕信息中词语的个数;所述TF-IDF(ai)为所述任一词语的TF-IDF加权值。
上述方案中,所述基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所有词语出现的情况下,所述弹幕信息为正常弹幕的第二概率P2,包括:
利用公式P2=P(“正常弹幕”|a1,a2,a3,a4,a5,a6,...,an)=(p(“正常弹幕”|a1)*TF-IDF(a1))*(p(“正常弹幕”|a2)*TF-IDF(a2))*(p(“正常弹幕”|a3)*TF-IDF(a3))*...*(p(“正常弹幕”|a3)*TF-IDF(ai))*...*(p(“正常弹幕”|an)*TF-IDF(an));其中,所述ai为所述各词语中的任一词语,所述n为所述弹幕信息中词语的个数;所述TF-IDF(ai)为所述任一词语的TF-IDF加权值。
本发明提供一种垃圾弹幕的识别装置,应用于直播平台中,所述装置包括:
提取单元,用于基于预设的弹幕信息特征构建规则,对所述弹幕信息进行特征提取,获取第一弹幕信息;
预处理单元,用于对所述第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据;
切词单元,用于根据所述直播平台自定义词库中的成词规则对预处理后的所述第一弹幕信息进行切词,构成词袋模型;
转换单元,用于基于预设的映射规则,将所述词袋模型转换为词向量;
加权单元,用于对所述词向量中的各词语进行词频-反文档频率TF-IDF加权,获取所述各词语的TF-IDF加权值;
建立单元,用于建立所述朴素贝叶斯模型,基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所述词袋模型中所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1、及所述弹幕信息为正常弹幕的第二概率P2;
判断单元,用于判断所述第一概率P1是否大于所述第二概率P2,若所述第一概率P1大于所述第二概率P2,则确定所述弹幕信息为垃圾弹幕。
本发明还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现以下步骤:
基于预设的弹幕信息特征构建规则,对所述弹幕信息进行特征提取,获取第一弹幕信息;
对所述第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据;
根据直播平台自定义词库中的成词规则对预处理后的所述第一弹幕信息进行切词,构成词袋模型;
基于预设的映射规则,将所述词袋模型转换为词向量;
对所述词向量中的各词语进行词频-反文档频率TF-IDF加权,获取所述各 词语的TF-IDF加权值;
建立所述朴素贝叶斯模型,基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所述词袋模型中所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1、及所述弹幕信息为正常弹幕的第二概率P2;
判断所述第一概率P1是否大于所述第二概率P2,若所述第一概率P1大于所述第二概率P2,则确定所述弹幕信息为垃圾弹幕。
本发明还提供一种用于垃圾弹幕识别的计算机设备,包括:
至少一个处理器;以及
与所述处理器通信连接的至少一个存储器,其中,
所述存储器存储有可被所述处理器执行的程序指令,所述处理器调用所述程序指令能够执行如权利要求1至7任一所述的方法。
本发明提供了一种垃圾弹幕的识别方法、装置及计算机设备,所述方法包括:基于预设的弹幕信息特征构建规则,对所述弹幕信息进行特征提取,获取第一弹幕信息;对所述第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据;根据所述直播平台自定义词库中的成词规则对预处理后的所述第一弹幕信息进行切词,构成词袋模型;基于预设的映射规则,将所述词袋模型转换为词向量;对所述词向量中的各词语进行TF-IDF加权,获取所述各词语的TF-IDF加权值;建立所述朴素贝叶斯模型,基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在词袋模型中所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1、及所述弹幕信息为正常弹幕的第二概率P2;判断所述第一概率P1是否大于所述第二概率P2,若所述第一概率P1大于所述第二概率P2,则确定所述弹幕信息为垃圾弹幕;如此,基于词袋模型将有效的弹幕内容进行向量化表示,并且由于预先会对弹幕信息进行特征提取,使得词向量更能准确地表达弹幕信息的真实含义,因此在利用贝叶斯模型计算垃圾弹幕的概率时,提高了计算的精度,进而提高的垃圾弹幕的识别精度;进而避免直播间出现大量的垃圾弹幕信息,提高用户的参与度,确保了直播平台的用户量。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1为本发明实施例一提供的垃圾弹幕的识别方法流程示意图;
图2为本发明实施例二垃圾弹幕的识别装置结构示意图;
图3为本发明实施例三提供的用于垃圾弹幕识别的计算机设备的整体结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
下面通过附图及具体实施例对本发明的技术方案做进一步的详细说明。
实施例一
本实施例提供一种垃圾弹幕的识别方法,如图1所示,所述方法包括:
S101,基于预设的弹幕信息特征构建规则,对所述弹幕信息进行特征提取,获取第一弹幕信息;
本步骤中,在对弹幕信息进行特征提取之前,需从关系型数据库HIVE中利用数据库查询语句SQL提取已标记好弹幕类型的弹幕信息,当做样本数据。所述弹幕信息包括弹幕内容及弹幕类型。
然后基于预设的弹幕信息特征构建规则,对所述弹幕信息进行特征提取,获取第一弹幕信息。所述特征构建规则包括:利用特定标识来表示符合某类特征的词。
具体地,因弹幕中有很多带有直播平台特色的专有词汇,比如“666”,它不止是一个数值,在直播场景中代表着喝彩;还有“裸狼”是狼人杀游戏的专有术语。因此,可以将“666”等喝彩性的词汇用“喝彩”来标识,将“QQ324567865”这类字符串用“QQ联系方式”来标识;将“13617258349”这种符合手机号特征的数字串用“手机联系方式”来标识;将“QQ324567865”这类字符串用“QQ联系方式”来标识。对这些特殊词语进行标识之后,将这些标识添加在自定义词典库中。还可以结合直播平台的弹幕特色,按照实际需求进行个性化标识,比如将“裸狼”一词加入自定义词库,在后续切词过程中会将“裸狼”切分为一个词,而不会切成“裸”和“狼”两个单词。其中,所述自定义词库中包括高频出现的一些个性化词语,所述个性化词语不同于常规词语;所述自定义词库每天都会定时增量更新,以提高自定义词库的准确度。
这样就把某种特定意图的词语转化为相应的特定标识,在对所述弹幕信息进行特征提取时,可以有效减少特征提取的取值个数,以降低后续朴素贝叶斯模型的特征纬度,降低计算的复杂度,提高计算效率。
S102,对所述第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据;
获取到第一弹幕信息后,为了消除后续切词时不必要的干扰,对所第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据。
比如,若“开心”这个词语中带有空格,切词的时候会把“开心”这个词切成“开”“心”,这样就影响了弹幕识别的准确率,因此对所第一弹幕信息进行预处理时,要去除所述第一弹幕信息中弹幕内容为空的数据、所述弹幕内容中的标点符号及弹幕类型为空的数据。
S103,根据所述直播平台自定义词库中的成词规则对预处理后的所述第一弹幕信息进行切词,构成词袋模型;
本步骤中,对第一弹幕信息进行预处理后,根据所述直播平台自定义词库中的成词规则对所述第一弹幕信息进行切词、去重,构成词袋模型。
具体地,根据所述成词规则对所述第一弹幕信息中的对所述朴素贝叶斯模型识别影响小或无影响的词语进行过滤,获取过滤后的词语;将所述过滤后的词语按照预定的顺序进行组合,构成所述词袋模型。其中,对所述朴素贝叶斯模型识别影响小或无影响的词语可以添加至自定义停用词库中。
这里,所述词袋模型是将每条弹幕信息转换为一个个词语的集合;比如,若弹幕信息是“主播今天的裸狼玩的太棒了”这句话,对这句话进行切词后,将停用词“今天”、“的”、“了”过滤掉,剩下词语是“主播”、“裸狼”、“玩”、“太棒”,那么该弹幕信息的词袋模型就是[“主播”,“裸狼”,“玩”,“太棒”]。
S104,基于预设的映射规则,将所述词袋模型转换为词向量;对所述词向量中的各词语进行词频-反文档频率TF-IDF加权,获取所述各词语的TF-IDF加权值;
本步骤中,获取到词袋模型后,还要基于预设的映射规则,将所述词袋模型的各词语映射至所述词向量的相应纬度上,将所述词袋模型转换为所述词向量。这里,所述词向量的纬度为40万维,即每一条弹幕信息表示为40万维的一个向量V,每个向量位置对应一个词。
比如,以词袋模型[“主播”,“裸狼”,“玩”,“太棒”]为例,可以将“主播”映射到向量V(0)的位置,将“裸狼”映射到V(1)的位置,将“玩”映射到V(2)的位置,将“太棒”映射到V(3)的位置,这样最后形成词向量就是(1,1,1,2,0,0,0,0,…,0),省略号省略了39万9千9百90个0(因为词向量固定纬度为40万)。
需要说明的是,所述映射规则可以是顺序映射也可以是随机映射,在此不做限定。
然后对所述词向量中的各词语进行TF-IDFT加权,获取所述各词语的TF-IDF加权值。
具体地,计算所述各词语在所述弹幕信息中出现的频率TF;基于公式(1)计算所述各词语的反文档频率加权值IDF:
IDF=log 2M                        (1)
公式(1)中,所述M为总弹幕信息的数目分别与包含各词语的弹幕信息数目的商值。
然后再根据公式(2)计算各词语的TF-IDF加权值:
TF-IDF=TF*IDF                    (2)
S105,建立所述朴素贝叶斯模型,基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1、及所述弹幕信息为正常弹幕的第二概率P2;
本步骤中,获取到各词语的TF-IDF加权值后,建立所述朴素贝叶斯模型,基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在词袋模型中所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1、及所述弹幕信息为正常弹幕的第二概率P2。
具体地,所述标准贝叶斯模型P(A|B)表示事件B已经发生的前提下,事件A发生的概率,叫做事件B发生下事件A的条件概率。其基本求解公式为如公式(3)所示:
Figure PCTCN2018082176-appb-000001
当标准叶斯模型中的A为“垃圾弹幕”时,可以利用公式(4)计算所述弹幕信息为垃圾弹幕的第一概率P1:
P1=P(“垃圾弹幕”|a1,a2,a3,a4,a5,a6,...,ai,...,an)=(p(“垃圾弹幕”|a1)*TF-IDF(a1))*(p(“垃圾弹幕”|a2)*TF-IDF(a2))*(p(“垃圾弹幕”|a3)*TF-IDF(a3))*...*(p(“垃圾弹幕”|ai)*TF-IDF(ai))*...*(p(“垃圾弹幕”|an)*TF-IDF(an))(4)
公式(4)中,所述ai为所述各词语中的任一词语,所述n为所述弹幕信息中词语的个数;所述TF-IDF(ai)为所述任一词语的TF-IDF加权值。
当标准叶斯模型中的A为“正常弹幕”时,可以利用公式(5)计算所述弹幕信息为正常弹幕的第二概率P2:
P2=P(“正常弹幕”|a1,a2,a3,a4,a5,a6,...,an)=(p(“正常弹幕”|a1)*TF-IDF(a1))*(p(“正常弹幕”|a2)*TF-IDF(a2))*(p(“正常弹幕”|a3)*TF-IDF(a3))*...*(p(“正常弹幕”|a3)*TF-IDF(ai))*...*(p(“正常弹 幕”|an)*TF-IDF(an))                 (5)
其中,所述ai为所述各词语中的任一词语,所述n为所述弹幕信息中词语的个数,i<n;所述TF-IDF(ai)为所述任一词语的TF-IDF加权值。
S106,判断所述第一概率P1是否大于所述第二概率P2,若所述第一概率P1大于所述第二概率P2,则确定所述弹幕信息为垃圾弹幕;
本步骤中,计算出所述弹幕信息为垃圾弹幕的第一概率P1及所述弹幕信息为正常弹幕的第二概率P2后,判断所述第一概率P1是否大于所述第二概率P2,若所述第一概率P1大于所述第二概率P2,则确定所述弹幕信息为垃圾弹幕;若所述第一概率P1小于所述第二概率P2,则确定所述弹幕信息为正常弹幕。这样就识别出了垃圾弹幕。
实施例二
相应于实施例一,本实施例提供一种垃圾弹幕识别的装置,如图2所示,所述装置包括:提取单元21、预处理单元22、切词单元23、转换单元24、加权单元25、建立单元26及判断单元27;其中,
在提取单元21对弹幕信息进行特征提取之前,需从关系型数据库HIVE中利用数据库查询语句SQL提取已标记好弹幕类型的弹幕信息,当做样本数据。所述弹幕信息包括弹幕内容及弹幕类型。
然后提取单元21基于预设的弹幕信息特征构建规则,对所述弹幕信息进行特征提取,获取第一弹幕信息。所述特征构建规则包括:利用特定标识来表示符合某类特征的词。
具体地,因弹幕中有很多带有直播平台特色的专有词汇,比如“666”,它不止是一个数值,在直播场景中代表着喝彩;还有“裸狼”是狼人杀游戏的专有术语。因此,可以将“666”等喝彩性的词汇用“喝彩”来标识,将“QQ324567865”这类字符串用“QQ联系方式”来标识;将“13617258349”这种符合手机号特征的数字串用“手机联系方式”来标识;将“QQ324567865”这类字符串用“QQ联系方式”来标识。对这些特殊词语进行标识之后,将这些标识添加在自定义词典库中。还可以结合直播平台的弹幕特色,按照实际需求进行个性化标识,比如将“裸狼”一词加入自定义词库,在后续切词过程中会将“裸狼”切分为一个词,而不会切成“裸”和“狼”两个单词。其中,所述自定义词库中包括高频出现的一些个性化词语,所述个性化词语不同于常规词语;所述自定义词库每天都会定时增量更新,以提高自定义词库的准确度。
这样就把某种特定意图的词语转化为相应的特定标识,在对所述弹幕信息进行特征提取时,可以有效减少特征提取的取值个数,以降低后续朴素贝叶斯模型的特征纬度,降低计算的复杂度,提高计算效率。
获取到第一弹幕信息后,所述预处理单元22用于对所述第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据.。
比如,若“开心”这个词语中带有空格,切词的时候会把“开心”这个词切成“开”“心”,这样就影响了弹幕识别的准确率,因此所述预处理单元22对所第一弹幕信息进行预处理时,要去除所述第一弹幕信息中弹幕内容为空的数据、所述弹幕内容中的标点符号及弹幕类型为空的数据。
预处理单元22对第一弹幕信息进行预处理后,切词单元23用于根据所述直播平台自定义词库中的成词规则对预处理后的所述第一弹幕信息进行切词,去重,构成词袋模型。
具体地,切词单元23根据所述成词规则对所述第一弹幕信息中的对所述朴素贝叶斯模型识别影响小或无影响的词语进行过滤,获取过滤后的词语;将所述过滤后的词语按照预定的顺序进行组合,构成所述词袋模型。其中,对所述朴素贝叶斯模型识别影响小或无影响的词语可以添加至自定义停用词库中。
这里,所述词袋模型是将每条弹幕信息转换为一个个词语的集合;比如,若弹幕信息是“主播今天的裸狼玩的太棒了”这句话,对这句话进行切词后,将停用词“今天”、“的”、“了”过滤掉,剩下词语是“主播”、“裸狼”、“玩”、“太棒”,那么该弹幕信息的词袋模型就是[“主播”,“裸狼”,“玩”,“太棒”]。
获取到词袋模型后,转换单元24用于基于预设的映射规则,将所述词袋模型的各词语映射至所述词向量的相应纬度上,将所述词袋模型转换为所述词向量。这里,所述词向量的纬度为40万维,即每一条弹幕信息表示为40万维的一个向量V,每个向量位置对应一个词。
比如,以词袋模型[“主播”,“裸狼”,“玩”,“太棒”]为例,转换单元24可以将“主播”映射到向量V(0)的位置,将“裸狼”映射到V(1)的位置,将“玩”映射到V(2)的位置,将“太棒”映射到V(3)的位置,这样最后形成词向量就是(1,1,1,2,0,0,0,0,…,0),省略号省略了39万9千9百90个0(因为词向量固定纬度为40万)。
需要说明的是,所述映射规则可以是顺序映射也可以是随机映射,在此不做限定。
然后,所述加权单元25用于对所述词向量中的各词语进行TF-IDF加权,获取所述各词语的TF-IDF加权值。
具体地,所述加权单元25计算所述各词语在所述弹幕信息中出现的频率TF;基于公式(1)计算所述各词语的反文档频率加权值IDF:
IDF=log 2M                     (1)
公式(1)中,所述M为总弹幕信息的数目分别与包含各词语的弹幕信息 数目的商值。
然后再根据公式(2)计算各词语的TF-IDF加权值:
TF-IDF=TF*IDF                     (2)
获取到各词语的TF-IDF加权值后,建立单元26用于建立所述朴素贝叶斯模型,基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1、及所述弹幕信息为正常弹幕的第二概率P2;
所述标准贝叶斯模型P(A|B)表示事件B已经发生的前提下,事件A发生的概率,叫做事件B发生下事件A的条件概率。其基本求解公式为如公式(3)所示:
Figure PCTCN2018082176-appb-000002
当标准叶斯模型中的A为“垃圾弹幕”时,建立单元26可以利用公式(4)计算所述弹幕信息为垃圾弹幕的第一概率P1:
P1=P(“垃圾弹幕”|a1,a2,a3,a4,a5,a6,...,ai,...,an)=(p(“垃圾弹幕”|a1)*TF-IDF(a1))*(p(“垃圾弹幕”|a2)*TF-IDF(a2))*(p(“垃圾弹幕”|a3)*TF-IDF(a3))*...*(p(“垃圾弹幕”|ai)*TF-IDF(ai))*...*(p(“垃圾弹幕”|an)*TF-IDF(an))                     (4)
公式(4)中,所述ai为所述各词语中的任一词语,所述n为所述弹幕信息中词语的个数;所述TF-IDF(ai)为所述任一词语的TF-IDF加权值。
当标准叶斯模型中的A为“正常弹幕”时,建立单元26可以利用公式(5)计算所述弹幕信息为正常弹幕的第二概率P2:
P2=P(“正常弹幕”|a1,a2,a3,a4,a5,a6,...,an)=(p(“正常弹幕”|a1)*TF-IDF(a1))*(p(“正常弹幕”|a2)*TF-IDF(a2))*(p(“正常弹幕”|a3)*TF-IDF(a3))*...*(p(“正常弹幕”|a3)*TF-IDF(ai))*...*(p(“正常弹幕”|an)*TF-IDF(an))                (5)
其中,所述ai为所述各词语中的任一词语,所述n为所述弹幕信息中词语的个数,i<n;所述TF-IDF(ai)为所述任一词语的TF-IDF加权值。
建立单元26计算出所述弹幕信息为垃圾弹幕的第一概率P1及所述弹幕信息为正常弹幕的第二概率P2后,判断单元27用于判断所述第一概率P1是否大于所述第二概率P2,若所述第一概率P1大于所述第二概率P2,则确定所述弹幕信息为垃圾弹幕;若所述第一概率P1小于所述第二概率P2,判断单元27则确定所述弹幕信息为正常弹幕。这样就识别出了垃圾弹幕。
实施例三
本实施例还提供一种垃圾弹幕识别的计算机设备,如图3所示,所述计算 机设备包括:射频(Radio Frequency,RF)电路310、存储器320、输入单元330、显示单元340、音频电路350、WiFi模块360、处理器370、以及电源380等部件。本领域技术人员可以理解,图3中示出的计算机设备结构并不构成对计算机设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图3对计算机设备的各个构成部件进行具体的介绍:
RF电路310可用于信号的接收和发送,特别地,将基站的下行信息接收后,给处理器370处理。通常,RF电路310包括但不限于至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,LNA)、双工器等。
存储器320可用于存储软件程序以及模块,处理器370通过运行存储在存储器320的软件程序以及模块,从而执行计算机设备的各种功能应用以及数据处理。存储器320可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据计算机设备的使用所创建的数据等。此外,存储器320可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元330可用于接收输入的数字或字符信息,以及产生与计算机设备的用户设置以及功能控制有关的键信号输入。具体地,输入单元330可包括触控面板331以及其他输入设备332。触控面板331,可收集用户在其上的输入操作,并根据预先设定的程式驱动相应的连接装置。触控面板331采集到输出信息后再送给处理器370。除了触控面板331,输入单元330还可以包括其他输入设备332。具体地,其他输入设备332可以包括但不限于触控面板、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元340可用于显示由用户输入的信息或提供给用户的信息以及计算机设备的各种菜单。显示单元340可包括显示面板341,可选的,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板341。进一步的,触控面板331可覆盖显示面板341,当触控面板331检测到在其上或附近的触摸操作后,传送给处理器370以确定触摸事件的类型,随后处理器370根据输入事件的类型在显示面板341上提供相应的视觉输出。虽然在图3中触控面板331与显示面板341是作为两个独立的部件来实现计算机设备的输入和输入功能,但是在某些实施例中,可以将触控面板331与显示面板341集成而实现计算机设备的输入和输出功能。
音频电路350、扬声器351,传声器352可提供用户与计算机设备之间的音频接口。音频电路350可将接收到的音频数据转换后的电信号,传输到扬声器 351,由扬声器351转换为声音信号输出;
WiFi属于短距离无线传输技术,计算机设备通过WiFi模块360可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图3示出了WiFi模块360,但是可以理解的是,其并不属于计算机设备的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
处理器370是计算机设备的控制中心,利用各种接口和线路连接整个计算机设备的各个部分,通过运行或执行存储在存储器320内的软件程序和/或模块,以及调用存储在存储器320内的数据,执行计算机设备的各种功能和处理数据,从而对计算机设备进行整体监控。可选的,处理器370可包括一个或多个处理单元;优选的,处理器370可集成应用处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等。
计算机设备还包括给各个部件供电的电源380(比如电源适配器),优选的,电源可以通过电源管理系统与处理器370逻辑相连。
本发明提供的垃圾弹幕的识别方法、装置及计算机设备能带来的有益效果至少是:
本发明提供一种垃圾弹幕的识别方法、装置及计算机设备,所述方法包括:基于预设的弹幕信息特征构建规则,对所述弹幕信息进行特征提取,获取第一弹幕信息;对所述第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据;根据所述直播平台自定义词库中的成词规则对所述第一弹幕信息进行切词,构成词袋模型;基于预设的映射规则,将所述词袋模型转换为词向量;对所述词向量中的各词语进行TF-IDF词频-反文档频率加权,获取所述各词语的TF-IDF加权值;建立所述朴素贝叶斯模型,基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1、及所述弹幕信息为正常弹幕的第二概率P2;判断所述第一概率P1是否大于所述第二概率P2,若所述第一概率P1大于所述第二概率P2,则确定所述弹幕信息为垃圾弹幕;如此,基于词袋模型将有效的弹幕内容进行向量化表示,并且由于预先会对弹幕信息进行特征提取,使得词向量更能准确地表达弹幕信息的真实含义,因此在利用贝叶斯模型计算垃圾弹幕的概率时,提高了计算的精度,进而提高的垃圾弹幕的识别精度;进而避免直播间出现大量的垃圾弹幕信息,提高用户的参与度,确保了直播平台的用户量;另外,由于预先会对弹幕信息进行特征提取,这样就把某种特定意图的词语转化为相应的特定标识,在对所述弹幕信息进行特征提取时,可以有效减少特征提取的取值个数,以降低后续朴素贝叶斯模型的特征纬度,降低计算的复杂度,提高计算效率。
在此提供的算法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本发明也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本发明的内容,并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
此外,本领域的技术人员能够理解,尽管在此的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的网关、代理服务器、系统中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或 者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读存储介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供;该程序被处理器执行时实现以下步骤:基于预设的弹幕信息特征构建规则,对所述弹幕信息进行特征提取,获取第一弹幕信息;对所述第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据;根据所述直播平台自定义词库中的成词规则对所述第一弹幕信息进行切词,构成词袋模型;基于预设的映射规则,将所述词袋模型转换为词向量;对所述词向量中的各词语进行TF-IDF词频-反文档频率加权,获取所述各词语的TF-IDF加权值;建立所述朴素贝叶斯模型,基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1、及所述弹幕信息为正常弹幕的第二概率P2;判断所述第一概率P1是否大于所述第二概率P2,若所述第一概率P1大于所述第二概率P2,则确定所述弹幕信息为垃圾弹幕。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
以上所述,仅为本发明的较佳实施例而已,并非用于限定本发明的保护范围,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种垃圾弹幕的识别方法,其特征在于,应用于直播平台中,所述方法包括:
    基于预设的弹幕信息特征构建规则,对所述弹幕信息进行特征提取,获取第一弹幕信息;
    对所述第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据;
    根据所述直播平台自定义词库中的成词规则对预处理后的所述第一弹幕信息进行切词,构成词袋模型;
    基于预设的映射规则,将所述词袋模型转换为词向量;
    对所述词向量中的各词语进行词频-反文档频率TF-IDF加权,获取所述各词语的TF-IDF加权值;
    建立所述朴素贝叶斯模型,基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所述词袋模型中所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1、及所述弹幕信息为正常弹幕的第二概率P2;
    判断所述第一概率P1是否大于所述第二概率P2,若所述第一概率P1大于所述第二概率P2,则确定所述弹幕信息为垃圾弹幕。
  2. 如权利要求1所述的方法,其特征在于,对所述第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据,包括:
    去除所述第一弹幕信息中弹幕内容为空的数据、所述弹幕内容中的标点符号及弹幕类型为空的数据。
  3. 如权利要求1所述的方法,其特征在于,所述根据所述直播平台自定义词库中的成词规则对所述第一弹幕信息进行切词,构成词袋模型,包括:
    根据所述成词规则对所述第一弹幕信息中的对所述朴素贝叶斯模型识别无影响的词语进行过滤,获取过滤后的词语;
    将所述过滤后的词语按照预定的顺序进行组合,构成所述词袋模型。
  4. 如权利要求1所述的方法,其特征在于,所述基于预设的映射规则,将所述词袋模型转换为词向量,包括:
    基于预设的词向量纬度,将所述词袋模型的各词语映射至所述词向量的相应纬度上,将所述词袋模型转换为所述词向量。
  5. 如权利要求1所述的方法,其特征在于,所述对所述词向量中的各词语进行TF-IDF加权,获取所述各词语的TF-IDF加权值,包括:
    计算所述各词语在所述弹幕信息中出现的频率TF;
    基于公式IDF=log 2M计算所述各词语的反文档频率加权值IDF,所述M为 总弹幕信息的数目分别与包含各词语的弹幕信息数目的商值;
    根据公式TF-IDF=TF*IDF计算所述各词语的TF-IDF加权值。
  6. 如权利要求1所述的方法,其特征在于,所述基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1,包括:
    利用公式P1=P(“垃圾弹幕”|a1,a2,a3,a4,a5,a6,…,ai,…,an)=(p(“垃圾弹幕”|a1)*TF-IDF(a1))*(p(“垃圾弹幕”|a2)*TF-IDF(a2))*(p(“垃圾弹幕”|a3)*TF-IDF(a3))*…*(p(“垃圾弹幕”|ai)*TF-IDF(ai))*…*(p(“垃圾弹幕”|an)*TF-IDF(an))计算所述弹幕信息为垃圾弹幕的第一概率P1;其中,所述ai为所述各词语中的任一词语,所述n为所述弹幕信息中词语的个数;所述TF-IDF(ai)为所述任一词语的TF-IDF加权值。
  7. 如权利要求1所述的方法,其特征在于,所述基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所有词语出现的情况下,所述弹幕信息为正常弹幕的第二概率P2,包括:
    利用公式P2=P(“正常弹幕”|a1,a2,a3,a4,a5,a6,…,an)=(p(“正常弹幕”|a1)*TF-IDF(a1))*(p(“正常弹幕”|a2)*TF-IDF(a2))*(p(“正常弹幕”|a3)*TF-IDF(a3))*…*(p(“正常弹幕”|a3)*TF-IDF(ai))*…*(p(“正常弹幕”|an)*TF-IDF(an));其中,所述ai为所述各词语中的任一词语,所述n为所述弹幕信息中词语的个数;所述TF-IDF(ai)为所述任一词语的TF-IDF加权值。
  8. 一种垃圾弹幕的识别装置,其特征在于,应用于直播平台中,所述装置包括:
    提取单元,用于基于预设的弹幕信息特征构建规则,对所述弹幕信息进行特征提取,获取第一弹幕信息;
    预处理单元,用于对所述第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据;
    切词单元,用于根据所述直播平台自定义词库中的成词规则对预处理后的所述第一弹幕信息进行切词,构成词袋模型;
    转换单元,用于基于预设的映射规则,将所述词袋模型转换为词向量;
    加权单元,用于对所述词向量中的各词语进行词频-反文档频率TF-IDF加权,获取所述各词语的TF-IDF加权值;
    建立单元,用于建立所述朴素贝叶斯模型,基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所述词袋模型中所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1、及所述弹幕信息为正常弹幕的第二概率P2;
    判断单元,用于判断所述第一概率P1是否大于所述第二概率P2,若所述第一概率P1大于所述第二概率P2,则确定所述弹幕信息为垃圾弹幕。
  9. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现以下步骤:
    基于预设的弹幕信息特征构建规则,对所述弹幕信息进行特征提取,获取第一弹幕信息;
    对所述第一弹幕信息进行预处理,去除所述第一弹幕信息中对朴素贝叶斯模型识别有影响的数据;
    根据直播平台自定义词库中的成词规则对预处理后的所述第一弹幕信息进行切词,构成词袋模型;
    基于预设的映射规则,将所述词袋模型转换为词向量;
    对所述词向量中的各词语进行词频-反文档频率TF-IDF加权,获取所述各词语的TF-IDF加权值;
    建立所述朴素贝叶斯模型,基于所述各词语的TF-IDF加权值,利用所述朴素贝叶斯模型分别计算在所述词袋模型中所有词语出现的情况下,所述弹幕信息为垃圾弹幕的第一概率P1、及所述弹幕信息为正常弹幕的第二概率P2;
    判断所述第一概率P1是否大于所述第二概率P2,若所述第一概率P1大于所述第二概率P2,则确定所述弹幕信息为垃圾弹幕。
  10. 一种用于垃圾弹幕识别的计算机设备,其特征在于,包括:
    至少一个处理器;以及
    与所述处理器通信连接的至少一个存储器,其中,
    所述存储器存储有可被所述处理器执行的程序指令,所述处理器调用所述程序指令能够执行如权利要求1至7任一所述的方法。
PCT/CN2018/082176 2017-06-28 2018-04-08 一种垃圾弹幕的识别方法、装置及计算机设备 WO2019001075A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710506120.2A CN107480123B (zh) 2017-06-28 2017-06-28 一种垃圾弹幕的识别方法、装置及计算机设备
CN201710506120.2 2017-06-28

Publications (1)

Publication Number Publication Date
WO2019001075A1 true WO2019001075A1 (zh) 2019-01-03

Family

ID=60595017

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/082176 WO2019001075A1 (zh) 2017-06-28 2018-04-08 一种垃圾弹幕的识别方法、装置及计算机设备

Country Status (2)

Country Link
CN (1) CN107480123B (zh)
WO (1) WO2019001075A1 (zh)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480123B (zh) * 2017-06-28 2020-10-16 武汉斗鱼网络科技有限公司 一种垃圾弹幕的识别方法、装置及计算机设备
CN108846431B (zh) * 2018-06-05 2021-09-28 成都信息工程大学 基于改进贝叶斯模型的视频弹幕情感分类方法
CN109145291A (zh) * 2018-07-25 2019-01-04 广州虎牙信息科技有限公司 一种弹幕关键词筛选的方法、装置、设备及存储介质
CN109062905B (zh) * 2018-09-04 2022-06-24 武汉斗鱼网络科技有限公司 一种弹幕文本价值评价方法、装置、设备及介质
CN109189889B (zh) * 2018-09-10 2021-03-12 武汉斗鱼网络科技有限公司 一种弹幕识别模型建立方法、装置、服务器及介质
CN109145308B (zh) * 2018-09-28 2022-07-12 乐山师范学院 一种基于改进朴素贝叶斯的涉密文本识别方法
CN109408639B (zh) * 2018-10-31 2022-05-31 广州虎牙科技有限公司 一种弹幕分类方法、装置、设备和存储介质
CN109511000B (zh) * 2018-11-06 2021-10-15 武汉斗鱼网络科技有限公司 弹幕类别确定方法、装置、设备及存储介质
CN109766435A (zh) * 2018-11-06 2019-05-17 武汉斗鱼网络科技有限公司 弹幕类别识别方法、装置、设备及存储介质
CN110430448B (zh) * 2019-07-31 2021-09-03 北京奇艺世纪科技有限公司 一种弹幕处理方法、装置及电子设备
CN111476321B (zh) * 2020-05-18 2022-05-17 哈尔滨工程大学 基于特征加权贝叶斯优化算法的空中飞行物识别方法
CN112527965A (zh) * 2020-12-18 2021-03-19 国家电网有限公司客户服务中心 基于专业库和闲聊库相结合的自动问答实现方法和装置
CN113673376B (zh) * 2021-08-03 2023-09-01 北京奇艺世纪科技有限公司 弹幕生成方法、装置、计算机设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663093A (zh) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 不良网站检测方法及设备
US8725732B1 (en) * 2009-03-13 2014-05-13 Google Inc. Classifying text into hierarchical categories
CN104021302A (zh) * 2014-06-18 2014-09-03 北京邮电大学 一种基于贝叶斯文本分类模型的辅助挂号方法
CN106210770A (zh) * 2016-07-11 2016-12-07 北京小米移动软件有限公司 一种显示弹幕信息的方法和装置
CN107480123A (zh) * 2017-06-28 2017-12-15 武汉斗鱼网络科技有限公司 一种垃圾弹幕的识别方法、装置及计算机设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8725732B1 (en) * 2009-03-13 2014-05-13 Google Inc. Classifying text into hierarchical categories
CN102663093A (zh) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 不良网站检测方法及设备
CN104021302A (zh) * 2014-06-18 2014-09-03 北京邮电大学 一种基于贝叶斯文本分类模型的辅助挂号方法
CN106210770A (zh) * 2016-07-11 2016-12-07 北京小米移动软件有限公司 一种显示弹幕信息的方法和装置
CN107480123A (zh) * 2017-06-28 2017-12-15 武汉斗鱼网络科技有限公司 一种垃圾弹幕的识别方法、装置及计算机设备

Also Published As

Publication number Publication date
CN107480123A (zh) 2017-12-15
CN107480123B (zh) 2020-10-16

Similar Documents

Publication Publication Date Title
WO2019001075A1 (zh) 一种垃圾弹幕的识别方法、装置及计算机设备
CN110418208B (zh) 一种基于人工智能的字幕确定方法和装置
US20150332154A1 (en) System and method for identifying social trends
KR102716178B1 (ko) 사용자 발화를 처리하는 시스템 및 그 시스템의 제어 방법
CN107958042B (zh) 一种目标专题的推送方法及移动终端
CN110069769B (zh) 应用标签生成方法、装置及存储设备
CN107844992A (zh) 评论信息处理方法、装置、终端设备及存储介质
CN111312233A (zh) 一种语音数据的识别方法、装置及系统
CN103853757A (zh) 网络的信息展示方法和系统、终端和信息展示处理装置
KR102707293B1 (ko) 사용자 음성 입력을 처리하는 장치
CN112382294A (zh) 语音识别方法、装置、电子设备和存储介质
CN103501487A (zh) 分类器更新方法、装置、终端、服务器及系统
CN106056350A (zh) 一种电子邮件的信息抽离方法、装置和系统
CN113312451A (zh) 文本标签确定方法和装置
CN113157984A (zh) 处理方法、终端设备及存储介质
CN116307394A (zh) 产品用户体验评分方法、装置、介质及设备
CN110337008B (zh) 视频互动调整方法、装置、设备及存储介质
CN113852835A (zh) 直播音频处理方法、装置、电子设备以及存储介质
CN111353422A (zh) 信息提取方法、装置及电子设备
CN109451447A (zh) 一种鉴别垃圾信息的方法、装置、存储介质和设备
CN113868092B (zh) 应用确定方法、应用确定装置、电子设备和可读存储介质
CN114399058B (zh) 一种模型更新的方法、相关装置、设备以及存储介质
CN114201958B (zh) 网络资源数据处理方法、系统及电子设备
CN117012202B (zh) 语音通道识别方法、装置、存储介质及电子设备
CN114676251A (zh) 分类模型确定方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18823225

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18823225

Country of ref document: EP

Kind code of ref document: A1