CN109213988B - Barrage theme extraction method, medium, equipment and system based on N-gram model - Google Patents

Barrage theme extraction method, medium, equipment and system based on N-gram model Download PDF

Info

Publication number
CN109213988B
CN109213988B CN201710514238.XA CN201710514238A CN109213988B CN 109213988 B CN109213988 B CN 109213988B CN 201710514238 A CN201710514238 A CN 201710514238A CN 109213988 B CN109213988 B CN 109213988B
Authority
CN
China
Prior art keywords
word
bullet screen
words
gram model
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710514238.XA
Other languages
Chinese (zh)
Other versions
CN109213988A (en
Inventor
龚灿
陈少杰
张文明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Douyu Network Technology Co Ltd
Original Assignee
Wuhan Douyu Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Douyu Network Technology Co Ltd filed Critical Wuhan Douyu Network Technology Co Ltd
Priority to CN201710514238.XA priority Critical patent/CN109213988B/en
Publication of CN109213988A publication Critical patent/CN109213988A/en
Application granted granted Critical
Publication of CN109213988B publication Critical patent/CN109213988B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a barrage theme extraction method, medium, equipment and system based on an N-gram model, and relates to the field of live broadcast. The method comprises the following steps: extracting bullet screen data; extracting the characteristics corresponding to the words representing a certain specific intention, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank; data preprocessing: removing data with empty 'barrage content' field; removing punctuation marks in the 'bullet screen content' field; the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a word in a sentence is related to the previous N-1 words, and N is a positive integer; segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined stop word bank. The method and the device can accurately extract the barrage theme.

Description

Bullet screen theme extraction method, medium, equipment and system based on N-gram model
Technical Field
The invention relates to the field of live broadcast, in particular to a barrage theme extraction method, medium, equipment and system based on an N-gram model.
Background
The main text content of the live broadcast platform is generally expressed as a bullet screen, and bullet screen text information of the live broadcast platform needs to be extracted for counting bullet screen content. At present, most of traditional barrage text extraction schemes in the live broadcast industry adopt a manual labeling method, the method consumes a large amount of manpower and material cost, and when the method is used for full-platform users, anchor broadcasters and hundreds of millions of barrage data, the manual processing method is obviously low in efficiency. In addition, the existing bullet screen text is represented purely based on a word bag model, and the relation between a single word and the context is ignored, so that the extraction of the bullet screen theme is inaccurate.
Disclosure of Invention
The invention aims to overcome the defects of the background technology and provides a bullet screen theme extraction method, medium, equipment and system based on an N-gram model, and the bullet screen theme can be accurately extracted.
The invention provides a barrage theme extraction method based on an N-gram model, which comprises the following steps:
preparing data: extracting bullet screen data;
constructing bullet screen characteristics: extracting the characteristics corresponding to the words representing a certain specific intention, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank;
data preprocessing: removing data with blank 'bullet screen content' field; removing punctuation marks in the 'bullet screen content' field;
and representing the bullet screen content as a word vector by adopting an N-gram model: the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a certain word in a sentence is related to the previous N-1 words, and N is a positive integer; segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined stop word bank.
On the basis of the technical scheme, the value of N is 2, namely, each word is related to the previous word.
On the basis of the technical scheme, in the N-gram model, a probability formula of sentence occurrence is as follows:
Figure GDA0003614739490000021
where p represents the probability value of sentence occurrence, w1Representing the word at the 1 st position in the sentence, pi representing the multiplicative symbol, m representing the number of words in the sentence, wiDenotes the ith word, m and i are positive integers, p (w)iwi-1) And (3) representing the probability of the simultaneous occurrence of the words at the i-th and i-1-th positions, recombining the existing phrases according to the formula, and combining two adjacent words to generate a new phrase.
On the basis of the technical scheme, the dimensionality of the word vector is 60-ten-thousand, namely, each bullet screen content is represented as a 60-ten-thousand-dimensional vector, each position corresponds to a word, and the final bullet screen theme representation is obtained.
The invention also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method.
The invention also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program running on the processor, and the processor executes the computer program to realize the method.
The invention also provides a bullet screen theme extraction system based on the N-gram model, which comprises a data preparation unit, a bullet screen characteristic construction unit, a data preprocessing unit, an N-gram model representation unit and a segmentation unit, wherein:
the data preparation unit is to: extracting bullet screen data;
the bullet screen characteristic construction unit is used for: extracting the characteristics corresponding to the words representing a certain specific intention, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank;
the data preprocessing unit is used for: removing data with empty 'barrage content' field; removing punctuation marks in the 'bullet screen content' field;
the N-gram model representation unit is used for: the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a word in a sentence is related to the previous N-1 words, and N is a positive integer;
the segmentation unit is used for: segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined stop word bank.
On the basis of the technical scheme, the value of N is 2, namely, each word is related to the previous word.
On the basis of the technical scheme, in the N-gram model, a probability formula of sentence occurrence is as follows:
Figure GDA0003614739490000031
where p represents the probability value of sentence occurrence, w1Representing the word at the 1 st position in the sentence, pi representing the multiplicative symbol, m representing the number of words in the sentence, wiDenotes the ith word, m and i are positive integers, p (w)iwi-1) And (3) representing the probability of the simultaneous occurrence of the words at the i-th and i-1-th positions, recombining the existing phrases according to the formula, and combining two adjacent words to generate a new phrase.
On the basis of the technical scheme, the dimensionality of the word vector is 60-ten-thousand, namely, each bullet screen content is represented as a 60-ten-thousand-dimensional vector, each position corresponds to a word, and the final bullet screen theme representation is obtained.
Compared with the prior art, the invention has the following advantages:
(1) the method extracts bullet screen data; extracting the characteristics corresponding to the words representing a certain specific intention, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank; data preprocessing: removing data with empty 'barrage content' field; removing punctuation marks in the 'bullet screen content' field; the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a word in a sentence is related to the previous N-1 words, and N is a positive integer; segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined stop word bank. The invention adopts the N-gram model to represent the bullet screen content as the word vector, and the bullet screen representation method of the N-gram model overcomes the defect that the context is ignored by the existing word bag model, so that the bullet screen representation is more accurate, and the bullet screen theme can be accurately extracted.
(2) In the N-gram model of the invention, the value of N is 2, namely, each word has a relation with a word before the word. The invention is improved based on the original 2-gram model, and can reduce the calculation complexity.
(3) The method disclosed by the invention integrates the N-gram model and the artificial characteristic construction method, can accurately extract the main information of a single bullet screen and a single room, and realizes vectorization expression of the bullet screen theme.
(4) With the continuous development of live platform services, live platforms accumulate a lot of active users and also accumulate a large amount of text-type data. By deeply mining the text content of the live broadcast platform, the content similarity between the user and the room and the content similarity between the room and the room can be known, so that the personalized recommendation effect of the live broadcast platform is improved.
Drawings
FIG. 1 is a flowchart of a bullet screen theme extraction method based on an N-gram model in an embodiment of the present invention.
Fig. 2 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the embodiments.
Referring to fig. 1, an embodiment of the present invention provides a bullet screen theme extraction method based on an N-gram model, including the following steps:
s1, data preparation: extracting bullet screen data;
s2, constructing bullet screen characteristics: extracting the characteristics corresponding to the words representing a certain specific intention, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank;
s3, preprocessing data: removing data with empty 'barrage content' field; removing punctuation marks in the 'bullet screen content' field;
s4, representing the bullet screen content as a word vector by adopting an N-gram model: the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a word in a sentence is related to the previous N-1 words, and N is a positive integer; segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined stop word bank.
The embodiment of the invention also provides a storage medium, wherein the storage medium is stored with a computer program, and the computer program is executed by a processor to realize the barrage theme extraction method based on the N-gram model. The storage medium includes various media capable of storing program codes, such as a usb disk, a portable hard disk, a ROM (Read-Only Memory), a RAM (Random Access Memory), a magnetic disk, or an optical disk.
Referring to fig. 2, an embodiment of the present invention further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program running on the processor, and the processor implements the bullet screen theme extraction method based on the N-gram model when executing the computer program.
The embodiment of the invention also provides a barrage theme extraction system based on the N-gram model, which comprises a data preparation unit, a barrage feature construction unit, a data preprocessing unit, an N-gram model representation unit and a segmentation unit, wherein:
the data preparation unit is to: extracting bullet screen data;
the bullet screen characteristic construction unit is used for: extracting the characteristics corresponding to the words representing a certain specific intention, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank;
the data preprocessing unit is used for: removing data with empty 'barrage content' field; removing punctuation marks in the 'bullet screen content' field;
the N-gram model representation unit is used for: the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a certain word in a sentence is related to the previous N-1 words, and N is a positive integer;
the segmentation unit is used for: segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined stop word bank.
The N-gram Model is a Language Model commonly used in large vocabulary continuous speech recognition, and for Chinese, the Model is called Chinese Language Model (CLM). The Chinese language model can calculate the sentence with the maximum probability by using the collocation information between adjacent words in the context when the continuous blank-free pinyin, strokes or numbers representing letters or strokes are required to be converted into Chinese character strings (namely sentences), thereby realizing the automatic conversion of Chinese characters without manual selection of a user and avoiding the problem of repeated codes of a plurality of Chinese characters corresponding to the same pinyin (or stroke strings or number strings).
The N-gram model is based on the assumption that the occurrence of a word is only related to the first N-1 words, but not to any other words, and the probability of a complete sentence is the product of the probabilities of occurrence of the words. These probabilities can be obtained by counting the number of times that N words occur simultaneously directly from the corpus.
In the embodiment of the invention, the value of N is 2, namely, each word is related to the previous word.
In the N-gram model, the probability formula of sentence occurrence is as follows:
Figure GDA0003614739490000061
where p represents the probability value of sentence occurrence, w1Representing the word at the 1 st position in the sentence, pi representing the multiplicative symbol, m representing the number of words in the sentence, wiDenotes the ith word, m and i are positive integers, p (w)iwi-1) And (3) representing the probability of the simultaneous occurrence of the words at the i-th and i-1-th positions, recombining the existing phrases according to the formula, and combining two adjacent words to generate a new phrase.
The dimension of the word vector is 60 ten thousand, namely each bullet screen content is represented as a 60 ten thousand-dimensional vector, each position corresponds to a word, and the final bullet screen theme representation is obtained.
Compared with the prior art that the relation between a single word and the context is ignored in the text representation purely based on the bag-of-words model, the embodiment of the invention considers the influence of the context-related text through the N-gram model, so that the bullet screen theme is more accurately extracted; on the other hand, the text representation is too complex by simply using an N-gram model, and the embodiment of the invention improves the N-gram.
The following examples are given.
Modeling data sources: the bullet screen data of the platform in the last month is taken as a data source.
Modeling:
(1) preparing data: extracting bullet screen data of the latest month, wherein the data mainly comprises a field of bullet screen content, and the data format is [ bullet screen content ];
(2) constructing bullet screen characteristics: there are many proprietary words with platform features in barrage, such as "Water friend", which refers to the fan of anchor.
And self-defining proper nouns and verbs, and adding the proper nouns and verbs into a self-defining dictionary library. For example: the 'friend with water' is added into a user-defined dictionary, and the 'friend with water' can be segmented into one word in the subsequent word segmentation process, but can not be segmented into two words of 'water' and 'friend'.
Feature extraction: replacing the words with cheering such as "666" with cheering; replacing the digit string which is 136 XXXXXXXXX and conforms to the characteristics of the mobile phone number by the characteristics of the contact way of the mobile phone; instead of using the "QQXXXXXXX" character string, the "QQ contact method" feature, all the accumulated platform words with platform features and specific intentions are converted into a corresponding feature in this way.
The feature extraction method can effectively reduce the dimension of the feature, for example, all the QQ numbers are converted into the feature of 'QQ connection mode'.
(3) Data preprocessing:
data preprocessing is carried out on the basis of the step (2): removing data with empty 'barrage content' field; punctuation marks contained in the "bullet screen content" field are removed.
(4) And representing the bullet screen content as a word vector by adopting an N-gram model:
self-defining a word bank: based on the content of the platform, a user-defined word bank containing all specific words of the platform is manually arranged, and the accuracy of the user-defined word bank can influence the extraction accuracy of the bullet screen theme content.
Self-defining a disabled word bank: stop words are words that have little practical meaning compared to other words and are removed prior to content analysis. Specifically, the embodiment of the invention adopts a special stop word library deposited by a live broadcast platform per se when judging whether a word is a stop word or not, which is different from person to person and scene to scene.
The bag-of-words model of word cutting and bullet screen represents: dividing the barrage subjected to data preprocessing into a group of word vectors, firstly dividing each barrage according to the word forming rule in the user-defined word bank, and simultaneously filtering useless words according to the word forming rule in the user-defined word bank.
For example, the bullet screen content is "D of BC of a present day is D", the stop words include "today", "B", "C", "D", and after the words are cut, the bullet screen becomes [ "a", "B", "C", "D" ]. Here, in order to remember the word order between words after bullet screen word cutting, a key-value key value pair is used to represent an index of the word order between words as [ "a": 1, "B": 2, "C": 3, "D": 4, "D": 5], key represents a word, and value represents the word order of the sentence in which the word is located.
The implications of the N-gram model are: the probability of occurrence of a word in a sentence is related to the first N-1 words, where N has a value of 2, i.e., each word is related to the previous word.
In the N-gram model, the probability of a sentence occurring is represented as:
Figure GDA0003614739490000081
wherein, p (w)i|wi-1) Represents: conditional probability of occurrence of the ith word when the word at the i-1 position occurs.
Calculating the probability of a sentence according to an N-gram model, and calculating the conditional probability p (w) of the last word depended on by each word in turni|wi-1)。
The naive Bayes formula shows that:
Figure GDA0003614739490000091
p(wi|wi-1) The specific calculation process is very complex, and the embodiment of the invention innovatively simplifies the original formula into:
Figure GDA0003614739490000092
wherein, p (w)iwi-1) Representing the probability of the word appearing at the i-th and i-1-th positions simultaneously, p representing the probability value of the sentence appearing, w1A word representing the 1 st position in the sentence, n representing a multiplicative sign, m representing the number of words in the sentence, wiRepresents the ith word, and m and i are positive integers.
The embodiment of the invention calculates the conditional probability of the ith word when the word at the (i-1) th position appears:
Figure GDA0003614739490000093
reduced to computing the probability p (w) that a word co-occurs at the i-th and i-1-th positionsiwi-1) The workload and complexity of calculation are obviously reduced.
Based on the step (3), the existing phrases are recombined according to the simplified rules of the N-gram, and two adjacent words are combined to generate a new phrase, wherein the phrases are converted into phrases of [ "AB", "B", "C", "D", "D" ], and [ "AB", "BC", "DD", "CD" ]
The bullet screen representation method based on the 2-gram model overcomes the defect that the bag-of-words model ignores the context, so that bullet screen representation is more accurate; in addition, improvement is made on the basis of the original 2-gram model, and the calculation complexity in practice is reduced.
Hash mapping of words: the dimension of the word vector is set to be 60 ten thousand (based on empirical knowledge), that is, each bullet screen content is represented as a vector V with 60 ten thousand, and each position corresponds to a word. For example: here, "a" is mapped to the position of vector V (0), "B" is mapped to the position of V (1), "C" is mapped to the position of V (2), "D" is mapped to the position of V (3) (the mapping in reality is random mapping to 60 ten thousand positions, for convenience of description, words are mapped to the top 4 positions designated, the value of the corresponding position indicates the number of times the word appears, then [ "AB": 1, "BC": 1, "DD": 1, "CD": 1] becomes (1, 1, 1, 1, 0, 0, 0, 0, · 0) after being converted into a word vector, 59 ten thousand, 9 hundred, 90 0 s are omitted from the ellipses (since the word vector has a fixed length of 60 ten thousand), and after the theme representation of the bullet screen is obtained, technology bedding can be made for identifying the garbage bullet screen based on the theme extraction of the single bullet screen.
It should be noted that: in the system provided in the embodiment of the present invention, when performing inter-module communication, only the division of each functional module is illustrated, and in practical applications, the above function distribution may be completed by different functional modules as needed, that is, the internal structure of the system is divided into different functional modules to complete all or part of the above described functions.
Further, the present invention is not limited to the above-mentioned embodiments, and it will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements are also considered to be within the scope of the present invention. Those not described in detail in this specification are within the skill of the art.

Claims (8)

1. A barrage theme extraction method based on an N-gram model is characterized by comprising the following steps:
preparing data: extracting bullet screen data;
constructing bullet screen characteristics: extracting characteristics corresponding to words representing characteristics of a live broadcast platform or conforming to contact information characteristics, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank;
data preprocessing: removing data with blank 'bullet screen content' field; removing punctuation marks in the 'bullet screen content' field;
and representing the bullet screen content as a word vector by adopting an N-gram model: the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a word in a sentence is related to the previous N-1 words, and N is a positive integer; segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined disabled word bank;
in the N-gram model, a probability formula of sentence occurrence is as follows:
Figure FDA0003614739480000011
where p represents the probability value of sentence occurrence, w1A word representing the 1 st position in the sentence, pi represents a multiplicative symbol, m represents the number of words in the sentence, wiDenotes the ith word, m and i are positive integers, p (w)iwi-1) And (3) representing the probability of the simultaneous occurrence of the words at the i-th and i-1-th positions, recombining the existing phrases according to the formula, and combining two adjacent words to generate a new phrase.
2. The method for extracting bullet screen subjects based on the N-gram model according to claim 1, wherein: the value of N is 2, i.e. each word has a relationship with the word preceding it.
3. The bullet screen theme extraction method based on the N-gram model as claimed in claim 1, characterized in that: the dimension of the word vector is 60 ten thousand dimensions, namely, each bullet screen content is represented as a 60 ten thousand-dimensional vector, each position corresponds to a word, and the final bullet screen theme representation is obtained.
4. A storage medium having a computer program stored thereon, characterized in that: the computer program, when executed by a processor, implements the method of any of claims 1 to 3.
5. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program that runs on the processor, characterized in that: a processor implementing the method of any one of claims 1 to 3 when executing the computer program.
6. The utility model provides a bullet screen theme extraction system based on N-gram model which characterized in that: the system comprises a data preparation unit, a bullet screen feature construction unit, a data preprocessing unit, an N-gram model representation unit and a segmentation unit, wherein:
the data preparation unit is to: extracting bullet screen data;
the bullet screen characteristic construction unit is used for: extracting characteristics corresponding to words representing characteristics of a live broadcast platform or conforming to contact information characteristics, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank;
the data preprocessing unit is used for: removing data with empty 'barrage content' field; removing punctuation marks in the 'bullet screen content' field;
the N-gram model representation unit is used for: the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a word in a sentence is related to the previous N-1 words, and N is a positive integer;
the segmentation unit is used for: segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined disabled word bank;
in the N-gram model, a probability formula of sentence occurrence is as follows:
Figure FDA0003614739480000021
where p represents the probability value of sentence occurrence, w1Representing the word at the 1 st position in the sentence, pi representing the multiplicative symbol, m representing the number of words in the sentence, wiDenotes the ith word, m and i are positive integers, p (w)iwi-1) And (3) representing the probability of the simultaneous occurrence of the words at the i-th and i-1-th positions, recombining the existing phrases according to the formula, and combining two adjacent words to generate a new phrase.
7. The system for bullet screen theme extraction based on N-gram model of claim 6, characterized in that: the value of N is 2, i.e. each word has a relationship with the word preceding it.
8. The system for bullet screen theme extraction based on N-gram model of claim 6, characterized in that: the dimension of the word vector is 60-ten-thousand, namely, each bullet screen content is represented as a 60-ten-thousand-dimensional vector, each position corresponds to a word, and the final bullet screen theme representation is obtained.
CN201710514238.XA 2017-06-29 2017-06-29 Barrage theme extraction method, medium, equipment and system based on N-gram model Active CN109213988B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710514238.XA CN109213988B (en) 2017-06-29 2017-06-29 Barrage theme extraction method, medium, equipment and system based on N-gram model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710514238.XA CN109213988B (en) 2017-06-29 2017-06-29 Barrage theme extraction method, medium, equipment and system based on N-gram model

Publications (2)

Publication Number Publication Date
CN109213988A CN109213988A (en) 2019-01-15
CN109213988B true CN109213988B (en) 2022-06-21

Family

ID=64976355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710514238.XA Active CN109213988B (en) 2017-06-29 2017-06-29 Barrage theme extraction method, medium, equipment and system based on N-gram model

Country Status (1)

Country Link
CN (1) CN109213988B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948152B (en) * 2019-03-06 2020-07-17 北京工商大学 L STM-based Chinese text grammar error correction model method
CN110430448B (en) * 2019-07-31 2021-09-03 北京奇艺世纪科技有限公司 Bullet screen processing method and device and electronic equipment
CN113948085B (en) 2021-12-22 2022-03-25 中国科学院自动化研究所 Speech recognition method, system, electronic device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008098507A1 (en) * 2007-02-13 2008-08-21 Beijing Sogou Technology Development Co., Ltd. An input method of combining words intelligently, input method system and renewing method
CN101930561A (en) * 2010-05-21 2010-12-29 电子科技大学 N-Gram participle model-based reverse neural network junk mail filter device
CN103207921A (en) * 2013-04-28 2013-07-17 福州大学 Method for automatically extracting terms from Chinese electronic document
CN103246644A (en) * 2013-04-02 2013-08-14 亿赞普(北京)科技有限公司 Method and device for processing Internet public opinion information
CN105435453A (en) * 2015-12-22 2016-03-30 网易(杭州)网络有限公司 Bullet screen information processing method, device and system
CN105516820A (en) * 2015-12-10 2016-04-20 腾讯科技(深圳)有限公司 Barrage interaction method and device
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning
CN106055538A (en) * 2016-05-26 2016-10-26 达而观信息科技(上海)有限公司 Automatic extraction method for text labels in combination with theme model and semantic analyses

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008098507A1 (en) * 2007-02-13 2008-08-21 Beijing Sogou Technology Development Co., Ltd. An input method of combining words intelligently, input method system and renewing method
CN101930561A (en) * 2010-05-21 2010-12-29 电子科技大学 N-Gram participle model-based reverse neural network junk mail filter device
CN103246644A (en) * 2013-04-02 2013-08-14 亿赞普(北京)科技有限公司 Method and device for processing Internet public opinion information
CN103207921A (en) * 2013-04-28 2013-07-17 福州大学 Method for automatically extracting terms from Chinese electronic document
CN105516820A (en) * 2015-12-10 2016-04-20 腾讯科技(深圳)有限公司 Barrage interaction method and device
CN105435453A (en) * 2015-12-22 2016-03-30 网易(杭州)网络有限公司 Bullet screen information processing method, device and system
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning
CN106055538A (en) * 2016-05-26 2016-10-26 达而观信息科技(上海)有限公司 Automatic extraction method for text labels in combination with theme model and semantic analyses

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
有效进行直播平台的弹幕管理;李金兰;《有线电视技术》;20170331;第105-107页 *

Also Published As

Publication number Publication date
CN109213988A (en) 2019-01-15

Similar Documents

Publication Publication Date Title
WO2023065544A1 (en) Intention classification method and apparatus, electronic device, and computer-readable storage medium
CN109801630B (en) Digital conversion method, device, computer equipment and storage medium for voice recognition
CN108804423B (en) Medical text feature extraction and automatic matching method and system
CN112784696B (en) Lip language identification method, device, equipment and storage medium based on image identification
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN108733647B (en) Word vector generation method based on Gaussian distribution
CN109213988B (en) Barrage theme extraction method, medium, equipment and system based on N-gram model
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN110188359B (en) Text entity extraction method
CN111274804A (en) Case information extraction method based on named entity recognition
CN110826298A (en) Statement coding method used in intelligent auxiliary password-fixing system
CN111539228B (en) Vector model training method and device and similarity determining method and device
CN111597807B (en) Word segmentation data set generation method, device, equipment and storage medium thereof
CN111914825A (en) Character recognition method and device and electronic equipment
CN111160026B (en) Model training method and device, and text processing method and device
CN111046660B (en) Method and device for identifying text professional terms
CN111506726A (en) Short text clustering method and device based on part-of-speech coding and computer equipment
CN114398943B (en) Sample enhancement method and device thereof
CN113255331B (en) Text error correction method, device and storage medium
CN114782965A (en) Visual rich document information extraction method, system and medium based on layout relevance
CN110597982A (en) Short text topic clustering algorithm based on word co-occurrence network
CN113868389A (en) Data query method and device based on natural language text and computer equipment
CN111695350B (en) Word segmentation method and word segmentation device for text
CN113239245A (en) Method and device for information query, electronic equipment and readable storage medium
CN113361260A (en) Text processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant