CN109213988B

CN109213988B - Barrage theme extraction method, medium, equipment and system based on N-gram model

Info

Publication number: CN109213988B
Application number: CN201710514238.XA
Authority: CN
Inventors: 龚灿; 陈少杰; 张文明
Original assignee: Wuhan Douyu Network Technology Co Ltd
Current assignee: Wuhan Douyu Network Technology Co Ltd
Priority date: 2017-06-29
Filing date: 2017-06-29
Publication date: 2022-06-21
Anticipated expiration: 2037-06-29
Also published as: CN109213988A

Abstract

The invention discloses a barrage theme extraction method, medium, equipment and system based on an N-gram model, and relates to the field of live broadcast. The method comprises the following steps: extracting bullet screen data; extracting the characteristics corresponding to the words representing a certain specific intention, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank; data preprocessing: removing data with empty 'barrage content' field; removing punctuation marks in the 'bullet screen content' field; the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a word in a sentence is related to the previous N-1 words, and N is a positive integer; segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined stop word bank. The method and the device can accurately extract the barrage theme.

Description

Bullet screen theme extraction method, medium, equipment and system based on N-gram model

Technical Field

The invention relates to the field of live broadcast, in particular to a barrage theme extraction method, medium, equipment and system based on an N-gram model.

Background

The main text content of the live broadcast platform is generally expressed as a bullet screen, and bullet screen text information of the live broadcast platform needs to be extracted for counting bullet screen content. At present, most of traditional barrage text extraction schemes in the live broadcast industry adopt a manual labeling method, the method consumes a large amount of manpower and material cost, and when the method is used for full-platform users, anchor broadcasters and hundreds of millions of barrage data, the manual processing method is obviously low in efficiency. In addition, the existing bullet screen text is represented purely based on a word bag model, and the relation between a single word and the context is ignored, so that the extraction of the bullet screen theme is inaccurate.

Disclosure of Invention

The invention aims to overcome the defects of the background technology and provides a bullet screen theme extraction method, medium, equipment and system based on an N-gram model, and the bullet screen theme can be accurately extracted.

The invention provides a barrage theme extraction method based on an N-gram model, which comprises the following steps:

preparing data: extracting bullet screen data;

constructing bullet screen characteristics: extracting the characteristics corresponding to the words representing a certain specific intention, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank;

data preprocessing: removing data with blank 'bullet screen content' field; removing punctuation marks in the 'bullet screen content' field;

and representing the bullet screen content as a word vector by adopting an N-gram model: the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a certain word in a sentence is related to the previous N-1 words, and N is a positive integer; segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined stop word bank.

On the basis of the technical scheme, the value of N is 2, namely, each word is related to the previous word.

On the basis of the technical scheme, in the N-gram model, a probability formula of sentence occurrence is as follows:

where p represents the probability value of sentence occurrence, w₁Representing the word at the 1 st position in the sentence, pi representing the multiplicative symbol, m representing the number of words in the sentence, w_iDenotes the ith word, m and i are positive integers, p (w)_iw_i-1) And (3) representing the probability of the simultaneous occurrence of the words at the i-th and i-1-th positions, recombining the existing phrases according to the formula, and combining two adjacent words to generate a new phrase.

On the basis of the technical scheme, the dimensionality of the word vector is 60-ten-thousand, namely, each bullet screen content is represented as a 60-ten-thousand-dimensional vector, each position corresponds to a word, and the final bullet screen theme representation is obtained.

The invention also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method.

The invention also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program running on the processor, and the processor executes the computer program to realize the method.

The invention also provides a bullet screen theme extraction system based on the N-gram model, which comprises a data preparation unit, a bullet screen characteristic construction unit, a data preprocessing unit, an N-gram model representation unit and a segmentation unit, wherein:

the data preparation unit is to: extracting bullet screen data;

the bullet screen characteristic construction unit is used for: extracting the characteristics corresponding to the words representing a certain specific intention, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank;

the data preprocessing unit is used for: removing data with empty 'barrage content' field; removing punctuation marks in the 'bullet screen content' field;

the N-gram model representation unit is used for: the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a word in a sentence is related to the previous N-1 words, and N is a positive integer;

the segmentation unit is used for: segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined stop word bank.

Compared with the prior art, the invention has the following advantages:

(1) the method extracts bullet screen data; extracting the characteristics corresponding to the words representing a certain specific intention, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank; data preprocessing: removing data with empty 'barrage content' field; removing punctuation marks in the 'bullet screen content' field; the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a word in a sentence is related to the previous N-1 words, and N is a positive integer; segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined stop word bank. The invention adopts the N-gram model to represent the bullet screen content as the word vector, and the bullet screen representation method of the N-gram model overcomes the defect that the context is ignored by the existing word bag model, so that the bullet screen representation is more accurate, and the bullet screen theme can be accurately extracted.

(2) In the N-gram model of the invention, the value of N is 2, namely, each word has a relation with a word before the word. The invention is improved based on the original 2-gram model, and can reduce the calculation complexity.

(3) The method disclosed by the invention integrates the N-gram model and the artificial characteristic construction method, can accurately extract the main information of a single bullet screen and a single room, and realizes vectorization expression of the bullet screen theme.

(4) With the continuous development of live platform services, live platforms accumulate a lot of active users and also accumulate a large amount of text-type data. By deeply mining the text content of the live broadcast platform, the content similarity between the user and the room and the content similarity between the room and the room can be known, so that the personalized recommendation effect of the live broadcast platform is improved.

Drawings

FIG. 1 is a flowchart of a bullet screen theme extraction method based on an N-gram model in an embodiment of the present invention.

Fig. 2 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the embodiments.

Referring to fig. 1, an embodiment of the present invention provides a bullet screen theme extraction method based on an N-gram model, including the following steps:

s1, data preparation: extracting bullet screen data;

s2, constructing bullet screen characteristics: extracting the characteristics corresponding to the words representing a certain specific intention, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank;

s3, preprocessing data: removing data with empty 'barrage content' field; removing punctuation marks in the 'bullet screen content' field;

s4, representing the bullet screen content as a word vector by adopting an N-gram model: the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a word in a sentence is related to the previous N-1 words, and N is a positive integer; segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined stop word bank.

The embodiment of the invention also provides a storage medium, wherein the storage medium is stored with a computer program, and the computer program is executed by a processor to realize the barrage theme extraction method based on the N-gram model. The storage medium includes various media capable of storing program codes, such as a usb disk, a portable hard disk, a ROM (Read-Only Memory), a RAM (Random Access Memory), a magnetic disk, or an optical disk.

Referring to fig. 2, an embodiment of the present invention further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program running on the processor, and the processor implements the bullet screen theme extraction method based on the N-gram model when executing the computer program.

The embodiment of the invention also provides a barrage theme extraction system based on the N-gram model, which comprises a data preparation unit, a barrage feature construction unit, a data preprocessing unit, an N-gram model representation unit and a segmentation unit, wherein:

the data preparation unit is to: extracting bullet screen data;

the N-gram model representation unit is used for: the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a certain word in a sentence is related to the previous N-1 words, and N is a positive integer;

The N-gram Model is a Language Model commonly used in large vocabulary continuous speech recognition, and for Chinese, the Model is called Chinese Language Model (CLM). The Chinese language model can calculate the sentence with the maximum probability by using the collocation information between adjacent words in the context when the continuous blank-free pinyin, strokes or numbers representing letters or strokes are required to be converted into Chinese character strings (namely sentences), thereby realizing the automatic conversion of Chinese characters without manual selection of a user and avoiding the problem of repeated codes of a plurality of Chinese characters corresponding to the same pinyin (or stroke strings or number strings).

The N-gram model is based on the assumption that the occurrence of a word is only related to the first N-1 words, but not to any other words, and the probability of a complete sentence is the product of the probabilities of occurrence of the words. These probabilities can be obtained by counting the number of times that N words occur simultaneously directly from the corpus.

In the embodiment of the invention, the value of N is 2, namely, each word is related to the previous word.

In the N-gram model, the probability formula of sentence occurrence is as follows:

The dimension of the word vector is 60 ten thousand, namely each bullet screen content is represented as a 60 ten thousand-dimensional vector, each position corresponds to a word, and the final bullet screen theme representation is obtained.

Compared with the prior art that the relation between a single word and the context is ignored in the text representation purely based on the bag-of-words model, the embodiment of the invention considers the influence of the context-related text through the N-gram model, so that the bullet screen theme is more accurately extracted; on the other hand, the text representation is too complex by simply using an N-gram model, and the embodiment of the invention improves the N-gram.

The following examples are given.

Modeling data sources: the bullet screen data of the platform in the last month is taken as a data source.

Modeling:

(1) preparing data: extracting bullet screen data of the latest month, wherein the data mainly comprises a field of bullet screen content, and the data format is [ bullet screen content ];

(2) constructing bullet screen characteristics: there are many proprietary words with platform features in barrage, such as "Water friend", which refers to the fan of anchor.

And self-defining proper nouns and verbs, and adding the proper nouns and verbs into a self-defining dictionary library. For example: the 'friend with water' is added into a user-defined dictionary, and the 'friend with water' can be segmented into one word in the subsequent word segmentation process, but can not be segmented into two words of 'water' and 'friend'.

Feature extraction: replacing the words with cheering such as "666" with cheering; replacing the digit string which is 136 XXXXXXXXX and conforms to the characteristics of the mobile phone number by the characteristics of the contact way of the mobile phone; instead of using the "QQXXXXXXX" character string, the "QQ contact method" feature, all the accumulated platform words with platform features and specific intentions are converted into a corresponding feature in this way.

The feature extraction method can effectively reduce the dimension of the feature, for example, all the QQ numbers are converted into the feature of 'QQ connection mode'.

(3) Data preprocessing:

data preprocessing is carried out on the basis of the step (2): removing data with empty 'barrage content' field; punctuation marks contained in the "bullet screen content" field are removed.

(4) And representing the bullet screen content as a word vector by adopting an N-gram model:

self-defining a word bank: based on the content of the platform, a user-defined word bank containing all specific words of the platform is manually arranged, and the accuracy of the user-defined word bank can influence the extraction accuracy of the bullet screen theme content.

Self-defining a disabled word bank: stop words are words that have little practical meaning compared to other words and are removed prior to content analysis. Specifically, the embodiment of the invention adopts a special stop word library deposited by a live broadcast platform per se when judging whether a word is a stop word or not, which is different from person to person and scene to scene.

The bag-of-words model of word cutting and bullet screen represents: dividing the barrage subjected to data preprocessing into a group of word vectors, firstly dividing each barrage according to the word forming rule in the user-defined word bank, and simultaneously filtering useless words according to the word forming rule in the user-defined word bank.

For example, the bullet screen content is "D of BC of a present day is D", the stop words include "today", "B", "C", "D", and after the words are cut, the bullet screen becomes [ "a", "B", "C", "D" ]. Here, in order to remember the word order between words after bullet screen word cutting, a key-value key value pair is used to represent an index of the word order between words as [ "a": 1, "B": 2, "C": 3, "D": 4, "D": 5], key represents a word, and value represents the word order of the sentence in which the word is located.

The implications of the N-gram model are: the probability of occurrence of a word in a sentence is related to the first N-1 words, where N has a value of 2, i.e., each word is related to the previous word.

In the N-gram model, the probability of a sentence occurring is represented as:

wherein, p (w)_i|w_i-1) Represents: conditional probability of occurrence of the ith word when the word at the i-1 position occurs.

Calculating the probability of a sentence according to an N-gram model, and calculating the conditional probability p (w) of the last word depended on by each word in turn_i|w_i-1)。

The naive Bayes formula shows that:

p(w_i|w_i-1) The specific calculation process is very complex, and the embodiment of the invention innovatively simplifies the original formula into:

wherein, p (w)_iw_i-1) Representing the probability of the word appearing at the i-th and i-1-th positions simultaneously, p representing the probability value of the sentence appearing, w₁A word representing the 1 st position in the sentence, n representing a multiplicative sign, m representing the number of words in the sentence, w_iRepresents the ith word, and m and i are positive integers.

The embodiment of the invention calculates the conditional probability of the ith word when the word at the (i-1) th position appears:

reduced to computing the probability p (w) that a word co-occurs at the i-th and i-1-th positions_iw_i-1) The workload and complexity of calculation are obviously reduced.

Based on the step (3), the existing phrases are recombined according to the simplified rules of the N-gram, and two adjacent words are combined to generate a new phrase, wherein the phrases are converted into phrases of [ "AB", "B", "C", "D", "D" ], and [ "AB", "BC", "DD", "CD" ]

The bullet screen representation method based on the 2-gram model overcomes the defect that the bag-of-words model ignores the context, so that bullet screen representation is more accurate; in addition, improvement is made on the basis of the original 2-gram model, and the calculation complexity in practice is reduced.

Hash mapping of words: the dimension of the word vector is set to be 60 ten thousand (based on empirical knowledge), that is, each bullet screen content is represented as a vector V with 60 ten thousand, and each position corresponds to a word. For example: here, "a" is mapped to the position of vector V (0), "B" is mapped to the position of V (1), "C" is mapped to the position of V (2), "D" is mapped to the position of V (3) (the mapping in reality is random mapping to 60 ten thousand positions, for convenience of description, words are mapped to the top 4 positions designated, the value of the corresponding position indicates the number of times the word appears, then [ "AB": 1, "BC": 1, "DD": 1, "CD": 1] becomes (1, 1, 1, 1, 0, 0, 0, 0, · 0) after being converted into a word vector, 59 ten thousand, 9 hundred, 90 0 s are omitted from the ellipses (since the word vector has a fixed length of 60 ten thousand), and after the theme representation of the bullet screen is obtained, technology bedding can be made for identifying the garbage bullet screen based on the theme extraction of the single bullet screen.

It should be noted that: in the system provided in the embodiment of the present invention, when performing inter-module communication, only the division of each functional module is illustrated, and in practical applications, the above function distribution may be completed by different functional modules as needed, that is, the internal structure of the system is divided into different functional modules to complete all or part of the above described functions.

Further, the present invention is not limited to the above-mentioned embodiments, and it will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements are also considered to be within the scope of the present invention. Those not described in detail in this specification are within the skill of the art.

Claims

1. A barrage theme extraction method based on an N-gram model is characterized by comprising the following steps:

preparing data: extracting bullet screen data;

constructing bullet screen characteristics: extracting characteristics corresponding to words representing characteristics of a live broadcast platform or conforming to contact information characteristics, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank;

and representing the bullet screen content as a word vector by adopting an N-gram model: the bullet screen content subjected to data preprocessing is represented by an N-gram model, the N-gram model represents that the occurrence probability of a word in a sentence is related to the previous N-1 words, and N is a positive integer; segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined disabled word bank;

in the N-gram model, a probability formula of sentence occurrence is as follows:

where p represents the probability value of sentence occurrence, w₁A word representing the 1 st position in the sentence, pi represents a multiplicative symbol, m represents the number of words in the sentence, w_iDenotes the ith word, m and i are positive integers, p (w)_iw_i-1) And (3) representing the probability of the simultaneous occurrence of the words at the i-th and i-1-th positions, recombining the existing phrases according to the formula, and combining two adjacent words to generate a new phrase.

2. The method for extracting bullet screen subjects based on the N-gram model according to claim 1, wherein: the value of N is 2, i.e. each word has a relationship with the word preceding it.

3. The bullet screen theme extraction method based on the N-gram model as claimed in claim 1, characterized in that: the dimension of the word vector is 60 ten thousand dimensions, namely, each bullet screen content is represented as a 60 ten thousand-dimensional vector, each position corresponds to a word, and the final bullet screen theme representation is obtained.

4. A storage medium having a computer program stored thereon, characterized in that: the computer program, when executed by a processor, implements the method of any of claims 1 to 3.

5. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program that runs on the processor, characterized in that: a processor implementing the method of any one of claims 1 to 3 when executing the computer program.

6. The utility model provides a bullet screen theme extraction system based on N-gram model which characterized in that: the system comprises a data preparation unit, a bullet screen feature construction unit, a data preprocessing unit, an N-gram model representation unit and a segmentation unit, wherein:

the data preparation unit is to: extracting bullet screen data;

the bullet screen characteristic construction unit is used for: extracting characteristics corresponding to words representing characteristics of a live broadcast platform or conforming to contact information characteristics, and adding the characteristics to a custom word bank; adding words without actual meanings into a custom stop word bank;

the segmentation unit is used for: segmenting each bullet screen content into a group of word vectors, segmenting each bullet screen content according to word formation rules in a user-defined word bank, and filtering useless words according to a user-defined disabled word bank;

7. The system for bullet screen theme extraction based on N-gram model of claim 6, characterized in that: the value of N is 2, i.e. each word has a relationship with the word preceding it.

8. The system for bullet screen theme extraction based on N-gram model of claim 6, characterized in that: the dimension of the word vector is 60-ten-thousand, namely, each bullet screen content is represented as a 60-ten-thousand-dimensional vector, each position corresponds to a word, and the final bullet screen theme representation is obtained.