CN107066450B

CN107066450B - Instant messaging session segmentation method based on learning

Info

Publication number: CN107066450B
Application number: CN201710391483.6A
Authority: CN
Inventors: 唐积强; 马秀娟; 李传海; 毛洪亮; 吴震; 李焱余; 苏沐冉; 王秀文; 徐小磊; 张露晨; 王海平; 王峰
Original assignee: Beijing Scistor Technologies Co ltd; National Computer Network and Information Security Management Center
Current assignee: Beijing Scistor Technologies Co ltd; National Computer Network and Information Security Management Center
Priority date: 2017-05-27
Filing date: 2017-05-27
Publication date: 2020-04-10
Anticipated expiration: 2037-05-27
Also published as: CN107066450A

Abstract

The invention discloses an instant messaging session segmentation technology and method based on learning, belonging to the field of big data analysis; dividing every two instant communication session users into one group, and classifying and sequencing conversation bill details of each group based on time; the session is divided into: sequentially selecting two adjacent telephone bills R1 and R2, and calculating a time interval delta t, text content similarity delta sim and a distance value F (R1, R2); if F (R1, R2) < F, tickets R1 and R2 belong to the same session; otherwise, the call tickets R1 and R2 belong to two different sessions respectively; and simultaneously operating all user groups and all types of call ticket detail data of each user group respectively through spark parallel, and finally segmenting all sessions of all users in instant messaging. The invention comprehensively considers the session time distance influence factor and the session text content distance influence factor, realizes the session segmentation standard of differentiation of different session user groups, and effectively solves the problems of accuracy and high efficiency of segmentation of massive instant messaging text sessions under the background of big data.

Description

Instant messaging session segmentation method based on learning

Technical Field

The invention belongs to the field of big data analysis, and relates to an instant messaging session segmentation method based on learning.

Background

As big data technologies mature and spread, more and more enterprises and related organizations try to perform user analysis based on various data of users, such as analyzing topics discussed in each session of users based on instant messaging data of users, and then analyzing and tagging users based on historical session topics of users. Under the normal condition, a data analyzer faces historical conversation ticket detail data of both parties of instant messaging, and the detail data does not clearly identify the conversation to which the data analyzer belongs, so that how to perform conversation segmentation based on the existing instant messaging ticket detail data has a vital role in analyzing the conversation content theme of a user and further analyzing the user.

Instant messaging session segmentation has the following features and challenges: (1) the instant messaging text generally belongs to an ultra-short text, so that the efficient and accurate instant messaging text segmentation is difficult to realize simply by using a text classification clustering technology based on the content of the session text; (2) the instant messaging session is time-efficient, and generally speaking, the communication topics of both instant messaging parties in a certain continuous time period are the same, so that the session segmentation can be assisted by considering the instant messaging time; (3) due to the characteristics of characters, habits, identities and the like, the time intervals for replying different instant messaging sessions are different, even if the instant messaging sessions of the same communication user group are different from normal, the session segmentation cannot be simply carried out based on a fixed time interval threshold value.

Disclosure of Invention

The invention provides an instant messaging session segmentation method based on learning, which is used for carrying out session segmentation on massive instant messaging detail data and providing data support for session theme analysis and user analysis based on session content.

The method comprises the following specific steps:

step one, aiming at all instant messaging conversation users, dividing two users into a group according to the communication contact between every two users;

step two, recording and classifying the detail data of the original session bill aiming at a certain communication session user group;

call detail data R ═ (RS, RR, T, C);

RS represents a session initiator (Record Sender), RR represents a communication session receiver (Record receiver), T represents the sending time of a call ticket R, and C is the text content of the call ticket R;

step three, sequencing each type of conversation bill detail data according to the sequence of sending time;

step four, selecting two adjacent telephone bills R1 and R2 according to each type of sequenced conversation telephone bill detail data, and calculating the time interval delta t of the two telephone bills;

Δt＝F2(T2-T1)＝T2-T1；T2＞T1

t1 is the transmission time of the ticket R1; t2 is the transmission time of the ticket R2;

step five, calculating the text content similarity delta sim recorded by the two adjacent telephone bills R1 and R2;

the method comprises the following specific steps:

step 501, obtaining the text content C1 of the ticket R1 and the text content C2 of the ticket R2 by using word2 vec;

and 502, performing word segmentation and stop word removal on the text contents C1 and C2 to obtain a word set.

The text content C1 obtains wc1 words; the text content C2 obtains wc2 words;

step 503, calculating the text content distance F3 between the adjacent telephone bills R1 and R2;

sim(wc1_i,wc2_j) Calculating by adopting a cos cosine method; wc1_iThe ith word representing textual content C1; wc2_jRepresenting the jth word in textual content C2.

Step 504, calculating the text content similarity delta sim of the ticket R1 and the ticket R2 by using the text content distance F3;

Δsim＝F3(C1,C2)

step six, calculating the distance value F (R1, R2) of the adjacent call tickets R1 and R2 by adopting a call ticket distance algorithm;

F(R1,R2)＝α×Δt+β×Δsim

α is the session time distance impact factor, β is the value of the session text content distance;

step seven, judging whether the distance value F (R1, R2) is smaller than a threshold value F, if so, enabling the ticket R1 and the ticket R2 to belong to the same session; otherwise, the call tickets R1 and R2 belong to two different sessions respectively;

the tickets R1 and R2 belong to two different sessions respectively, i.e. the last message of the previous session is R1, and the first message of the new session is R2.

Step eight, aiming at all kinds of conversation bill detail data of the communication conversation user group, all kinds are divided in parallel through spark calculation;

the session distance of each communication session user group conforms to the following characteristics: if the conversation belongs to 1 conversation, the distance values recorded by different adjacent call tickets are distributed in a centralized way; if the difference belongs to 1 session, the distance values recorded by different adjacent call tickets are distributed sparsely.

And step nine, aiming at all grouped instant communication session users, performing parallel operation on all communication session user groups in steps two to eight through spark parallel operation.

The invention has the advantages that:

1) the instant messaging session segmentation method based on learning can achieve session segmentation standards of different session user groups in differentiation.

2) The instant messaging session segmentation method based on learning comprehensively considers session time distance influence factors and session text content distance influence factors, and effectively solves the problems of accuracy and high efficiency of segmentation of massive instant messaging text sessions under the background of big data.

Drawings

FIG. 1 is a schematic diagram illustrating a learning-based instant messaging session segmentation method according to the present invention;

FIG. 2 is a flow chart of an instant messaging session segmentation method based on learning according to the present invention;

FIG. 3 is a flowchart of a method for calculating the similarity of text contents recorded by two adjacent tickets R1 and R2 according to the invention.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

The following describes in detail a specific embodiment of the present invention with reference to the drawings.

The invention provides a learning-based instant communication session segmentation method, which integrates two factors of session short text content similarity and call ticket recording time interval, and provides a learning-based segmentation method, as shown in figure 1, aiming at all instant communication session users, two groups are divided into one group, the session call ticket detail data (RS, RR, T, C) of each session user group are classified and time-based sequencing is carried out, all user groups and all call ticket detail data of each user group are operated simultaneously through spark in parallel, the specific session segmentation method comprises the steps of sequentially selecting two adjacent call tickets R1 and R2, calculating the time interval delta T and text content similarity delta sim of the two call ticket records, adopting a multivariate linear function fitting model to obtain a session time distance influence factor α and a session text content distance factor β, further calculating the distance values F (R1, R2) of the adjacent call tickets R1 and R2, further calculating the distance threshold F according to the distance value F (R1, R2) and the threshold F, judging whether the two new call tickets belong to the same session R3527, and the final session messages R638 are identical to the same session R3527, otherwise, judging whether the two new session messages belong to the same session R638, if the last session R638 and the session R3 are identical session messages.

As shown in fig. 2, the specific steps are as follows:

all communication participants comprise a message sender and a message receiver, and two users of communication are divided into a group according to communication contact and are deduplicated; the communication contact comprises telephone communication, mail communication, micro-communication, short message communication and the like.

call detail data R ═ (RS, RR, T, C);

all communication contacts of the communication session user group are divided into different types according to different communication modes, such as WeChat communication, short message communication and the like.

Δt＝F2(T2-T1)＝T2-T1；T2＞T1

the bill detail data R1 is (RS1, RR1, T1, C1); the bill detail data R2 is (RS2, RR2, T2, C2);

RS1, RR1, RS2, RR2 are user IDs of the communication participant group; t1 is the transmission time of the ticket R1; t2 is the transmission time of the ticket R2; c1 is the text content of the ticket R1; c2 is the text content of the ticket R2;

the text content distance similarity calculation algorithm is suitable for instant messaging text conversation segmentation and is designed and realized based on word2vec and cosine distance. As shown in fig. 3, the specific steps are as follows:

obtaining a word set and a corresponding word feature vector by using word2 vec;

The text content C1 obtains wc1 words; the text content C2 obtains wc2 words;

Δsim＝F3(C1,C2)

F(R1,R2)＝α×Δt+β×Δsim

α is the session time distance influence factor, β is the session text content distance parameter influence factor;

α and β learning models are mainly as follows, a batch of call ticket communication detail record data of a certain conversation user group are sampled and sorted according to time, whether any adjacent 2 call ticket record data belong to 1 conversation is marked by an artificial marking method, if the call ticket record data belong to one conversation, the distance value is marked as 1, if the call ticket record data do not belong to the same conversation process, the distance value is-1, then function fitting is carried out based on the sample data after the standards, and a multivariate linear function fitting model is mainly adopted to obtain the values of α and β.

the session distance of each communication session user group conforms to the following characteristics: if the conversation belongs to 1 conversation, the distance values recorded by different adjacent call tickets are distributed in a centralized way; if the difference belongs to 1 session, the distance values recorded by different adjacent call tickets are distributed sparsely. And calculating and solving by analyzing the adjacent words corresponding to different distance values, wherein the specific calculation algorithm is an extreme value solving algorithm for solving inflection points.

It is to be noted and understood that various modifications and improvements can be made to the invention described in detail above without departing from the spirit and scope of the invention as claimed in the appended claims. Accordingly, the scope of the claimed subject matter is not limited by any of the specific exemplary teachings provided.

Claims

1. A learning-based instant messaging session segmentation method is characterized by comprising the following specific steps:

Δt＝F2(T2-T1)＝T2-T1；T2＞T1

t1 is the transmission time of the ticket R1; t2 is the transmission time of the ticket R2; f2 is the sending time distance between the adjacent tickets R1 and R2;

Δsim＝F3(C1,C2)

c1 is the text content of the ticket R1, C2 is the text content of the ticket R2; f3 is the text content distance between the adjacent tickets R1 and R2;

F(R1,R2)＝α×Δt+β×Δsim

the telephone bills R1 and R2 belong to two different conversations respectively, namely the last message of the previous conversation is R1, and the first message of the new conversation is R2;

2. The learning-based instant messaging session segmentation method according to claim 1, wherein in the second step, the detail call list data R ═ (RS, RR, T, C);

RS denotes a session initiator (Record Sender), RR denotes a communication session receiver (Record receiver), T denotes transmission time of the ticket R, and C denotes text content of the ticket R.

3. The learning-based instant messaging session segmentation method according to claim 1, wherein the concrete steps of the fifth step are as follows:

502, performing word segmentation and stop word removal on the text contents C1 and C2 to obtain a word set;

the text content C1 obtains wc1 words; the text content C2 obtains wc2 words;

sim(wc1_i,wc2_j) Calculating by adopting a cos cosine method; wc1_iThe ith word representing textual content C1; wc2_jRepresents the jth word in textual content C2;

Δsim＝F3(C1,C2)。

4. the learning-based instant messaging session segmentation method according to claim 1, wherein the respective session distances of the communication session user groups conform to the following characteristics: if the conversation belongs to 1 conversation, the distance values recorded by different adjacent call tickets are distributed in a centralized way; if the difference belongs to 1 session, the distance values recorded by different adjacent call tickets are distributed sparsely.