CN107066450B - Instant messaging session segmentation method based on learning - Google Patents

Instant messaging session segmentation method based on learning Download PDF

Info

Publication number
CN107066450B
CN107066450B CN201710391483.6A CN201710391483A CN107066450B CN 107066450 B CN107066450 B CN 107066450B CN 201710391483 A CN201710391483 A CN 201710391483A CN 107066450 B CN107066450 B CN 107066450B
Authority
CN
China
Prior art keywords
session
ticket
text content
distance
conversation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710391483.6A
Other languages
Chinese (zh)
Other versions
CN107066450A (en
Inventor
唐积强
马秀娟
李传海
毛洪亮
吴震
李焱余
苏沐冉
王秀文
徐小磊
张露晨
王海平
王峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Scistor Technologies Co ltd
National Computer Network and Information Security Management Center
Original Assignee
Beijing Scistor Technologies Co ltd
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Scistor Technologies Co ltd, National Computer Network and Information Security Management Center filed Critical Beijing Scistor Technologies Co ltd
Priority to CN201710391483.6A priority Critical patent/CN107066450B/en
Publication of CN107066450A publication Critical patent/CN107066450A/en
Application granted granted Critical
Publication of CN107066450B publication Critical patent/CN107066450B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Telephonic Communication Services (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses an instant messaging session segmentation technology and method based on learning, belonging to the field of big data analysis; dividing every two instant communication session users into one group, and classifying and sequencing conversation bill details of each group based on time; the session is divided into: sequentially selecting two adjacent telephone bills R1 and R2, and calculating a time interval delta t, text content similarity delta sim and a distance value F (R1, R2); if F (R1, R2) < F, tickets R1 and R2 belong to the same session; otherwise, the call tickets R1 and R2 belong to two different sessions respectively; and simultaneously operating all user groups and all types of call ticket detail data of each user group respectively through spark parallel, and finally segmenting all sessions of all users in instant messaging. The invention comprehensively considers the session time distance influence factor and the session text content distance influence factor, realizes the session segmentation standard of differentiation of different session user groups, and effectively solves the problems of accuracy and high efficiency of segmentation of massive instant messaging text sessions under the background of big data.

Description

Instant messaging session segmentation method based on learning
Technical Field
The invention belongs to the field of big data analysis, and relates to an instant messaging session segmentation method based on learning.
Background
As big data technologies mature and spread, more and more enterprises and related organizations try to perform user analysis based on various data of users, such as analyzing topics discussed in each session of users based on instant messaging data of users, and then analyzing and tagging users based on historical session topics of users. Under the normal condition, a data analyzer faces historical conversation ticket detail data of both parties of instant messaging, and the detail data does not clearly identify the conversation to which the data analyzer belongs, so that how to perform conversation segmentation based on the existing instant messaging ticket detail data has a vital role in analyzing the conversation content theme of a user and further analyzing the user.
Instant messaging session segmentation has the following features and challenges: (1) the instant messaging text generally belongs to an ultra-short text, so that the efficient and accurate instant messaging text segmentation is difficult to realize simply by using a text classification clustering technology based on the content of the session text; (2) the instant messaging session is time-efficient, and generally speaking, the communication topics of both instant messaging parties in a certain continuous time period are the same, so that the session segmentation can be assisted by considering the instant messaging time; (3) due to the characteristics of characters, habits, identities and the like, the time intervals for replying different instant messaging sessions are different, even if the instant messaging sessions of the same communication user group are different from normal, the session segmentation cannot be simply carried out based on a fixed time interval threshold value.
Disclosure of Invention
The invention provides an instant messaging session segmentation method based on learning, which is used for carrying out session segmentation on massive instant messaging detail data and providing data support for session theme analysis and user analysis based on session content.
The method comprises the following specific steps:
step one, aiming at all instant messaging conversation users, dividing two users into a group according to the communication contact between every two users;
step two, recording and classifying the detail data of the original session bill aiming at a certain communication session user group;
call detail data R ═ (RS, RR, T, C);
RS represents a session initiator (Record Sender), RR represents a communication session receiver (Record receiver), T represents the sending time of a call ticket R, and C is the text content of the call ticket R;
step three, sequencing each type of conversation bill detail data according to the sequence of sending time;
step four, selecting two adjacent telephone bills R1 and R2 according to each type of sequenced conversation telephone bill detail data, and calculating the time interval delta t of the two telephone bills;
Δt=F2(T2-T1)=T2-T1;T2>T1
t1 is the transmission time of the ticket R1; t2 is the transmission time of the ticket R2;
step five, calculating the text content similarity delta sim recorded by the two adjacent telephone bills R1 and R2;
the method comprises the following specific steps:
step 501, obtaining the text content C1 of the ticket R1 and the text content C2 of the ticket R2 by using word2 vec;
and 502, performing word segmentation and stop word removal on the text contents C1 and C2 to obtain a word set.
The text content C1 obtains wc1 words; the text content C2 obtains wc2 words;
step 503, calculating the text content distance F3 between the adjacent telephone bills R1 and R2;
Figure GDA0002390441680000021
sim(wc1i,wc2j) Calculating by adopting a cos cosine method; wc1iThe ith word representing textual content C1; wc2jRepresenting the jth word in textual content C2.
Step 504, calculating the text content similarity delta sim of the ticket R1 and the ticket R2 by using the text content distance F3;
Δsim=F3(C1,C2)
step six, calculating the distance value F (R1, R2) of the adjacent call tickets R1 and R2 by adopting a call ticket distance algorithm;
F(R1,R2)=α×Δt+β×Δsim
α is the session time distance impact factor, β is the value of the session text content distance;
step seven, judging whether the distance value F (R1, R2) is smaller than a threshold value F, if so, enabling the ticket R1 and the ticket R2 to belong to the same session; otherwise, the call tickets R1 and R2 belong to two different sessions respectively;
the tickets R1 and R2 belong to two different sessions respectively, i.e. the last message of the previous session is R1, and the first message of the new session is R2.
Step eight, aiming at all kinds of conversation bill detail data of the communication conversation user group, all kinds are divided in parallel through spark calculation;
the session distance of each communication session user group conforms to the following characteristics: if the conversation belongs to 1 conversation, the distance values recorded by different adjacent call tickets are distributed in a centralized way; if the difference belongs to 1 session, the distance values recorded by different adjacent call tickets are distributed sparsely.
And step nine, aiming at all grouped instant communication session users, performing parallel operation on all communication session user groups in steps two to eight through spark parallel operation.
The invention has the advantages that:
1) the instant messaging session segmentation method based on learning can achieve session segmentation standards of different session user groups in differentiation.
2) The instant messaging session segmentation method based on learning comprehensively considers session time distance influence factors and session text content distance influence factors, and effectively solves the problems of accuracy and high efficiency of segmentation of massive instant messaging text sessions under the background of big data.
Drawings
FIG. 1 is a schematic diagram illustrating a learning-based instant messaging session segmentation method according to the present invention;
FIG. 2 is a flow chart of an instant messaging session segmentation method based on learning according to the present invention;
FIG. 3 is a flowchart of a method for calculating the similarity of text contents recorded by two adjacent tickets R1 and R2 according to the invention.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
The following describes in detail a specific embodiment of the present invention with reference to the drawings.
The invention provides a learning-based instant communication session segmentation method, which integrates two factors of session short text content similarity and call ticket recording time interval, and provides a learning-based segmentation method, as shown in figure 1, aiming at all instant communication session users, two groups are divided into one group, the session call ticket detail data (RS, RR, T, C) of each session user group are classified and time-based sequencing is carried out, all user groups and all call ticket detail data of each user group are operated simultaneously through spark in parallel, the specific session segmentation method comprises the steps of sequentially selecting two adjacent call tickets R1 and R2, calculating the time interval delta T and text content similarity delta sim of the two call ticket records, adopting a multivariate linear function fitting model to obtain a session time distance influence factor α and a session text content distance factor β, further calculating the distance values F (R1, R2) of the adjacent call tickets R1 and R2, further calculating the distance threshold F according to the distance value F (R1, R2) and the threshold F, judging whether the two new call tickets belong to the same session R3527, and the final session messages R638 are identical to the same session R3527, otherwise, judging whether the two new session messages belong to the same session R638, if the last session R638 and the session R3 are identical session messages.
As shown in fig. 2, the specific steps are as follows:
step one, aiming at all instant messaging conversation users, dividing two users into a group according to the communication contact between every two users;
all communication participants comprise a message sender and a message receiver, and two users of communication are divided into a group according to communication contact and are deduplicated; the communication contact comprises telephone communication, mail communication, micro-communication, short message communication and the like.
Step two, recording and classifying the detail data of the original session bill aiming at a certain communication session user group;
call detail data R ═ (RS, RR, T, C);
RS represents a session initiator (Record Sender), RR represents a communication session receiver (Record receiver), T represents the sending time of a call ticket R, and C is the text content of the call ticket R;
all communication contacts of the communication session user group are divided into different types according to different communication modes, such as WeChat communication, short message communication and the like.
Step three, sequencing each type of conversation bill detail data according to the sequence of sending time;
step four, selecting two adjacent telephone bills R1 and R2 according to each type of sequenced conversation telephone bill detail data, and calculating the time interval delta t of the two telephone bills;
Δt=F2(T2-T1)=T2-T1;T2>T1
the bill detail data R1 is (RS1, RR1, T1, C1); the bill detail data R2 is (RS2, RR2, T2, C2);
RS1, RR1, RS2, RR2 are user IDs of the communication participant group; t1 is the transmission time of the ticket R1; t2 is the transmission time of the ticket R2; c1 is the text content of the ticket R1; c2 is the text content of the ticket R2;
step five, calculating the text content similarity delta sim recorded by the two adjacent telephone bills R1 and R2;
the text content distance similarity calculation algorithm is suitable for instant messaging text conversation segmentation and is designed and realized based on word2vec and cosine distance. As shown in fig. 3, the specific steps are as follows:
step 501, obtaining the text content C1 of the ticket R1 and the text content C2 of the ticket R2 by using word2 vec;
obtaining a word set and a corresponding word feature vector by using word2 vec;
and 502, performing word segmentation and stop word removal on the text contents C1 and C2 to obtain a word set.
The text content C1 obtains wc1 words; the text content C2 obtains wc2 words;
step 503, calculating the text content distance F3 between the adjacent telephone bills R1 and R2;
Figure GDA0002390441680000041
sim(wc1i,wc2j) Calculating by adopting a cos cosine method; wc1iThe ith word representing textual content C1; wc2jRepresenting the jth word in textual content C2.
Step 504, calculating the text content similarity delta sim of the ticket R1 and the ticket R2 by using the text content distance F3;
Δsim=F3(C1,C2)
step six, calculating the distance value F (R1, R2) of the adjacent call tickets R1 and R2 by adopting a call ticket distance algorithm;
F(R1,R2)=α×Δt+β×Δsim
α is the session time distance influence factor, β is the session text content distance parameter influence factor;
α and β learning models are mainly as follows, a batch of call ticket communication detail record data of a certain conversation user group are sampled and sorted according to time, whether any adjacent 2 call ticket record data belong to 1 conversation is marked by an artificial marking method, if the call ticket record data belong to one conversation, the distance value is marked as 1, if the call ticket record data do not belong to the same conversation process, the distance value is-1, then function fitting is carried out based on the sample data after the standards, and a multivariate linear function fitting model is mainly adopted to obtain the values of α and β.
Step seven, judging whether the distance value F (R1, R2) is smaller than a threshold value F, if so, enabling the ticket R1 and the ticket R2 to belong to the same session; otherwise, the call tickets R1 and R2 belong to two different sessions respectively;
the tickets R1 and R2 belong to two different sessions respectively, i.e. the last message of the previous session is R1, and the first message of the new session is R2.
Step eight, aiming at all kinds of conversation bill detail data of the communication conversation user group, all kinds are divided in parallel through spark calculation;
the session distance of each communication session user group conforms to the following characteristics: if the conversation belongs to 1 conversation, the distance values recorded by different adjacent call tickets are distributed in a centralized way; if the difference belongs to 1 session, the distance values recorded by different adjacent call tickets are distributed sparsely. And calculating and solving by analyzing the adjacent words corresponding to different distance values, wherein the specific calculation algorithm is an extreme value solving algorithm for solving inflection points.
And step nine, aiming at all grouped instant communication session users, performing parallel operation on all communication session user groups in steps two to eight through spark parallel operation.
It is to be noted and understood that various modifications and improvements can be made to the invention described in detail above without departing from the spirit and scope of the invention as claimed in the appended claims. Accordingly, the scope of the claimed subject matter is not limited by any of the specific exemplary teachings provided.

Claims (4)

1. A learning-based instant messaging session segmentation method is characterized by comprising the following specific steps:
step one, aiming at all instant messaging conversation users, dividing two users into a group according to the communication contact between every two users;
step two, recording and classifying the detail data of the original session bill aiming at a certain communication session user group;
step three, sequencing each type of conversation bill detail data according to the sequence of sending time;
step four, selecting two adjacent telephone bills R1 and R2 according to each type of sequenced conversation telephone bill detail data, and calculating the time interval delta t of the two telephone bills;
Δt=F2(T2-T1)=T2-T1;T2>T1
t1 is the transmission time of the ticket R1; t2 is the transmission time of the ticket R2; f2 is the sending time distance between the adjacent tickets R1 and R2;
step five, calculating the text content similarity delta sim recorded by the two adjacent telephone bills R1 and R2;
Δsim=F3(C1,C2)
c1 is the text content of the ticket R1, C2 is the text content of the ticket R2; f3 is the text content distance between the adjacent tickets R1 and R2;
step six, calculating the distance value F (R1, R2) of the adjacent call tickets R1 and R2 by adopting a call ticket distance algorithm;
F(R1,R2)=α×Δt+β×Δsim
α is the session time distance impact factor, β is the value of the session text content distance;
step seven, judging whether the distance value F (R1, R2) is smaller than a threshold value F, if so, enabling the ticket R1 and the ticket R2 to belong to the same session; otherwise, the call tickets R1 and R2 belong to two different sessions respectively;
the telephone bills R1 and R2 belong to two different conversations respectively, namely the last message of the previous conversation is R1, and the first message of the new conversation is R2;
step eight, aiming at all kinds of conversation bill detail data of the communication conversation user group, all kinds are divided in parallel through spark calculation;
and step nine, aiming at all grouped instant communication session users, performing parallel operation on all communication session user groups in steps two to eight through spark parallel operation.
2. The learning-based instant messaging session segmentation method according to claim 1, wherein in the second step, the detail call list data R ═ (RS, RR, T, C);
RS denotes a session initiator (Record Sender), RR denotes a communication session receiver (Record receiver), T denotes transmission time of the ticket R, and C denotes text content of the ticket R.
3. The learning-based instant messaging session segmentation method according to claim 1, wherein the concrete steps of the fifth step are as follows:
step 501, obtaining the text content C1 of the ticket R1 and the text content C2 of the ticket R2 by using word2 vec;
502, performing word segmentation and stop word removal on the text contents C1 and C2 to obtain a word set;
the text content C1 obtains wc1 words; the text content C2 obtains wc2 words;
step 503, calculating the text content distance F3 between the adjacent telephone bills R1 and R2;
Figure FDA0002390441670000021
sim(wc1i,wc2j) Calculating by adopting a cos cosine method; wc1iThe ith word representing textual content C1; wc2jRepresents the jth word in textual content C2;
step 504, calculating the text content similarity delta sim of the ticket R1 and the ticket R2 by using the text content distance F3;
Δsim=F3(C1,C2)。
4. the learning-based instant messaging session segmentation method according to claim 1, wherein the respective session distances of the communication session user groups conform to the following characteristics: if the conversation belongs to 1 conversation, the distance values recorded by different adjacent call tickets are distributed in a centralized way; if the difference belongs to 1 session, the distance values recorded by different adjacent call tickets are distributed sparsely.
CN201710391483.6A 2017-05-27 2017-05-27 Instant messaging session segmentation method based on learning Active CN107066450B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710391483.6A CN107066450B (en) 2017-05-27 2017-05-27 Instant messaging session segmentation method based on learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710391483.6A CN107066450B (en) 2017-05-27 2017-05-27 Instant messaging session segmentation method based on learning

Publications (2)

Publication Number Publication Date
CN107066450A CN107066450A (en) 2017-08-18
CN107066450B true CN107066450B (en) 2020-04-10

Family

ID=59617598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710391483.6A Active CN107066450B (en) 2017-05-27 2017-05-27 Instant messaging session segmentation method based on learning

Country Status (1)

Country Link
CN (1) CN107066450B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111708866B (en) * 2020-08-24 2020-12-11 北京世纪好未来教育科技有限公司 Session segmentation method and device, electronic equipment and storage medium
CN112256879B (en) * 2020-10-29 2021-07-20 贝壳找房(北京)科技有限公司 Information processing method and apparatus, electronic device, and computer-readable storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101552737A (en) * 2008-03-31 2009-10-07 国际商业机器公司 Instant message communication method and instant message communication device based on theme
CN103430578A (en) * 2010-10-27 2013-12-04 诺基亚公司 Method and apparatus for identifying conversation in multiple strings
JP5514703B2 (en) * 2010-11-29 2014-06-04 Kddi株式会社 Search delivery server, program and method for delivering related information according to search log
CN103686617B (en) * 2013-12-23 2017-08-25 百度在线网络技术(北京)有限公司 Create the method and device of instant messaging group
CN105450497A (en) * 2014-07-31 2016-03-30 国际商业机器公司 Method and device for generating clustering model and carrying out clustering based on clustering model
CN106789572B (en) * 2016-12-19 2019-09-24 重庆博琨瀚威科技有限公司 A kind of instant communicating system and instant communication method for realizing adaptive message screening

Also Published As

Publication number Publication date
CN107066450A (en) 2017-08-18

Similar Documents

Publication Publication Date Title
US20210216723A1 (en) Classification model training method, classification method, device, and medium
CN109492026B (en) Telecommunication fraud classification detection method based on improved active learning technology
WO2021068843A1 (en) Emotion recognition method and apparatus, electronic device, and readable storage medium
US20140304617A1 (en) Information Prompt Method, Apparatus and Terminal Device
CN106815588B (en) Junk picture filtering method and device
CN108257594A (en) A kind of conference system and its information processing method
CN108924371B (en) Method for identifying user number through incoming call number in electric power customer service process
CN107066450B (en) Instant messaging session segmentation method based on learning
CN110765266B (en) Method and system for merging similar dispute focuses of referee documents
CN109614464B (en) Method and device for identifying business problems
CN114650229B (en) Network encryption traffic classification method and system based on three-layer model SFTF-L
CN111651566A (en) Multi-task small sample learning-based referee document dispute focus extraction method
CN110689357B (en) Work order generation method for online customer service based on machine learning
CN105488364A (en) Method, device and system using two-dimension code to distinguish user type
CN110675263B (en) Risk identification method and device for transaction data
CN111428151A (en) False message identification method and device based on network acceleration
CN112801721B (en) Information processing method, information processing device, electronic equipment and storage medium
CN101719924A (en) Unhealthy multimedia message filtering method based on groupware comprehension
CN111401478B (en) Data anomaly identification method and device
CN114708080B (en) Distributed financial data online processing method
CN110990570A (en) Mail drop extraction method based on deep learning
CN102984076B (en) The recognition methods of flow type of service and device
CN105959205A (en) Chatting records keeping method
CN111353019A (en) WeChat public number topic classification method and device
CN110555431B (en) Image recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant