CN112732903A - Evidence classification method and system in instant messaging information evidence obtaining process - Google Patents

Evidence classification method and system in instant messaging information evidence obtaining process Download PDF

Info

Publication number
CN112732903A
CN112732903A CN202010990656.8A CN202010990656A CN112732903A CN 112732903 A CN112732903 A CN 112732903A CN 202010990656 A CN202010990656 A CN 202010990656A CN 112732903 A CN112732903 A CN 112732903A
Authority
CN
China
Prior art keywords
text
word
semantic
information
evidence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010990656.8A
Other languages
Chinese (zh)
Inventor
李炳龙
张宇
王懿
周振宇
李媛芳
孙怡峰
唐慧林
常朝稳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan Yunyan Technology Co ltd
Kaifeng Institute Of Science And Technology Information
Information Engineering University of PLA Strategic Support Force
Original Assignee
Henan Yunyan Technology Co ltd
Kaifeng Institute Of Science And Technology Information
Information Engineering University of PLA Strategic Support Force
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan Yunyan Technology Co ltd, Kaifeng Institute Of Science And Technology Information, Information Engineering University of PLA Strategic Support Force filed Critical Henan Yunyan Technology Co ltd
Priority to CN202010990656.8A priority Critical patent/CN112732903A/en
Publication of CN112732903A publication Critical patent/CN112732903A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/04Real-time or near real-time messaging, e.g. instant messaging [IM]
    • H04L51/043Real-time or near real-time messaging, e.g. instant messaging [IM] using or handling presence information

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of data processing, in particular to an evidence classification method and an evidence classification system in the process of instant messaging information forensics, which comprise the following steps: collecting original chatting records, dividing the original chatting records into a training set and a testing set, respectively preprocessing data in the training set and the testing set, and filtering non-text data; aiming at the text data in the training set, capturing semantic level information by using a dynamic semantic representation model, generating a corresponding word vector for each text word, and putting the semantic words in the word vectors into a word bank; establishing new semantic words by utilizing sparse constraint aiming at the text data in the test set; forming a text vector matrix by word vector splicing; and extracting text characteristics of the text vector matrix by using a bidirectional gating cycle model, and screening out text information related to evidence obtaining through classification. The invention applies the text classification technology to the chat records to screen the texts related to crimes, and improves the classification performance, the evidence obtaining efficiency and the accuracy by updating the text feature representation and the text feature extraction.

Description

Evidence classification method and system in instant messaging information evidence obtaining process
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to an evidence classification method and system in an instant messaging information forensics process.
Background
With the rapid development of the mobile internet, the use of smart phones is also rapidly increasing. But the smart phone brings convenience to people and crimes. The record of crimes using smart phones is growing. In the past, mobile phone forensics mainly focuses on geographic information data, short messages, call records and the like stored in a mobile phone. However, with the development of social software, the chat records retained by the social software have become new evidence sources. According to the survey report issued by CNNIC, the number of people using social media in china has reached 89012 by 3 months of 2020. It is because of the popularity of social software that forensics is also a new focus of digital research.
Evidence obtaining personnel can find data related to case situations by analyzing the chat records, but the problem is that the chat records are numerous and complicated, and currently, evidence obtaining personnel can search manually or set a filter, so that the method is inefficient and key evidence can be missed, and a text classification method is needed for screening out key data from the chat records. However, there are still many problems to be solved in applying text classification techniques to chat logs: firstly, chat records generated by social media are not directly used like ordinary texts, and non-text information such as symbols and pictures exists in the chat records; second, chat logs often show new words that reduce classifier accuracy and some semantic information is ignored during text quantization. In recent years, deep learning has been successful in the field of natural language processing, and many researchers have used it to solve the problem of word ambiguity in text classification. The RNN introduces a memory unit to enable the network to have certain memorability, and text features can be extracted by better combining the characteristics of text sequences. Wherein the characteristics of a word relate not only to the current word but also to its context. This way, the characteristics of the words can change along with the change of the above, and the words have semantic characteristics. However, the RNN cycle mechanism is too simple to perform a time multiplication operation when the gradient propagates reversely, which may cause the problems of gradient disappearance and gradient explosion. Thereby causing the training process to be stalled. LSTM and GRU introduce a door mechanism on the basis of the traditional RNN, thereby better overcoming the defects of the RNN. Based on the above, many models are improved on basic LSTM and GRU models, and good effects are achieved. However, the accuracy improvement is limited due to the large amount of parameters of the external memory matrix.
Disclosure of Invention
Therefore, the invention provides an evidence classification method and an evidence classification system in the instant messaging information evidence obtaining process, which apply a text classification technology to chat records to screen texts related to crimes, improve classification performance by updating text feature representation and text feature extraction, and improve digital evidence obtaining efficiency and accuracy.
According to the design scheme provided by the invention, the evidence classification method in the instant messaging information forensics process comprises the following contents:
collecting original chatting records, dividing the original chatting records into a training set and a testing set, respectively preprocessing data in the training set and the testing set, and filtering non-text data;
aiming at the text data in the training set, capturing semantic level information by using a dynamic semantic representation model, generating corresponding word vectors for each text vocabulary, and collecting semantic words in the word vectors into a word bank; aiming at the text data in the test set, establishing a new semantic word constrained with the existing semantic word in the word library by using sparse constraint; aiming at a word bank and new semantic words, a text vector matrix is formed by word vector splicing;
extracting text features of a text vector matrix by using a bidirectional gating cycle model according to text context information;
and classifying the extracted text features to distinguish the chat record types and screen out text information related to evidence collection.
As the evidence classification method in the instant messaging information evidence obtaining process, the original chat records are further extracted from a social application program, the chat records are preprocessed, non-text data are firstly filtered, then, complex and simple font conversion is carried out through dictionary mapping, word segmentation and part-of-speech tagging are carried out through a Chinese word segmentation system, and noise data are removed through stopping a word list.
As the evidence classification method in the instant messaging information forensics process of the invention, further, the chat record preprocessing also comprises: and converting the network words through mapping the network words and the standard written words.
As the evidence classification method in the instant messaging information evidence obtaining process, the invention further adopts pre-trained word embedding vectors in the semantic level information capturing process to obtain each word embedding vector and form the word embedding vector into a feature matrix of a text, wherein each column in the text feature matrix is represented by a word feature; and performing clustering analysis on each text, dividing words with similar semantics into the same class, and extracting a clustering center as a text semantic word in each clustering process.
As an evidence classification method in the instant communication information evidence obtaining process, further, in word embedding vector training, an attraction information matrix and an attribution information matrix are defined according to the applicability that two text embedding vectors are suitable to be used as a clustering center of the other party, and the two matrices are initialized to be 0 before training is started; acquiring an attraction information matrix element in the current iteration according to the similarity of the text embedding vector and the attribution information matrix element in the previous iteration, acquiring the attribution information matrix element in the current iteration according to the attraction information matrix element in the previous iteration, setting iteration conditions to stop updating calculation, replacing similar words in the text with corresponding semantic words, generating a semantic word set for each text in a training set, and adding the semantic word set into a word bank.
As the evidence classification method in the instant communication information evidence obtaining process, further, in the iteration updating, an attenuation coefficient lambda is introduced, and the current iteration is set to be lambda times of the updated value of the previous iteration and 1-lambda times of the current iteration value.
As the evidence classification method in the instant communication information evidence obtaining process, the invention further integrates the word frequency, the word property and the word position to perform feature fusion on the words in the semantic word set, and reconstructs word embedded vectors.
As an evidence classification method in the instant messaging information forensics process, a sparse constraint objective function is further constructed according to the l2 norm, and new semantic words which are sparsely represented are added to a word bank, wherein the sparse constraint objective function is represented as:
Figure BDA0002690771560000021
wherein λ is a weight parameter, kiWord vector, x, corresponding to the ith new word of the test set textiFor reconstructing the vector, K is composed of m semantic word vectors in the lexicon.
As an evidence classification method in the instant messaging information evidence obtaining process, further, in a bidirectional gating circulation model, word semantics and context meaning information in text data are associated by two gating circulation units which are reversely superposed, text upper information is extracted by a forward gating circulation unit, and text lower information is extracted by a reverse gating circulation unit; the model input is a text matrix formed by word vectors, text context information features are extracted through a hidden layer, and the model output is jointly determined by the states of two gate control circulation units.
Further, based on the above method, the present invention further provides an evidence classification system in the instant messaging information forensics process, comprising: a data collection module, a text matrix splicing module, a feature extraction module and a classification module, wherein,
the data collection module is used for collecting original chatting records, dividing the original chatting records into a training set and a testing set, respectively preprocessing data in the training set and the testing set and filtering non-text data;
the text matrix splicing module is used for capturing semantic level information by using a dynamic semantic representation model aiming at the text data in the training set, generating a corresponding word vector for each text word and collecting the semantic words in the word vectors into a word bank; aiming at the test set Chinese data, establishing a new semantic word constrained with the existing semantic word in a word bank by utilizing sparse constraint; aiming at a word bank and new semantic words, a text vector matrix is formed by word vector splicing;
the feature extraction module is used for extracting text features of the text vector matrix by utilizing a bidirectional gating cycle model according to the text context information;
and the classification module is used for classifying the extracted text features so as to distinguish the chat record types and screen out text information related to evidence collection.
The invention has the beneficial effects that:
preprocessing a chatting record, performing characteristic representation on words by adopting a dynamic semantic representation model, putting selected words into a word bank, reconstructing word embedding vectors by weighted combination of word attributes, and performing sparse representation on new words by utilizing trained semantic words in the word bank to enhance the adaptability of the model; extracting features from a text consisting of word vectors by adopting a BGRU model, and solving the problem of word ambiguity according to context change; the text classification technology is applied to chat records to screen texts related to crimes, the classification performance is improved by updating text feature representation and text feature extraction, digital evidence screening and classification are facilitated, the evidence obtaining purpose is achieved, and the text classification method has good application value.
Description of the drawings:
FIG. 1 is a schematic diagram of an evidence classification process in the parallel instant messaging information forensics process in the embodiment;
FIG. 2 is a schematic diagram of a GRU network in an embodiment;
FIG. 3 is a schematic diagram of a BGRU network structure in an embodiment;
FIG. 4 is a schematic diagram of the BGRU language model structure in the embodiment;
FIG. 5 is an illustration of the effect of different word attribute weighting methods in an embodiment;
fig. 6 is a schematic diagram of the effect of sparse representation in the embodiment.
The specific implementation mode is as follows:
in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.
In order to apply a text classification technology to chat records to screen texts related to crimes, an embodiment of the present invention provides an evidence classification method in an instant messaging information forensics process, including the following contents:
collecting original chatting records, dividing the original chatting records into a training set and a testing set, respectively preprocessing data in the training set and the testing set, and filtering non-text data;
aiming at the text data in the training set, capturing semantic level information by using a dynamic semantic representation model, generating corresponding word vectors for each text vocabulary, and collecting semantic words in the word vectors into a word bank; aiming at the text data in the test set, establishing a new semantic word constrained with the existing semantic word in the word library by using sparse constraint; aiming at a word bank and new semantic words, a text vector matrix is formed by word vector splicing;
extracting text features of a text vector matrix by using a bidirectional gating cycle model according to text context information;
and classifying the extracted text features to distinguish the chat record types and screen out text information related to evidence collection.
Referring to fig. 1, an original chat log is first extracted from a social application. The original chat log is then preprocessed to remove non-text data. And then, a DSR model is used for capturing information of semantic levels, and word vectors are reconstructed on the pre-trained word vectors by combining text features. And represent the new word using the word vector that has been trained by the sparse representation. And then extracting text features formed by the word vectors by adopting a BGRU model. And finally, inputting the obtained feature vectors into a SoftMax regression model for classification, and distinguishing the categories of the chat records. By the method, the purpose of screening the text of the chat records related to the crime and obtaining evidence is achieved. The text classification technology is applied to chat records to screen texts related to crimes, classification performance, efficiency and accuracy are improved by updating text feature representation and text feature extraction, digital evidence screening and classification are facilitated, evidence obtaining purposes are achieved, and the text classification technology has good application value.
As an evidence classification method in the instant messaging information evidence obtaining process in the embodiment of the invention, further, original chat records are extracted from a social application program, the chat records are preprocessed, non-text data are firstly filtered, then, complex and simple font conversion is carried out through dictionary mapping, word segmentation and part-of-speech tagging are carried out through a Chinese word segmentation system, and noise data are removed through stopping a word list.
The content of the chat records is noisy, and the content has the characteristics of short text, spoken language, existence of emoticons, pictures and other non-text data, so that preprocessing is required. Non-text data such as emoticons and pictures are messy codes in the chat text, so that the contents such as emoticons and pictures are firstly filtered. The conversion of simplified and traditional Chinese characters can be carried out by a dictionary mapping method. Word segmentation and part-of-speech tagging can be performed by using the Chinese word segmentation system NLPIR (ICTCCLAS 2013). The partial results after word segmentation and part-of-speech tagging are shown in table 1, and the part-of-speech correspondence is shown in table 2. The intersection of the Hadamard stop vocabulary, the Baidu stop vocabulary and the Sichuan university machine intelligence laboratory stop vocabulary can be used to remove noise data. Network hotwords often appear in the chat records, and further, the chat record preprocessing further comprises: and converting the network words through mapping the network words and the standard written words.
Figure BDA0002690771560000041
Figure BDA0002690771560000051
As the evidence classification method in the instant messaging information evidence obtaining process, the invention further adopts pre-trained word embedding vectors in the semantic level information capturing process to obtain each word embedding vector and form the word embedding vector into a feature matrix of a text, wherein each column in the text feature matrix is represented by a word feature; and performing clustering analysis on each text, dividing words with similar semantics into the same class, and extracting a clustering center as a text semantic word in each clustering process.
The reconstruction of the word embedding vector is based on the word vector which is pre-trained, and by adopting the pre-trained word embedding vector, the embedding vector w of each word can be obtained and is formed into a feature matrix T of the text, wherein each column in the feature matrix is the feature representation of the word. Then, clustering analysis is carried out on each text, and words with similar semantics are divided into the same class. And in each clustering process, extracting a clustering center as a semantic word of the text.
As an evidence classification method in the instant messaging information forensics process in the embodiment of the invention, further, in word embedding vector training, an attraction information matrix and an attribution information matrix are defined according to the applicability that two text embedding vectors are suitable to be used as opposite clustering centers, and the two matrices are initialized to 0 before training; acquiring an attraction information matrix element in the current iteration according to the similarity of the text embedding vector and the attribution information matrix element in the previous iteration, acquiring the attribution information matrix element in the current iteration according to the attraction information matrix element in the previous iteration, setting iteration conditions to stop updating calculation, replacing similar words in the text with corresponding semantic words, generating a semantic word set for each text in a training set, and adding the semantic word set into a word bank. Further, in the iterative updating, an attenuation coefficient lambda is introduced, and the current iteration is set to be lambda times of the updated value of the previous iteration and 1-lambda times of the current iteration value.
Assume a set of embedded vectors for text as w1,w2,...,wnDefine S as a similarity matrix between samples, where S (i, j) describes wiAnd wjCan adopt wiAnd wjThe Euclidean distance of (c) is taken as the value of s (i, j). Define R as the attraction information matrix and A as the attribution information matrix, where R (i, j) describes wjIs suitable as wiAnd a (i, j) describes wiSelection of wjAs a suitability for its clustering center. The elements of matrix R and a are initialized to 0 before training begins. The clustering algorithm is realized by iteratively updating the attraction information matrix R and the attribution information matrix A:
first r (i, j) is iterated according to equation (1):
Figure BDA0002690771560000052
then a (i, j) iterates according to equations (2) and (3):
Figure BDA0002690771560000061
Figure BDA0002690771560000062
further, an attenuation coefficient λ is introduced to avoid oscillation. As shown in formulas (4) and (5). Each iteration is set to λ times its previous iteration update value plus 1- λ times the value of this iteration. Where λ is a real number between 0 and 1.
rt+1(i,k)=(1-λ)rt+1(i,k)+λrt(i,k) (4)
at+1(i,k)=(1-λ)at+1(i,k)+λat(i,k) (5)
And if the clustering center is kept unchanged after multiple iterations or the iteration times exceed the set iteration times, stopping the calculation. And using the calculated clustering centers as semantic words.
Based on the above method, similar words in the text can be replaced with corresponding semantic words and a semantic word set is generated for each text in the training set. The set contains k semantic words { sw1,sw2,...,swk}. The semantic word set of the text in the training set can form a word stock for dynamically updating the semantic words, so that the self-adaptive capacity of the model is further improved.
In order to further improve the expression capability of word embedding, the semantic words generated by the text are further processed to fuse other characteristic information. And (6) performing feature fusion on the words in the semantic word set by adopting a formula:
V(w(i))=MTF(swi) (6)
where MTF () is a feature fusion function. Unlike the traditional bag of words (BOW) which uses only statistical features to represent all words, the word embedding vector can be reconstructed by combining the three features of word frequency, word part of speech and word position.
Word frequency is the number of occurrences of a word in the text. The higher the frequency of words, the more important the words. It is one of the commonly used word attributes in the statistical function. For calculating the word frequency factor freiThe nonlinear function can be used as:
Figure BDA0002690771560000063
where n is the word swiThe number of occurrences of (c). The non-linear function has two advantages: firstly, the word frequency factor is in direct proportion to the word frequency; second, the word frequency factor is in direct proportion to the word frequency. Second, when the word frequency increases to a certain degree, the word frequency factorThe value will decrease, consistent with the language reality.
The part-of-speech factor is a quantification of part-of-speech. Semantic words that have a large impact on text semantics are mostly nouns. The influence of verbs and adjectives is relatively small compared to nouns. Because different parts of speech have different effects on text classification, words can be classified into three categories according to parts of speech:
Figure BDA0002690771560000071
wherein m is1,m2,m3Are parameters obtained by training.
The position of the word in the text is also of significant value in determining its importance. Different words appear in different locations of the text and have different effects on the subject matter of the text. The influence of the position can be defined according to the following formula:
Figure BDA0002690771560000072
lastiis wiPosition of first appearance, lastiIs wiPosition of first appearance, sumiIs the total number of words in the text.
Constructing a fusion feature to represent the semantic word, as calculated as follows:
MTF(swi)=(α1frei2posi3loci)·swi (7)
wherein freiIs the word frequency factor, posiIs a part of speech factor, lociIs the word position factor and α 1, α 2 and α 3 are the weights of the feature factors. Therefore, according to equation (6), { V (w (1)), V (w (2)),. that, V (w (i)) }may be usedjTo represent a text T with i wordsj
Since the training set and the test set are randomly assigned and semantic words in the lexicon are composed of words in the training set, words extracted from the test set may not appear in the lexicon. If the words extracted from the test text do not appear in the lexicon, other semantic words will be used to sparsely represent the words. The objective function is as follows:
Figure BDA0002690771560000073
or
Figure BDA0002690771560000074
Where y is the sample to be reconstructed, X is the matrix of the embedded vector, and both epsilon and lambda are small normal numbers.
Although the l1 norm plays an implicit role in the selection of the regression training samples, the iterative solution is computationally expensive, replacing the regularization term with the l2 norm. The objective function can be expressed as:
Figure BDA0002690771560000081
as an evidence classification method in the instant messaging information forensics process in the embodiment of the present invention, further, a sparse constraint objective function is constructed according to the l2 norm, and a new semantic word expressed sparsely is added to a lexicon, where the sparse constraint objective function is expressed as:
Figure BDA0002690771560000082
wherein λ is a weight parameter, kiWord vector, x, corresponding to the ith new word of the test set texti∈RmIs a reconstruction vector, K ∈ Rm×nThe method is characterized by comprising m semantic word vectors in a word bank, wherein n is the dimension of the word vectors, and finally, new semantic words which are expressed sparsely are added into the word bank to improve the adaptability of a model.
As an evidence classification method in the instant messaging information evidence obtaining process in the embodiment of the invention, further, in a bidirectional gating circulation model, word semantics and context meaning information in text data are associated by two gating circulation units which are reversely superposed, text upper information is extracted by a forward gating circulation unit, and text lower information is extracted by a reverse gating circulation unit; the model input is a text matrix formed by word vectors, text context information features are extracted through a hidden layer, and the model output is jointly determined by the states of two gate control circulation units.
Comments made by different users through the self-media platform are a representation form of natural language, and the form is free but still has context dependency on the structure. According to the above information and the following information of the text, the text semantics can be understood more accurately. The RNN can mine time sequence information and context semantic information of texts, but when the RNN learns a time sequence with any length, the perception capability of the RNN on information long before is reduced along with the increase of input, and long-term dependence and gradient disappearance problems are generated, and an LSTM network improved from the RNN can solve the long-term dependence and gradient disappearance problems of the RNN. However, the LSTM model has more parameters and long training and prediction time, and compared with the LSTM model, the GRU model has fewer parameters, simple model and higher efficiency. The GRU model structure is shown in FIG. 2, ztTo refresh the door, rtTo reset the gate, xtAn input representing time t; h ist-1 is a hidden layer representing the output at time t-1; sigma is a Sigmoid function; h istThe hidden layer represents the output at the time t; the calculation of each gate in the GRU model is shown in equations (9) to (12):
zt=σ(Wz·[ht-1,xt]) (9)
rt=σ(Wr·[ht-1,xt]) (10)
ht=(1-zt)*ht-1+zt*ht (11)
ht=tanh(W·[rt*ht-1,xt]) (12)
wherein: wzWeight matrix, W, representing the connections of the update gatesrA weight matrix representing the connections of the reset gates, "·" represents the multiplication of two matrix elements.
However, since the meaning of a word is context dependent and GRU can only indicate the above meaning, the word meaning of a word isThe following information is introduced using BGRU instead of GRU. The BGRU model is illustrated in fig. 3, and can be understood as the reverse superposition of two GRUs, and there are gates in two GRUs in opposite directions at each time. Wherein,
Figure BDA0002690771560000091
represents the forward output of the GRU at time t;
Figure BDA0002690771560000092
represents the reverse output of the GRU at time t; h istThe output of the BGRU indicating time t; x is the number oftIndicating the input at time t. The state calculation at each moment in the BGRU model is shown in equations (13) and (14). The output is then determined by the state of the GRU in both directions, as shown in equation (15):
Figure BDA0002690771560000093
Figure BDA0002690771560000094
Figure BDA0002690771560000095
wherein, wtWeight matrix, v, representing the forward outputtWeight matrix representing the inverted output, btIndicating the offset at time t.
And (3) splicing word vectors output by the DSR model to form a text vector matrix, wherein the formula (16) is as follows:
Figure BDA0002690771560000096
wherein w (1), w (2),.. and w (i) represent a textual vocabulary; v (w (1)), V (w (2)),. and V (w (i)) represent word vectors corresponding to text vocabularies output by the DSR model, and T (w (1)), V (w (2)),. andjrepresenting the j-th text spliced by i word vectorsA present vector matrix;
Figure BDA0002690771560000097
representing a concatenation operation of word vectors. The BGRU-based language model structure is shown in FIG. 4, and text matrix T is aligned by using BGRUjAnd (3) extracting context features, wherein the calculation method is shown as a formula (15), namely the forward GRU is used for extracting the context information features of the comment text, the calculation method is shown as a formula (13), the reverse GRU is used for extracting the context information features of the comment text, and the calculation method is shown as a formula (6). In the embodiment of the scheme, text features can be extracted by constructing a BGRU model through a Keras framework, wherein the model comprises the following components of a text matrix formed by vector representation of words by using a DSR model in an input layer, the size of a BGRU hidden layer is 64, an input sequence is respectively input from two directions of the model, the text upper information features and the text lower information features are extracted through the hidden layer, and finally, the hidden layer outputs in the two directions are spliced through a formula (17):
hijt=BiLSTM(Tijt) (17)
wherein: t isijtA text matrix composed of i word vectors representing the j-th text input at the time t; h isijtIndicating the output of BGRU at time t
The size of the output layer Softmax is consistent with the classification of the text classification, a binary classification algorithm can be utilized, and the output layer is provided with 2 neurons which respectively represent normal and abnormal.
Text classification is performed by a Softmax function, which is shown as equation (18):
yi=softmax(wihijt+bi) (18)
wherein, wiMatrix of weight coefficients representing the feature extraction layer to the output layer, biDenotes the corresponding offset, hijtThe output vector of the feature extraction layer at time t is shown.
To verify the effectiveness of the embodiments of the present disclosure, the following further explanation is provided with experimental data:
the experimental environment is as follows: the experimental environment of the system is x86 platform, Intel CPU, memory 16GB and hard disk 100 GB. Operating system Windows10 home edition. And (3) carrying out model building and testing by using a deep learning library Keras based on TensorFlow.
Data set: the experimental data were taken from the Android smartphone used for the experiment. The mobile phone has chat records with 1000 WeChat friends, and the 1000 conversation contents respectively comprise normal chat and chat related to crime. The normal chat and chat topics associated with crime are shown in table 3. The normal chat contents are labeled as "normal text", and the text related to crime is labeled as "abnormal text", and the chat records contain 24100 pieces of chat short text in total. Some data sets including normal and abnormal 2 types are randomly extracted from the chat records, the data set of each type is divided into a training set, the proportion of the test set 2 is 4:1, and the data are stored in a csv format data file. The number of data for each part is shown in table 4.
Figure BDA0002690771560000101
Figure BDA0002690771560000102
In addition, four data sets were used to evaluate the performance of the DSR model, as shown in table 5. Because there are fewer datasets for abnormal text classification, a common chinese dataset for text sentiment tendency discrimination can be chosen. 80% of the data sets are training sets and the remaining text is a test set.
Figure BDA0002690771560000103
Figure BDA0002690771560000111
Evaluation criteria:
the classification results may be evaluated using the accuracy, recall, and F1 values. The calculation formula is shown as the following formula:
Figure BDA0002690771560000112
Figure BDA0002690771560000113
Figure BDA0002690771560000114
wherein: TP (true Positive) represents that the normal text is predicted to be the normal text quantity, FP (false positive) represents that the abnormal text is predicted to be the normal text quantity, and FN (false negative) represents that the normal text is predicted to be the abnormal text quantity.
Performance analysis: the overall solution is tested according to the logic implemented by the system as a whole. The following problems are emphasized:
(1) and analyzing the influence of different word attribute weighting methods on the classification effect.
There are many ways to quantify word weights in NLP, the common ones being TF-IDF weighting and word frequency weighting. The algorithm combines the position, frequency and part-of-speech characteristics of the words to quantify the word weight. In the experiments, the effects of these quantitative word weight methods were compared. The results in fig. 5 show that weighted combining is more efficient than other weighting methods.
(2) The effect of the sparse representation method for new words in the DSR model is analyzed.
As shown in fig. 6, the sparse representation approach improves the performance of the classifier on most datasets. The optimization performance on the ChnSentiCorp data set is poor because the words in the training set and the test set are overlapped more, and therefore, the number of new words added by dynamic representation is less. The word coincidence of the training set and the testing set on other data sets is less, but the word senses are similar, and the sparse representation effect is improved. F1_1 shows F1-score with sparse representation removed and F1_2 shows F1-score with sparse representation added.
(3) And comparing the performances of different text feature representation methods and classification methods.
In order to fully verify the effectiveness and the contrast of the DSR-BGRU text classification model, the method not only compares the method with other crime text classification researches, but also adopts the current popular social media-oriented text classification model for comparative analysis. The details of these methods are as follows:
1) DSR-BGRU: the text classification model adopted by the embodiment of the scheme uses DSR to perform text feature representation and uses BGRU to perform text feature extraction;
2) BERT: the short text is subjected to sentence-level feature vector representation by using a BERT pre-training language model, and then the obtained feature vectors are input into a Softmax regression model for classification.
3) Word2 Vec-CNN: the words in the large-scale corpus are trained by adopting a word2vec tool, the words are represented in the form of low-dimensional vectors, and then deep semantic features of the text are extracted by using CNN (convolutional neural network), so that text feature vectors which can be used for clustering are obtained.
4) Word2 Vec-TF-IDF-SVM: the method is characterized in that word vector representation is carried out on word segmentation results of short texts by using word2vec word embedding technology, each word vector is weighted by using TF-IDF, and finally text classification is carried out by using SVM classification algorithm.
5) TF-IDF-SVM: unlike other studies, short text is classified with word as a feature granularity.
And selecting the characteristic frequency to reduce the characteristic dimension, and performing SVM training.
6) TF-IDF-SVM: and starting from the text characteristics of the group chat records, performing weight-endowing transformation on the word vectors by using a TF-IDF technology, performing dimension reduction processing on the word vectors by using a gradient dimension reduction method, performing text classification on the word vectors by using an SVM (support vector machine) and the like, and building a classification model facing the group chat.
7) ARPR: a PageRank algorithm and a relational network are combined, and an ARPR algorithm is provided. The algorithm adopts a TF-IDF method to extract group chat related key words of group chat personnel, and measures the contribution of the related key words in the sequencing of the suspected suspicion degree; and then, conducting aggregation on the suspect weight obtained by calculating the information of each dimension by guiding an analytic hierarchy process to serve as a weight coefficient, and establishing a relationship network for a link by using a friend relationship to serve as the incoming degree and outgoing degree of the PageRank to calculate the corresponding PageRank weight.
Figure BDA0002690771560000121
As can be seen from the comparative experimental results in Table 6, the DSR-BGRU in the present embodiment has better performance advantages in terms of accuracy, recall and F1 values than other methods.
Further, based on the above method, an embodiment of the present invention further provides an evidence classification system in an instant messaging information forensics process, including: a data collection module, a text matrix splicing module, a feature extraction module and a classification module, wherein,
the data collection module is used for collecting original chatting records, dividing the original chatting records into a training set and a testing set, respectively preprocessing data in the training set and the testing set and filtering non-text data;
the text matrix splicing module is used for capturing semantic level information by using a dynamic semantic representation model aiming at the text data in the training set, generating a corresponding word vector for each text word and collecting the semantic words in the word vectors into a word bank; aiming at the test set Chinese data, establishing a new semantic word constrained with the existing semantic word in a word bank by utilizing sparse constraint; aiming at a word bank and new semantic words, a text vector matrix is formed by word vector splicing;
the feature extraction module is used for extracting text features of the text vector matrix by utilizing a bidirectional gating cycle model according to the text context information;
and the classification module is used for classifying the extracted text features so as to distinguish the chat record types and screen out text information related to evidence collection.
The relative steps, numerical expressions, and numerical values of the components and steps set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
Based on the foregoing system, an embodiment of the present invention further provides a server, including: one or more processors; a storage device to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the system as described above.
Based on the above system, the embodiment of the present invention further provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the above system.
The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the system embodiment, and for the sake of brief description, reference may be made to the corresponding content in the system embodiment for the part where the device embodiment is not mentioned.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing system embodiments, and are not described herein again.
In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and system may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the system according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent substitutions of some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. An evidence classification method in an instant messaging information forensics process is characterized by comprising the following contents:
collecting original chatting records, dividing the original chatting records into a training set and a testing set, respectively preprocessing data in the training set and the testing set, and filtering non-text data;
aiming at the text data in the training set, capturing semantic level information by using a dynamic semantic representation model, generating a corresponding word vector for each text word, and putting the semantic words in the word vectors into a word bank; aiming at the text data in the test set, establishing a new semantic word constrained with the existing semantic word in a word bank by utilizing sparse constraint; aiming at a word bank and new semantic words, a text vector matrix is formed by word vector splicing;
extracting text features of a text vector matrix by using a bidirectional gating cycle model according to text context information;
and classifying the extracted text features to distinguish the chat record types and screen out text information related to evidence collection.
2. The method as claimed in claim 1, wherein the method comprises extracting original chat logs from social applications, preprocessing the chat logs, filtering non-text data, performing complex and simple font conversion through dictionary mapping, performing word segmentation and part-of-speech tagging by using a Chinese word segmentation system, and removing noise data by deactivating a vocabulary.
3. The evidence classification method in the instant messaging information forensics process according to claim 2, wherein the chat record preprocessing further comprises: and converting the network words through mapping the network words and the standard written words.
4. The evidence classification method in the instant messaging information forensics process according to claim 1, wherein in the process of capturing semantic hierarchy information, pre-trained word embedding vectors are adopted to obtain each word embedding vector and form the word embedding vector into a feature matrix of a text, wherein each column in the text feature matrix is represented by a word feature; and performing clustering analysis on each text, dividing words with similar semantics into the same class, and extracting a clustering center as a text semantic word in each clustering process.
5. The evidence classification method in the instant messaging information forensics process according to claim 4, wherein in the word embedding vector training, an attraction information matrix and an attribution information matrix are defined according to the applicability that two text embedding vectors are suitable to be used as opposite clustering centers, and the two matrices are initialized to 0 before the training is started; acquiring an attraction information matrix element in the current iteration according to the similarity of the text embedding vector and the attribution information matrix element in the previous iteration, acquiring the attribution information matrix element in the current iteration according to the attraction information matrix element in the previous iteration, setting iteration conditions to stop updating calculation, replacing similar words in the text with corresponding semantic words, generating a semantic word set for each text in a training set, and adding the semantic word set into a word bank.
6. The method for classifying evidence during the process of obtaining evidence of instant communication information as claimed in claim 5, wherein in the iterative update, a damping coefficient λ is introduced, and the current iteration is set to be λ times of the updated value of the previous iteration plus 1- λ times of the current iteration value.
7. The evidence classification method according to claim 4, wherein the word frequency, the word part of speech and the word position are integrated to perform feature fusion on the words in the semantic word set, so as to reconstruct the word embedding vector.
8. The evidence classification method in the instant messaging information forensics process according to claim 1, characterized in that a sparse constraint objective function is constructed according to a l2 norm, and new semantic words which are sparsely represented are added to a word stock, wherein the sparse constraint objective function is represented as:
Figure FDA0002690771550000011
wherein λ is a weight parameter, kiWord vector, x, corresponding to the ith new word of the test set textiFor reconstructing the vector, K is composed of m semantic word vectors in the lexicon.
9. The evidence classification method in the instant messaging information forensics process according to claim 1, wherein in the bidirectional gating cycle model, word semantics and context meaning information in text data are associated by two gating cycle units which are reversely superposed, text upper information is extracted by a forward gating cycle unit, and text lower information is extracted by a reverse gating cycle unit; the model input is a text matrix formed by word vectors, text context information features are extracted through a hidden layer, and the model output is jointly determined by the states of two gate control circulation units.
10. An evidence classification system in an instant messaging information forensics process is characterized by comprising: a data collection module, a text matrix splicing module, a feature extraction module and a classification module, wherein,
the data collection module is used for collecting original chatting records, dividing the original chatting records into a training set and a testing set, respectively preprocessing data in the training set and the testing set and filtering non-text data;
the text matrix splicing module is used for capturing semantic level information by using a dynamic semantic representation model aiming at the text data in the training set, generating a corresponding word vector for each text word and collecting the semantic words in the word vectors into a word bank; aiming at the text data in the test set, establishing a new semantic word constrained with the existing semantic word in a word bank by utilizing sparse constraint; aiming at a word bank and new semantic words, a text vector matrix is formed by word vector splicing;
the feature extraction module is used for extracting text features of the text vector matrix by utilizing a bidirectional gating cycle model according to the text context information;
and the classification module is used for classifying the extracted text features so as to distinguish the chat record types and screen out text information related to evidence collection.
CN202010990656.8A 2020-09-19 2020-09-19 Evidence classification method and system in instant messaging information evidence obtaining process Pending CN112732903A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010990656.8A CN112732903A (en) 2020-09-19 2020-09-19 Evidence classification method and system in instant messaging information evidence obtaining process

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010990656.8A CN112732903A (en) 2020-09-19 2020-09-19 Evidence classification method and system in instant messaging information evidence obtaining process

Publications (1)

Publication Number Publication Date
CN112732903A true CN112732903A (en) 2021-04-30

Family

ID=75597214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010990656.8A Pending CN112732903A (en) 2020-09-19 2020-09-19 Evidence classification method and system in instant messaging information evidence obtaining process

Country Status (1)

Country Link
CN (1) CN112732903A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116542252A (en) * 2023-07-07 2023-08-04 北京营加品牌管理有限公司 Financial text checking method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020080553A (en) * 2001-04-16 2002-10-26 삼성전자 주식회사 Dynamic semantic cluster method and apparatus for selectional restriction
CN109376242A (en) * 2018-10-18 2019-02-22 西安工程大学 Text classification algorithm based on Recognition with Recurrent Neural Network variant and convolutional neural networks
CN110532395A (en) * 2019-05-13 2019-12-03 南京大学 A kind of method for building up of the term vector improved model based on semantic embedding
CN111008274A (en) * 2019-12-10 2020-04-14 昆明理工大学 Case microblog viewpoint sentence identification and construction method of feature extended convolutional neural network
CN111651597A (en) * 2020-05-27 2020-09-11 福建博思软件股份有限公司 Multi-source heterogeneous commodity information classification method based on Doc2Vec and convolutional neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020080553A (en) * 2001-04-16 2002-10-26 삼성전자 주식회사 Dynamic semantic cluster method and apparatus for selectional restriction
CN109376242A (en) * 2018-10-18 2019-02-22 西安工程大学 Text classification algorithm based on Recognition with Recurrent Neural Network variant and convolutional neural networks
CN110532395A (en) * 2019-05-13 2019-12-03 南京大学 A kind of method for building up of the term vector improved model based on semantic embedding
CN111008274A (en) * 2019-12-10 2020-04-14 昆明理工大学 Case microblog viewpoint sentence identification and construction method of feature extended convolutional neural network
CN111651597A (en) * 2020-05-27 2020-09-11 福建博思软件股份有限公司 Multi-source heterogeneous commodity information classification method based on Doc2Vec and convolutional neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
曹宇等: "BGRU:中文文本情感分析的新方法", 《计算机科学与探索》, pages 973 - 981 *
王天时: "基于特征嵌入表示的文本分类方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 08, pages 9 - 33 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116542252A (en) * 2023-07-07 2023-08-04 北京营加品牌管理有限公司 Financial text checking method and system
CN116542252B (en) * 2023-07-07 2023-09-29 北京营加品牌管理有限公司 Financial text checking method and system

Similar Documents

Publication Publication Date Title
CN110188194B (en) False news detection method and system based on multitask learning model
US11263250B2 (en) Method and system for analyzing entities
Gan et al. Scalable multi-channel dilated CNN–BiLSTM model with attention mechanism for Chinese textual sentiment analysis
CN113312500A (en) Method for constructing event map for safe operation of dam
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
AU2022245920A1 (en) Document distinguishing based on page sequence learning
CN110348227A (en) A kind of classification method and system of software vulnerability
Chawla et al. Bidirectional LSTM autoencoder for sequence based anomaly detection in cyber security.
CN112199503B (en) Feature-enhanced unbalanced Bi-LSTM-based Chinese text classification method
Sunarya et al. Comparison of accuracy between convolutional neural networks and Naïve Bayes Classifiers in sentiment analysis on Twitter
Wint et al. Deep learning based sentiment classification in social network services datasets
Bansal et al. An Evolving Hybrid Deep Learning Framework for Legal Document Classification.
Huang et al. Text classification with document embeddings
Sajeevan et al. An enhanced approach for movie review analysis using deep learning techniques
CN112732903A (en) Evidence classification method and system in instant messaging information evidence obtaining process
Arbaatun et al. Hate speech detection on Twitter through Natural Language Processing using LSTM model
CN114691836B (en) Text emotion tendentiousness analysis method, device, equipment and medium
Al Duhayyim et al. Hyperparameter Tuned Deep Learning Enabled Cyberbullying Classification in Social Media.
Riemer et al. A deep learning and knowledge transfer based architecture for social media user characteristic determination
CN115034299A (en) Text classification method and device based on convolutional neural network multi-channel feature representation
Farhan et al. Ensemble of gated recurrent unit and convolutional neural network for sarcasm detection in bangla
Biesek Comparison of Traditional Machine Learning Approach and Deep Learning Models in Automatic Cyberbullying Detection for Polish Language
CN113449517A (en) Entity relationship extraction method based on BERT (belief propagation) gating multi-window attention network model
Kabakus Towards the Importance of the Type of Deep Neural Network and Employment of Pre-trained Word Vectors for Toxicity Detection: An Experimental Study
Goldani et al. X-CapsNet For Fake News Detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination