CN111669757B - Terminal fraud call identification method based on conversation text word vector - Google Patents

Terminal fraud call identification method based on conversation text word vector Download PDF

Info

Publication number
CN111669757B
CN111669757B CN202010542362.9A CN202010542362A CN111669757B CN 111669757 B CN111669757 B CN 111669757B CN 202010542362 A CN202010542362 A CN 202010542362A CN 111669757 B CN111669757 B CN 111669757B
Authority
CN
China
Prior art keywords
text
vector
word
fraud
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010542362.9A
Other languages
Chinese (zh)
Other versions
CN111669757A (en
Inventor
孙晓晨
宁珊
林格平
张之含
侯炜
洪永婷
倪善金
周书敏
万辛
沈亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinxun Digital Technology Hangzhou Co ltd
National Computer Network and Information Security Management Center
Original Assignee
EB INFORMATION TECHNOLOGY Ltd
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by EB INFORMATION TECHNOLOGY Ltd, National Computer Network and Information Security Management Center filed Critical EB INFORMATION TECHNOLOGY Ltd
Priority to CN202010542362.9A priority Critical patent/CN111669757B/en
Publication of CN111669757A publication Critical patent/CN111669757A/en
Application granted granted Critical
Publication of CN111669757B publication Critical patent/CN111669757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W12/00Security arrangements; Authentication; Protecting privacy or anonymity
    • H04W12/12Detection or prevention of fraud
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A terminal fraud call identification method based on conversation text word vectors comprises the following steps: the user marks the incoming call in the terminal App, when the incoming call is marked as a fraud category, the incoming call is converted into a text after the approval of the user authorization, the text is viewed and desensitized by the user, and the text is uploaded to the server and stored as a text sample after the authorization of the user; performing word segmentation and part-of-speech tagging on a text sample to obtain a syntactic dependency label and a word combination vector of a segmented word, splicing the word combination vector, the part-of-speech tagging and the syntactic dependency label to form a content vector of the segmented word, and calculating a scene element label to which the segmented word belongs to obtain a semantic vector of the text sample; constructing a fraud classification recognition model, using a text sample in a server as a training sample, and then pushing the trained model to an App from the server; and after receiving the new call to be identified, the App obtains the fraud-related category to which the new call belongs according to the model and prompts the user. The invention belongs to the technical field of information, and can accurately identify fraud telephones based on conversation texts.

Description

Terminal fraud call identification method based on conversation text word vector
Technical Field
The invention relates to a terminal fraud call identification method based on a conversation text word vector, and belongs to the technical field of information.
Background
The current telecommunication fraud cases launched overseas are increasing day by day, and the filtering requirements of mobile phone users on fraud calls are increasing. However, more and more fraudulent communication behaviors tend to be concealed, the characteristics related to the communication behaviors are weakened, and the accuracy and the recall rate of the mobile phone system for identifying bad calls can be further improved only by analyzing and identifying the communication texts.
At present, the fraud call filtering method based on the mobile phone terminal system in the market is still more primitive. Mainstream manufacturers generally adopt a user marking means, that is, the user actively marks the category of the phone and uploads the phone to the server to form a fraud number marking library, so as to filter fraud numbers. The drawback of this approach is that fraudulent calls cannot be found in real time, often when the victim has been found to be deceived.
Therefore, how to accurately identify fraudulent calls based on the call text has become a technical problem generally concerned by various mobile phone manufacturers and mobile phone system developers.
Disclosure of Invention
In view of the above, the present invention provides a method for identifying a terminal fraud phone based on a conversation text word vector, which can accurately identify a fraud phone based on a conversation text.
In order to achieve the above object, the present invention provides a method for identifying terminal fraud calls based on conversation text word vectors, comprising:
step one, a user marks an incoming call in a mobile phone terminal App, for the incoming call marked as a fraud category by the user, the incoming call is extracted and converted into a text after the user authorizes to approve, then the converted text is submitted to the user for inspection and desensitization, and finally the text after the user inspection and desensitization is uploaded to a server to be stored as a text sample after the user authorizes to approve;
secondly, performing word segmentation and part-of-speech tagging on each text sample in the server to obtain a syntactic dependency tag of each word segmentation, then calculating a word vector, a character vector, a pinyin vector and a stroke vector of each word segmentation in the text sample to form a word combination vector of each word segmentation in the text sample, splicing the word combination vector, the part-of-speech tagging and the syntactic dependency tag of each word segmentation to form a content vector of each word segmentation, calculating a scenario element tag to which each word segmentation belongs according to the content vector of each word segmentation, and finally averaging the content vectors and the scenario element tags of all words segmentation in the text sample to obtain a semantic vector corresponding to the text sample;
thirdly, constructing a fraud classification recognition model, inputting semantic vectors corresponding to texts, outputting fraud-related classes to which the texts belong, training the fraud classification recognition model by using text samples uploaded by users in a server as training samples, and then pushing the trained model to a mobile phone terminal App of the users from the server side for model updating;
step four, after receiving a new call to be identified, the mobile phone terminal App of the user extracts the content text of the call to be identified for word segmentation, generates part-of-speech labels, syntactic dependency labels and word combination vectors of all the segmented words in the text, then obtains the fraud-related category to which the call number to be identified belongs according to the fraud classification identification model in the mobile phone terminal App, and prompts the user through App information,
in the second step, word combination vectors, part-of-speech tagging and syntactic dependency labels of each participle in a text sample are spliced to form content vectors of each participle, context element labels to which each participle belongs are obtained through calculation according to the content vectors of each participle, and finally the content vectors and the context element labels of all participles in the text sample are averaged, so that semantic vectors corresponding to the text are obtained, and the method further comprises the following steps:
a1, setting a plurality of scene elements;
step A2, inputting a word combination vector, a part-of-speech tag and a syntax dependence tag of each participle in a text sample into an LSTM model for encoding, and obtaining a content vector corresponding to each participle;
step A3, using Self-orientation to calculate the weighted influence factor of each participle relative to other participles according to the word combination vector of each participle;
step A4, combining the content vector of each participle obtained in the step A2 and the weighted influence factor of each participle obtained in the step A3 into a new content vector of each participle, and then inputting the new content vector of each participle into a CNN model, wherein the output of the CNN model is a scene element corresponding to each participle;
and A5, inputting the new content vector and the scene element of each participle in the text sample into an LSTM model for encoding, combining output results of the LSTM models corresponding to all participles in the text sample into a vector matrix, and taking an average value of a second dimension of the orientation quantity matrix as a semantic vector of the text sample.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a conversation text recognition method for quickly converting conversation texts into digitized vectors, fusing word vectors, pinyin vectors and stroke vectors, and constructing event elements of multiple fraud scenes on the basis of part-of-speech identifiers, which can realize the targeted analysis of multiple fraud scenes from multiple angles such as event description, subsequent actions, double attitudes and the like, fully ensure the privacy of users, solve the problem of semantic deviation caused by homophonic abnormal characters or polyphonic characters, and improve the precision and recall rate of the recognition of bad calls by the users and manufacturers to the greatest extent.
Drawings
FIG. 1 is a flow chart of a terminal fraud call identification method based on conversation text word vectors of the present invention.
Fig. 2 is a flowchart of a specific step of performing word segmentation and part-of-speech tagging on each text sample to obtain a syntactic dependency tag of each segmented word in step two of fig. 1.
Fig. 3 is a flowchart of a specific step in the second step in fig. 1, in which a word combination vector, part-of-speech tagging, and a syntactic dependency tag of each segmented word in a text sample are combined to form a content vector of each segmented word, a context element tag to which each segmented word belongs is obtained through calculation according to the content vector of each segmented word, and finally, the content vectors and the context element tags of all segmented words in the text sample are averaged, so that a semantic vector corresponding to the text sample is obtained.
Detailed Description
To make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.
As shown in fig. 1, the present invention provides a method for identifying terminal fraud calls based on conversation text word vectors, which comprises:
step one, a user marks an incoming call in a mobile phone terminal App, for the incoming call marked as a fraud category by the user, the incoming call is extracted and converted into a text after the user authorizes to approve, then the converted text is submitted to the user for inspection and desensitization, and finally the text after the user inspection and desensitization is uploaded to a server to be stored as a text sample after the user authorizes to approve, wherein the desensitization is to remove sensitive information related to personal identity, such as identity card number, name, mobile phone number and the like;
secondly, performing word segmentation and part-of-speech tagging on each text sample in the server to obtain a syntactic dependency tag of each word segmentation, then calculating a word vector, a character vector, a pinyin vector and a stroke vector of each word segmentation in the text sample to form a word combination vector of each word segmentation in the text sample, splicing the word combination vector, the part-of-speech tagging and the syntactic dependency tag of each word segmentation to form a content vector of each word segmentation, calculating a scenario element tag to which each word segmentation belongs according to the content vector of each word segmentation, and finally averaging the content vectors and the scenario element tags of all words segmentation in the text sample to obtain a semantic vector corresponding to the text sample;
thirdly, constructing a fraud classification recognition model, inputting semantic vectors corresponding to texts, outputting fraud-related classes to which the texts belong, training the fraud classification recognition model by using text samples uploaded by users in a server as training samples, and then pushing the trained model to a mobile phone terminal App of the users from the server side for model updating;
and step four, after receiving a new call to be identified, the mobile phone terminal App of the user extracts the content text of the call to be identified for word segmentation, generates part-of-speech labels, sentence dependence labels and word combination vectors of all the segmented words in the text, then obtains the fraud category to which the call number to be identified belongs according to a fraud classification identification model in the mobile phone terminal App, and prompts the user through App information.
The first step may further comprise:
step 11, after a user installs a mobile phone terminal App, obtaining a function of marking an incoming call, when the user marks that the current incoming call is a fraud type by using the function, extracting content in the first 60 seconds of the incoming call by using an HMM algorithm in the mobile phone terminal App so as to generate a content text, then removing personal identity related information in the content text based on a general rule, and finally pushing the desensitized text in the mobile phone terminal App to be viewed by the user;
step 12, the user views the text, can edit the text to further perfect desensitization, and then select whether to upload the desensitization text marked as a fraud category to the server, if so, upload the text and the mark of the fraud category to the server under the authorization of the user;
step 13, performing text cleaning on the text received by the server, wherein the text cleaning comprises the steps of removing abnormal characters except Chinese, english and numbers in the text, uniformly replacing line feed characters and placeholders with blanks, and separating and converting a plurality of blanks into a blank;
and step 14, cleaning the text again, intercepting the first 180 characters of the text, and removing the text with the text amount smaller than 15 characters.
As shown in fig. 2, in the second step, performing word segmentation and part-of-speech tagging on each text sample to obtain a syntactic dependency tag of each word segmentation, which may further include:
step 21, generating a stop word dictionary based on Chinese grammar;
step 22, manually adding common words as a user-defined dictionary based on the fraud scene;
step 23, performing word segmentation and part-of-speech tagging on the text sample by using an HMM algorithm based on a DAG (hidden Markov model) word graph, and simultaneously inputting an optimized word segmentation result of a custom dictionary;
step 24, performing syntactic dependency analysis on each participle by using a fast Offset-based algorithm, and outputting a syntactic dependency label of each participle, as shown in the following table:
Figure GDA0003880057270000041
Figure GDA0003880057270000051
and 25, filtering stop words in the text sample by using the stop word dictionary.
In the second step, the word vector, the pinyin vector and the stroke vector of each participle are calculated to form a word combination vector of each participle in the text sample, and the method further comprises the following steps:
outputting a word vector C of each participle by using a skip-Gram method w0 Word vector C c Pinyin vector C p And stroke vector C b Then, a word combination vector for each segmented word is constructed:
Figure GDA0003880057270000052
Figure GDA0003880057270000053
wherein,
Figure GDA0003880057270000054
the vectors are obtained by different combination modes, and sum represents summation operation.
The invention utilizes a skip-Gram model to convert words into numerical vectors. The core of the skip-Gram is a Huffman tree, each word starting from the rootReaching a leaf node, one word in its context can be predicted. Each word is iterated N-1 times, resulting in a prediction of all words in its context. I.e. assuming that a text sample S is composed of n words w 1 ......w n Composition of, wherein the word w t The probability of 2k words occurring with a context word window size of k can be predicted.
As shown in fig. 3, in the second step, the word combination vector, the part-of-speech tagging, and the syntactic dependency tag of each participle in the text sample are merged to form a content vector of each participle, a context element tag to which each participle belongs is obtained through calculation according to the content vector of each participle, and finally, the content vectors and the context element tags of all participles in the text sample are averaged, so as to obtain a semantic vector corresponding to the text, which may further include:
step A1, setting a plurality of scene elements, labeling the scene elements corresponding to each participle by combining specific event scenes, labeling 12 types of scene elements shown in the following table, and classifying the scene elements which do not belong to the 12 types into other scene elements;
Figure GDA0003880057270000055
Figure GDA0003880057270000061
step A2, inputting the word combination vector, the part-of-speech tagging and the sentence dependency tag of each participle in the text sample into an LSTM model for encoding, and obtaining a content vector corresponding to each participle;
step A3, using Self-orientation to calculate the weighted influence factor of each participle relative to other participles according to the word combination vector of each participle;
step A4, combining the content vector of each participle obtained in the step A2 and the weighted influence factor of each participle obtained in the step A3 into a new content vector of each participle, and then inputting the new content vector of each participle into a CNN model, wherein the output of the CNN model is a scene element corresponding to each participle;
and step A5, inputting the new content vector and the scene element of each participle in the text sample into an LSTM model for coding, combining the output results of the LSTM models corresponding to all participles in the text sample into a vector matrix, and taking the average value of the second dimension of the orientation quantity matrix as the semantic vector of the text sample.
In step three, a fraud classification identification model may be constructed based on the CNN model.
In the fourth step, the working process of the fraud classification and identification model in the mobile phone terminal App is as follows:
the method comprises the steps of combining content vectors of all participles according to word combination vectors, part-of-speech tagging and sentence dependency tags of all the participles in a call text, calculating to obtain scene element tags to which the participles belong according to the content vectors of the participles, averaging the content vectors and the scene element tags of all the participles in the call text to obtain semantic vectors corresponding to the call text, inputting the semantic vectors corresponding to the call text into a fraud classification recognition model in a mobile phone terminal App to obtain fraud categories to which call numbers to be recognized belong, pushing the tag obtained through recognition through App messages to remind users, selecting whether to correct the tag by the users and editing and desensitizing the text again, and uploading the text and the tag to a server for secondary training if the users approve the tags.
It is worth mentioning that in step two, a plurality of word combination vectors of each participle in the training sample can be calculated, for example
Figure GDA0003880057270000062
Figure GDA0003880057270000063
Then in the third step, the semantic vectors obtained by corresponding to different word combination vectors are respectively input into the fraud classification recognition models for training, and according to the recognition accuracy of the fraud classification recognition models corresponding to different word combination vectors, the fraud classification recognition model with the highest recognition accuracy and the corresponding word combination vector thereof are selected from the semantic vectors,and in the fourth step, the selected fraud classification identification model and the word combination vector are used for calculating and obtaining the fraud related category to which the call number to be identified belongs.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (7)

1. A terminal fraud call identification method based on conversation text word vectors is characterized by comprising the following steps:
step one, a user marks an incoming call in a mobile phone terminal App, for the incoming call marked as a fraud category by the user, the incoming call is extracted and converted into a text after the user authorizes to approve, then the converted text is submitted to the user for inspection and desensitization, and finally the text after the user inspection and desensitization is uploaded to a server to be stored as a text sample after the user authorizes to approve;
secondly, performing word segmentation and part-of-speech tagging on each text sample in the server to obtain a syntactic dependency tag of each word segmentation, then calculating a word vector, a character vector, a pinyin vector and a stroke vector of each word segmentation in the text sample to form a word combination vector of each word segmentation in the text sample, splicing the word combination vector, the part-of-speech tagging and the syntactic dependency tag of each word segmentation to form a content vector of each word segmentation, calculating a scenario element tag to which each word segmentation belongs according to the content vector of each word segmentation, and finally averaging the content vectors and the scenario element tags of all words segmentation in the text sample to obtain a semantic vector corresponding to the text sample;
thirdly, a fraud classification recognition model is constructed, the input of the fraud classification recognition model is a semantic vector corresponding to a text, the output of the fraud classification recognition model is a fraud category to which the text belongs, a text sample uploaded by a user in a server is used as a training sample to train the fraud classification recognition model, and then the trained model is pushed to a mobile phone terminal App of the user from the server side to update the model;
step four, after receiving a new call to be identified, the mobile phone terminal App of the user extracts the content text of the call to be identified for word segmentation, generates part-of-speech labels, syntactic dependency labels and word combination vectors of all the words in the text, then obtains the fraud category to which the call number to be identified belongs according to a fraud classification identification model in the mobile phone terminal App, prompts the user through App information,
in the second step, word combination vectors, part-of-speech labels and syntactic dependency labels of each participle in the text sample are combined to form content vectors of each participle, a scene element label to which each participle belongs is obtained through calculation according to the content vector of each participle, and finally, the content vectors and the scene element labels of all participles in the text sample are averaged, so that semantic vectors corresponding to the text are obtained, and the method further comprises the following steps:
a1, setting a plurality of scene elements;
step A2, inputting a word combination vector, a part-of-speech tag and a syntactic dependency tag of each participle in a text sample into an LSTM model for encoding, and obtaining a content vector corresponding to each participle;
step A3, using Self-orientation to calculate the weighted influence factor of each participle relative to other participles according to the word combination vector of each participle;
step A4, combining the content vector of each participle obtained in the step A2 and the weighted influence factor of each participle obtained in the step A3 into a new content vector of each participle, and then inputting the new content vector of each participle into a CNN model, wherein the output of the CNN model is a scene element corresponding to each participle;
and A5, inputting the new content vector and the scene element of each participle in the text sample into an LSTM model for encoding, combining output results of the LSTM models corresponding to all participles in the text sample into a vector matrix, and taking an average value of a second dimension of the orientation quantity matrix as a semantic vector of the text sample.
2. The method of claim 1, wherein step one further comprises:
step 11, after a user installs a mobile phone terminal App, obtaining a function of marking an incoming call, when the user marks that the current incoming call is a fraud type by using the function, extracting content in the first 60 seconds of the incoming call by using an HMM algorithm in the mobile phone terminal App so as to generate a content text, then removing personal identity related information in the content text based on a general rule, and finally pushing the desensitized text in the mobile phone terminal App to be viewed by the user;
step 12, the user views the text, edits the text to further improve desensitization, and then selects whether to upload desensitization text marked as fraud categories by the user to the server, and if so, uploads the text and the marks of the fraud categories to the server under the authorization of the user;
step 13, performing text cleaning on the text received by the server, wherein the text cleaning comprises the steps of removing abnormal characters except Chinese, english and numbers in the text, uniformly replacing line feed characters and placeholders with blanks, and separating and converting a plurality of blanks into a blank;
and step 14, cleaning the text again, intercepting the first 180 characters of the text, and removing the text with the text amount smaller than 15 characters.
3. The method according to claim 1, wherein in the second step, word segmentation and part-of-speech tagging are performed on each text sample to obtain a syntactic dependency label of each word segmentation, and the method further comprises:
step 21, generating a stop word dictionary based on Chinese grammar;
step 22, based on the fraud scene, manually adding common words as a custom dictionary;
step 23, performing word segmentation and part-of-speech tagging on the text sample by using an HMM algorithm based on a DAG (hidden Markov model) word graph, and simultaneously inputting an optimized word segmentation result of a custom dictionary;
step 24, performing syntactic dependency analysis on each participle by using a fast Offset-based algorithm, and outputting a syntactic dependency label of each participle:
and 25, filtering stop words in the text sample by using the stop word dictionary.
4. The method of claim 1, wherein in step two, the word vector, the pinyin vector, and the stroke vector of each participle are calculated to form a word combination vector of each participle in the text sample, and further comprising:
outputting a word vector C of each participle by using skip-Gram method w0 Word vector C c Pinyin vector C p And stroke vector C b Then, a word combination vector for each participle is constructed:
Figure FDA0003880057260000021
Figure FDA0003880057260000022
wherein,
Figure FDA0003880057260000023
the vector is a plurality of word combination vectors obtained by different combination modes, and sum represents summation operation.
5. The method of claim 1, wherein in step three, a fraud classification identification model is constructed based on a CNN model.
6. The method as claimed in claim 1, wherein in step four, the fraud classification recognition model works in the cell phone terminal App as follows:
the method comprises the steps of combining content vectors of all participles according to word combination vectors, part-of-speech tagging and syntactic dependency tags of all the participles in a call text, calculating to obtain scene element tags to which the participles belong according to the content vectors of the participles, averaging the content vectors and the scene element tags of all the participles in the call text to obtain semantic vectors corresponding to the call text, inputting the semantic vectors corresponding to the call text into a fraud classification recognition model in a mobile phone terminal App to obtain fraud categories to which call numbers to be recognized belong, pushing the tag obtained through recognition through an App message to remind a user, selecting whether to correct the tag by the user and editing and desensitizing the text again, and uploading the text and the tag to a server for secondary training if the user agrees to authorize.
7. The method as claimed in claim 1, wherein a plurality of word combination vectors of each segment in the training samples are calculated in step two, then the semantic vectors corresponding to different word combination vectors are respectively inputted into the fraud classification recognition models for training in step three, and the fraud classification recognition model with the highest recognition accuracy and the word combination vector corresponding thereto are selected according to the recognition accuracy of the fraud classification recognition models corresponding to different word combination vectors, so that the fraud classification recognition model and the word combination vector selected in step four are used to calculate the fraud category to which the obtained to-be-recognized call number belongs.
CN202010542362.9A 2020-06-15 2020-06-15 Terminal fraud call identification method based on conversation text word vector Active CN111669757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010542362.9A CN111669757B (en) 2020-06-15 2020-06-15 Terminal fraud call identification method based on conversation text word vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010542362.9A CN111669757B (en) 2020-06-15 2020-06-15 Terminal fraud call identification method based on conversation text word vector

Publications (2)

Publication Number Publication Date
CN111669757A CN111669757A (en) 2020-09-15
CN111669757B true CN111669757B (en) 2023-03-14

Family

ID=72387708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010542362.9A Active CN111669757B (en) 2020-06-15 2020-06-15 Terminal fraud call identification method based on conversation text word vector

Country Status (1)

Country Link
CN (1) CN111669757B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112734050A (en) * 2020-12-11 2021-04-30 平安科技(深圳)有限公司 Text model training method, text model recognition device, text model equipment and storage medium
CN112766903B (en) * 2021-01-18 2024-02-06 阿斯利康投资(中国)有限公司 Method, device, equipment and medium for identifying adverse event
CN112906380B (en) * 2021-02-02 2024-09-27 北京有竹居网络技术有限公司 Character recognition method and device in text, readable medium and electronic equipment
CN113254595B (en) * 2021-06-22 2021-10-22 北京沃丰时代数据科技有限公司 Chatting recognition method and device, electronic equipment and storage medium
CN114091476A (en) * 2021-11-18 2022-02-25 北京淘友天下科技发展有限公司 Dialog recognition method and device, electronic equipment and computer readable storage medium
CN114021564B (en) * 2022-01-06 2022-04-01 成都无糖信息技术有限公司 Segmentation word-taking method and system for social text
CN117891926B (en) * 2024-03-15 2024-05-14 环球数科集团有限公司 Text feature fraud early warning system based on artificial intelligence
CN118445673B (en) * 2024-07-03 2024-09-20 北京秒信科技有限公司 Telecom fraud recognition analysis system based on intelligent analysis

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294845A (en) * 2016-08-19 2017-01-04 清华大学 The many emotions sorting technique extracted based on weight study and multiple features and device
CN108153727A (en) * 2017-12-18 2018-06-12 浙江鹏信信息科技股份有限公司 Utilize the method for semantic mining algorithm mark sales calls and the system of improvement sales calls
CN108428447A (en) * 2018-06-19 2018-08-21 科大讯飞股份有限公司 A kind of speech intention recognition methods and device
CN109388801A (en) * 2018-09-30 2019-02-26 阿里巴巴集团控股有限公司 The determination method, apparatus and electronic equipment of similar set of words
CN109451182A (en) * 2018-10-19 2019-03-08 北京邮电大学 A kind of detection method and device of fraudulent call
CN110309299A (en) * 2018-04-12 2019-10-08 腾讯科技(深圳)有限公司 Communicate anti-swindle method, apparatus, computer-readable medium and electronic equipment
CN110427608A (en) * 2019-06-24 2019-11-08 浙江大学 A kind of Chinese word vector table dendrography learning method introducing layering ideophone feature

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108566627A (en) * 2017-11-27 2018-09-21 浙江鹏信信息科技股份有限公司 A kind of method and system identifying fraud text message using deep learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294845A (en) * 2016-08-19 2017-01-04 清华大学 The many emotions sorting technique extracted based on weight study and multiple features and device
CN108153727A (en) * 2017-12-18 2018-06-12 浙江鹏信信息科技股份有限公司 Utilize the method for semantic mining algorithm mark sales calls and the system of improvement sales calls
CN110309299A (en) * 2018-04-12 2019-10-08 腾讯科技(深圳)有限公司 Communicate anti-swindle method, apparatus, computer-readable medium and electronic equipment
CN108428447A (en) * 2018-06-19 2018-08-21 科大讯飞股份有限公司 A kind of speech intention recognition methods and device
CN109388801A (en) * 2018-09-30 2019-02-26 阿里巴巴集团控股有限公司 The determination method, apparatus and electronic equipment of similar set of words
CN109451182A (en) * 2018-10-19 2019-03-08 北京邮电大学 A kind of detection method and device of fraudulent call
CN110427608A (en) * 2019-06-24 2019-11-08 浙江大学 A kind of Chinese word vector table dendrography learning method introducing layering ideophone feature

Also Published As

Publication number Publication date
CN111669757A (en) 2020-09-15

Similar Documents

Publication Publication Date Title
CN111669757B (en) Terminal fraud call identification method based on conversation text word vector
CN112804400B (en) Customer service call voice quality inspection method and device, electronic equipment and storage medium
WO2022142041A1 (en) Training method and apparatus for intent recognition model, computer device, and storage medium
CN110362822B (en) Text labeling method, device, computer equipment and storage medium for model training
CN112380853B (en) Service scene interaction method and device, terminal equipment and storage medium
CN112992125B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN109670166A (en) Collection householder method, device, equipment and storage medium based on speech recognition
CN111324708A (en) Natural language processing system based on human-computer interaction
CN112235470B (en) Incoming call client follow-up method, device and equipment based on voice recognition
CN112699683A (en) Named entity identification method and device fusing neural network and rule
CN112101003B (en) Sentence text segmentation method, device and equipment and computer readable storage medium
CN113270114A (en) Voice quality inspection method and system
CN113240510A (en) Abnormal user prediction method, device, equipment and storage medium
KR101440887B1 (en) Method and apparatus of recognizing business card using image and voice information
CN116610772A (en) Data processing method, device and server
CN116741155A (en) Speech recognition method, training method, device and equipment of speech recognition model
CN110728145A (en) Method for establishing natural language understanding model based on recording conversation
CN113420549B (en) Abnormal character string identification method and device
CN111464687A (en) Strange call request processing method and device
CN115440217A (en) Voice recognition method, device, equipment and storage medium
CN114595332A (en) Text classification prediction method and device and electronic equipment
CN113936666A (en) Audio data identification method and system based on BS (browser/server) architecture and readable storage medium
CN114254088A (en) Method for constructing automatic response model and automatic response method
CN110765300B (en) Semantic analysis method based on emoji
CN111161707A (en) Method for automatically supplementing quality inspection keyword list, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: 100029 Beijing city Chaoyang District Yumin Road No. 3

Patentee after: NATIONAL COMPUTER NETWORK AND INFORMATION SECURITY MANAGEMENT CENTER

Patentee after: Xinxun Digital Technology (Hangzhou) Co.,Ltd.

Address before: 100029 Beijing city Chaoyang District Yumin Road No. 3

Patentee before: NATIONAL COMPUTER NETWORK AND INFORMATION SECURITY MANAGEMENT CENTER

Patentee before: EB Information Technology Ltd.

CP01 Change in the name or title of a patent holder