CN114372461A - Hidden keyword extraction method, terminal device and storage medium - Google Patents

Hidden keyword extraction method, terminal device and storage medium Download PDF

Info

Publication number
CN114372461A
CN114372461A CN202111488191.7A CN202111488191A CN114372461A CN 114372461 A CN114372461 A CN 114372461A CN 202111488191 A CN202111488191 A CN 202111488191A CN 114372461 A CN114372461 A CN 114372461A
Authority
CN
China
Prior art keywords
word
library
words
phrase library
phrase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111488191.7A
Other languages
Chinese (zh)
Inventor
陈云
杜新胜
吴松洋
蔡勇恩
汤增荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN202111488191.7A priority Critical patent/CN114372461A/en
Publication of CN114372461A publication Critical patent/CN114372461A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for extracting hidden keywords, terminal equipment and a storage medium, wherein the method comprises the following steps: s1: setting classification dimensions and different categories contained in each classification dimension, performing text extraction on the evidence data according to each category in each classification dimension, and constructing a text library in different categories; s2: performing word segmentation processing on the texts in the text base under each category, and building the results after the word segmentation processing into a word group base under each category; s3: filtering words in each word group library; s4: calculating the similarity between each word in the word group library and the known keyword, and eliminating the words with the similarity smaller than a similarity threshold value; s5: calculating the weight of each word in the word group library in the evidence data, and eliminating the words with the weight smaller than the weight threshold value; s6: and obtaining the recessive keywords under each category according to the phrase library processed by the steps. The invention realizes the automatic mining of case-involved hidden keywords in massive evidence-obtaining data.

Description

Hidden keyword extraction method, terminal device and storage medium
Technical Field
The invention relates to the technical field of evidence obtaining, in particular to a method for extracting hidden keywords, terminal equipment and a storage medium.
Background
With the high-speed development of the mobile internet technology, mobile phone evidence obtaining data are more and more appeared in various case works, an analysis method based on case-related keywords can often play a key supporting role in the case investigation process, particularly when explicit clues are not mastered, the case clues can often be rapidly excavated by using the keywords, the cases are broken through, and the working efficiency of investigation staff is greatly improved.
In the use of the existing case-related keywords, on one hand, along with the enhancement of professional knowledge and anti-investigation consciousness of case-related personnel, the hidden keywords are often adopted for illegal activities, and the novel case-related keywords which are frequently changed and layered endlessly cannot be mastered in time, so that clues of related cases are difficult to find, and inconvenience is brought to investigation and striking work of the cases; on the other hand, illegal activities show the characteristic of geography and are concentrated in different areas to show different types, so that the keyword library summarized by traditional analysis is difficult to adapt to the conditions all over the country and the development of times.
Disclosure of Invention
In order to solve the above problems, the present invention provides a hidden keyword extraction method, a terminal device, and a storage medium.
The specific scheme is as follows:
a method for extracting recessive keywords comprises the following steps:
s1: setting classification dimensions and different categories contained in each classification dimension, performing text extraction on the evidence data according to each category in each classification dimension, and constructing a text library in different categories;
s2: performing word segmentation processing on the texts in the text base under each category, and building the results after the word segmentation processing into a word group base under each category;
s3: filtering words in each word group library;
s4: based on the phrase library processed in step S3, for each word in the phrase library, performing similarity calculation between the word and a known keyword in the category corresponding to the phrase library, and removing words with similarity smaller than a similarity threshold from the phrase library;
s5: based on the phrase library processed in step S4, calculating the weight of each word in the forensic data by an IF-IDF algorithm for each word in the phrase library, and eliminating words with weights less than a weight threshold in the phrase library;
s6: and obtaining the recessive keywords under each category according to the phrase library processed in the step S5.
Further, the classification dimension comprises case types, case issuing areas, case-related household locations, case-related ethnic groups, application types and acquisition time.
Furthermore, the evidence obtaining data is communication data, and the source of the communication data is one or more of short message service, instant chat content, email, microblog, bar and word stock of data method.
Further, before building the text library, the step S1 further includes preprocessing the text for building the text library, where the preprocessing includes: deduplication processing, invalid data removal processing, and converting semi-structured or structured data into structured data in text format.
Further, the filtering process in step S3 includes the steps of:
s31: carrying out stop word filtering processing on the words in the word group library, and eliminating stop words in the word group library;
s32: based on the phrase library processed in step S31, according to the part of speech of each word, performing part of speech filtering processing on the words in the phrase library, and removing words corresponding to unnecessary parts of speech in the phrase library;
s33: based on the phrase library processed in step S32, removing words existing in the white list lexicon in the phrase library according to the constructed white list lexicon for storing the common keywords;
s34: based on the phrase library processed in step S33, gaussian distribution is performed on all words in the phrase library according to the word frequency of each word, words corresponding to the distribution interval are extracted from the result of gaussian distribution according to the distribution interval corresponding to the hidden keyword, and other words are removed from the phrase library.
Further, the similarity calculation method in step S4 includes: after the word vector of each word is calculated through the word2vec algorithm, the distance between the word vectors of the two words is used as the similarity between the two words.
Further, step S6 is specifically: taking the phrase library processed in step S5 as the phrase library corresponding to the current time period, performing intersection operation on the phrase library and the keyword library corresponding to the historical time period, and taking the word in the difference between the phrase library corresponding to the current time period and the intersection operation result as the recessive keyword in the category corresponding to the phrase library.
A hidden keyword extracting terminal device includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, where the processor executes the computer program to implement the steps of the method in the embodiment of the present invention.
A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method as described above for an embodiment of the invention.
By adopting the technical scheme, the invention realizes the automatic mining of the involved hidden keywords in the mass evidence data, the discovered involved new keywords are continuously updated in an iterative manner, the keyword library is continuously accumulated, and the discovery method of the hidden keywords has strong self-learning capability and strong adaptability to future changes.
Drawings
Fig. 1 is a flowchart illustrating a first embodiment of the present invention.
Detailed Description
To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.
The invention will now be further described with reference to the accompanying drawings and detailed description.
The first embodiment is as follows:
the embodiment of the invention provides a method for extracting hidden keywords, which comprises the following steps of:
s1: setting classification dimensions and different categories contained in each classification dimension, performing text extraction on the evidence data according to each category in each classification dimension, and constructing a text library in different categories.
The classification dimension set in the embodiment includes case type, case issuing area, account location of case-involved person, ethnic group of case-involved person, application type and acquisition time. The categories under each classification dimension, such as the categories corresponding to the application types, include WeChat, QQ, email, and the like.
The evidence obtaining data is selected as communication data, and sources of the communication data comprise short messages, instant chat contents, electronic mailboxes, microblogs, posts, word banks of data methods and the like.
Since the extracted text may contain repeated content and invalid content, and may also contain semi-structured or unstructured data such as voice, pictures, documents, and the like, this embodiment further includes, before constructing the text library, preprocessing the text for constructing the text library, where the preprocessing includes: deduplication processing, invalid data (e.g., empty text, plain emoticons, links, red packets, system messages, etc.) removal processing, and converting semi-structured or structured data into structured data in text format. The conversion process of the structured data can be realized by adopting the existing artificial intelligence algorithm, and details are not described herein.
S2: and performing word segmentation processing on the texts in the text base under each category, and building the results after the word segmentation processing into a word group base under each category.
The word segmentation processing can be performed by using the existing artificial intelligence algorithm, which is not described herein.
S3: and carrying out filtering processing on the words in each word group library.
The filtering process is used to narrow the scope of the hidden keyword, i.e., to exclude words that are unlikely to be hidden keywords. The filtering process in this embodiment includes the steps of:
s31: and performing stop word filtering processing on the words in the word group library, and removing the stop words in the word group library.
Stop words such as "yes", "and" may be stored in a manner of pre-constructing a stop word library.
S32: based on the phrase library processed in step S31, the parts of speech filtering processing is performed on the words in the phrase library according to the parts of speech of each word, and the words corresponding to the unnecessary parts of speech in the phrase library are removed.
Such as words that only need to retain part of speech such as nouns, verbs, quantifiers, adjectives, etc.
S33: and based on the phrase library processed in the step S32, removing words existing in the white list lexicon in the phrase library according to the constructed white list lexicon for storing the common keywords.
The white list word bank can be constructed by carrying out word frequency statistics through a similar analysis method of historical evidence data and combining manual supplementation and data provided by a third party.
S34: based on the phrase library processed in step S33, gaussian distribution is performed on all words in the phrase library according to the word frequency of each word, words corresponding to the distribution interval are extracted from the result of gaussian distribution according to the distribution interval corresponding to the hidden keyword, and other words are removed from the phrase library.
Hidden keywords in a case, namely, involved words such as black words and word lines in the traditional meaning, often belong to low-frequency words in the whole general content, are obtained through verification of massive word-dividing phrases, the occurrence frequency statistics of the word-dividing phrases accords with Gaussian distribution in the general content, and according to the characteristics of the Gaussian distribution, the hidden keywords are substituted into corresponding groups through dimensions such as case types, case regions, registered people nationalities, case-involved people nationalities, application types, acquisition time and the like to perform distribution interval calculation of the hidden keywords, so that most normal keywords can be eliminated.
S4: based on the phrase library processed in step S3, for each word in the phrase library, similarity calculation is performed between the word and the known keyword in the category corresponding to the phrase library, and words with similarity smaller than a similarity threshold are removed from the phrase library.
The method for calculating the similarity in the embodiment comprises the following steps: after the word vector of each word is calculated through the word2vec algorithm, the distance between the word vectors of the two words is used as the similarity between the two words. The different categories of known keywords may be stored in the form of a library of known keywords by way of manual collection.
The similarity threshold may be set by a person skilled in the art according to requirements, and is not limited herein.
S5: based on the phrase library processed in step S4, for each word in the phrase library, the weight of each word in the forensic data is calculated by the IF-IDF algorithm, and words whose weights do not conform to the weight range are removed from the phrase library.
And combining the occupation ratio of each word in the whole word group library, further calculating the weight of each word by using an IF-IDF algorithm, and further reducing the result range by setting a weight threshold value, so that a worker can further analyze and judge the credibility of the keywords by combining an actual service scene.
The weight range can be set by a person skilled in the art according to experience and experimental results, and is not limited herein.
S6: and obtaining the recessive keywords under each category according to the phrase library processed in the step S5.
Because the involved keywords can change along with the characteristics and means of illegal behaviors and the change of counterreconnaissance consciousness of the involved people, the keywords have stronger timeliness, and the interference of time dimension can be eliminated to the maximum extent by carrying out cross comparison on the involved recessive keywords by utilizing the keyword set operation based on a dynamic time window.
In this embodiment, the phrase library processed in step S5 is used as the phrase library corresponding to the current time period, the phrase library is subjected to intersection operation with the keyword library corresponding to the historical time period, and the word in the difference between the phrase library corresponding to the current time period and the intersection operation result is used as the recessive keyword in the category corresponding to the phrase library. The hidden keywords are hidden keywords newly appearing in the current time period, and in the next time period, the hidden keywords in the current time period can be added into the keyword library corresponding to the historical time period, and then the hidden keywords corresponding to the next time period are extracted.
According to the embodiment of the invention, the detailed segmentation of the dimensions of case types, case regions, household locations, nationalities, application types, acquisition time and the like is carried out according to the evidence-obtaining data link contents, the grouped link contents are subjected to automatic word segmentation, various screening strategies and Gaussian distribution are combined, the interference data are greatly reduced, the weight calculation is realized by using a corresponding algorithm, the real-time processing and automatic discovery of case-involved recessive keywords can be realized for massive evidence-obtaining data, and the whole processing link does not need manual intervention.
The embodiment of the invention can support all case types and communication application types, can adapt dialects, special codes and the like, is not restricted by any conditions, and has strong adaptability to the characteristics and rules of doing cases in different areas. The newly discovered hidden keywords can be further brought into the case-related keyword library for continuous iteration updating and perfection, and the method is used for subsequent study and judgment analysis and clue mining, and the overall working efficiency is improved.
Example two:
the invention further provides a hidden keyword extraction terminal device, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the computer program to realize the steps of the method embodiment of the first embodiment of the invention.
Further, as an executable scheme, the hidden keyword extraction terminal device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The hidden keyword extracting terminal device can include, but is not limited to, a processor and a memory. Those skilled in the art will understand that the above-mentioned component structure of the hidden keyword extraction terminal device is only an example of the hidden keyword extraction terminal device, and does not constitute a limitation on the hidden keyword extraction terminal device, and may include more or less components than the above, or combine some components, or different components, for example, the hidden keyword extraction terminal device may further include an input/output device, a network access device, a bus, and the like, which is not limited in this embodiment of the present invention.
Further, as an executable solution, the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and the like. The general processor can be a microprocessor or the processor can be any conventional processor, and the processor is a control center of the hidden keyword extraction terminal device and is connected with each part of the whole hidden keyword extraction terminal device by various interfaces and lines.
The memory can be used for storing the computer program and/or the module, and the processor can realize various functions of the implicit keyword extraction terminal device by running or executing the computer program and/or the module stored in the memory and calling the data stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
The invention also provides a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method of an embodiment of the invention.
The module/unit integrated with the hidden keyword extracting terminal device may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), software distribution medium, and the like.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (9)

1. A method for extracting recessive keywords is characterized by comprising the following steps:
s1: setting classification dimensions and different categories contained in each classification dimension, performing text extraction on the evidence data according to each category in each classification dimension, and constructing a text library in different categories;
s2: performing word segmentation processing on the texts in the text base under each category, and building the results after the word segmentation processing into a word group base under each category;
s3: filtering words in each word group library;
s4: based on the phrase library processed in step S3, for each word in the phrase library, performing similarity calculation between the word and a known keyword in the category corresponding to the phrase library, and removing words with similarity smaller than a similarity threshold from the phrase library;
s5: based on the phrase library processed in step S4, calculating the weight of each word in the forensic data by an IF-IDF algorithm for each word in the phrase library, and eliminating words whose weight does not conform to the weight range from the phrase library;
s6: and obtaining the recessive keywords under each category according to the phrase library processed in the step S5.
2. The latent keyword extraction method according to claim 1, characterized in that: the classification dimension comprises case types, case issuing areas, case-involved household locations, case-involved nationalities, application types and acquisition time.
3. The latent keyword extraction method according to claim 1, characterized in that: the evidence obtaining data is communication data, and the source of the communication data is one or more of short message service, instant chat content, email, microblog, bar and data method lexicon.
4. The latent keyword extraction method according to claim 1, characterized in that: in step S1, before constructing the text library, the method further includes preprocessing the text used for constructing the text library, where the preprocessing includes: deduplication processing, invalid data removal processing, and converting semi-structured or structured data into structured data in text format.
5. The latent keyword extraction method according to claim 1, characterized in that: the filtering process in step S3 includes the steps of:
s31: carrying out stop word filtering processing on the words in the word group library, and eliminating stop words in the word group library;
s32: based on the phrase library processed in step S31, according to the part of speech of each word, performing part of speech filtering processing on the words in the phrase library, and removing words corresponding to unnecessary parts of speech in the phrase library;
s33: based on the phrase library processed in step S32, removing words existing in the white list lexicon in the phrase library according to the constructed white list lexicon for storing the common keywords;
s34: based on the phrase library processed in step S33, gaussian distribution is performed on all words in the phrase library according to the word frequency of each word, words corresponding to the distribution interval are extracted from the result of gaussian distribution according to the distribution interval corresponding to the hidden keyword, and other words are removed from the phrase library.
6. The latent keyword extraction method according to claim 1, characterized in that: the similarity calculation method in step S4 includes: after the word vector of each word is calculated through the word2vec algorithm, the distance between the word vectors of the two words is used as the similarity between the two words.
7. The latent keyword extraction method according to claim 1, characterized in that: step S6 specifically includes: taking the phrase library processed in step S5 as the phrase library corresponding to the current time period, performing intersection operation on the phrase library and the keyword library corresponding to the historical time period, and taking the word in the difference between the phrase library corresponding to the current time period and the intersection operation result as the recessive keyword in the category corresponding to the phrase library.
8. A hidden keyword extraction terminal device is characterized in that: comprising a processor, a memory and a computer program stored in said memory and running on said processor, said processor implementing the steps of the method according to any one of claims 1 to 7 when executing said computer program.
9. A computer-readable storage medium storing a computer program, characterized in that: the computer program when executed by a processor implementing the steps of the method as claimed in any one of claims 1 to 7.
CN202111488191.7A 2021-12-07 2021-12-07 Hidden keyword extraction method, terminal device and storage medium Pending CN114372461A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111488191.7A CN114372461A (en) 2021-12-07 2021-12-07 Hidden keyword extraction method, terminal device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111488191.7A CN114372461A (en) 2021-12-07 2021-12-07 Hidden keyword extraction method, terminal device and storage medium

Publications (1)

Publication Number Publication Date
CN114372461A true CN114372461A (en) 2022-04-19

Family

ID=81139929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111488191.7A Pending CN114372461A (en) 2021-12-07 2021-12-07 Hidden keyword extraction method, terminal device and storage medium

Country Status (1)

Country Link
CN (1) CN114372461A (en)

Similar Documents

Publication Publication Date Title
WO2020244066A1 (en) Text classification method, apparatus, device, and storage medium
Mohamad et al. An evaluation on the efficiency of hybrid feature selection in spam email classification
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN111797210A (en) Information recommendation method, device and equipment based on user portrait and storage medium
Ismail et al. Efficient E‐Mail Spam Detection Strategy Using Genetic Decision Tree Processing with NLP Features
CN109947934B (en) Data mining method and system for short text
CN112492606B (en) Classification recognition method and device for spam messages, computer equipment and storage medium
Susanti et al. Twitter’s sentiment analysis on GSM services using Multinomial Naïve Bayes
CN106569989A (en) De-weighting method and apparatus for short text
CN106649338B (en) Information filtering strategy generation method and device
CN114282498B (en) Data knowledge processing system applied to electric power transaction
CN111814486A (en) Enterprise client tag generation method, system and device based on semantic analysis
CN107545505A (en) Insure recognition methods and the system of finance product information
CN115840808A (en) Scientific and technological project consultation method, device, server and computer-readable storage medium
CN104765784A (en) Key words list maintenance method and system
You et al. Web service-enabled spam filtering with naive Bayes classification
CN110516066B (en) Text content safety protection method and device
CN112527969A (en) Incremental intention clustering method, device, equipment and storage medium
CN107045497A (en) A kind of quick newsletter archive content sentiment analysis system and method
CN116881408A (en) Visual question-answering fraud prevention method and system based on OCR and NLP
CN114372461A (en) Hidden keyword extraction method, terminal device and storage medium
CN115758211A (en) Text information classification method and device, electronic equipment and storage medium
CN114118937A (en) Information recommendation method and device based on task, electronic equipment and storage medium
CN114186552B (en) Text analysis method, device and equipment and computer storage medium
CN113377922B (en) Method, device, electronic equipment and medium for matching information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination