CN114372461A

CN114372461A - Hidden keyword extraction method, terminal device and storage medium

Info

Publication number: CN114372461A
Application number: CN202111488191.7A
Authority: CN
Inventors: 陈云; 杜新胜; 吴松洋; 蔡勇恩; 汤增荣
Original assignee: Xiamen Meiya Pico Information Co Ltd
Current assignee: Xiamen Meiya Pico Information Co Ltd
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-04-19

Abstract

The invention relates to a method for extracting hidden keywords, terminal equipment and a storage medium, wherein the method comprises the following steps: s1: setting classification dimensions and different categories contained in each classification dimension, performing text extraction on the evidence data according to each category in each classification dimension, and constructing a text library in different categories; s2: performing word segmentation processing on the texts in the text base under each category, and building the results after the word segmentation processing into a word group base under each category; s3: filtering words in each word group library; s4: calculating the similarity between each word in the word group library and the known keyword, and eliminating the words with the similarity smaller than a similarity threshold value; s5: calculating the weight of each word in the word group library in the evidence data, and eliminating the words with the weight smaller than the weight threshold value; s6: and obtaining the recessive keywords under each category according to the phrase library processed by the steps. The invention realizes the automatic mining of case-involved hidden keywords in massive evidence-obtaining data.

Description

Hidden keyword extraction method, terminal device and storage medium

Technical Field

The invention relates to the technical field of evidence obtaining, in particular to a method for extracting hidden keywords, terminal equipment and a storage medium.

Background

With the high-speed development of the mobile internet technology, mobile phone evidence obtaining data are more and more appeared in various case works, an analysis method based on case-related keywords can often play a key supporting role in the case investigation process, particularly when explicit clues are not mastered, the case clues can often be rapidly excavated by using the keywords, the cases are broken through, and the working efficiency of investigation staff is greatly improved.

In the use of the existing case-related keywords, on one hand, along with the enhancement of professional knowledge and anti-investigation consciousness of case-related personnel, the hidden keywords are often adopted for illegal activities, and the novel case-related keywords which are frequently changed and layered endlessly cannot be mastered in time, so that clues of related cases are difficult to find, and inconvenience is brought to investigation and striking work of the cases; on the other hand, illegal activities show the characteristic of geography and are concentrated in different areas to show different types, so that the keyword library summarized by traditional analysis is difficult to adapt to the conditions all over the country and the development of times.

Disclosure of Invention

In order to solve the above problems, the present invention provides a hidden keyword extraction method, a terminal device, and a storage medium.

The specific scheme is as follows:

a method for extracting recessive keywords comprises the following steps:

s1: setting classification dimensions and different categories contained in each classification dimension, performing text extraction on the evidence data according to each category in each classification dimension, and constructing a text library in different categories;

s2: performing word segmentation processing on the texts in the text base under each category, and building the results after the word segmentation processing into a word group base under each category;

s3: filtering words in each word group library;

s4: based on the phrase library processed in step S3, for each word in the phrase library, performing similarity calculation between the word and a known keyword in the category corresponding to the phrase library, and removing words with similarity smaller than a similarity threshold from the phrase library;

s5: based on the phrase library processed in step S4, calculating the weight of each word in the forensic data by an IF-IDF algorithm for each word in the phrase library, and eliminating words with weights less than a weight threshold in the phrase library;

s6: and obtaining the recessive keywords under each category according to the phrase library processed in the step S5.

Further, the classification dimension comprises case types, case issuing areas, case-related household locations, case-related ethnic groups, application types and acquisition time.

Furthermore, the evidence obtaining data is communication data, and the source of the communication data is one or more of short message service, instant chat content, email, microblog, bar and word stock of data method.

Further, before building the text library, the step S1 further includes preprocessing the text for building the text library, where the preprocessing includes: deduplication processing, invalid data removal processing, and converting semi-structured or structured data into structured data in text format.

Further, the filtering process in step S3 includes the steps of:

s31: carrying out stop word filtering processing on the words in the word group library, and eliminating stop words in the word group library;

s32: based on the phrase library processed in step S31, according to the part of speech of each word, performing part of speech filtering processing on the words in the phrase library, and removing words corresponding to unnecessary parts of speech in the phrase library;

s33: based on the phrase library processed in step S32, removing words existing in the white list lexicon in the phrase library according to the constructed white list lexicon for storing the common keywords;

s34: based on the phrase library processed in step S33, gaussian distribution is performed on all words in the phrase library according to the word frequency of each word, words corresponding to the distribution interval are extracted from the result of gaussian distribution according to the distribution interval corresponding to the hidden keyword, and other words are removed from the phrase library.

Further, the similarity calculation method in step S4 includes: after the word vector of each word is calculated through the word2vec algorithm, the distance between the word vectors of the two words is used as the similarity between the two words.

Further, step S6 is specifically: taking the phrase library processed in step S5 as the phrase library corresponding to the current time period, performing intersection operation on the phrase library and the keyword library corresponding to the historical time period, and taking the word in the difference between the phrase library corresponding to the current time period and the intersection operation result as the recessive keyword in the category corresponding to the phrase library.

A hidden keyword extracting terminal device includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, where the processor executes the computer program to implement the steps of the method in the embodiment of the present invention.

A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method as described above for an embodiment of the invention.

By adopting the technical scheme, the invention realizes the automatic mining of the involved hidden keywords in the mass evidence data, the discovered involved new keywords are continuously updated in an iterative manner, the keyword library is continuously accumulated, and the discovery method of the hidden keywords has strong self-learning capability and strong adaptability to future changes.

Drawings

Fig. 1 is a flowchart illustrating a first embodiment of the present invention.

Detailed Description

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.

The invention will now be further described with reference to the accompanying drawings and detailed description.

The first embodiment is as follows:

the embodiment of the invention provides a method for extracting hidden keywords, which comprises the following steps of:

s1: setting classification dimensions and different categories contained in each classification dimension, performing text extraction on the evidence data according to each category in each classification dimension, and constructing a text library in different categories.

The classification dimension set in the embodiment includes case type, case issuing area, account location of case-involved person, ethnic group of case-involved person, application type and acquisition time. The categories under each classification dimension, such as the categories corresponding to the application types, include WeChat, QQ, email, and the like.

The evidence obtaining data is selected as communication data, and sources of the communication data comprise short messages, instant chat contents, electronic mailboxes, microblogs, posts, word banks of data methods and the like.

Since the extracted text may contain repeated content and invalid content, and may also contain semi-structured or unstructured data such as voice, pictures, documents, and the like, this embodiment further includes, before constructing the text library, preprocessing the text for constructing the text library, where the preprocessing includes: deduplication processing, invalid data (e.g., empty text, plain emoticons, links, red packets, system messages, etc.) removal processing, and converting semi-structured or structured data into structured data in text format. The conversion process of the structured data can be realized by adopting the existing artificial intelligence algorithm, and details are not described herein.

S2: and performing word segmentation processing on the texts in the text base under each category, and building the results after the word segmentation processing into a word group base under each category.

The word segmentation processing can be performed by using the existing artificial intelligence algorithm, which is not described herein.

S3: and carrying out filtering processing on the words in each word group library.

The filtering process is used to narrow the scope of the hidden keyword, i.e., to exclude words that are unlikely to be hidden keywords. The filtering process in this embodiment includes the steps of:

s31: and performing stop word filtering processing on the words in the word group library, and removing the stop words in the word group library.

Stop words such as "yes", "and" may be stored in a manner of pre-constructing a stop word library.

S32: based on the phrase library processed in step S31, the parts of speech filtering processing is performed on the words in the phrase library according to the parts of speech of each word, and the words corresponding to the unnecessary parts of speech in the phrase library are removed.

Such as words that only need to retain part of speech such as nouns, verbs, quantifiers, adjectives, etc.

S33: and based on the phrase library processed in the step S32, removing words existing in the white list lexicon in the phrase library according to the constructed white list lexicon for storing the common keywords.

The white list word bank can be constructed by carrying out word frequency statistics through a similar analysis method of historical evidence data and combining manual supplementation and data provided by a third party.

Hidden keywords in a case, namely, involved words such as black words and word lines in the traditional meaning, often belong to low-frequency words in the whole general content, are obtained through verification of massive word-dividing phrases, the occurrence frequency statistics of the word-dividing phrases accords with Gaussian distribution in the general content, and according to the characteristics of the Gaussian distribution, the hidden keywords are substituted into corresponding groups through dimensions such as case types, case regions, registered people nationalities, case-involved people nationalities, application types, acquisition time and the like to perform distribution interval calculation of the hidden keywords, so that most normal keywords can be eliminated.

S4: based on the phrase library processed in step S3, for each word in the phrase library, similarity calculation is performed between the word and the known keyword in the category corresponding to the phrase library, and words with similarity smaller than a similarity threshold are removed from the phrase library.

The method for calculating the similarity in the embodiment comprises the following steps: after the word vector of each word is calculated through the word2vec algorithm, the distance between the word vectors of the two words is used as the similarity between the two words. The different categories of known keywords may be stored in the form of a library of known keywords by way of manual collection.

The similarity threshold may be set by a person skilled in the art according to requirements, and is not limited herein.

S5: based on the phrase library processed in step S4, for each word in the phrase library, the weight of each word in the forensic data is calculated by the IF-IDF algorithm, and words whose weights do not conform to the weight range are removed from the phrase library.

And combining the occupation ratio of each word in the whole word group library, further calculating the weight of each word by using an IF-IDF algorithm, and further reducing the result range by setting a weight threshold value, so that a worker can further analyze and judge the credibility of the keywords by combining an actual service scene.

The weight range can be set by a person skilled in the art according to experience and experimental results, and is not limited herein.

Because the involved keywords can change along with the characteristics and means of illegal behaviors and the change of counterreconnaissance consciousness of the involved people, the keywords have stronger timeliness, and the interference of time dimension can be eliminated to the maximum extent by carrying out cross comparison on the involved recessive keywords by utilizing the keyword set operation based on a dynamic time window.

In this embodiment, the phrase library processed in step S5 is used as the phrase library corresponding to the current time period, the phrase library is subjected to intersection operation with the keyword library corresponding to the historical time period, and the word in the difference between the phrase library corresponding to the current time period and the intersection operation result is used as the recessive keyword in the category corresponding to the phrase library. The hidden keywords are hidden keywords newly appearing in the current time period, and in the next time period, the hidden keywords in the current time period can be added into the keyword library corresponding to the historical time period, and then the hidden keywords corresponding to the next time period are extracted.

According to the embodiment of the invention, the detailed segmentation of the dimensions of case types, case regions, household locations, nationalities, application types, acquisition time and the like is carried out according to the evidence-obtaining data link contents, the grouped link contents are subjected to automatic word segmentation, various screening strategies and Gaussian distribution are combined, the interference data are greatly reduced, the weight calculation is realized by using a corresponding algorithm, the real-time processing and automatic discovery of case-involved recessive keywords can be realized for massive evidence-obtaining data, and the whole processing link does not need manual intervention.

The embodiment of the invention can support all case types and communication application types, can adapt dialects, special codes and the like, is not restricted by any conditions, and has strong adaptability to the characteristics and rules of doing cases in different areas. The newly discovered hidden keywords can be further brought into the case-related keyword library for continuous iteration updating and perfection, and the method is used for subsequent study and judgment analysis and clue mining, and the overall working efficiency is improved.

Example two:

the invention further provides a hidden keyword extraction terminal device, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the computer program to realize the steps of the method embodiment of the first embodiment of the invention.

Further, as an executable scheme, the hidden keyword extraction terminal device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The hidden keyword extracting terminal device can include, but is not limited to, a processor and a memory. Those skilled in the art will understand that the above-mentioned component structure of the hidden keyword extraction terminal device is only an example of the hidden keyword extraction terminal device, and does not constitute a limitation on the hidden keyword extraction terminal device, and may include more or less components than the above, or combine some components, or different components, for example, the hidden keyword extraction terminal device may further include an input/output device, a network access device, a bus, and the like, which is not limited in this embodiment of the present invention.

Further, as an executable solution, the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and the like. The general processor can be a microprocessor or the processor can be any conventional processor, and the processor is a control center of the hidden keyword extraction terminal device and is connected with each part of the whole hidden keyword extraction terminal device by various interfaces and lines.

The memory can be used for storing the computer program and/or the module, and the processor can realize various functions of the implicit keyword extraction terminal device by running or executing the computer program and/or the module stored in the memory and calling the data stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The invention also provides a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method of an embodiment of the invention.

The module/unit integrated with the hidden keyword extracting terminal device may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), software distribution medium, and the like.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for extracting recessive keywords is characterized by comprising the following steps:

s3: filtering words in each word group library;

s5: based on the phrase library processed in step S4, calculating the weight of each word in the forensic data by an IF-IDF algorithm for each word in the phrase library, and eliminating words whose weight does not conform to the weight range from the phrase library;

2. The latent keyword extraction method according to claim 1, characterized in that: the classification dimension comprises case types, case issuing areas, case-involved household locations, case-involved nationalities, application types and acquisition time.

3. The latent keyword extraction method according to claim 1, characterized in that: the evidence obtaining data is communication data, and the source of the communication data is one or more of short message service, instant chat content, email, microblog, bar and data method lexicon.

4. The latent keyword extraction method according to claim 1, characterized in that: in step S1, before constructing the text library, the method further includes preprocessing the text used for constructing the text library, where the preprocessing includes: deduplication processing, invalid data removal processing, and converting semi-structured or structured data into structured data in text format.

5. The latent keyword extraction method according to claim 1, characterized in that: the filtering process in step S3 includes the steps of:

6. The latent keyword extraction method according to claim 1, characterized in that: the similarity calculation method in step S4 includes: after the word vector of each word is calculated through the word2vec algorithm, the distance between the word vectors of the two words is used as the similarity between the two words.

7. The latent keyword extraction method according to claim 1, characterized in that: step S6 specifically includes: taking the phrase library processed in step S5 as the phrase library corresponding to the current time period, performing intersection operation on the phrase library and the keyword library corresponding to the historical time period, and taking the word in the difference between the phrase library corresponding to the current time period and the intersection operation result as the recessive keyword in the category corresponding to the phrase library.

8. A hidden keyword extraction terminal device is characterized in that: comprising a processor, a memory and a computer program stored in said memory and running on said processor, said processor implementing the steps of the method according to any one of claims 1 to 7 when executing said computer program.

9. A computer-readable storage medium storing a computer program, characterized in that: the computer program when executed by a processor implementing the steps of the method as claimed in any one of claims 1 to 7.