CN114328830A - Hierarchical retrieval and multi-dimensional classification method for information clue discovery - Google Patents

Hierarchical retrieval and multi-dimensional classification method for information clue discovery Download PDF

Info

Publication number
CN114328830A
CN114328830A CN202111601210.2A CN202111601210A CN114328830A CN 114328830 A CN114328830 A CN 114328830A CN 202111601210 A CN202111601210 A CN 202111601210A CN 114328830 A CN114328830 A CN 114328830A
Authority
CN
China
Prior art keywords
classification
document
sentence
intelligence
clue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111601210.2A
Other languages
Chinese (zh)
Inventor
胡明昊
罗准辰
罗威
谭玉珊
叶宇铭
宋宇
周纤
毛彬
田昌海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Military Science Information Research Center Of Military Academy Of Chinese Pla
Original Assignee
Military Science Information Research Center Of Military Academy Of Chinese Pla
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Military Science Information Research Center Of Military Academy Of Chinese Pla filed Critical Military Science Information Research Center Of Military Academy Of Chinese Pla
Priority to CN202111601210.2A priority Critical patent/CN114328830A/en
Publication of CN114328830A publication Critical patent/CN114328830A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a hierarchical retrieval and multi-dimensional classification method for information clue discovery, which comprises the following steps: acquiring related internet open source data according to information research subjects, and constructing a full-text search index library of the research subjects; according to the search keyword list, hierarchical search is executed based on a full-text search index library to obtain a multi-granularity information clue screening set; and carrying out multi-dimensional automatic classification on the information clues with different granularities according to the multi-granularity information clue screening set to obtain a multi-dimensional information clue sorting set. The method not only obviously improves the information clue screening efficiency, but also can ensure higher recall rate and accuracy; and the efficiency and the accuracy of classification and editing of the information clues are effectively improved.

Description

Hierarchical retrieval and multi-dimensional classification method for information clue discovery
Technical Field
The invention relates to the technical field of information retrieval and natural language processing, in particular to a hierarchical retrieval and multi-dimensional classification method for information clue discovery.
Background
In the information research business model in the national defense science and technology information field, the traditional method for developing the open source information research topic is to rely on the professional knowledge of field experts, firstly to find out the information clues related to the topic from the internet open source information sources such as dynamic news, report files and social media, then to filter, classify, organize, analyze and judge the collected clues, and finally to produce the information products such as research reports, message interpretation, news briefing and the like. However, the huge amount of information resources emerging in the big data era brings serious challenges to the research model, and mainly appears in two aspects: 1. the traditional method of filtering information clues by manual screening has low efficiency, and is easy to have the problems of screening errors and information omission; 2. the traditional classification and reorganization work relying on manual clues has a large amount of repetitive work, a large amount of labor cost and time cost are required to be consumed, and the business requirement of quick response cannot be met. Therefore, there is a need to develop a new working paradigm of research for enabling intelligence clues discovery using big data, artificial intelligence techniques.
For the challenge of information clue screening, the current common practice is to use a search engine such as an elastic search to build an index for collecting information clues and to perform full text retrieval, so as to achieve the purpose of screening and filtering. However, such a search method generally can only return chapter-level search results such as a certain news or a certain report, and further depends on experts to further read to obtain sentence-level or even phrase-level informative clues from the search results, so that there are many problems of low efficiency, easy occurrence of screening omission and the like.
Aiming at the challenge of intelligence clue reorganization, the traditional method is to utilize text classification or clustering model to classify and reorganize the intelligence clues. However, in the method represented by text classification, the classification label system is generally constructed in advance, so that the model can only classify the input text from a single dimension, for example, the civil news classification task only comprises a few fixed categories of sports, finance, society and the like. The professionalism in the field of information research requires classification models to classify and compile different granularity information clues from multiple dimensions, for example, a news report belongs to the category of artificial intelligence, a sentence in the news report belongs to the category of technical research and development, another sentence belongs to the category of viewpoint, and the viewpoint sentence can be subdivided into the category of expert viewpoint and officer's speech. Therefore, the existing text classification method cannot adapt to the service requirement of information clue classification and compilation.
Disclosure of Invention
Aiming at the problems of low efficiency, incomplete single-dimensional classification visual angle and the like in the screening and editing of information clues in the existing information research process, the invention aims to overcome the defects of the prior art and provides a hierarchical retrieval and multi-dimensional classification method for information clue discovery.
In order to achieve the above object, the present invention provides a hierarchical retrieval and multidimensional classification method for intelligence cue discovery, the method comprising:
step 1) acquiring related Internet open source data according to an intelligence research topic, and constructing a full-text search index library of the research topic;
step 2) according to the search keyword list, hierarchical search is executed based on a full-text search index library to obtain a multi-granularity information clue screening set;
and 3) screening and collecting according to the multi-granularity information clues, and carrying out multi-dimensional automatic classification on the information clues with different granularities to obtain a multi-dimensional information clue sorting and collecting set.
As an improvement of the above method, the step 1) specifically includes:
step 1-1) collecting relevant data from an internet public information source through a data mining technology according to an information research topic to obtain a document set D ═ { D ═ related to the topic1,...,di,...,dn},i∈[1,n]N is the total number of documents, an index of the document set is created based on an elastic search engine of the full-text search, and an index field is made;
step 1-2) traverse the document set D, for each document DiAnd carrying out data mapping and leading the mapped fields into the index to obtain a full text search index library G related to the intelligence research topic.
As an improvement of the above method, the index field of step 1-1) includes: document index name, document index type, document index ID number, document publication year, document short description, language used by the document, data source of the document, document title, document content, document crawl time, and document hyperlink.
As an improvement of the above method, the step 2) specifically includes:
step 2-1) according to the search keyword list W ═ { W ═ W1,...,wmWhere m is the number of keywords, and the search time range TsAnd TeGenerating a search query statement q ═ F based on the Elasticissearchquery(W,Ts,Te) In which F isquery() Generating a function for the query statement;
step 2-2) carrying out full-text retrieval on the index database G according to the retrieval query statement q
Figure BDA0003431868340000021
Obtaining chapter-level information clue screening set
Figure BDA0003431868340000022
Wherein n is*For the number of documents in the collection, n*≤n,Fretri() A full text retrieval function;
step 2-3) traversing the set
Figure BDA0003431868340000023
Each document in (2) is subjected to the following processing:
for a document
Figure BDA0003431868340000024
Performing segmentation and sentence division processing to obtain document
Figure BDA0003431868340000025
Sentence collection of
Figure BDA0003431868340000026
Wherein k is
Figure BDA0003431868340000027
The number of sentences of (1); according to the search keyword list W, combining heuristic rule and hard matching mode to sentence set SiFiltering is carried out
Figure BDA0003431868340000031
Obtaining a document
Figure BDA0003431868340000032
Set of sentence-level intelligence clues
Figure BDA0003431868340000033
Wherein k is*For the number of sentences after screening, k*≤k,Ffilter() Filtering functions for sentence level;
step 2-4) summarizing the intelligence clue set of each article to obtain a set
Figure BDA0003431868340000034
Sentence-level intelligence hint filtering set
Figure BDA0003431868340000035
As an improvement of the above method, the step 3) specifically includes:
step 3-1) traversing the chapter-level information clue screening set, inputting the text of the document and the title into a pre-established and trained theme classification model for classification, and obtaining a chapter-level information clue compilation set;
step 3-2) traversing the sentence-level information clue screening set, inputting clue sentences into a pre-established and trained segment classification model for classification, and obtaining a sentence-level information clue compilation set;
and 3-3) inputting the 'viewpoint' sentences in the sentence-level intelligence clue reorganization set into a pre-established and trained viewpoint classification model for classification to obtain a viewpoint intelligence clue reorganization subset.
As an improvement of the above method, the step 3-1) specifically comprises:
step 3-1-1) ergodic set
Figure BDA0003431868340000036
Each article in (a) was subjected to the following treatment:
for a document
Figure BDA0003431868340000037
Wherein, tiFor the title of the document, ciFor the text of the document, a special symbol [ CLS ] is used]And [ SEP ]]As a pair of separators tiAnd ciSplicing to obtain an input sequence, and obtaining word embedding expression H according to the input sequence0Word-embedded means H0Is the sum of character embedding, position embedding and segment embedding;
using pre-trained language model pairs H0Coding is carried out, the pre-training language model comprises L pre-training Transformer blocks, and the pair H0Sequentially encoding to obtain the following formula:
Figure BDA0003431868340000038
wherein HlAnd Hl-1The hidden state representation output for the first and the first-1 Transformer blocks respectively, Transformer block () represents a Transformer function;
taking the hidden state output by the L-th Transformer block to represent HL[ CLS ] of]Bit vector HL[0]Inputting the multi-layer perceptron layer to obtain the probability distribution Y of the theme classificationtopic
Ytopic=softmax(WtopicHL[0])
Wherein softmax () represents a normalized exponential function, WtopicA parameter matrix for a topic classification model;
for YtopicDecoding to obtain the input document
Figure BDA0003431868340000039
Subject classification result topic ofi
Step 3-1-2) Collection of collections
Figure BDA00034318683400000310
And obtaining the subject classification result of each document to obtain a chapter-level information clue compilation set.
As an improvement of the above method, the step 3-2) specifically includes:
step 3-2-1) ergodic sentence-level information clue screening set
Figure BDA0003431868340000041
For each document
Figure BDA0003431868340000042
Corresponding sentence-level intelligence hint set
Figure BDA0003431868340000043
The following treatments were all carried out:
go through
Figure BDA0003431868340000044
Inner information wireSentence searching
Figure BDA0003431868340000045
With special symbols [ CLS]And [ SEP ]]Splicing the sentences as separators to obtain an input sequence, and obtaining corresponding word embedded expression according to the input sequence
Figure BDA0003431868340000046
Word-embedded representation
Figure BDA0003431868340000047
Is the sum of character embedding, position embedding and segment embedding;
embedding representations for the word using a pre-trained language model
Figure BDA0003431868340000048
Encoding, the pre-training language model comprises L pre-training Transformer blocks, pair
Figure BDA0003431868340000049
Sequentially encoding to obtain the following formula:
Figure BDA00034318683400000410
taking the hidden state representation output by the L-th Transformer block
Figure BDA00034318683400000411
[ CLS ] of]Bit vector
Figure BDA00034318683400000412
Inputting the multi-layer perceptron layer to obtain the probability distribution Y of segment classificationsegment
Figure BDA00034318683400000413
Wherein softmax () represents a normalized exponential function, WsegmentA parameter matrix of the segment classification model;
for YsegmentDecoding to obtain input sentence
Figure BDA00034318683400000414
Result of classification of fragments
Figure BDA00034318683400000415
Step 3-2-2) Collection of collections
Figure BDA00034318683400000416
The sentence fragment classification result of each document in the sentence level information clue compilation set is obtained.
As an improvement of the above method, the step 3-3) specifically includes:
step 3-3-1) ergodic sentence-level information clue screening set
Figure BDA00034318683400000417
The 'viewpoint' sentences in the sentence-level intelligence clue compilation set are processed as follows:
with special symbols [ CLS]And [ SEP ]]Splicing the 'viewpoint' sentences as separators to obtain an input sequence, and obtaining corresponding word embedding representation according to the input sequence
Figure BDA00034318683400000418
Embedding representations for the word using a pre-trained language model
Figure BDA00034318683400000419
Encoding, the pre-training language model comprises L pre-training Transformer blocks, pair
Figure BDA00034318683400000420
Sequentially encoding to obtain the following formula:
Figure BDA00034318683400000421
taking the hidden state representation output by the L-th Transformer block
Figure BDA00034318683400000422
[ CLS ] of]Bit vector
Figure BDA00034318683400000423
Inputting into multi-layer perceptron layer to obtain probability distribution Y of viewpoint classificationopinion
Figure BDA00034318683400000424
Wherein softmax () represents a normalized exponential function, WopinionA parameter matrix for the point of view classification model;
for YopinionDecoding to obtain a fine-grained viewpoint classification result of the sentence;
step 3-3-2) Collection of collections
Figure BDA0003431868340000051
And obtaining a fine-grained viewpoint classification result of the viewpoint sentence of each document to obtain a viewpoint intelligence clue compilation subset.
As an improvement of the above method, the viewpoint classification model and the segment classification model have the same input data, the same network structure, and different prediction dimensions.
As an improvement of the above method, the method further comprises a training step of a theme classification model, a segment classification model and a viewpoint classification model; the method specifically comprises the following steps:
selecting a data source as the news data in the national defense science and technology information field, after acquiring the texts in the field, manually marking each document according to a predefined topic classification category, marking sentences in the document according to a predefined segment classification category aiming at a certain document with a marked topic, and marking the viewpoint sentences according to a predefined viewpoint classification category in a fine-grained manner;
aiming at a marked document, splicing a document title and a text by taking a special symbol as a separator to obtain an input sequence, obtaining a corresponding word embedded representation, coding the word embedded representation by using a pre-training language model to obtain a hidden state representation of the input sequence, outputting predicted classification probability distribution by using a multilayer perceptron layer based on a [ CLS ] bit vector of the representation, calculating a cross entropy loss function with a real classification label, and training a theme classification model based on the loss function to obtain the theme classification model meeting the training requirement;
aiming at a labeled sentence, splicing the sentence by taking a special symbol as a separator to obtain an input sequence and obtain a corresponding word embedded representation, coding the word embedded representation by using a pre-training language model to obtain a hidden state representation of the input sequence, outputting predicted classification probability distribution by using a multilayer perceptron layer based on a [ CLS ] bit vector of the representation, calculating a cross entropy loss function with a real classification label, and training a segment classification model based on the loss function to obtain a theme classification model meeting the training requirement;
aiming at a marked 'viewpoint' sentence, splicing the sentences by taking a special symbol as a separator to obtain an input sequence and obtain a corresponding word embedding expression, coding the word embedding expression by using a pre-training language model to obtain a hidden state expression of the input sequence, outputting predicted classification probability distribution by using a multilayer perceptron layer based on a [ CLS ] bit vector of the expression, calculating a cross entropy loss function with a real classification label, and training a viewpoint classification model based on the loss function to obtain the viewpoint classification model meeting the training requirement.
Compared with the prior art, the invention has the advantages that:
1. the invention provides a hierarchical retrieval method, which simply and efficiently realizes a retrieval method from coarse granularity to fine granularity by comprehensively utilizing technical means such as full-text search, keyword matching and the like in a man-machine combination mode so as to achieve the purpose of recalling chapter-level/sentence-level multi-granularity information clue screening and gathering;
2. the invention designs a multi-dimensional automatic classification method, which is used for automatically classifying multi-granularity information clues from multiple dimensions by combining a subject classification model, a segment classification model and a viewpoint classification model aiming at the problem that a traditional classification model has an incomplete single-dimensional classification visual angle, so that the classification and reorganization efficiency and accuracy of the information clues are effectively improved.
Drawings
FIG. 1 is a flow chart of hierarchical information retrieval and multi-dimensional classification for information-oriented thread screening and cataloging according to the present invention;
FIG. 2 is a schematic diagram of hierarchical information retrieval;
FIG. 3 is a schematic diagram of a topic classification model structure;
FIG. 4 is a schematic diagram of a segment classification model and a point of view classification model structure.
Detailed Description
The invention provides a hierarchical retrieval and multi-dimensional classification method for information clue discovery, which comprises a hierarchical information retrieval module and a multi-dimensional classification module, and the method comprises the following steps:
step 1) providing internet open source data related to the acquired intelligence research topic, and constructing a full-text search index library of the research topic;
step 2) giving a search keyword list, and executing hierarchical information search based on the index database, wherein the hierarchical information search comprises the steps of chapter-level coarse granularity search and sentence-level fine granularity search, and a multi-granularity information clue screening set is obtained;
and 3) giving a multi-granularity information clue screening set recalled by retrieval, and carrying out multi-dimensional automatic classification on different granularity information clues, wherein the multi-dimensional automatic classification comprises three dimensions of topic classification, segment classification and viewpoint classification to obtain a multi-dimensional information clue sorting set.
In the above technical solution, the step 1) specifically includes:
step 1-1) giving an intelligence research topic and related Internet open source data obtained by collection, creating a document index based on a full-text search engine elastic search and making data mapping;
and step 1-2) leading the acquired data into the index according to a data mapping format to obtain a full text search index library related to the intelligence topic.
In the above technical solution, the step 2) specifically includes:
step 2-1) giving a search keyword list and a search time range to generate a search query statement;
step 2-2) executing chapter-level coarse-grained retrieval, and performing full-text retrieval on the index database based on retrieval query sentences to obtain a chapter-level information clue screening set;
and 2-3) executing sentence-level fine-grained retrieval, segmenting and sentence-dividing the chapter-level information clues, and filtering the sentence set by combining heuristic rules and a hard matching mode based on the retrieval keyword list to obtain a sentence-level information clue screening set.
In the above technical solution, the step 3) specifically includes:
step 3-1) traversing the chapter-level information clue screening set, inputting the text of the document and the title into the topic classification model for classification, and obtaining a chapter-level information clue compilation set;
step 3-2) traversing the sentence-level information clue screening set, inputting clue sentences into the segment classification model for classification, and obtaining a sentence-level information clue compilation set;
and 3-3) inputting the viewpoint sentences in the sentence-level information clue compilation set into the viewpoint classification model for classification to obtain a viewpoint information clue compilation subset.
Aiming at the multi-dimensional automatic classification requirements related in the method, the main steps of training the topic classification, segment classification and viewpoint classification models comprise:
step S1), collecting news data in the national defense science and technology information field, and carrying out data annotation according to a subject classification, segment classification and viewpoint classification label system to obtain an annotated data set;
step S2), training a topic classification model based on the labeling data for outputting the topic classification of chapter-level information clues;
step S3) training a segment classification model based on the labeling data, and outputting segment categories of sentence-level informative clues;
step S4) training a viewpoint classification model based on the labeling data, and outputting a fine-grained viewpoint category of sentence-level intelligence clues of the viewpoint category.
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, the present invention provides a hierarchical retrieval and multidimensional classification method for intelligence clue discovery, which mainly comprises the following steps:
step 1) providing the internet open source data related to the acquired intelligence research topic, and constructing a full text search index library of the research topic, which specifically comprises the following steps: given an intelligence research topic, firstly, relevant data is collected from an internet public information source through a data mining technical means, and a document set D ═ D associated with the topic is obtained1,...,dnN is the total number of documents; then based on the full text search engine elastic search, an index of the document set is created and index fields are formulated, a typical document index is shown in table 1. The document set D is then traversed, for each document DiAnd performing data mapping, for example, mapping the title of the document to a title field, and introducing the mapped field into the index to obtain a full-text search index library G related to the intelligence topic.
TABLE 1 document indexing sample
Index field Field interpretation
index Document index name
type Document index type
id Document index ID number
year Year of publication of document
description Brief descriptive statement about the document
language The language used by the document
source The data source of the document
title Document title
content Document content
crawled Document crawl time
url Hyperlinks to the document
Step 2) giving a search keyword list, and performing hierarchical information search based on the index database, wherein the hierarchical information search comprises chapter-level coarse-grained search and sentence-level fine-grained search, so as to obtain a multi-grained information clue screening set, and a flow chart of the multi-grained information clue screening set is shown in fig. 2, and the method specifically comprises the following steps: firstly, the method
Carrying out chapter-level coarse-grained search, and giving a search keyword list W ═ W1,...,wmWhere m is the number of keywords, and the search time range TsAnd TeGenerating a search query statement q ═ F based on the Elasticissearchquery(W,Ts,Te) In which F isqueryA function is generated for the query statement. Then, full-text search is performed on the index database G based on the search query sentence q
Figure BDA0003431868340000081
Returning chapter-level intelligence hint filter collections
Figure BDA0003431868340000082
Wherein n is*N is the number of documents in the set, FretriIs a full text search function.
After performing chapter-level coarse-grained retrieval, screening a set based on information clues
Figure BDA0003431868340000083
Sentence-level fine-grained retrieval is performed. In particular, for sets
Figure BDA0003431868340000084
A certain document in
Figure BDA0003431868340000085
Firstly, the segmentation and sentence division processing is executed to obtain the sentence set of the document
Figure BDA0003431868340000086
Where k is the number of sentences. Then, a search keyword list W is given, and a sentence set S is matched in a mode of combining heuristic rules and hard matchingiFiltering is carried out
Figure BDA0003431868340000087
Obtaining a document
Figure BDA0003431868340000088
Set of sentence-level intelligence clues
Figure BDA0003431868340000091
Wherein k is*K is the number of sentences after screening, FfilterThe function is filtered for sentence level. On the basis of the above, the sets are collected
Figure BDA0003431868340000092
All the documents in the document collection system are subjected to the screening operation, and finally the collection result is obtained
Figure BDA0003431868340000093
Sentence-level intelligence hint filtering set
Figure BDA0003431868340000094
Step 3) given multi-granularity information clue screening set of retrieval recall
Figure BDA0003431868340000095
Carrying out multi-dimensional classification on different granularity information clues, wherein the multi-dimensional classification comprises three dimensions of topic classification, segment classification and viewpoint classification to obtain a multi-dimensional information clue compilation set, and the method specifically comprises the following steps:
first, the chapter-level intelligence clues are subject-classified. The topic classification comprises 23 categories including 22 topics such as artificial intelligence, 5G technology and the like, and 1 'none' category, which indicates that the document does not belong to any topic above, and some disclosed topic classification categories are shown in table 2.
Subject classification categories disclosed in Table 2
Artificial intelligence Unmanned system 5G technology
Quantum technology Electronic information Biotechnology
Advanced materials and fabrication Basic science
The specific steps for performing topic classification are as follows: first, traverse chapter-level information clue screening set
Figure BDA0003431868340000096
For collections
Figure BDA0003431868340000097
A certain document in
Figure BDA0003431868340000098
Wherein t isiFor the title of the document, ciFor the text of the document, t is addediAnd ciThe input sequence obtained after splicing is as follows<[CLS],Tok1,…,Tok|t|,[SEP],Tok1,…,Tok|c|,[SEP]>Where the header contains | t | characters: tok1,…,Tok|t|The body contains | c | characters: tok1,…,Tok|c|,[CLS]And [ SEP ]]Is a special delimiter. Then, the sequence is input into a topic classification model for classification, and the structure diagram of the topic classification model is shown in detail in fig. 3. In particular, the word-embedded representation H of the sequence is obtained first0The word embedding is expressed as the sum of character embedding, position embedding, and segment embedding, and its dimension is (| t | + | c | +2) × | h |, where | h | represents the hidden state dimension. The word-embedded representation is then encoded using a pre-trained language model, such as BERT, that utilizes L pre-trained Transformer blocks for the embedded representation H of the input sequence0Are carried out in sequenceAnd (3) encoding:
Figure BDA0003431868340000099
wherein HlThe hidden state representation output for the first Transformer block, Transformer block () represents the Transformer function;
taking the hidden state output by the L-th Transformer block to represent HL[ CLS ] of]Bit vector HL[0]Inputting the multi-layer perceptron layer to obtain the probability distribution Y of the theme classificationtopic
Ytopic=softmax(WtopicHL[0])
Wherein WtopicIs a parameter matrix of the topic classification model with dimension 23 x h |, YtopicIs a probability distribution with dimension 23, and the input document can be obtained by decoding the probability distribution
Figure BDA0003431868340000101
Subject classification result topic ofi
Second, segment classification is performed on sentence-level intelligence cues. The segment classification contains 19 categories including 18 categories of development, deployment, viewpoint and the like, and 1 "none" category, which means that the sentence does not belong to any of the above categories, and the partially disclosed segment classification categories are shown in table 3.
Classification of fragments as partially disclosed in Table 3
Research and development Production of Deploying
Test of Delivery of Disclose (a)
Viewpoint of Influence of
The specific steps for performing segment classification are as follows: first, traverse sentence-level information clue filtering set
Figure BDA0003431868340000102
For each document
Figure BDA0003431868340000103
Corresponding sentence-level intelligence hint set
Figure BDA0003431868340000104
Traversing the informative clue sentences in the set
Figure BDA0003431868340000105
With special symbols [ CLS]And [ SEP ]]The sentence is spliced as a separator to obtain an input sequence as follows<[CLS],Tok1,…,Tok|s|,[SEP]>Where | s | represents the number of characters of the sentence. Then, the sequence is input into a segment classification model for classification, and the structure diagram of the segment classification model is shown in detail in fig. 4. In particular, the word-embedded representation of the sequence is first obtained
Figure BDA0003431868340000106
The word embedding is expressed as the sum of character embedding, position embedding, and segment embedding, and its dimension is (| s | +2) × | h |. The word-embedded representation is then encoded using a pre-trained language model, such as BERT, that utilizes L pre-trained Transformer blocks for an embedded representation of the input sequence
Figure BDA0003431868340000107
And (3) coding in sequence:
Figure BDA0003431868340000108
taking the hidden state representation output by the L-th Transformer block
Figure BDA0003431868340000109
[ CLS ] of]Bit vector
Figure BDA00034318683400001010
Inputting the multi-layer perceptron layer to obtain the probability distribution Y of segment classificationsegment
Figure BDA00034318683400001011
Wherein WsegmentIs a parameter matrix of a segment classification model with dimensions 19 x h |, YsegmentIs a probability distribution with dimension 19, which is decoded to obtain the input sentence
Figure BDA00034318683400001012
Result of classification of fragments
Figure BDA00034318683400001013
Finally, the sentence-level intelligence clue of the segment classification category of the viewpoint category is classified to execute viewpoint classification. The opinion classification contains a total of 5 categories including government positions, expert opinions, official opinions, etc., and is shown in table 4.
TABLE 4 View Classification categories
Independent view point Government ground Expert opinion
Point of view of report Official speech
The specific steps of performing viewpoint classification are substantially identical to the segment classification steps, specifically: firstly, aiming at the sentence of 'viewpoint' class, a special symbol [ CLS ] is used]And [ SEP ]]Splicing the sentences as separators to obtain an input sequence<[CLS]Input sentence, [ SEP ]]>. On this basis, a word-embedded representation of the input sequence is obtained
Figure BDA0003431868340000111
The word-embedded representation is then encoded using a pre-trained language model:
Figure BDA0003431868340000112
taking the hidden state representation output by the L-th Transformer block
Figure BDA0003431868340000113
[ CLS ] of]Bit vector
Figure BDA0003431868340000114
Inputting into multi-layer perceptron layer to obtain probability distribution Y of viewpoint classificationopinion
Figure BDA0003431868340000115
Wherein WopinionIs a parameter matrix of a viewpoint classification model with dimension 5 x h |, YopinionIs a probability distribution with dimension of 5, and the probability distribution is decoded to obtain a fine-grained viewpoint classification node of an input viewpoint sentenceAnd (5) fruit.
Aiming at the multi-dimensional automatic classification requirements related in the method, the main steps of training the topic classification, segment classification and viewpoint classification models comprise:
step S1), labeling classification data, specifically including the steps of: firstly, determining a data source as the news data in the national defense science and technology information field, after acquiring the text in the field, manually marking each document according to predefined topic classification categories, and one marking sample is shown in a table 5. Then, for a certain document with labeled subjects, the sentences in the document are labeled according to the predefined segment classification categories. In particular, for the "view" class sentences, fine-grained labeling is performed according to predefined view classification categories, and one example of labeling is shown in table 6.
Step S2) training a topic classification model based on the labeling data, which specifically comprises the following steps: and splicing the title and the text of the document by using special symbols [ CLS ] and [ SEP ] as separators aiming at the marked document to obtain an input sequence < [ CLS ], a title, [ SEP ], a text and [ SEP ] >, obtaining a word embedded representation of the input sequence, and then coding the word embedded representation by using a pre-training language model to obtain a hidden state representation of the input sequence. Based on the [ CLS ] bit vector of the representation, the classification probability distribution of the output prediction of the multi-layer perceptron layer is used, a cross entropy loss function is calculated with a real classification label, and finally a topic classification model is trained based on the loss function.
TABLE 5 sample topic Classification
Figure BDA0003431868340000121
TABLE 6 segment Classification and View Classification sample instances
Figure BDA0003431868340000122
Step S3) training a segment classification model based on the annotation data, which specifically comprises the following steps: and aiming at the marked sentences, splicing the sentences by using special symbols [ CLS ] and [ SEP ] as separators to obtain input sequences < [ CLS ], sentences and [ SEP ] >, obtaining word embedded expressions of the input sequences, and then coding the word embedded expressions by using a pre-training language model to obtain hidden state expressions of the input sequences. Based on the [ CLS ] bit vector of the representation, using the classification probability distribution of the output prediction of the multi-layer perceptron layer, calculating a cross entropy loss function with a real classification label, and finally training a segment classification model based on the loss function.
Step S4) training the viewpoint classification model based on the annotation data, and since the viewpoint classification model is the same as the input data of the segment classification model, the network structure is the same, and only the prediction dimension is different, the training step is the same as step S3), which is not described herein again.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A hierarchical retrieval and multi-dimensional classification method for intelligence clue discovery, the method comprises the following steps:
step 1) acquiring related Internet open source data according to an intelligence research topic, and constructing a full-text search index library of the research topic;
step 2) according to the search keyword list, hierarchical search is executed based on a full-text search index library to obtain a multi-granularity information clue screening set;
and 3) screening and collecting according to the multi-granularity information clues, and carrying out multi-dimensional automatic classification on the information clues with different granularities to obtain a multi-dimensional information clue sorting and collecting set.
2. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method according to claim 1, wherein the step 1) specifically comprises:
step 1-1) collecting relevant data from an internet public information source through a data mining technology according to an information research topic to obtain a document set D ═ { D ═ related to the topic1,...,di,...,dn},i∈[1,n]N is the total number of documents, an index of the document set is created based on an elastic search engine of the full-text search, and an index field is made;
step 1-2) traverse the document set D, for each document DiAnd carrying out data mapping and leading the mapped fields into the index to obtain a full text search index library G related to the intelligence research topic.
3. The intelligence-oriented hint discovery hierarchical retrieval and multi-dimensional classification method of claim 2, wherein the index field of step 1-1) comprises: document index name, document index type, document index ID number, document publication year, document short description, language used by the document, data source of the document, document title, document content, document crawl time, and document hyperlink.
4. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method according to claim 3, wherein the step 2) specifically comprises:
step 2-1) according to the search keyword list W ═ { W ═ W1,...,wmWhere m is the number of keywords, and the search time range TsAnd TeGenerating a search query statement q ═ F based on the Elasticissearchquery(W,Ts,Te) In which F isquery() Generating a function for the query statement;
step 2-2) carrying out full-text retrieval on the index database G according to the retrieval query statement q
Figure FDA0003431868330000011
Obtaining chapter-level information clue screening set
Figure FDA0003431868330000012
Wherein n is*For the number of documents in the collection, n*≤n,Fretri() A full text retrieval function;
step 2-3) traversing the set
Figure FDA0003431868330000013
Each document in (2) is subjected to the following processing:
for a document
Figure FDA0003431868330000014
Performing segmentation and sentence division processing to obtain document
Figure FDA0003431868330000015
Sentence collection of
Figure FDA0003431868330000016
Wherein k is
Figure FDA0003431868330000017
The number of sentences of (1); according to the search keyword list W, combining heuristic rule and hard matching mode to sentence set SiFiltering is carried out
Figure FDA0003431868330000021
Obtaining a document
Figure FDA0003431868330000022
Set of sentence-level intelligence clues
Figure FDA0003431868330000023
Wherein k is*For the number of sentences after screening, k*≤k,Ffilter() Filtering functions for sentence level;
step 2-4) summarizing the intelligence clue set of each article to obtain a set
Figure FDA0003431868330000024
Sentence-level intelligence hint filtering set
Figure FDA0003431868330000025
5. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method as claimed in claim 4, wherein the step 3) comprises:
step 3-1) traversing the chapter-level information clue screening set, inputting the text of the document and the title into a pre-established and trained theme classification model for classification, and obtaining a chapter-level information clue compilation set;
step 3-2) traversing the sentence-level information clue screening set, inputting clue sentences into a pre-established and trained segment classification model for classification, and obtaining a sentence-level information clue compilation set;
and 3-3) inputting the 'viewpoint' sentences in the sentence-level intelligence clue reorganization set into a pre-established and trained viewpoint classification model for classification to obtain a viewpoint intelligence clue reorganization subset.
6. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method according to claim 5, wherein the step 3-1) specifically comprises:
step 3-1-1) ergodic set
Figure FDA0003431868330000026
Each article in (a) was subjected to the following treatment:
for a document
Figure FDA0003431868330000027
Wherein, tiFor the title of the document, ciFor the text of the document, a special symbol [ CLS ] is used]And [ SEP ]]As a pair of separators tiAnd ciSplicing to obtain an input sequence, and obtaining word embedding expression H according to the input sequence0Word-embedded means H0For character embedding, position embedding andsegment embedding sum;
using pre-trained language model pairs H0Coding is carried out, the pre-training language model comprises L pre-training Transformer blocks, and the pair H0Sequentially encoding to obtain the following formula:
Figure FDA0003431868330000028
wherein HlAnd Hl-1The hidden state representation output for the first and the first-1 Transformer blocks respectively, Transformer block () represents a Transformer function;
taking the hidden state output by the L-th Transformer block to represent HL[ CLS ] of]Bit vector HL[0]Inputting the multi-layer perceptron layer to obtain the probability distribution Y of the theme classificationtopic
Ytopic=softmax(WtopicHL[0])
Wherein softmax () represents a normalized exponential function, WtopicA parameter matrix for a topic classification model;
for YtopicDecoding to obtain the input document
Figure FDA0003431868330000029
Subject classification result topic ofi
Step 3-1-2) Collection of collections
Figure FDA0003431868330000031
And obtaining the subject classification result of each document to obtain a chapter-level information clue compilation set.
7. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method as claimed in claim 6, wherein the step 3-2) specifically comprises:
step 3-2-1) ergodic sentence-level information clue screening set
Figure FDA0003431868330000032
For each document
Figure FDA0003431868330000033
Corresponding sentence-level intelligence hint set
Figure FDA0003431868330000034
The following treatments were all carried out:
go through
Figure FDA0003431868330000035
Inner intelligence clue sentences
Figure FDA0003431868330000036
With special symbols [ CLS]And [ SEP ]]Splicing the sentences as separators to obtain an input sequence, and obtaining corresponding word embedded expression according to the input sequence
Figure FDA0003431868330000037
Word-embedded representation
Figure FDA0003431868330000038
Is the sum of character embedding, position embedding and segment embedding;
embedding representations for the word using a pre-trained language model
Figure FDA0003431868330000039
Encoding, the pre-training language model comprises L pre-training Transformer blocks, pair
Figure FDA00034318683300000310
Sequentially encoding to obtain the following formula:
Figure FDA00034318683300000311
taking the hidden state representation output by the L-th Transformer block
Figure FDA00034318683300000312
[ CLS ] of]Bit vector
Figure FDA00034318683300000313
Inputting the multi-layer perceptron layer to obtain the probability distribution Y of segment classificationsegment
Figure FDA00034318683300000314
Wherein softmax () represents a normalized exponential function, WsegmentA parameter matrix of the segment classification model;
for YsegmentDecoding to obtain input sentence
Figure FDA00034318683300000315
Result of classification of fragments
Figure FDA00034318683300000316
Step 3-2-2) Collection of collections
Figure FDA00034318683300000317
The sentence fragment classification result of each document in the sentence level information clue compilation set is obtained.
8. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method as claimed in claim 7, wherein the step 3-3) specifically comprises:
step 3-3-1) ergodic sentence-level information clue screening set
Figure FDA00034318683300000318
The 'viewpoint' sentences in the sentence-level intelligence clue compilation set are processed as follows:
with special symbols [ CLS]And [ SEP ]]As a separator pair "view" classThe sentences are spliced to obtain an input sequence, and corresponding word embedding representation is obtained according to the input sequence
Figure FDA00034318683300000319
Embedding representations for the word using a pre-trained language model
Figure FDA00034318683300000320
Encoding, the pre-training language model comprises L pre-training Transformer blocks, pair
Figure FDA00034318683300000321
Sequentially encoding to obtain the following formula:
Figure FDA00034318683300000322
taking the hidden state representation output by the L-th Transformer block
Figure FDA0003431868330000041
[ CLS ] of]Bit vector
Figure FDA0003431868330000042
Inputting into multi-layer perceptron layer to obtain probability distribution Y of viewpoint classificationopinion
Figure FDA0003431868330000043
Wherein softmax () represents a normalized exponential function, WopinionA parameter matrix for the point of view classification model;
for YopinionDecoding to obtain a fine-grained viewpoint classification result of the sentence;
step 3-3-2) Collection of collections
Figure FDA0003431868330000044
And obtaining a fine-grained viewpoint classification result of the viewpoint sentence of each document to obtain a viewpoint intelligence clue compilation subset.
9. The intelligence-oriented hint discovery hierarchical retrieval and multi-dimensional classification method of claim 8, wherein the opinion classification model is the same as the segment classification model in input data, the same in network structure, and different in prediction dimension.
10. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method according to claim 9, further comprising the steps of training a topic classification model, a segment classification model and a point of view classification model; the method specifically comprises the following steps:
selecting a data source as the news data in the national defense science and technology information field, after acquiring the texts in the field, manually marking each document according to a predefined topic classification category, marking sentences in the document according to a predefined segment classification category aiming at a certain document with a marked topic, and marking the viewpoint sentences according to a predefined viewpoint classification category in a fine-grained manner;
aiming at a marked document, splicing a document title and a text by taking a special symbol as a separator to obtain an input sequence, obtaining a corresponding word embedded representation, coding the word embedded representation by using a pre-training language model to obtain a hidden state representation of the input sequence, outputting predicted classification probability distribution by using a multilayer perceptron layer based on a [ CLS ] bit vector of the representation, calculating a cross entropy loss function with a real classification label, and training a theme classification model based on the loss function to obtain the theme classification model meeting the training requirement;
aiming at a labeled sentence, splicing the sentence by taking a special symbol as a separator to obtain an input sequence and obtain a corresponding word embedded representation, coding the word embedded representation by using a pre-training language model to obtain a hidden state representation of the input sequence, outputting predicted classification probability distribution by using a multilayer perceptron layer based on a [ CLS ] bit vector of the representation, calculating a cross entropy loss function with a real classification label, and training a segment classification model based on the loss function to obtain a theme classification model meeting the training requirement;
aiming at a marked 'viewpoint' sentence, splicing the sentences by taking a special symbol as a separator to obtain an input sequence and obtain a corresponding word embedding expression, coding the word embedding expression by using a pre-training language model to obtain a hidden state expression of the input sequence, outputting predicted classification probability distribution by using a multilayer perceptron layer based on a [ CLS ] bit vector of the expression, calculating a cross entropy loss function with a real classification label, and training a viewpoint classification model based on the loss function to obtain the viewpoint classification model meeting the training requirement.
CN202111601210.2A 2021-12-24 2021-12-24 Hierarchical retrieval and multi-dimensional classification method for information clue discovery Pending CN114328830A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111601210.2A CN114328830A (en) 2021-12-24 2021-12-24 Hierarchical retrieval and multi-dimensional classification method for information clue discovery

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111601210.2A CN114328830A (en) 2021-12-24 2021-12-24 Hierarchical retrieval and multi-dimensional classification method for information clue discovery

Publications (1)

Publication Number Publication Date
CN114328830A true CN114328830A (en) 2022-04-12

Family

ID=81014006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111601210.2A Pending CN114328830A (en) 2021-12-24 2021-12-24 Hierarchical retrieval and multi-dimensional classification method for information clue discovery

Country Status (1)

Country Link
CN (1) CN114328830A (en)

Similar Documents

Publication Publication Date Title
Jung Semantic vector learning for natural language understanding
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN108121829A (en) The domain knowledge collection of illustrative plates automated construction method of software-oriented defect
Yao et al. Extracting multiple visual senses for web learning
CN112256939B (en) Text entity relation extraction method for chemical field
CN112231494B (en) Information extraction method and device, electronic equipment and storage medium
CN114896388A (en) Hierarchical multi-label text classification method based on mixed attention
CN109446423B (en) System and method for judging sentiment of news and texts
CN113221559B (en) Method and system for extracting Chinese key phrase in scientific and technological innovation field by utilizing semantic features
Nikzad-Khasmakhi et al. Phraseformer: Multimodal key-phrase extraction using transformer and graph embedding
CN115017303A (en) Method, computing device and medium for enterprise risk assessment based on news text
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
CN114756733A (en) Similar document searching method and device, electronic equipment and storage medium
CN113282729A (en) Question-answering method and device based on knowledge graph
TW202111569A (en) Text classification method with high scalability and multi-tag and apparatus thereof also providing a method and a device for constructing topic classification templates
CN109543038B (en) Emotion analysis method applied to text data
CN108399238A (en) A kind of viewpoint searching system and method for fusing text generalities and network representation
CN114676346A (en) News event processing method and device, computer equipment and storage medium
Zhou et al. Learning transferable node representations for attribute extraction from web documents
Papanikolaou et al. Protest event analysis: A longitudinal analysis for Greece
CN110245275B (en) Large-scale similar news headline rapid normalization method
CN112231476A (en) Improved graph neural network scientific and technical literature big data classification method
Zhan et al. Multi-similarity semantic correctional hashing for cross modal retrieval
Hamdi et al. Machine learning vs deterministic rule-based system for document stream segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination