CN114328830A

CN114328830A - Hierarchical retrieval and multi-dimensional classification method for information clue discovery

Info

Publication number: CN114328830A
Application number: CN202111601210.2A
Authority: CN
Inventors: 胡明昊; 罗准辰; 罗威; 谭玉珊; 叶宇铭; 宋宇; 周纤; 毛彬; 田昌海
Original assignee: Military Science Information Research Center Of Military Academy Of Chinese Pla
Current assignee: Military Science Information Research Center Of Military Academy Of Chinese Pla
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-04-12

Abstract

The invention discloses a hierarchical retrieval and multi-dimensional classification method for information clue discovery, which comprises the following steps: acquiring related internet open source data according to information research subjects, and constructing a full-text search index library of the research subjects; according to the search keyword list, hierarchical search is executed based on a full-text search index library to obtain a multi-granularity information clue screening set; and carrying out multi-dimensional automatic classification on the information clues with different granularities according to the multi-granularity information clue screening set to obtain a multi-dimensional information clue sorting set. The method not only obviously improves the information clue screening efficiency, but also can ensure higher recall rate and accuracy; and the efficiency and the accuracy of classification and editing of the information clues are effectively improved.

Description

Hierarchical retrieval and multi-dimensional classification method for information clue discovery

Technical Field

The invention relates to the technical field of information retrieval and natural language processing, in particular to a hierarchical retrieval and multi-dimensional classification method for information clue discovery.

Background

In the information research business model in the national defense science and technology information field, the traditional method for developing the open source information research topic is to rely on the professional knowledge of field experts, firstly to find out the information clues related to the topic from the internet open source information sources such as dynamic news, report files and social media, then to filter, classify, organize, analyze and judge the collected clues, and finally to produce the information products such as research reports, message interpretation, news briefing and the like. However, the huge amount of information resources emerging in the big data era brings serious challenges to the research model, and mainly appears in two aspects: 1. the traditional method of filtering information clues by manual screening has low efficiency, and is easy to have the problems of screening errors and information omission; 2. the traditional classification and reorganization work relying on manual clues has a large amount of repetitive work, a large amount of labor cost and time cost are required to be consumed, and the business requirement of quick response cannot be met. Therefore, there is a need to develop a new working paradigm of research for enabling intelligence clues discovery using big data, artificial intelligence techniques.

For the challenge of information clue screening, the current common practice is to use a search engine such as an elastic search to build an index for collecting information clues and to perform full text retrieval, so as to achieve the purpose of screening and filtering. However, such a search method generally can only return chapter-level search results such as a certain news or a certain report, and further depends on experts to further read to obtain sentence-level or even phrase-level informative clues from the search results, so that there are many problems of low efficiency, easy occurrence of screening omission and the like.

Aiming at the challenge of intelligence clue reorganization, the traditional method is to utilize text classification or clustering model to classify and reorganize the intelligence clues. However, in the method represented by text classification, the classification label system is generally constructed in advance, so that the model can only classify the input text from a single dimension, for example, the civil news classification task only comprises a few fixed categories of sports, finance, society and the like. The professionalism in the field of information research requires classification models to classify and compile different granularity information clues from multiple dimensions, for example, a news report belongs to the category of artificial intelligence, a sentence in the news report belongs to the category of technical research and development, another sentence belongs to the category of viewpoint, and the viewpoint sentence can be subdivided into the category of expert viewpoint and officer's speech. Therefore, the existing text classification method cannot adapt to the service requirement of information clue classification and compilation.

Disclosure of Invention

Aiming at the problems of low efficiency, incomplete single-dimensional classification visual angle and the like in the screening and editing of information clues in the existing information research process, the invention aims to overcome the defects of the prior art and provides a hierarchical retrieval and multi-dimensional classification method for information clue discovery.

In order to achieve the above object, the present invention provides a hierarchical retrieval and multidimensional classification method for intelligence cue discovery, the method comprising:

step 1) acquiring related Internet open source data according to an intelligence research topic, and constructing a full-text search index library of the research topic;

step 2) according to the search keyword list, hierarchical search is executed based on a full-text search index library to obtain a multi-granularity information clue screening set;

and 3) screening and collecting according to the multi-granularity information clues, and carrying out multi-dimensional automatic classification on the information clues with different granularities to obtain a multi-dimensional information clue sorting and collecting set.

As an improvement of the above method, the step 1) specifically includes:

step 1-1) collecting relevant data from an internet public information source through a data mining technology according to an information research topic to obtain a document set D ═ { D ═ related to the topic₁,...,d_i,...,d_n}，i∈[1,n]N is the total number of documents, an index of the document set is created based on an elastic search engine of the full-text search, and an index field is made;

step 1-2) traverse the document set D, for each document D_iAnd carrying out data mapping and leading the mapped fields into the index to obtain a full text search index library G related to the intelligence research topic.

As an improvement of the above method, the index field of step 1-1) includes: document index name, document index type, document index ID number, document publication year, document short description, language used by the document, data source of the document, document title, document content, document crawl time, and document hyperlink.

As an improvement of the above method, the step 2) specifically includes:

step 2-1) according to the search keyword list W ═ { W ═ W₁,...,w_mWhere m is the number of keywords, and the search time range T_sAnd T_eGenerating a search query statement q ═ F based on the Elasticissearch_query(W,T_s,T_e) In which F is_query() Generating a function for the query statement;

step 2-2) carrying out full-text retrieval on the index database G according to the retrieval query statement q

Obtaining chapter-level information clue screening set

Wherein n is^*For the number of documents in the collection, n^*≤n，F_retri() A full text retrieval function;

step 2-3) traversing the set

Each document in (2) is subjected to the following processing:

for a document

Performing segmentation and sentence division processing to obtain document

Sentence collection of

Wherein k is

The number of sentences of (1); according to the search keyword list W, combining heuristic rule and hard matching mode to sentence set S_iFiltering is carried out

Obtaining a document

Set of sentence-level intelligence clues

Wherein k is^*For the number of sentences after screening, k^*≤k，F_filter() Filtering functions for sentence level;

step 2-4) summarizing the intelligence clue set of each article to obtain a set

Sentence-level intelligence hint filtering set

As an improvement of the above method, the step 3) specifically includes:

step 3-1) traversing the chapter-level information clue screening set, inputting the text of the document and the title into a pre-established and trained theme classification model for classification, and obtaining a chapter-level information clue compilation set;

step 3-2) traversing the sentence-level information clue screening set, inputting clue sentences into a pre-established and trained segment classification model for classification, and obtaining a sentence-level information clue compilation set;

and 3-3) inputting the 'viewpoint' sentences in the sentence-level intelligence clue reorganization set into a pre-established and trained viewpoint classification model for classification to obtain a viewpoint intelligence clue reorganization subset.

As an improvement of the above method, the step 3-1) specifically comprises:

step 3-1-1) ergodic set

Each article in (a) was subjected to the following treatment:

for a document

Wherein, t_iFor the title of the document, c_iFor the text of the document, a special symbol [ CLS ] is used]And [ SEP ]]As a pair of separators t_iAnd c_iSplicing to obtain an input sequence, and obtaining word embedding expression H according to the input sequence₀Word-embedded means H₀Is the sum of character embedding, position embedding and segment embedding;

using pre-trained language model pairs H₀Coding is carried out, the pre-training language model comprises L pre-training Transformer blocks, and the pair H₀Sequentially encoding to obtain the following formula:

wherein H_lAnd H_l-1The hidden state representation output for the first and the first-1 Transformer blocks respectively, Transformer block () represents a Transformer function;

taking the hidden state output by the L-th Transformer block to represent H_L[ CLS ] of]Bit vector H_L[0]Inputting the multi-layer perceptron layer to obtain the probability distribution Y of the theme classification_topic：

Y_topic＝softmax(W_topicH_L[0])

Wherein softmax () represents a normalized exponential function, W_topicA parameter matrix for a topic classification model;

for Y_topicDecoding to obtain the input document

Subject classification result topic of_i；

Step 3-1-2) Collection of collections

And obtaining the subject classification result of each document to obtain a chapter-level information clue compilation set.

As an improvement of the above method, the step 3-2) specifically includes:

step 3-2-1) ergodic sentence-level information clue screening set

For each document

Corresponding sentence-level intelligence hint set

The following treatments were all carried out:

go through

Inner information wireSentence searching

With special symbols [ CLS]And [ SEP ]]Splicing the sentences as separators to obtain an input sequence, and obtaining corresponding word embedded expression according to the input sequence

Word-embedded representation

Is the sum of character embedding, position embedding and segment embedding;

embedding representations for the word using a pre-trained language model

Encoding, the pre-training language model comprises L pre-training Transformer blocks, pair

Sequentially encoding to obtain the following formula:

taking the hidden state representation output by the L-th Transformer block

[ CLS ] of]Bit vector

Inputting the multi-layer perceptron layer to obtain the probability distribution Y of segment classification_segment：

Wherein softmax () represents a normalized exponential function, W_segmentA parameter matrix of the segment classification model;

for Y_segmentDecoding to obtain input sentence

Result of classification of fragments

Step 3-2-2) Collection of collections

The sentence fragment classification result of each document in the sentence level information clue compilation set is obtained.

As an improvement of the above method, the step 3-3) specifically includes:

step 3-3-1) ergodic sentence-level information clue screening set

The 'viewpoint' sentences in the sentence-level intelligence clue compilation set are processed as follows:

with special symbols [ CLS]And [ SEP ]]Splicing the 'viewpoint' sentences as separators to obtain an input sequence, and obtaining corresponding word embedding representation according to the input sequence

Embedding representations for the word using a pre-trained language model

Sequentially encoding to obtain the following formula:

taking the hidden state representation output by the L-th Transformer block

[ CLS ] of]Bit vector

Inputting into multi-layer perceptron layer to obtain probability distribution Y of viewpoint classification_opinion：

Wherein softmax () represents a normalized exponential function, W_opinionA parameter matrix for the point of view classification model;

for Y_opinionDecoding to obtain a fine-grained viewpoint classification result of the sentence;

step 3-3-2) Collection of collections

And obtaining a fine-grained viewpoint classification result of the viewpoint sentence of each document to obtain a viewpoint intelligence clue compilation subset.

As an improvement of the above method, the viewpoint classification model and the segment classification model have the same input data, the same network structure, and different prediction dimensions.

As an improvement of the above method, the method further comprises a training step of a theme classification model, a segment classification model and a viewpoint classification model; the method specifically comprises the following steps:

selecting a data source as the news data in the national defense science and technology information field, after acquiring the texts in the field, manually marking each document according to a predefined topic classification category, marking sentences in the document according to a predefined segment classification category aiming at a certain document with a marked topic, and marking the viewpoint sentences according to a predefined viewpoint classification category in a fine-grained manner;

aiming at a marked document, splicing a document title and a text by taking a special symbol as a separator to obtain an input sequence, obtaining a corresponding word embedded representation, coding the word embedded representation by using a pre-training language model to obtain a hidden state representation of the input sequence, outputting predicted classification probability distribution by using a multilayer perceptron layer based on a [ CLS ] bit vector of the representation, calculating a cross entropy loss function with a real classification label, and training a theme classification model based on the loss function to obtain the theme classification model meeting the training requirement;

aiming at a labeled sentence, splicing the sentence by taking a special symbol as a separator to obtain an input sequence and obtain a corresponding word embedded representation, coding the word embedded representation by using a pre-training language model to obtain a hidden state representation of the input sequence, outputting predicted classification probability distribution by using a multilayer perceptron layer based on a [ CLS ] bit vector of the representation, calculating a cross entropy loss function with a real classification label, and training a segment classification model based on the loss function to obtain a theme classification model meeting the training requirement;

aiming at a marked 'viewpoint' sentence, splicing the sentences by taking a special symbol as a separator to obtain an input sequence and obtain a corresponding word embedding expression, coding the word embedding expression by using a pre-training language model to obtain a hidden state expression of the input sequence, outputting predicted classification probability distribution by using a multilayer perceptron layer based on a [ CLS ] bit vector of the expression, calculating a cross entropy loss function with a real classification label, and training a viewpoint classification model based on the loss function to obtain the viewpoint classification model meeting the training requirement.

Compared with the prior art, the invention has the advantages that:

1. the invention provides a hierarchical retrieval method, which simply and efficiently realizes a retrieval method from coarse granularity to fine granularity by comprehensively utilizing technical means such as full-text search, keyword matching and the like in a man-machine combination mode so as to achieve the purpose of recalling chapter-level/sentence-level multi-granularity information clue screening and gathering;

2. the invention designs a multi-dimensional automatic classification method, which is used for automatically classifying multi-granularity information clues from multiple dimensions by combining a subject classification model, a segment classification model and a viewpoint classification model aiming at the problem that a traditional classification model has an incomplete single-dimensional classification visual angle, so that the classification and reorganization efficiency and accuracy of the information clues are effectively improved.

Drawings

FIG. 1 is a flow chart of hierarchical information retrieval and multi-dimensional classification for information-oriented thread screening and cataloging according to the present invention;

FIG. 2 is a schematic diagram of hierarchical information retrieval;

FIG. 3 is a schematic diagram of a topic classification model structure;

FIG. 4 is a schematic diagram of a segment classification model and a point of view classification model structure.

Detailed Description

The invention provides a hierarchical retrieval and multi-dimensional classification method for information clue discovery, which comprises a hierarchical information retrieval module and a multi-dimensional classification module, and the method comprises the following steps:

step 1) providing internet open source data related to the acquired intelligence research topic, and constructing a full-text search index library of the research topic;

step 2) giving a search keyword list, and executing hierarchical information search based on the index database, wherein the hierarchical information search comprises the steps of chapter-level coarse granularity search and sentence-level fine granularity search, and a multi-granularity information clue screening set is obtained;

and 3) giving a multi-granularity information clue screening set recalled by retrieval, and carrying out multi-dimensional automatic classification on different granularity information clues, wherein the multi-dimensional automatic classification comprises three dimensions of topic classification, segment classification and viewpoint classification to obtain a multi-dimensional information clue sorting set.

In the above technical solution, the step 1) specifically includes:

step 1-1) giving an intelligence research topic and related Internet open source data obtained by collection, creating a document index based on a full-text search engine elastic search and making data mapping;

and step 1-2) leading the acquired data into the index according to a data mapping format to obtain a full text search index library related to the intelligence topic.

In the above technical solution, the step 2) specifically includes:

step 2-1) giving a search keyword list and a search time range to generate a search query statement;

step 2-2) executing chapter-level coarse-grained retrieval, and performing full-text retrieval on the index database based on retrieval query sentences to obtain a chapter-level information clue screening set;

and 2-3) executing sentence-level fine-grained retrieval, segmenting and sentence-dividing the chapter-level information clues, and filtering the sentence set by combining heuristic rules and a hard matching mode based on the retrieval keyword list to obtain a sentence-level information clue screening set.

In the above technical solution, the step 3) specifically includes:

step 3-1) traversing the chapter-level information clue screening set, inputting the text of the document and the title into the topic classification model for classification, and obtaining a chapter-level information clue compilation set;

step 3-2) traversing the sentence-level information clue screening set, inputting clue sentences into the segment classification model for classification, and obtaining a sentence-level information clue compilation set;

and 3-3) inputting the viewpoint sentences in the sentence-level information clue compilation set into the viewpoint classification model for classification to obtain a viewpoint information clue compilation subset.

Aiming at the multi-dimensional automatic classification requirements related in the method, the main steps of training the topic classification, segment classification and viewpoint classification models comprise:

step S1), collecting news data in the national defense science and technology information field, and carrying out data annotation according to a subject classification, segment classification and viewpoint classification label system to obtain an annotated data set;

step S2), training a topic classification model based on the labeling data for outputting the topic classification of chapter-level information clues;

step S3) training a segment classification model based on the labeling data, and outputting segment categories of sentence-level informative clues;

step S4) training a viewpoint classification model based on the labeling data, and outputting a fine-grained viewpoint category of sentence-level intelligence clues of the viewpoint category.

The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, the present invention provides a hierarchical retrieval and multidimensional classification method for intelligence clue discovery, which mainly comprises the following steps:

step 1) providing the internet open source data related to the acquired intelligence research topic, and constructing a full text search index library of the research topic, which specifically comprises the following steps: given an intelligence research topic, firstly, relevant data is collected from an internet public information source through a data mining technical means, and a document set D ═ D associated with the topic is obtained₁,...,d_nN is the total number of documents; then based on the full text search engine elastic search, an index of the document set is created and index fields are formulated, a typical document index is shown in table 1. The document set D is then traversed, for each document D_iAnd performing data mapping, for example, mapping the title of the document to a title field, and introducing the mapped field into the index to obtain a full-text search index library G related to the intelligence topic.

TABLE 1 document indexing sample

Index field	Field interpretation
		index	Document index name
type	Document index type
		id	Document index ID number
year	Year of publication of document
		description	Brief descriptive statement about the document
language	The language used by the document
		source	The data source of the document
title	Document title
		content	Document content
crawled	Document crawl time
		url	Hyperlinks to the document

Step 2) giving a search keyword list, and performing hierarchical information search based on the index database, wherein the hierarchical information search comprises chapter-level coarse-grained search and sentence-level fine-grained search, so as to obtain a multi-grained information clue screening set, and a flow chart of the multi-grained information clue screening set is shown in fig. 2, and the method specifically comprises the following steps: firstly, the method

Carrying out chapter-level coarse-grained search, and giving a search keyword list W ═ W₁,...,w_mWhere m is the number of keywords, and the search time range T_sAnd T_eGenerating a search query statement q ═ F based on the Elasticissearch_query(W,T_s,T_e) In which F is_queryA function is generated for the query statement. Then, full-text search is performed on the index database G based on the search query sentence q

Returning chapter-level intelligence hint filter collections

Wherein n is^*N is the number of documents in the set, F_retriIs a full text search function.

After performing chapter-level coarse-grained retrieval, screening a set based on information clues

Sentence-level fine-grained retrieval is performed. In particular, for sets

A certain document in

Firstly, the segmentation and sentence division processing is executed to obtain the sentence set of the document

Where k is the number of sentences. Then, a search keyword list W is given, and a sentence set S is matched in a mode of combining heuristic rules and hard matching_iFiltering is carried out

Obtaining a document

Set of sentence-level intelligence clues

Wherein k is^*K is the number of sentences after screening, F_filterThe function is filtered for sentence level. On the basis of the above, the sets are collected

All the documents in the document collection system are subjected to the screening operation, and finally the collection result is obtained

Sentence-level intelligence hint filtering set

Step 3) given multi-granularity information clue screening set of retrieval recall

Carrying out multi-dimensional classification on different granularity information clues, wherein the multi-dimensional classification comprises three dimensions of topic classification, segment classification and viewpoint classification to obtain a multi-dimensional information clue compilation set, and the method specifically comprises the following steps:

first, the chapter-level intelligence clues are subject-classified. The topic classification comprises 23 categories including 22 topics such as artificial intelligence, 5G technology and the like, and 1 'none' category, which indicates that the document does not belong to any topic above, and some disclosed topic classification categories are shown in table 2.

Subject classification categories disclosed in Table 2

Artificial intelligence	Unmanned system	5G technology
			Quantum technology	Electronic information	Biotechnology
Advanced materials and fabrication	Basic science	…

The specific steps for performing topic classification are as follows: first, traverse chapter-level information clue screening set

For collections

A certain document in

Wherein t is_iFor the title of the document, c_iFor the text of the document, t is added_iAnd c_iThe input sequence obtained after splicing is as follows<[CLS],Tok₁,…,Tok_|t|,[SEP],Tok₁,…,Tok_|c|,[SEP]>Where the header contains | t | characters: tok₁,…,Tok_|t|The body contains | c | characters: tok₁,…,Tok_|c|，[CLS]And [ SEP ]]Is a special delimiter. Then, the sequence is input into a topic classification model for classification, and the structure diagram of the topic classification model is shown in detail in fig. 3. In particular, the word-embedded representation H of the sequence is obtained first₀The word embedding is expressed as the sum of character embedding, position embedding, and segment embedding, and its dimension is (| t | + | c | +2) × | h |, where | h | represents the hidden state dimension. The word-embedded representation is then encoded using a pre-trained language model, such as BERT, that utilizes L pre-trained Transformer blocks for the embedded representation H of the input sequence₀Are carried out in sequenceAnd (3) encoding:

wherein H_lThe hidden state representation output for the first Transformer block, Transformer block () represents the Transformer function;

Y_topic＝softmax(W_topicH_L[0])

Wherein W_topicIs a parameter matrix of the topic classification model with dimension 23 x h |, Y_topicIs a probability distribution with dimension 23, and the input document can be obtained by decoding the probability distribution

Subject classification result topic of_i。

Second, segment classification is performed on sentence-level intelligence cues. The segment classification contains 19 categories including 18 categories of development, deployment, viewpoint and the like, and 1 "none" category, which means that the sentence does not belong to any of the above categories, and the partially disclosed segment classification categories are shown in table 3.

Classification of fragments as partially disclosed in Table 3

Research and development	Production of	Deploying
			Test of	Delivery of	Disclose (a)
Viewpoint of	Influence of	…

The specific steps for performing segment classification are as follows: first, traverse sentence-level information clue filtering set

For each document

Corresponding sentence-level intelligence hint set

Traversing the informative clue sentences in the set

With special symbols [ CLS]And [ SEP ]]The sentence is spliced as a separator to obtain an input sequence as follows<[CLS],Tok₁,…,Tok_|s|,[SEP]>Where | s | represents the number of characters of the sentence. Then, the sequence is input into a segment classification model for classification, and the structure diagram of the segment classification model is shown in detail in fig. 4. In particular, the word-embedded representation of the sequence is first obtained

The word embedding is expressed as the sum of character embedding, position embedding, and segment embedding, and its dimension is (| s | +2) × | h |. The word-embedded representation is then encoded using a pre-trained language model, such as BERT, that utilizes L pre-trained Transformer blocks for an embedded representation of the input sequence

And (3) coding in sequence:

taking the hidden state representation output by the L-th Transformer block

[ CLS ] of]Bit vector

Wherein W_segmentIs a parameter matrix of a segment classification model with dimensions 19 x h |, Y_segmentIs a probability distribution with dimension 19, which is decoded to obtain the input sentence

Result of classification of fragments

Finally, the sentence-level intelligence clue of the segment classification category of the viewpoint category is classified to execute viewpoint classification. The opinion classification contains a total of 5 categories including government positions, expert opinions, official opinions, etc., and is shown in table 4.

TABLE 4 View Classification categories

Independent view point	Government ground	Expert opinion
			Point of view of report	Official speech

The specific steps of performing viewpoint classification are substantially identical to the segment classification steps, specifically: firstly, aiming at the sentence of 'viewpoint' class, a special symbol [ CLS ] is used]And [ SEP ]]Splicing the sentences as separators to obtain an input sequence<[CLS]Input sentence, [ SEP ]]>. On this basis, a word-embedded representation of the input sequence is obtained

The word-embedded representation is then encoded using a pre-trained language model:

taking the hidden state representation output by the L-th Transformer block

[ CLS ] of]Bit vector

Wherein W_opinionIs a parameter matrix of a viewpoint classification model with dimension 5 x h |, Y_opinionIs a probability distribution with dimension of 5, and the probability distribution is decoded to obtain a fine-grained viewpoint classification node of an input viewpoint sentenceAnd (5) fruit.

step S1), labeling classification data, specifically including the steps of: firstly, determining a data source as the news data in the national defense science and technology information field, after acquiring the text in the field, manually marking each document according to predefined topic classification categories, and one marking sample is shown in a table 5. Then, for a certain document with labeled subjects, the sentences in the document are labeled according to the predefined segment classification categories. In particular, for the "view" class sentences, fine-grained labeling is performed according to predefined view classification categories, and one example of labeling is shown in table 6.

Step S2) training a topic classification model based on the labeling data, which specifically comprises the following steps: and splicing the title and the text of the document by using special symbols [ CLS ] and [ SEP ] as separators aiming at the marked document to obtain an input sequence < [ CLS ], a title, [ SEP ], a text and [ SEP ] >, obtaining a word embedded representation of the input sequence, and then coding the word embedded representation by using a pre-training language model to obtain a hidden state representation of the input sequence. Based on the [ CLS ] bit vector of the representation, the classification probability distribution of the output prediction of the multi-layer perceptron layer is used, a cross entropy loss function is calculated with a real classification label, and finally a topic classification model is trained based on the loss function.

TABLE 5 sample topic Classification

TABLE 6 segment Classification and View Classification sample instances

Step S3) training a segment classification model based on the annotation data, which specifically comprises the following steps: and aiming at the marked sentences, splicing the sentences by using special symbols [ CLS ] and [ SEP ] as separators to obtain input sequences < [ CLS ], sentences and [ SEP ] >, obtaining word embedded expressions of the input sequences, and then coding the word embedded expressions by using a pre-training language model to obtain hidden state expressions of the input sequences. Based on the [ CLS ] bit vector of the representation, using the classification probability distribution of the output prediction of the multi-layer perceptron layer, calculating a cross entropy loss function with a real classification label, and finally training a segment classification model based on the loss function.

Step S4) training the viewpoint classification model based on the annotation data, and since the viewpoint classification model is the same as the input data of the segment classification model, the network structure is the same, and only the prediction dimension is different, the training step is the same as step S3), which is not described herein again.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A hierarchical retrieval and multi-dimensional classification method for intelligence clue discovery, the method comprises the following steps:

2. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method according to claim 1, wherein the step 1) specifically comprises:

3. The intelligence-oriented hint discovery hierarchical retrieval and multi-dimensional classification method of claim 2, wherein the index field of step 1-1) comprises: document index name, document index type, document index ID number, document publication year, document short description, language used by the document, data source of the document, document title, document content, document crawl time, and document hyperlink.

4. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method according to claim 3, wherein the step 2) specifically comprises:

Obtaining chapter-level information clue screening set

step 2-3) traversing the set

Each document in (2) is subjected to the following processing:

for a document

Performing segmentation and sentence division processing to obtain document

Sentence collection of

Wherein k is

Obtaining a document

Set of sentence-level intelligence clues

step 2-4) summarizing the intelligence clue set of each article to obtain a set

Sentence-level intelligence hint filtering set

5. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method as claimed in claim 4, wherein the step 3) comprises:

6. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method according to claim 5, wherein the step 3-1) specifically comprises:

step 3-1-1) ergodic set

Each article in (a) was subjected to the following treatment:

for a document

Wherein, t_iFor the title of the document, c_iFor the text of the document, a special symbol [ CLS ] is used]And [ SEP ]]As a pair of separators t_iAnd c_iSplicing to obtain an input sequence, and obtaining word embedding expression H according to the input sequence₀Word-embedded means H₀For character embedding, position embedding andsegment embedding sum;

Y_topic＝softmax(W_topicH_L[0])

for Y_topicDecoding to obtain the input document

Subject classification result topic of_i；

Step 3-1-2) Collection of collections

7. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method as claimed in claim 6, wherein the step 3-2) specifically comprises:

step 3-2-1) ergodic sentence-level information clue screening set

For each document

Corresponding sentence-level intelligence hint set

The following treatments were all carried out:

go through

Inner intelligence clue sentences

Word-embedded representation

Is the sum of character embedding, position embedding and segment embedding;

embedding representations for the word using a pre-trained language model

Sequentially encoding to obtain the following formula:

taking the hidden state representation output by the L-th Transformer block

[ CLS ] of]Bit vector

for Y_segmentDecoding to obtain input sentence

Result of classification of fragments

Step 3-2-2) Collection of collections

8. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method as claimed in claim 7, wherein the step 3-3) specifically comprises:

step 3-3-1) ergodic sentence-level information clue screening set

with special symbols [ CLS]And [ SEP ]]As a separator pair "view" classThe sentences are spliced to obtain an input sequence, and corresponding word embedding representation is obtained according to the input sequence

Embedding representations for the word using a pre-trained language model

Sequentially encoding to obtain the following formula:

taking the hidden state representation output by the L-th Transformer block

[ CLS ] of]Bit vector

step 3-3-2) Collection of collections

9. The intelligence-oriented hint discovery hierarchical retrieval and multi-dimensional classification method of claim 8, wherein the opinion classification model is the same as the segment classification model in input data, the same in network structure, and different in prediction dimension.

10. The intelligence-oriented cue discovery hierarchical retrieval and multi-dimensional classification method according to claim 9, further comprising the steps of training a topic classification model, a segment classification model and a point of view classification model; the method specifically comprises the following steps: