CN116150335A

CN116150335A - Text semantic retrieval method under military scene

Info

Publication number: CN116150335A
Application number: CN202211630251.9A
Authority: CN
Inventors: 孙斌; 韩立斌; 赵文成; 袁翔; 郑少秋; 王静; 周宇; 黎健
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-05-23

Abstract

The invention discloses a text semantic retrieval method in a military scene. Firstly, constructing a dual-type semantic retrieval model based on a military pre-training model, training and fine-tuning on a military semantic retrieval data set to form a question-answer pair language representation model, acquiring a military text data semantic vector library offline, and constructing a secondary inverted index in a vector clustering mode; and secondly, constructing a text retrieval fine-ranking model based on a military pre-training model, and performing fine tuning training on a military semantic retrieval fine-ranking data set. In the face of a real-time retrieval task, a question language representation model is utilized to obtain a question semantic vector representation, a text set meeting the user requirement is recalled through vector similarity calculation and retrieval, and a text retrieval precision model is utilized to accurately position specific text data and feed the specific text data back to the user. The method can accurately locate the data required by the user in real time from massive military text data, and can be used in massive text search and search type question-answering scenes in military scenes.

Description

Text semantic retrieval method under military scene

Technical Field

The invention belongs to the technical field of semantic retrieval and intelligent question and answer, and particularly relates to a text semantic retrieval method under a military scene.

Background

With the increasing maturity of the digitalization of army and the construction of data engineering, army data assets are exponentially increased, and military data resource retrieval under a large data background becomes an indispensable part of military users in executing various tasks. In the face of extensive and massive text data distribution in a joint information service environment, how to help various military personnel to quickly and accurately locate required data in data oceans is a key and urgent problem to be solved.

The traditional text semantic retrieval method realizes massive text data retrieval in a keyword retrieval mode. Specifically, firstly, word segmentation processing is carried out on the search requirement to obtain search keyword/word information; then, searching the information by inquiring a semantic word stock, a synonym forest, a related word stock and the like; then, a text set strongly related to the question is recalled by a statistical algorithm such as BM25 (Best Match) with a word frequency statistical method such as TF-IDF (term frequency-inverse text frequency index) as a core. The statistical method of the 'hard' matching of the keywords solves the text retrieval problem from the word level, but ignores the information of the text semantic level, and has good use effect only in the use scene of accurate search keywords and clear search intention.

With the development of natural language processing technology, researchers abstract text information retrieval into text matching problems, begin to attempt to build a discriminant LTR (Learning to Rank) model, acquire text feature representations by using a language representation model, and recall a text set strongly related to a question by calculating similarity between texts. For example, fu Jian and the like extract the intra-sentence structural features of each question and answer and the interaction information between the two by using a convolutional deep neural network model and calculate the similarity, so that the question-answer task effect based on document retrieval is improved. Shao Mingrui and the like apply a deep neural network model combining a transducer and an attention mechanism to a question-answering task based on FQA, so that the model can complete the most similar matching of texts under the condition of smaller data set. Zhu Zongkui et al utilize BERT pre-training models to complete entity recognition and similarity calculation problems for the specificity of Chinese questions to optimize the performance of Chinese knowledge graph question-answering tasks. Compared with the traditional method, the text matching method based on deep learning can automatically learn the semantic information of questions and answers, and is more excellent in text semantic matching.

The application limitation of the existing text retrieval method in military scenes is as follows: firstly, in military scenes, the special requirements of various text retrieval are strong, the clarity of the expressions of various user problems is different, and the use effect of the traditional text semantic retrieval method based on word frequency statistics is poor; secondly, a plurality of specialized words exist in the military text corpus, the existing general pre-training language model is difficult to be applicable, and a large amount of military text corpus data is required to be collected for re-training to generate a military pre-training model; thirdly, the recall text similarity sorting method is difficult to meet the requirement of high text precision in military scenes, and recall results are required to be further processed to obtain optimal text answers.

Disclosure of Invention

The invention aims to: aiming at the defects of the prior art, the invention provides a text semantic retrieval method under a military scene, aiming at massive military text data, by constructing a military question-answer semantic representation model, question semantic features with strong depth characterization specialization and different expression quality are used for recalling text sets which are strongly related to questions by using a vector similarity retrieval mode. On the basis, the text retrieval precision model is utilized to rapidly and accurately position the required text data, so that the recall ratio and the precision ratio of the text retrieval are improved from the semantic point of view, and the user experience is further improved.

In order to solve the technical problems, the invention discloses a text semantic retrieval method in military scenes, which is mainly divided into an offline part and an online part, and comprises the following specific contents:

offline part: step 1, constructing a military pre-training model: constructing a military text corpus data set, and constructing a pre-training model applicable to military scenes on the basis of the military text corpus data set; step 2, offline construction of a dual semantic retrieval model: generating a question-answer pair language representation model and a military text data semantic vector library, constructing a dual semantic retrieval model based on a military pre-training model, training and fine tuning on a military semantic retrieval data set, generating a question-answer pair language representation model, solving the military text semantic vector representation to be retrieved by using the answer language representation model, and constructing the military text data semantic vector library; constructing a secondary index of a military data text base to be searched: based on a military text data semantic vector library, constructing a semantic vector secondary inverted index by using a K-means clustering algorithm; step 3, constructing a text retrieval precision model: traversing all questions in the military semantic retrieval data set, acquiring a text set strongly related to each question, and constructing a military semantic retrieval fine-ranking data set. Based on the military pre-training model, a multi-classification fine-ranking model is built, training and fine-tuning are performed on the military semantic retrieval fine-ranking data set, and a text retrieval fine-ranking model is generated.

On-line part: step 4, aiming at a real-time text retrieval task, (1) inputting user data requirements, and acquiring a question semantic representation vector by adopting a question language representation model; (2) obtaining a text set strongly related to user demands through vector similarity calculation and retrieval; (3) and obtaining a text answer which best meets the requirements of the user by using the text retrieval refined model, and feeding back the text answer to the user.

Further, the step 1 military pre-training model offline construction comprises the following steps:

step 1-1, collecting military raw corpus data;

step 1-2, cleaning and converting data preprocessing is carried out on redundant characters, dead words and complex words in military raw corpus data, word segmentation processing is carried out on the preprocessed data, and vocabulary data in a semantic vocabulary, a synonym forest, a related vocabulary and an expanded vocabulary in an existing military information retrieval and intelligent question-answering system are collected to form a military vocabulary list;

step 1-3, selecting a pre-training model in the natural language processing field; aiming at each military original corpus, mapping and converting by utilizing a military vocabulary list, namely searching for the corresponding position sequence of each word and digitally constructing the corresponding position sequence to form a military text corpus data set;

and step 1-4, setting model training parameters to train a pre-training model based on the military text corpus data set to form a military pre-training model.

Further, the step 2 of offline construction of the dual semantic retrieval model comprises the following steps:

step 2-1, collecting military search question-answer corpus data, digitizing text data through military word list mapping conversion, and constructing and forming a military semantic search data set;

step 2-2, constructing a dual-type semantic retrieval model based on a military pre-training model, and obtaining question-answer pair language expression models after training the dual-type semantic retrieval model, wherein the two language expression models are respectively a left branch network model and a right branch network model of the dual-type semantic retrieval model;

step 2-3, aiming at the military data text set to be searched, generating a military text data semantic vector library by using an answer language representation model;

and 2-4, constructing a secondary inverted index by adopting a clustering algorithm aiming at a military text data semantic vector library.

Further, step 2-1 is directed to a text retrieval task, taking the retrieval question-answer corpus data in the original corpus data as a main part, and constructing a military semantic retrieval dataset, and comprises the following steps:

step 2-1-1: the question-answer pair data in the original corpus is set as a positive sample, and the label is set as 1;

step 2-1-2: for each question in the positive sample, a negative sample is constructed in two ways: (1) related but non-optimal answer samples: the BM25 algorithm is adopted to search the first three texts which are most relevant to question sentences except answers from the original corpus data, and the similarity is respectively set to be 0.8, 0.5 and 0.3; (2) uncorrelated samples: adopting a random extraction mode to randomly select two pieces of text data from the military text corpus except for a positive sample and the mode (1) or other question-answer pairs, wherein the similarity is set to be 0;

Step 2-1-3: and mapping and converting text data in the positive and negative sample sets by using the military vocabulary list, and constructing and forming a military semantic retrieval data set.

Further, step 2-2 builds a dual semantic retrieval model based on a military pre-training model, and obtains a question-answer pair language representation model after model training, comprising the following steps:

step 2-2-1: aiming at military semantic retrieval tasks, two input branch networks are constructed, namely a dual semantic retrieval model is formed; coding the data by utilizing the military pre-training model generated in the step 1 to obtain feature representations of questions and answers, and then obtaining the similarity between the questions and the answers by using vector similarity calculation;

step 2-2-2: training and fine-tuning the dual-type semantic retrieval model on the military semantic retrieval data set, and obtaining a question-answer pair language representation model after training the dual-type semantic retrieval model.

Further, step 2-4 builds a secondary inverted index by adopting a clustering algorithm aiming at a military text data semantic vector library, and comprises the following steps:

step 2-4-1: performing preliminary clustering on the military text data semantic vector library by using a K-means algorithm which takes Euclidean distance as a distance formula, wherein the number of the initial clustering categories is set as C; using the class serial number and class center point vector corresponding to each semantic vector as a primary index;

Step 2-4-2: aiming at each type of semantic vector set, performing clustering division again on the military text data semantic vector library by using a K-means algorithm which takes Euclidean distance as a distance formula, wherein the number of subdivision categories is set to be K, and K is not less than 10; and taking the sub-class center point vector corresponding to each semantic vector as a secondary index.

Further, the step 3 text retrieval refined model offline construction comprises the steps of 3-1: the military semantic retrieval fine-ranking data set is constructed, and the step 3-1 comprises the following steps:

step 3-1-1: traversing each question in the military semantic retrieval data set, and acquiring a question semantic vector representation by using the question language representation model generated in the step 2-2;

step 3-1-2: aiming at each question semantic vector representation, a two-level inverted index mode is adopted to quickly position a Top-N text set which is strongly related to the question semantic from a military text data semantic vector library; n represents the number of recall texts; in a military scene, if a user wants to search M texts strongly related to the requirement, and the value range of M is 1-5, the number of recalled texts is N=10xM;

step 3-1-3: for N recall results of each question, if the original question-answer pair answer text exists in the recall results, the answer text is a positive sample, the label is set to 1, other recall results are defined as negative samples, and the label corresponding to each result is set to 0; if the correct answer text does not exist in the recall results, randomly deleting one recall result, defining other recall results as negative samples, and setting the correct answer text as positive samples;

Step 3-1-4: and mapping and converting positive and negative sample text data by using a military word list to construct a military semantic retrieval refined data set, wherein the military semantic retrieval refined data set data comprises question sentences and N recall texts, the labels are N-dimensional 01 vectors, and 1 represents a text answer which is most matched with the question sentences.

Further, the step 3 text retrieval refined model offline construction comprises the following steps: based on the military pre-training model, a text retrieval fine-ranking model is built, and the step 3-2 comprises the following steps:

step 3-2-1: and (3) constructing a text retrieval fine-ranking model by taking the military pre-training model in the step (1) as a backbone. The model takes question sentences and N recall texts as input, takes N-dimensional classification vectors as output, and judges the text most conforming to the question sentences in potential answers;

step 3-2-2: and (3) fine tuning the training text retrieval fine-ranking model on the military semantic retrieval fine-ranking dataset acquired in the step (3-1).

Further, the text semantic retrieval of the real-time task in the step 4 comprises the following steps:

step 4-1: when the real-time searching task is oriented, acquiring a question semantic vector representation by using a question language representation model after cleaning the data of the question;

step 4-2: retrieving and recalling a text set strongly related to the user requirement by using the vector similarity;

Step 4-3: and mapping and converting the question and the recall result in the step 4-2 by using the army vocabulary list, taking the mapping and converting as input, using a text retrieval precision model, accurately positioning an answer text meeting the requirement of a user, and feeding back to the user.

Further, step 4-2 retrieves a text set strongly related to the user's needs using vector similarity, comprising the steps of:

step 4-2-1: traversing the C class center point vectors in the step 2-4-1, calculating the similarity between each class center point and the question semantic vector by using a vector dot product, sorting according to the similarity, and selecting the class of a similarity sorting Top (2*M) as a next retrieval target;

step 4-2-2: aiming at 2*M classes in the step 4-2-1, traversing K subclasses respectively, and calculating the similarity between the central point of each subclass and the semantic vector of the question by using a vector dot product; sorting the similarity of center points of 2 x M x K subclasses from large to small, and selecting subclasses of Top (4 x 10 x M) as a search text candidate set;

step 4-2-3: and (3) traversing the step (4-2-2) to search each text semantic vector in the text candidate set, calculating the similarity between each vector and the question semantic vector by using a vector dot product, sorting from large to small according to the similarity, and selecting Top (N) texts as a recall text corpus.

The beneficial effects are that:

compared with the prior art, the invention has the remarkable advantages that: firstly, the recall ratio is obviously improved, the traditional text retrieval method is easy to describe the fuzzy problem due to the requirement in the process of 'hard' keyword matching, and partial query results are lost.

Secondly, the precision is remarkably improved, in order to meet the requirement of users for accurately acquiring required data in military scenes, the text retrieval precision arranging model is built based on a military pre-training model, the text which best meets the requirement of the users is acquired from potential answers, and the precision of text retrieval is greatly improved.

The text semantic retrieval method provided by the invention can be divided into two parts of an offline system and an online system, wherein vectorization representation of military text data related to a large amount of computation is completed by the offline system, the online system only needs to complete three parts of vectorization of text requirements, vector similarity retrieval and fine arrangement of search results, and the vector similarity retrieval part reduces time loss relatively less by constructing a two-level inverted index retrieval mode, can respond to user requirements in time and improves retrieval efficiency.

Drawings

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.

FIG. 1 is a flow of a text semantic retrieval method in a military scenario;

FIG. 2 is a network architecture of a Bert pre-training model;

FIG. 3 is a network structure of dual semantic retrieval models;

fig. 4 is a diagram of a multi-class fine-pitch model network architecture.

Detailed Description

Embodiments of the present invention will be described below with reference to the accompanying drawings.

The invention provides a text semantic retrieval method under a military scene, which respectively builds a dual semantic retrieval model and a text retrieval precision arranging model by means of strong semantic feature representation capability of a pre-training model in the field of natural language processing, and improves the text retrieval capability under the military scene. The method specifically comprises the following steps:

step 1, offline construction of a military pre-training model: constructing a military text corpus data set, selecting an open source pre-training model, training in the military text corpus data set, and forming the military pre-training model, wherein the method comprises the following steps of:

step 1-1: collecting military raw corpus data: the military original corpus data mainly comprises three parts, namely (1) collecting military dynamic information data, military documents, historical archives and other text data in a military business information system; (2) acquiring related data in the military field from the Internet; (3) the data of the questions and answers in the military information retrieval, intelligent questions and answers and dialogue system are collected, and the data mainly comprise user search log information, and specifically comprise user requirements and answer information.

Step 1-2: preprocessing original corpus data: and (3) cleaning data of redundant characters, dead words, complex characters and the like in the military original corpus data, obtaining cleaned data, performing word segmentation processing on the cleaned data, and collecting word list data in a semantic word stock, a synonymous word stock, a related word stock and an expanded word stock in the existing military information retrieval and intelligent question-answering system to integrally form a military word list.

Step 1-3: constructing a military pre-training model and a military text corpus data set: as shown in fig. 2, the BERT model is selected as the pre-training model in this embodiment; preprocessing military raw corpus by utilizing the military word list to form a military text corpus data set. Since the BERT pre-training model contains two training tasks, a mask language model and a following sentence prediction model. To this end, two training data sets of the BERT model are constructed based on the military text corpus data set, the two training data sets of the BERT model comprising a following sentence prediction model data set and a mask language model data set.

Step 1-3-1: constructing the following sentence prediction model data set: 20-30 training corpus are intercepted from each military text corpus by adopting a random intercepting mode, and each corpus comprises two continuous sentences (sentence A and sentence B), namely, a next sentence relation is formed, and the next sentence relation is taken as a positive sample; the sentence B in the training corpus is replaced by any other sentence in the original corpus to form a relation of 'non-next sentence', and a negative sample is formed, so that a following sentence sub-prediction data set is constructed;

Step 1-3-2: constructing a masking language model dataset: masking partial words of each military text corpus by adopting a random masking mode, wherein the proportion of the random masking words is 15%, and taking sentences before and after masking as model input and output to construct a masking language model data set. To solve the problem of mismatching of pre-training and subsequent tasks, the random masking word should be operated according to the following strategy: 80% of the randomly masked words are marked with Mask, 10% of the randomly masked words are randomly replaced with other words, and 10% of the randomly masked words are randomly kept unchanged.

Step 1-4: training a military pre-training model: the method comprises the steps of obtaining a pre-training model in the natural language processing field from the Internet, and obtaining an open source Chinese BERT model from the Internet as an initial pre-training model. Because the open source pre-training model is trained on a public data set, is not applicable in military scenes, and needs to be retrained on the following sentence prediction data set and the shielding language model data set prepared in the steps 1-3, the training pseudo codes of the military pre-training model are shown in a table 1, and the network structure of the pre-training model is shown in fig. 2;

TABLE 1 Pre-training model training process pseudocode

Model input representation (input respresentation): text is vectorized at the BERT model input layer. And (3) performing the following input processing on each training corpus in the step (1-3), adding a "[ CLS ]" mark on the head part of each training corpus, adding a "[ SEP ]" mark on the tail part of the corpus, and inputting sentences into a model for vectorization representation. The maximum corpus length is set to 512. The input vectorization representation is formed by accumulating word vectors, text vectors and position vectors, and the dimension of the input vectorization representation is 768. The word vector obtains the id (bit sequence) of the word by searching the word segmentation list in the step 1-2, and then queries the word vector table (random vector representation) to convert each word into vector representation; the text vector specifically indicates whether the input is the text of the same sentence, if the same label is set to be 0, if two sentences exist in the text, the corresponding label value of each word in the second sentence is 1; the position vector is obtained by adopting a random vector representation mode according to the bit sequence of each word in the text.

And (3) model structure selection: the invention selects BERT _Lager The network model is used as an initial pre-training model, the number of layers of the model Trm (Transformer blocks) is 24, the dimension of a hidden layer is 1024, the number of self-attention mechanism modules is 16, and the total parameter is about 340M.

Model output representation: in the Bert pre-training model, the input and output positions of the model are in one-to-one correspondence, sentence-level semantic relation features are represented in [ CLS ] corresponding positions, and each other word corresponding position represents a word-level semantic feature. Aiming at two tasks of the Bert pre-training context sentence prediction and the masking language model, the following two output formats are designed.

The following sentence prediction task output: the following sentence prediction task is a simple classification problem, taking sentence-level semantic relation features into consideration, taking a [ CLS ] corresponding position hidden layer vector, and obtaining prediction of a two-sentence relation by a [1024,2] linear layer and using a softmax function, wherein the dimension is 1024.

Masking language model task output: the task of masking the language model is to predict words corresponding to each position in the sentence, which can be understood as a multi-classification problem. Namely, the hidden layer vector of the corresponding position of each word is taken, multiplied by the transposed matrix of the word vector matrix after the full-connection layer of [1024,768], and the vector with dimension of word segmentation number can be obtained by using the softmax function, and the probability of each word is represented.

Model training: and (3) selecting an Adam optimizer to train the Bert model, setting the maximum training frequency as 100, setting the model learning rate as 10e-5 and setting the dropout random loss probability as 0.2. And observing the convergence condition of the loss value in the training process, and properly adjusting the learning rate according to the condition until the loss value converges and then storing the model.

The training of the military pre-training model is completed.

Step 2, offline construction of a dual semantic retrieval model: constructing a military semantic retrieval data set, constructing a dual semantic retrieval model based on a military pre-training model, training fine adjustment on the military semantic retrieval data set, and generating a question-answer pair language representation model comprising a question language representation model and an answer language representation model; collecting a military data text set to be searched, wherein the military data text set to be searched is a military data text collected by a user according to service requirements; aiming at the military data text set to be searched, an answer language representation model is utilized to generate a military text data semantic vector library offline, a clustering algorithm is utilized to construct a secondary inverted index, and the searching efficiency is improved.

Step 2-1: military semantic retrieval dataset construction: and 2-1-1, selecting question-answer pair data from the original corpus data, and taking the question-answer pair data as a positive sample, wherein the similarity is set to be 1. Step 2-1-2, in order to distinguish the positive sample from the negative sample with higher answer similarity as far as possible, the robustness of the model is improved. The invention builds a negative sample in two ways, (1) a relevant but non-optimal answer sample: the BM25 algorithm is adopted to search the first three texts which are most relevant to question sentences except answers from the original corpus, and the similarity is respectively set to be 0.8, 0.5 and 0.3; (2) uncorrelated samples: two pieces of text data are randomly selected from the original corpus data (except the positive sample and the method (1)) or other question-answer pairs by adopting a random extraction mode, and the similarity is set to be 0. And 2-1-3, mapping and converting the question-answer pair text data by utilizing the word list under the military context obtained in the step 1-2 aiming at the question-answer pair positive and negative sample set, so as to construct a military semantic retrieval data set.

Step 2-2: building a dual semantic retrieval model: step 2-2-1, constructing two input branch networks aiming at military semantic retrieval tasks, namely a dual semantic retrieval model, specifically, encoding question-answer pairs in a military semantic retrieval data set by using the military pre-training model in step 1, obtaining question-answer feature representations, obtaining similarity between question-answer sentences by using vector similarity calculation, and selecting vector dot products, euclidean distances, cosine distances and the like as similarity calculation modes. In this embodiment, after the inner product of the question-answer pair semantic vector, the similarity between questions and answers is obtained by using a Sigmoid function. The specific network model is shown in fig. 3, and is specifically divided into three parts.

(1) Question language representation model: the question is characterized by using a military pre-training model main body network (a model non-input-output part). The input preprocessing operation is the same as that of the steps 1-4, and word vectors of questions are selected as the input of a question language representation model, and the dimension of the word vectors is 768.

(2) Answer language representation model: the candidate answers are characterized by a military pre-training model body network (a model non-input-output part). The input preprocessing operation is the same as that of step 1-4, and the word vector of the answer is selected as the input of the answer language representation model, and the dimension is 768.

(3) Similar pair calculation: sentence-level feature representation (question-answer sentence language representation model [ CLS ] position output vector) of the question-answer sentence is selected, and the similarity of the question-answer pair is obtained by using a vector dot product.

Step 2-2-2, constructing a question-answer pair language representation model: because of the different training tasks, the pre-training model cannot be used directly as a question-answering versus language representation model. In order to improve the accuracy of semantic retrieval, training and fine adjustment of dual semantic retrieval models are required to be performed on a military semantic retrieval data set. And selecting an Adam optimizer to perform fine tuning training on the dual semantic retrieval model, wherein the training frequency is set to be 5, the model learning rate is set to be 10e-5, and the dropout random loss probability is set to be 0.5. After model training is completed, a question-answer pair language representation model can be obtained.

Step 2-3: the military data text set vector to be retrieved represents: preprocessing each corpus in the text set of the military data to be searched by adopting the same input preprocessing operation as that in the step 1-4, and then carrying out vector characterization on each corpus by using the answer language representation model in the step 2-2 to obtain a semantic vector library of the military text data.

Step 2-4: constructing a text secondary inverted index to be searched: and adopting a K-means clustering algorithm to perform cluster analysis on the military text data semantic vector library, constructing a secondary inverted index and improving vector retrieval efficiency. The K-means algorithm is specifically as follows:

TABLE 2K-Means algorithm pseudocode

Step 2-4-1: and performing preliminary clustering on the military text data semantic vector library by using a K-means algorithm with Euclidean distance as a distance formula, wherein the clustering number C is set to be 1000 aiming at millions of military corpus data to be searched. Using the class serial number and class center point vector corresponding to each semantic vector as a primary index;

step 2-4-2: and for each class of semantic vector set, performing clustering division again on the military text data semantic vector library by using a K-means algorithm with Euclidean distance as a distance formula, wherein the number of subdivision classes is set to be K, and K is not less than 10. And taking the sub-class center point vector corresponding to each semantic vector as a secondary index.

Step 3, offline construction of a text retrieval precision model: and constructing a military semantic retrieval fine-ranking data set, constructing a multi-classification fine-ranking model based on a military pre-training model, training and fine-tuning the military semantic retrieval fine-ranking data set, and generating a text retrieval fine-ranking model.

Step 3-1: and constructing a military semantic retrieval fine-ranking data set.

Step 3-1-2: and for each question semantic vector representation, a two-level inverted index mode is adopted to quickly locate a text set (Top-N) strongly related to the question semantic from a military text data semantic vector library. N represents the number of recall texts, and depends on the specific application scene. In the military scenario, if the user wants to search M texts strongly related to the requirement, the value range of M is 1-5, and the number of recalled texts n=10×m. The method comprises the following specific steps:

step 3-1-2-1: c class center vectors in the step 2-5-1 are traversed, the similarity between each class center point and the question vectors is calculated by using a vector dot product, the classes of similarity sorting Top (2*M) are selected to serve as a next retrieval target according to the similarity, M is the number of texts which the user wants to acquire, and Top (2*M) basically covers texts which the user wants to retrieve strongly related to the requirements.

Step 3-1-2-2: aiming at M classes in the step 3-1-2-1, traversing K subclasses corresponding to each class in the step 2-5-2, and calculating the similarity between the central point of each class and the question vector by using a vector dot product. Sorting the similarity of center points of the 2 x M x K subclasses from large to small, and selecting the subclasses of Top (4 x 10 x M) as a search text candidate set.

Step 3-1-2-3: traversing 4.10xM subclasses in the step 3-1-2-2, calculating the similarity between each vector and question vectors by using a vector dot product, sorting from big to small according to the similarity, and selecting Top (N) texts as a recall text corpus, wherein the N is 10.xM.

Step 3-1-3: for N recall results of each question, if the original question-answer pair answer text exists in the recall results, the answer text is a positive sample, the label is set to 1, other recall results are defined as negative samples, and the label corresponding to each result is set to 0; if the correct answer text does not exist in the recall results, randomly deleting one recall result, defining other recall results as negative samples, and setting the correct answer text as positive samples.

Step 3-1-4: and mapping and converting positive and negative sample text data by using a military word list to construct a military semantic retrieval refined data set, wherein the military semantic retrieval refined data set data comprises question sentences and N recall texts, the labels are N-dimensional '01' vectors, and '1' represents a text answer which is most matched with the question sentences.

Step 3-2: building a text retrieval precision model: based on the military pre-training model in step 1, a multi-classification text fine-ranking model is constructed, and a specific network result is shown in fig. 4, and specifically comprises the following steps:

(1) Model input: the model takes as input a corpus of questions and potential answers (N recall text). The following data preprocessing is performed for each question and answer set: (1) Concatenating the 1+N texts to form a data corpus, adding a [ CLS ] "mark at the head of a question sentence and adding a [ SEP ]" mark at the tail of each corpus during splicing; (2) And (3) digitizing the input corpus by using the mapping conversion of the military word list generated in the step (1-2). Namely, the word id (bit order) is obtained by searching the word segmentation list in the step 1-2, and then each word is converted into a vector representation by querying a word vector table (random vector representation) so as to be input as a model.

(2) Main network structure:

the main body network structure (the model non-input-output part) of the military pre-training model is used as the main body network structure of the text retrieval precision model.

(3) Model output layer:

and constructing a fully-connected classification network at the output position of the model corresponding to the [ CLS ]: specifically, two full connection layers [1024,1024] and [1024, N ] are used for feature extraction and dimension reduction, and a softmax function is used for judging the best matching answer.

Because of different training tasks, model training fine adjustment is needed on the military semantic retrieval fine-ranking dataset. According to the invention, an Adam optimizer is selected to carry out fine tuning training on the multi-classification text fine-ranking model, the training frequency is set to be 5, the model learning rate is set to be 10e-5, and the probability of dropout random loss is set to be 0.2. And after the multi-classification text fine-ranking model training is completed, generating a text retrieval fine-ranking model.

Step 4, text semantic retrieval for real-time tasks: inputting user data requirements, firstly adopting a question language representation model generated in the step 2 to obtain a question semantic representation vector; then obtaining a text set strongly related to the user demand through vector similarity calculation and retrieval; and finally, obtaining the text answer which meets the requirements best by using the text retrieval refined model in the step 3, and feeding back the text answer to the user. The method comprises the following specific steps:

step 4-1: aiming at the data requirement of a user, after data cleaning is carried out on the question, the question language representation model generated in the step 2-2 is used for obtaining the semantic vector representation of the question;

step 4-2-1: traversing the C class center point vectors in the step 2-4-1, calculating the similarity between each class center point and the question vector by using a vector dot product, sorting according to the similarity, selecting the class of a similarity sorting Top (2*M) as a next retrieval target, wherein M is the number of texts which the user wants to acquire, and Top (2*M) basically comprises texts which the user wants to retrieve strongly related to the requirement.

Step 4-2-2: aiming at 2*M classes in the step 4-2-1, traversing K subclasses respectively, and calculating the similarity between the central point of each class and the question vector by using a vector dot product. Sorting the similarity of center points of the 2 x M x K subclasses from large to small, and selecting the subclasses of Top (4 x 10 x M) as a search text candidate set.

Step 4-2-3: and (3) traversing the step (4-2-2) to search each text semantic vector in the text candidate set, calculating the similarity between each vector and the question vector by using a vector dot product, sorting the similarity from large to small, and selecting Top (N) texts as a recall text corpus, wherein the N is 10 x M.

Step 4-3: and mapping and converting the question and the recall result in the step 4-2-4 by using the army vocabulary list, taking the mapping and conversion as input, using a text retrieval precision model, precisely positioning the answer text, and feeding back to the user.

As shown in fig. 1, for real-time data requirements of users, firstly, obtaining semantic vector representation of the data requirements through a question language representation model; secondly, calculating and searching by using vector similarity, and recalling a text set strongly related to user requirements from a military text semantic vector library to be searched; then, obtaining an optimal text answer by using a text retrieval refined model; and finally, feeding the corresponding original text data back to the user.

The principle of this embodiment:

the embodiment fully refers to a strong language learning pre-training model in the natural language processing field, designs a text semantic retrieval method in a military scene, firstly utilizes vector similarity calculation and retrieval on the basis of representing semantic features of questions by using the pre-training model, and improves the recall ratio of text retrieval in the military scene from a semantic level; and secondly, the text retrieval precision model is utilized to accurately position answers required by users, so that the text retrieval precision rate in military scenes is improved, and the user experience is further improved.

In particular: offline part step 1, military pre-training model construction part: the BERT model is selected as an initial pre-training model, and an original corpus is obtained by collecting text data in various information systems, question-answering systems, dialogue systems, history file libraries and the like in the military field. The method comprises the steps of constructing a military text corpus data set by aiming at a BERT model shielding language model and a text sentence prediction model, and generating a military pre-training model through model training, wherein the model can be used for a military field text processing general model; step 2, generating a question-answer pair language representation model and a text corpus text data semantic vector library: the method comprises the steps of constructing a dual-type semantic retrieval model based on a military pre-training model, training in a military semantic retrieval data set to generate a question-answer pair language representation model, fully characterizing the semantic features of the question-answer pair, and improving the recall ratio of text retrieval in a military scene from a semantic level; building a secondary index of a text library to be searched: selecting a K-means algorithm to perform secondary clustering on a text semantic vector library to be retrieved, generating a text library secondary inverted index, and improving the recall speed of the vector retrieval text; step 3, constructing a text retrieval precision model: based on the military pre-training model, a text retrieval precision arranging model is built, a text which best meets the requirements of users is selected from recalled texts, and the precision of text retrieval in military scenes is improved.

In summary, the invention provides a text semantic retrieval method in military scenes. In order to fully consider text semantic features, avoid the problem of poor text retrieval effect when the user needs are not clearly described, fully utilize a pre-training model, acquire semantic level features between question-answer pairs, and help various military users to accurately locate required data from massive text data through vector similarity calculation and search technology, so that user experience is improved.

The main contribution is as follows: firstly, a military pre-training model construction method is provided. And acquiring an original corpus by collecting various military information systems and various text data in the military field in the Internet. Based on the open source pre-training language model, training is carried out to form a military pre-training model which is used as a general model and can be used for natural language processing tasks in subsequent military scenes such as event classification, event extraction, emotion extraction, intelligent question-answering and the like.

Secondly, a dual semantic retrieval model construction method is provided for text retrieval problems in military scenes, a question-answer pair language model is respectively constructed based on a military pre-training model to obtain question-answer pair sentence level characteristics, and a text set (Top-K) strongly related to the question is obtained by using a characteristic vector similarity calculation method. By training fine tuning on a military semantic retrieval data set, the method finally obtains a question-answer pair language representation model, and the text corpus vector representation library can be obtained offline by utilizing the answer text language representation model.

Thirdly, in order to improve the retrieval efficiency of mass military text data, a secondary inverted index construction method is provided, a K-means algorithm is utilized to perform secondary clustering on a text semantic vector library to be retrieved, a secondary inverted index of the text library is constructed based on a secondary clustering result, and the text retrieval efficiency is greatly improved.

Fourth, in order to meet the requirement of accurately positioning text answers in military scenes, a multi-classification refined model construction mode is provided, wherein the model takes a pre-training model as a main body, question sentences and potential answer sets as inputs, and the probability of whether the potential answer sets are optimal answers is output. Through fine tuning training of a military semantic retrieval fine-ranking data set, a fine-ranking model is finally obtained in the mode, a real-time searching task is oriented, question feature vector representation is obtained through a question language representation model, a potential answer set is generated through similarity calculation and retrieval, then the fine-ranking model is searched through text, information required by a user is accurately located in real time, and the efficiency of user information retrieval is improved.

In a specific implementation, the application provides a computer storage medium and a corresponding data processing unit, wherein the computer storage medium can store a computer program, and when the computer program is executed by the data processing unit, the computer program can operate the invention content and part or all of the steps in each embodiment of the text semantic search method under the military scene. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.

It will be apparent to those skilled in the art that the technical solutions in the embodiments of the present invention may be implemented by means of a computer program and its corresponding general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied essentially or in the form of a computer program, i.e. a software product, which may be stored in a storage medium, and include several instructions to cause a device (which may be a personal computer, a server, a single-chip microcomputer, MUU or a network device, etc.) including a data processing unit to perform the methods described in the embodiments or some parts of the embodiments of the present invention.

The invention provides a text semantic retrieval method in military scenes, and the method and the way for realizing the technical scheme are numerous, the above description is only a specific implementation mode of the invention, and it should be pointed out that a plurality of improvements and modifications can be made to a person with ordinary skill in the art without departing from the principle of the invention, and the improvements and the modifications are also regarded as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims

1. The text semantic retrieval method under the military scene is characterized by comprising the following steps:

step 1, offline construction of a military pre-training model: constructing a military text corpus data set; an open source pre-training model is selected for training in a military text corpus data set to form a military pre-training model;

step 2, offline construction of a dual semantic retrieval model: constructing a military semantic retrieval data set; based on a military pre-training model, constructing a dual-type semantic retrieval model, training fine adjustment on a military semantic retrieval data set, and generating a question-answer pair language representation model, wherein the question-answer pair language representation model comprises a question language representation model and an answer language representation model; collecting a military data text set to be searched, generating a military text data semantic vector library offline by using an answer language representation model aiming at the military data text set to be searched, and constructing a secondary inverted index by using a clustering algorithm;

step 3, offline construction of a text retrieval precision model: constructing a military semantic retrieval fine-ranking data set, constructing a multi-classification fine-ranking model based on a military pre-training model, training and fine-tuning the military semantic retrieval fine-ranking data set, and generating a text retrieval fine-ranking model;

step 4, text semantic retrieval for real-time tasks: inputting user data requirements, firstly adopting a question language representation model generated in the step 2 to obtain a question semantic representation vector; then obtaining a text set strongly related to the user demand through vector similarity calculation and retrieval; and finally, obtaining a text answer meeting the requirements by using the text retrieval refined model in the step 3, and feeding back the text answer to the user.

2. The method for text semantic retrieval in a military scene according to claim 1, wherein the step 1 of offline construction of a military pre-training model comprises the following steps:

step 1-1, collecting military raw corpus data;

3. The text semantic retrieval method under a military scene according to claim 2, wherein the step 2 dual semantic retrieval model is constructed offline, comprising the following steps:

4. The method for text semantic retrieval under military scenes according to claim 3, wherein the step 2-1 is to construct a military semantic retrieval data set based on the retrieval question-answer corpus data in the original corpus data aiming at the text retrieval task, and comprises the following steps:

5. The method for text semantic retrieval under military scenes according to claim 4, wherein step 2-2 is based on a military pre-training model, constructing a dual semantic retrieval model, obtaining a question-answer pair language representation model after model training, and comprising the following steps:

6. The method for text semantic retrieval under a military scene according to claim 5, wherein the steps 2-4 construct a secondary inverted index by adopting a clustering algorithm aiming at a military text data semantic vector library, and the method comprises the following steps:

7. The method for text semantic retrieval under military scenes according to claim 6, wherein the step 3 of offline construction of the text retrieval and ranking model comprises the steps of 3-1: the military semantic retrieval fine-ranking data set is constructed, and the step 3-1 comprises the following steps:

8. The method for text semantic retrieval under military scenarios according to claim 7, wherein the step 3 of offline construction of the text retrieval refined model comprises the steps of 3-2: based on the military pre-training model, a text retrieval fine-ranking model is built, and the step 3-2 comprises the following steps:

step 3-2-1: constructing a text retrieval fine-ranking model by taking the military pre-training model in the step 1 as a backbone; the model takes question sentences and N recall texts as input, takes N-dimensional classification vectors as output, and judges the text most conforming to the question sentences in potential answers;

9. The text semantic retrieval method under the military scene according to claim 8, wherein the text semantic retrieval for the real-time task in the step 4 comprises the following steps:

10. The method for text semantic retrieval under military scenarios according to claim 9, wherein step 4-2 retrieves a text set strongly related to user demand using vector similarity, comprising the steps of: