CN113641793B

CN113641793B - Retrieval system for long text matching optimization aiming at electric power standard

Info

Publication number: CN113641793B
Application number: CN202110937101.1A
Authority: CN
Inventors: 赵常威; 钱宇骋; 李坚林; 潘超; 甄超; 朱太云; 李森林; 胡啸宇; 吴正阳; 吴杰; 吴海峰; 黄文礼; 温招洋
Original assignee: Anhui Nanrui Jiyuan Power Grid Technology Co ltd; Electric Power Research Institute of State Grid Anhui Electric Power Co Ltd; State Grid Anhui Electric Power Co Ltd
Current assignee: Anhui Nanrui Jiyuan Power Grid Technology Co ltd; Electric Power Research Institute of State Grid Anhui Electric Power Co Ltd; State Grid Anhui Electric Power Co Ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2024-05-07
Anticipated expiration: 2041-08-16
Also published as: CN113641793A

Abstract

The invention discloses a long text matching and optimizing retrieval system aiming at an electric power standard, which belongs to the field of text retrieval, and aims to solve the problem that how to effectively match retrieval words with the long text is a core when the length of each section in the electric power standard is more than 512 and the document retrieval of the electric power standard is established. The traditional TF-IDF and BM25 algorithms consider word dimension matching, do not consider matching degree and correlation of depth semantics, can cause matching similarity to have limitation, aim at the problem that the Mask operation of a single word level in the original BERT cannot learn the context of a domain professional vocabulary, and aim at the result of domain word segmentation to perform Mask operation of a continuous vocabulary segment level, so that the context of the vocabulary level is forced to be learned by a forced model, and the method has a certain effect on the improvement of Chinese retrieval tasks.

Description

Retrieval system for long text matching optimization aiming at electric power standard

Technical Field

The invention belongs to the field of text retrieval, and particularly relates to a retrieval system for long text matching optimization aiming at electric power standards.

Background

The electric power standard is a technical regulation and a technical management basis which are required to be observed jointly in the process of electric power construction and electric energy production, transformation, transmission, sales and use, the electric power standard is a mandatory standard, the electric energy production, transmission and sales are all completed at once in the moment, and the electric power system has great influence on the whole social life and production, so that the electric power system is required to have high reliability, stability and safety, and the electric power standard is a mandatory standard except for a few indication of 'can refer to and execute'.

When the length of each section in the power standard is larger than 512, how to effectively match the search word with the long text is a core problem when the power standard document is established, and the traditional TF-IDF and BM25 algorithm considers matching in word dimension and does not consider matching degree and relevance of depth semantics, so that matching similarity has limitation.

Disclosure of Invention

In order to solve the problems of the scheme, the invention provides a retrieval system for long text matching optimization aiming at power standards.

The aim of the invention can be achieved by the following technical scheme: a retrieval system for long text matching optimization aiming at electric power standard comprises a vocabulary extraction terminal, a pre-training BERT coding terminal, a vocabulary processing terminal and a semantic long text ordering terminal;

the pre-training BERT encoding terminal adopts two different pre-training BERT codes to encode the paragraph d and the corresponding search term q to obtain different vectors, the different vectors are expressed as d-vecor and q-vecor, and then cosine similarity is calculated for the two vectors to be used as a relevance score of the two vectors;

The vocabulary processing terminal internally comprises two models, namely a BERT pre-training language model adapting to the field and an unsupervised semantic similarity model adapting to the field.

Preferably, the vocabulary extraction terminal extracts documents and titles of all chapters in the power standard text as paragraphs and corresponding search terms, wherein the paragraphs are marked as d, and the corresponding search terms are marked as q.

Preferably, the pretrained BERT coding terminal comprises an expansion unit, and the search term corresponding to each document is expanded by the expansion unit.

Preferably, the domain-adapted BERT pre-trained language model forces the model to learn vocabulary-level context while eliminating NSP tasks.

Preferably, the field-adaptive unsupervised semantic similarity model adopts an unsupervised method to train two BERT unsupervised semantic similarity representation data for the search term q and the paragraph d respectively.

Preferably, the semantic long text ordering terminal is used for constructing a deep semantic long text ordering model to enable BERT suitable for q-d matching to be represented.

Preferably, the construction method of positive and negative samples for q-d matching algorithm is as follows:

Step one, aiming at the search term with the complete semantic relation, q-d pairs of other chapters are used, d parts of other chapters are used as negative examples, and paragraphs d matched with the search term q with the complete semantic relation are used as positive examples;

Step two, constructing word segmentation using the search term q without the complete semantic relation aiming at the search term without the complete semantic relation, wherein paragraphs d of other chapters are used as negative examples, and paragraphs d matched with the search term q without the complete semantic relation are used as positive examples;

In the training, in each batch, assuming that the size of the batch is batch_size, taking a positive example document paragraph corresponding to each search term as a positive sample, taking document paragraphs in other batch_size-1 samples as negative examples, and thus constructing a batch_size ² sample pair for training;

preferably, the search terms and the BERT semantic similarity representation models of the document paragraphs are respectively constructed in a targeted mode and are respectively called q-BERT and d-BERT.

Preferably, the q-BERT and the d-BERT encode the search term and the document paragraph respectively as the initialized representation, the final objective is to learn the encoder d-encoder and the q-encoder, respectively encode the search term q and the document paragraph d into the same vector space, in the same vector space, (q, d) with strong correlation is closer to the distance of (q, d) pairs with weak correlation, and the following loss function is designed for this:

The loss function is a negative log-likelihood function of the positive example, where q _i is the term, For the positive document paragraph with strong correlation,/>And the like are negative examples.

Compared with the prior art, the invention has the beneficial effects that:

1. Aiming at the problem that the Mask operation of the single word level in the original BERT cannot learn the context of the professional vocabulary in the field, the Mask operation of the continuous vocabulary segment level is carried out aiming at the result of the field segmentation, thereby forcing the model to learn the context of the vocabulary level and having a certain effect on the improvement of the Chinese retrieval task;

2. For the search term q and the paragraph d, respectively training two BERT unsupervised semantic similarity expression data by adopting an unsupervised method; the method has the advantages that one sentence is used for obtaining the BERT representation of the sentence through the encoder, the BERT representations obtained by other sentences are used as negative examples, the positive examples are respectively input into the encoder twice through the same sentence, and different drop mask mechanisms are used for obtaining different BERT representations, so that the method has better effects than the common text enhancement method cutting, word replacement and other methods;

3. In each batch, assuming that the batch size is batch_size, taking the positive document paragraph corresponding to each search term as a positive sample, taking the document paragraphs in other batch_size-1 samples as negative examples, and constructing a batch_size ² sample pair for training, thereby effectively training a deep semantic long text ordering model.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic block diagram of the present invention;

FIG. 2 is a schematic block diagram of a similarity model of the present invention.

Detailed Description

The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, a search system for long text matching optimization aiming at power standard comprises a vocabulary extraction terminal, a pre-training BERT coding terminal, a vocabulary processing terminal and a semantic long text ordering terminal;

The vocabulary extraction terminal output end is electrically connected with the pre-training BERT coding terminal input end, the pre-training BERT coding terminal output end is electrically connected with the vocabulary processing terminal input end, and the vocabulary processing terminal output end is electrically connected with the semantic long text ordering terminal input end;

The vocabulary extraction terminal extracts documents and titles of all chapters in the power standard text as paragraphs and corresponding search words, wherein the paragraphs are marked as d, and the corresponding search words are marked as q;

The pre-training BERT encoding terminal adopts two different pre-training BERT codes to encode the paragraph d and the corresponding search term q to obtain different vectors, the different vectors are expressed as d-vecor and q-vecor, and then cosine similarity is calculated on the two vectors to be used as a relevance score of the two vectors, wherein an expansion unit is arranged in the pre-training BERT encoding terminal, and the search term corresponding to each document can be expanded through the expansion unit, such as expansion through other topic keywords (such as a stable winding) existing in the pre-training BERT encoding terminal;

the vocabulary processing terminal internally comprises two models, namely a BERT pre-training language model adapting to the field and an unsupervised semantic similarity model adapting to the field;

The field-adaptive BERT pre-training language model has the following characteristics:

2. The long text semantic information obtained by modeling is incomplete due to the fact that the next sentence prediction task in the original BERT is split, so that adverse effects are caused on long text retrieval, and NSP tasks are cancelled to avoid the effects on the long text semantic information.

The field-adaptive unsupervised semantic similarity model has the following characteristics:

For the search term q and the paragraph d, respectively training two BERT unsupervised semantic similarity expression data by adopting an unsupervised method;

As shown in fig. 2, a sentence is used to obtain the BERT representation of the sentence through the encoder, the BERT representations obtained by other sentences are used as negative examples, while the positive examples are that the same sentence is respectively input into the encoder twice, and different drop mask mechanisms obtain different BERT representations, so that experiments prove that the effect is better than that of the conventional text enhancement method such as clipping, word replacement and the like, the coding feature vector of each text in the diagram is represented by a first matching pair with the arrow in the rear direction, and the first matching pair with the second matching pair and the first matching pair with the third matching pair;

respectively constructing search terms in a targeted manner, and respectively calling BERT semantic similarity representation models of document paragraphs as q-BERT and d-BERT;

Through the training optimization of the power standard text in the semantic similarity calculation scene, BERT representation which is more suitable for q-d matching is obtained, and a deep semantic long text ordering model needs to be constructed.

The semantic long text ordering terminal is used for constructing a deep semantic long text ordering model, and the construction method of positive and negative samples aiming at the q-d matching algorithm is as follows:

Step one, aiming at the search term with the complete semantic relation, using other q-d pairs aiming at a certain section, using d parts of other sections as negative examples, and using the paragraph d matched with the search term q with the complete semantic relation as positive examples;

step two, constructing word segmentation using the search term q without the complete semantic relation aiming at the search term without the complete semantic relation, wherein paragraphs d of other chapters are taken as negative examples, and paragraphs d matched with the search term q without the complete semantic relation are taken as positive examples.

Step three, in the training process, in each batch, the batch is assumed to be batch_size, the positive example document paragraph corresponding to each search term is taken as a positive sample, and the document paragraphs in other batch_size-1 samples are taken as negative examples, so that a batch_size ² sample pair is constructed for training, and a deep semantic long text ordering model is effectively trained.

Application and training is performed by the following steps:

Firstly, using d-encoder to code text paragraphs offline into vectors with fixed dimension, wherein the fixed dimension is 768 dimension, then establishing indexes for the vectors, in practical use, using q-encoder to code search words semantically into vectors with fixed dimension, and using CAISS vector search system to find the most relevant K document paragraphs, namely the application flow of the system, wherein the similarity of the code vectors of the search word q and the paragraph d is measured by using the following formula:

that is, a pair of (q, d) semantic code vectors is taken, and then an inner product is calculated as two correlation metrics, where T represents the transpose of the matrix.

In the training process, the q-BERT and the d-BERT obtained above are used for respectively encoding the search term and the document paragraph as initialization representations, the final objective is to learn the encoder d-encoder and the q-encoder, respectively encode the search term q and the document paragraph d into the same vector space, in the same vector space, the distance between the (q, d) pair with strong correlation and the (q, d) pair with weak correlation is closer, and finally, the following loss function is designed according to the task here:

The meaning represented by the above equation is a negative log likelihood function of the positive example, where q _i is the term, For the positive document paragraph with strong correlation,/>In negative examples, the samples are other than the correctly matched samples.

The above formulas are all formulas with dimensions removed and numerical values calculated, the formulas are formulas which are obtained by acquiring a large amount of data and performing software simulation to obtain the closest actual situation, and preset parameters and preset thresholds in the formulas are set by a person skilled in the art according to the actual situation or are obtained by simulating a large amount of data.

Working principle: aiming at the problem that the Mask operation of the single word level in the original BERT cannot learn the context of the professional vocabulary in the field, the Mask operation of the continuous vocabulary segment level is carried out aiming at the result of the field segmentation, thereby forcing the model to learn the context of the vocabulary level and having a certain effect on the improvement of the Chinese retrieval task;

For the search term q and the paragraph d, respectively training two BERT unsupervised semantic similarity expression data by adopting an unsupervised method; the method has the advantages that one sentence is used for obtaining the BERT representation of the sentence through the encoder, the BERT representations obtained by other sentences are used as negative examples, the positive examples are respectively input into the encoder twice through the same sentence, and different drop mask mechanisms are used for obtaining different BERT representations, so that the method has better effects than the common text enhancement method cutting, word replacement and other methods;

in each batch, assuming that the batch size is batch_size, taking the positive document paragraph corresponding to each search term as a positive sample, taking the document paragraphs in other batch_size-1 samples as negative examples, and constructing a batch_size ² sample pair for training, thereby effectively training a deep semantic long text ordering model.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented; the modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the method of this embodiment.

It will also be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical method of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical method of the present invention may be modified or substituted without departing from the spirit and scope of the technical method of the present invention.

Claims

1. The long text matching and optimizing retrieval system aiming at the electric power standard is characterized by comprising a vocabulary extraction terminal, a pre-training BERT coding terminal, a vocabulary processing terminal and a semantic long text ordering terminal;

The vocabulary extraction terminal output end is electrically connected with the pre-training BERT coding terminal input end, the pre-training BERT coding terminal output end is electrically connected with the vocabulary processing terminal input end, and the vocabulary processing terminal output end is electrically connected with the semantic long text ordering terminal input end; the vocabulary extraction terminal extracts documents and titles of all chapters in the power standard text as paragraphs and corresponding search words, wherein the paragraphs are marked as d, and the corresponding search words are marked as q; the field-adaptive unsupervised semantic similarity model respectively trains two BERT unsupervised semantic similarity representation data for the search term q and the paragraph d by adopting an unsupervised method;

the pre-training BERT coding terminal internally comprises an expansion unit, and the search word corresponding to each document is expanded through the expansion unit;

The BERT pre-training language model of the field adaptation forces the model to learn the word level context, and meanwhile, the NSP task is canceled;

Respectively constructing search terms in a targeted manner, and respectively calling BERT semantic similarity representation models of document paragraphs as q-BERT and d-BERT; the semantic long text ordering terminal is used for constructing a deep semantic long text ordering model, so that BERT suitable for q-d matching is expressed;

step one, aiming at the search term with the complete semantic relation, q-d pairs of other chapters are used, d of other chapters is used as a negative example, and paragraph d matched with the search term q with the complete semantic relation is used as a positive example;

in the training, in each batch, assuming that the size of the batch is batch_size, taking a positive example document paragraph corresponding to each search term as a positive sample, taking document paragraphs in other batch_size-1 samples as negative examples, and thus constructing a sample pair for training;

Using the q-BERT and d-BERT obtained above to encode the search term and the document paragraph respectively as initialization representations, the goal is to learn the encoders d-encoder and q-encoder, respectively encode the search term q and the document paragraph d into the same vector space, in the same vector space, (q, d) with strong correlation is closer to the distance of (q, d) pairs with weak correlation, and the following loss function is designed for this:

The loss function is a negative log-likelihood function of the positive example, where q _i is the term, For the positive document paragraph with strong correlation,/>Negative examples.