CN111339261A

CN111339261A - Document extraction method and system based on pre-training model

Info

Publication number: CN111339261A
Application number: CN202010185622.1A
Authority: CN
Inventors: 韩庆宏; 李纪为
Original assignee: Beijing Xiangnong Huiyu Technology Co ltd
Current assignee: Beijing Xiangnong Huiyu Technology Co ltd
Priority date: 2020-03-17
Filing date: 2020-03-17
Publication date: 2020-06-26

Abstract

The invention discloses a document extraction method and system based on a pre-training model, and belongs to the technical field of document extraction. The document extraction method based on the pre-training model comprises the following steps: the model pre-training step comprises the steps of carrying out document characterization processing, and pre-training a document to obtain a feature vector of the document; describing and characterizing, namely taking a sentence in a document as description, and performing memorability pre-training to obtain a feature vector; carrying out inner product operation on the feature vectors of the two and carrying out normalization processing; and save the document feature vector. A document extraction step comprising the characterization of the new description; carrying out inner product operation on the newly described feature vector and the document feature vector and carrying out normalization processing; and selecting a document with the maximum value of the normalization result. The method utilizes the pre-training of massive high-quality Wikipedia data on the model, greatly endows the model with rich semantic knowledge, and simultaneously applies a neural network pre-training method to truly reflect the relation between the search description and the document, thereby improving the accuracy of document extraction.

Description

Document extraction method and system based on pre-training model

Technical Field

The invention relates to the technical field of document selection, in particular to a document extraction method based on a pre-training model.

Background

In the prior art, document extraction means that a system extracts a document which best meets a description from a massive document library given the description. The characteristics of this task are: the amount of data is extremely large. This is reflected in that each document is relatively long and that the document library contains a very large number of documents. Therefore, in the document selection, it is important to accurately and quickly obtain the document matched with the search description.

The document extraction system in the past is mainly divided into two steps: firstly, extracting some candidate documents from a document library; then, a document that best matches the description is selected from the candidate documents. For the first step, it is actually necessary to extract some features, such as keyword information, from each document, and then compare the extracted features with the keywords in the description; in the second step, the candidate document needs to be extracted with finer granularity features and then compared with the search description, so as to obtain a finer and more accurate result. Because the relevant document features are manually defined in the document extraction process, the actual relationship between the document and the description cannot be accurately reflected when the document features are compared with the search description, and the result of document extraction is not accurate, so that the requirement of document search description cannot be met.

Disclosure of Invention

The invention mainly solves the technical problem of providing a document extraction method and system based on a pre-training model, and improving the accuracy of document extraction.

In order to achieve the above object, the first technical solution adopted by the present invention is: a document extraction method based on a pre-training model is provided, which comprises the following steps: the model pre-training step comprises the steps of selecting a document set, and pre-training documents in the document set to obtain a first feature vector of the documents; respectively and randomly selecting a sentence in the document, and respectively pre-training to obtain a second feature vector; respectively carrying out inner product operation on the first feature vector and a second feature vector, and carrying out normalization processing; and storing the first feature vector obtained by the document through the training model. A document extraction step comprising: inputting a search description into a pre-training model to obtain a third feature vector; respectively carrying out the inner product operation on the first feature vector and the third feature vector and carrying out normalization processing; and selecting a document with the maximum value of the normalization result as a final result.

In order to achieve the above object, the second technical solution adopted by the present invention is: the document extraction system based on the pre-training model comprises a text feature vector extraction module, a document matching module and a document matching module, wherein the text feature vector extraction module is used for pre-training documents in a document set to obtain a first feature vector of the documents, and extracting feature vectors of search description to obtain a second feature vector; the feature vector operation module is used for respectively carrying out inner product operation on the first feature vector and a second feature vector, carrying out normalization processing and storing the first feature vector; and the document selecting module selects one document with the maximum normalized result as a final result.

The invention has the beneficial effects that: the method utilizes the pre-training of massive high-quality Wikipedia data on the model, greatly endows the model with rich semantic knowledge, simultaneously applies a neural network pre-training method to convert text information into feature vectors, fully and truly reflects the relation between the search description and the document through the operation among the feature vectors, and improves the accuracy of document extraction.

Drawings

FIG. 1 is a schematic flow chart of a document extraction method based on a pre-training model according to the present invention;

FIG. 2 is a schematic diagram of a document extraction system based on a pre-training model according to the present invention.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

It is noted that the terms first and second in the claims and the description of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Referring to fig. 1, which is a schematic flow chart of a document extraction method based on a pre-trained model according to the present invention, in an embodiment of the present invention, the document extraction method based on the pre-trained model includes the following steps:

step S101, model pre-training step.

The document extraction means that given a description q, the system extracts a document D which best conforms to the description q from the document library D. In one embodiment of the invention, the invention utilizes massive high-quality Wikipedia text data to select the document set M as an object of model pre-training. Carrying out neural network pre-training on the document set M, converting text information of each document in the document set M into a feature vector, and using symbols

And (4) showing. The document is represented by feature vectors. Compared with a method for expressing text characteristics by using artificial definition and adopting keyword information, the method for expressing the text characteristics by using the characteristic vector of the document obtained by using the neural network pre-training method can reflect the characteristics of the document more and improve the accuracy of document extraction.

In the specific implementation mode, a sentence q in a document is randomly selected as a search description of the document, the search description q is subjected to feature coding, the text information of the search description q is converted into a feature vector by using a pre-training method of a neural network, and symbols are used for the feature vector

And (4) showing. Feature vector of document

Feature vector associated with search description q

Separately performing inner product operation

And normalizing the inner product result to (0,1) processing.

In one embodiment of the present invention, a first document

Is expressed as

Extracting a search description from a first document

The feature vector of

The feature vectors corresponding to the rest of the documents are

、

… … are provided. Feature vector of document

Feature vector associated with search description q

Performing inner product operation to obtain the result

… … are provided. The above inner product results are normalized to (0,1), i.e. the ratio of each inner product result to the sum of all inner product results is calculated separately. For example

The normalized result of

Because of the search description

As a first document

A sentence in (1), search description

With the first document

The highest degree of matching, therefore

The normalized result of (1) is close to 1, which shows that the document

Conform to the description

Document, document

To describe

The corresponding document of (1). The normalized result also reflects how well the document matches the search description. Documenting the rest of the documents and descriptions in the document set as well

And (4) performing feature inner product operation and normalization processing, so that the document extraction model f is fixed.

After the document extraction model is pre-trained and fixed, when the model receives other search descriptions again, a series of operation operations during pre-training can be carried out, and documents which are close to 1 in the normalization result of the search description feature vector, namely the documents which are most in line with the search descriptions, are selected from the document set.

Step S102, document extraction step.

In one embodiment of the present invention, after the model pre-training of step 101, the document extraction model already obtains the feature vectors of all documents in the document set, and in the document extraction process, the feature vectors of the documents are extracted

Storing and directly using the obtained product without repeated operation. The model is further simplified, and the efficiency of document extraction is improved.

In one embodiment of the present invention, when the document extraction model receives a search description, the text information of the search description is converted into a feature vector of the search description, and an inner product operation is performed with the feature vector of the already stored document and normalized to (0,1), and the document closest to 1, which is the document most consistent with the search description, is selected.

When the document is extracted, on the basis of the pre-training model, the neural network pre-training method is used for pre-training the document extraction model, the document and the text of the search description are converted into the feature vectors, the feature vectors are used for representing the features of the document and the search description, and compared with the manually defined keyword features, the real relation between the search description and the document is reflected more truly. Massive high-quality Wikipedia is used as a source of the pre-training document, so that the document extraction model is more comprehensive, and the quality of the document extraction result is improved.

In one embodiment of the present invention, fig. 2 is a schematic diagram of a document extraction system based on a pre-trained model, in which the document extraction system based on the pre-trained model comprises:

a text feature vector extraction part, which obtains the feature vector of the document by adopting a neural network pre-training method for the document in the document set, obtains the search description by adopting the neural network pre-training method, and searches the described feature vector; in one embodiment of the invention, the invention utilizes massive high-quality Wikipedia text data to select the document set M as an object of model pre-training. Carrying out neural network pre-training on the document set M, converting text information of each document in the document set M into a feature vector, and using symbols

And (4) showing. Randomly selecting a sentence q in the document as the search description of the document, carrying out characteristic coding on the search description q,converting the text information of the search description q into a feature vector by using a pre-training method of a neural network, and using symbols

And (4) showing. Feature vector of document

Feature vector associated with search description q

Separately performing inner product operation

And normalizing the inner product result to (0,1) processing.

In an embodiment of the present invention, as shown in fig. 2, the document extraction system based on the pre-training model of the present invention further includes a feature vector operation part, which performs inner product operation on the feature vectors of the documents in the document set and the feature vectors of the search description, respectively, performs normalization processing, stores the first feature vector, and fixes the document extraction model.

In one embodiment of the invention, a first document

Is expressed as

Extracting a search description from a first document

The feature vector of

The feature vectors corresponding to the rest of the documents are

、

… … are provided. Feature vector of document

Feature vector associated with search description q

Performing inner product operation to obtain the result

The normalized result of

Because of the search description

As a first document

A sentence in (1), search description

With the first document

The highest degree of matching, therefore

The normalized result of (1) is close to 1, which shows that the document

Conform to the description

Document, document

To describe

In one embodiment of the present invention, as shown in fig. 2, the document extraction system based on the pre-trained model of the present invention further includes a document extraction part. In one embodiment of the invention, the document extraction model already obtains the feature vectors of all documents in the document set, and in the document extraction process, the feature vectors of the documents are extracted

Storing and directly using the obtained product without repeated operation. The model is further simplified, and the efficiency of document extraction is improved. When the document extraction model receives the search description, converting the text information of the search description into the feature vector of the search description, performing inner product operation on the feature vector of the stored document, normalizing to (0,1), and selecting the document closest to 1, wherein the document is the document most consistent with the search description.

Claims

1. A document extraction method based on a pre-training model is characterized by comprising the following steps:

a model pre-training step, comprising:

selecting a document set, and pre-training documents in the document set to obtain a first feature vector of the documents;

respectively and randomly selecting a sentence in the document as a description, and respectively performing the pre-training to obtain a second feature vector;

performing inner product operation on the second feature vectors and the first feature vectors respectively, and performing normalization processing;

the first feature vector obtained by the document through the training model is stored;

a document extraction step comprising:

inputting a search description into the pre-training model to obtain a third feature vector;

respectively carrying out the inner product operation on the first feature vector and the third feature vector and carrying out the normalization processing;

and selecting a document with the maximum value of the normalization result as a final result.

2. The pre-training model-based document extraction method of claim 1, wherein in the document set selection, wikipedia data is adopted.

3. The method for extracting documents based on pre-trained model as claimed in claim 1, wherein in the model pre-training step, said pre-training method adopts neural network pre-training method to convert the text into the feature vector of said text.

4. The method for extracting documents based on pre-training model as claimed in claim 1, wherein in the model pre-training step, the normalization process normalizes the inner product operation result to between (0,1), and makes the normalized result of a first feature vector of a document and a second feature vector corresponding to the document approach to 1.

5. A system for extracting documents based on a pre-trained model, comprising:

the text feature vector extraction module is used for pre-training the documents in the document set to obtain a first feature vector of the documents, and extracting the feature vector of the search description to obtain a second feature vector;

the feature vector operation module is used for respectively carrying out inner product operation on the first feature vector and the second feature vector, carrying out normalization processing and storing the first feature vector;

and the document selecting module is used for selecting one document with the maximum normalized result as a final result.

6. The pre-trained model-based document extraction system of claim 5, wherein wikipedia data is employed in document set selection.

7. The pre-trained model based document extraction system of claim 5, wherein in the model pre-training step, said pre-training method uses a neural network pre-training method to convert text into feature vectors of said text.

8. The pre-trained model based document extraction system of claim 5, wherein in the model pre-training step, said normalization process normalizes the result of said inner product operation to between (0,1) and makes the normalized result of a first feature vector of a document and a second feature vector corresponding to the document approach to 1.