CN111339261A - Document extraction method and system based on pre-training model - Google Patents
Document extraction method and system based on pre-training model Download PDFInfo
- Publication number
- CN111339261A CN111339261A CN202010185622.1A CN202010185622A CN111339261A CN 111339261 A CN111339261 A CN 111339261A CN 202010185622 A CN202010185622 A CN 202010185622A CN 111339261 A CN111339261 A CN 111339261A
- Authority
- CN
- China
- Prior art keywords
- document
- feature vector
- training
- model
- extraction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
Abstract
The invention discloses a document extraction method and system based on a pre-training model, and belongs to the technical field of document extraction. The document extraction method based on the pre-training model comprises the following steps: the model pre-training step comprises the steps of carrying out document characterization processing, and pre-training a document to obtain a feature vector of the document; describing and characterizing, namely taking a sentence in a document as description, and performing memorability pre-training to obtain a feature vector; carrying out inner product operation on the feature vectors of the two and carrying out normalization processing; and save the document feature vector. A document extraction step comprising the characterization of the new description; carrying out inner product operation on the newly described feature vector and the document feature vector and carrying out normalization processing; and selecting a document with the maximum value of the normalization result. The method utilizes the pre-training of massive high-quality Wikipedia data on the model, greatly endows the model with rich semantic knowledge, and simultaneously applies a neural network pre-training method to truly reflect the relation between the search description and the document, thereby improving the accuracy of document extraction.
Description
Technical Field
The invention relates to the technical field of document selection, in particular to a document extraction method based on a pre-training model.
Background
In the prior art, document extraction means that a system extracts a document which best meets a description from a massive document library given the description. The characteristics of this task are: the amount of data is extremely large. This is reflected in that each document is relatively long and that the document library contains a very large number of documents. Therefore, in the document selection, it is important to accurately and quickly obtain the document matched with the search description.
The document extraction system in the past is mainly divided into two steps: firstly, extracting some candidate documents from a document library; then, a document that best matches the description is selected from the candidate documents. For the first step, it is actually necessary to extract some features, such as keyword information, from each document, and then compare the extracted features with the keywords in the description; in the second step, the candidate document needs to be extracted with finer granularity features and then compared with the search description, so as to obtain a finer and more accurate result. Because the relevant document features are manually defined in the document extraction process, the actual relationship between the document and the description cannot be accurately reflected when the document features are compared with the search description, and the result of document extraction is not accurate, so that the requirement of document search description cannot be met.
Disclosure of Invention
The invention mainly solves the technical problem of providing a document extraction method and system based on a pre-training model, and improving the accuracy of document extraction.
In order to achieve the above object, the first technical solution adopted by the present invention is: a document extraction method based on a pre-training model is provided, which comprises the following steps: the model pre-training step comprises the steps of selecting a document set, and pre-training documents in the document set to obtain a first feature vector of the documents; respectively and randomly selecting a sentence in the document, and respectively pre-training to obtain a second feature vector; respectively carrying out inner product operation on the first feature vector and a second feature vector, and carrying out normalization processing; and storing the first feature vector obtained by the document through the training model. A document extraction step comprising: inputting a search description into a pre-training model to obtain a third feature vector; respectively carrying out the inner product operation on the first feature vector and the third feature vector and carrying out normalization processing; and selecting a document with the maximum value of the normalization result as a final result.
In order to achieve the above object, the second technical solution adopted by the present invention is: the document extraction system based on the pre-training model comprises a text feature vector extraction module, a document matching module and a document matching module, wherein the text feature vector extraction module is used for pre-training documents in a document set to obtain a first feature vector of the documents, and extracting feature vectors of search description to obtain a second feature vector; the feature vector operation module is used for respectively carrying out inner product operation on the first feature vector and a second feature vector, carrying out normalization processing and storing the first feature vector; and the document selecting module selects one document with the maximum normalized result as a final result.
The invention has the beneficial effects that: the method utilizes the pre-training of massive high-quality Wikipedia data on the model, greatly endows the model with rich semantic knowledge, simultaneously applies a neural network pre-training method to convert text information into feature vectors, fully and truly reflects the relation between the search description and the document through the operation among the feature vectors, and improves the accuracy of document extraction.
Drawings
FIG. 1 is a schematic flow chart of a document extraction method based on a pre-training model according to the present invention;
FIG. 2 is a schematic diagram of a document extraction system based on a pre-training model according to the present invention.
Detailed Description
The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.
It is noted that the terms first and second in the claims and the description of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Referring to fig. 1, which is a schematic flow chart of a document extraction method based on a pre-trained model according to the present invention, in an embodiment of the present invention, the document extraction method based on the pre-trained model includes the following steps:
step S101, model pre-training step.
The document extraction means that given a description q, the system extracts a document D which best conforms to the description q from the document library D. In one embodiment of the invention, the invention utilizes massive high-quality Wikipedia text data to select the document set M as an object of model pre-training. Carrying out neural network pre-training on the document set M, converting text information of each document in the document set M into a feature vector, and using symbolsAnd (4) showing. The document is represented by feature vectors. Compared with a method for expressing text characteristics by using artificial definition and adopting keyword information, the method for expressing the text characteristics by using the characteristic vector of the document obtained by using the neural network pre-training method can reflect the characteristics of the document more and improve the accuracy of document extraction.
In the specific implementation mode, a sentence q in a document is randomly selected as a search description of the document, the search description q is subjected to feature coding, the text information of the search description q is converted into a feature vector by using a pre-training method of a neural network, and symbols are used for the feature vectorAnd (4) showing. Feature vector of documentFeature vector associated with search description qSeparately performing inner product operationAnd normalizing the inner product result to (0,1) processing.
In one embodiment of the present invention, a first documentIs expressed asExtracting a search description from a first documentThe feature vector ofThe feature vectors corresponding to the rest of the documents are、… … are provided. Feature vector of documentFeature vector associated with search description qPerforming inner product operation to obtain the result… … are provided. The above inner product results are normalized to (0,1), i.e. the ratio of each inner product result to the sum of all inner product results is calculated separately. For exampleThe normalized result ofBecause of the search descriptionAs a first documentA sentence in (1), search descriptionWith the first documentThe highest degree of matching, thereforeThe normalized result of (1) is close to 1, which shows that the documentConform to the descriptionDocument, documentTo describeThe corresponding document of (1). The normalized result also reflects how well the document matches the search description. Documenting the rest of the documents and descriptions in the document set as wellAnd (4) performing feature inner product operation and normalization processing, so that the document extraction model f is fixed.
After the document extraction model is pre-trained and fixed, when the model receives other search descriptions again, a series of operation operations during pre-training can be carried out, and documents which are close to 1 in the normalization result of the search description feature vector, namely the documents which are most in line with the search descriptions, are selected from the document set.
Step S102, document extraction step.
In one embodiment of the present invention, after the model pre-training of step 101, the document extraction model already obtains the feature vectors of all documents in the document set, and in the document extraction process, the feature vectors of the documents are extractedStoring and directly using the obtained product without repeated operation. The model is further simplified, and the efficiency of document extraction is improved.
In one embodiment of the present invention, when the document extraction model receives a search description, the text information of the search description is converted into a feature vector of the search description, and an inner product operation is performed with the feature vector of the already stored document and normalized to (0,1), and the document closest to 1, which is the document most consistent with the search description, is selected.
When the document is extracted, on the basis of the pre-training model, the neural network pre-training method is used for pre-training the document extraction model, the document and the text of the search description are converted into the feature vectors, the feature vectors are used for representing the features of the document and the search description, and compared with the manually defined keyword features, the real relation between the search description and the document is reflected more truly. Massive high-quality Wikipedia is used as a source of the pre-training document, so that the document extraction model is more comprehensive, and the quality of the document extraction result is improved.
In one embodiment of the present invention, fig. 2 is a schematic diagram of a document extraction system based on a pre-trained model, in which the document extraction system based on the pre-trained model comprises:
a text feature vector extraction part, which obtains the feature vector of the document by adopting a neural network pre-training method for the document in the document set, obtains the search description by adopting the neural network pre-training method, and searches the described feature vector; in one embodiment of the invention, the invention utilizes massive high-quality Wikipedia text data to select the document set M as an object of model pre-training. Carrying out neural network pre-training on the document set M, converting text information of each document in the document set M into a feature vector, and using symbolsAnd (4) showing. Randomly selecting a sentence q in the document as the search description of the document, carrying out characteristic coding on the search description q,converting the text information of the search description q into a feature vector by using a pre-training method of a neural network, and using symbolsAnd (4) showing. Feature vector of documentFeature vector associated with search description qSeparately performing inner product operationAnd normalizing the inner product result to (0,1) processing.
In an embodiment of the present invention, as shown in fig. 2, the document extraction system based on the pre-training model of the present invention further includes a feature vector operation part, which performs inner product operation on the feature vectors of the documents in the document set and the feature vectors of the search description, respectively, performs normalization processing, stores the first feature vector, and fixes the document extraction model.
In one embodiment of the invention, a first documentIs expressed asExtracting a search description from a first documentThe feature vector ofThe feature vectors corresponding to the rest of the documents are、… … are provided. Feature vector of documentFeature vector associated with search description qPerforming inner product operation to obtain the result… … are provided. The above inner product results are normalized to (0,1), i.e. the ratio of each inner product result to the sum of all inner product results is calculated separately. For exampleThe normalized result ofBecause of the search descriptionAs a first documentA sentence in (1), search descriptionWith the first documentThe highest degree of matching, thereforeThe normalized result of (1) is close to 1, which shows that the documentConform to the descriptionDocument, documentTo describeThe corresponding document of (1). The normalized result also reflects how well the document matches the search description. Documenting the rest of the documents and descriptions in the document set as wellAnd (4) performing feature inner product operation and normalization processing, so that the document extraction model f is fixed.
In one embodiment of the present invention, as shown in fig. 2, the document extraction system based on the pre-trained model of the present invention further includes a document extraction part. In one embodiment of the invention, the document extraction model already obtains the feature vectors of all documents in the document set, and in the document extraction process, the feature vectors of the documents are extractedStoring and directly using the obtained product without repeated operation. The model is further simplified, and the efficiency of document extraction is improved. When the document extraction model receives the search description, converting the text information of the search description into the feature vector of the search description, performing inner product operation on the feature vector of the stored document, normalizing to (0,1), and selecting the document closest to 1, wherein the document is the document most consistent with the search description.
Claims (8)
1. A document extraction method based on a pre-training model is characterized by comprising the following steps:
a model pre-training step, comprising:
selecting a document set, and pre-training documents in the document set to obtain a first feature vector of the documents;
respectively and randomly selecting a sentence in the document as a description, and respectively performing the pre-training to obtain a second feature vector;
performing inner product operation on the second feature vectors and the first feature vectors respectively, and performing normalization processing;
the first feature vector obtained by the document through the training model is stored;
a document extraction step comprising:
inputting a search description into the pre-training model to obtain a third feature vector;
respectively carrying out the inner product operation on the first feature vector and the third feature vector and carrying out the normalization processing;
and selecting a document with the maximum value of the normalization result as a final result.
2. The pre-training model-based document extraction method of claim 1, wherein in the document set selection, wikipedia data is adopted.
3. The method for extracting documents based on pre-trained model as claimed in claim 1, wherein in the model pre-training step, said pre-training method adopts neural network pre-training method to convert the text into the feature vector of said text.
4. The method for extracting documents based on pre-training model as claimed in claim 1, wherein in the model pre-training step, the normalization process normalizes the inner product operation result to between (0,1), and makes the normalized result of a first feature vector of a document and a second feature vector corresponding to the document approach to 1.
5. A system for extracting documents based on a pre-trained model, comprising:
the text feature vector extraction module is used for pre-training the documents in the document set to obtain a first feature vector of the documents, and extracting the feature vector of the search description to obtain a second feature vector;
the feature vector operation module is used for respectively carrying out inner product operation on the first feature vector and the second feature vector, carrying out normalization processing and storing the first feature vector;
and the document selecting module is used for selecting one document with the maximum normalized result as a final result.
6. The pre-trained model-based document extraction system of claim 5, wherein wikipedia data is employed in document set selection.
7. The pre-trained model based document extraction system of claim 5, wherein in the model pre-training step, said pre-training method uses a neural network pre-training method to convert text into feature vectors of said text.
8. The pre-trained model based document extraction system of claim 5, wherein in the model pre-training step, said normalization process normalizes the result of said inner product operation to between (0,1) and makes the normalized result of a first feature vector of a document and a second feature vector corresponding to the document approach to 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010185622.1A CN111339261A (en) | 2020-03-17 | 2020-03-17 | Document extraction method and system based on pre-training model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010185622.1A CN111339261A (en) | 2020-03-17 | 2020-03-17 | Document extraction method and system based on pre-training model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111339261A true CN111339261A (en) | 2020-06-26 |
Family
ID=71186118
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010185622.1A Pending CN111339261A (en) | 2020-03-17 | 2020-03-17 | Document extraction method and system based on pre-training model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111339261A (en) |
Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH1031677A (en) * | 1996-07-15 | 1998-02-03 | Sharp Corp | Document retrieval device |
US20030069873A1 (en) * | 1998-11-18 | 2003-04-10 | Kevin L. Fox | Multiple engine information retrieval and visualization system |
US20040243601A1 (en) * | 2003-04-30 | 2004-12-02 | Canon Kabushiki Kaisha | Document retrieving method and apparatus |
US20070299838A1 (en) * | 2006-06-02 | 2007-12-27 | Behrens Clifford A | Concept based cross media indexing and retrieval of speech documents |
US20080021860A1 (en) * | 2006-07-21 | 2008-01-24 | Aol Llc | Culturally relevant search results |
US20080319973A1 (en) * | 2007-06-20 | 2008-12-25 | Microsoft Corporation | Recommending content using discriminatively trained document similarity |
US20140129494A1 (en) * | 2012-11-08 | 2014-05-08 | Georges Harik | Searching text via function learning |
US20150112683A1 (en) * | 2012-03-13 | 2015-04-23 | Mitsubishi Electric Corporation | Document search device and document search method |
CN106407311A (en) * | 2016-08-30 | 2017-02-15 | 北京百度网讯科技有限公司 | Method and device for obtaining search result |
US20170193291A1 (en) * | 2015-12-30 | 2017-07-06 | Ryan Anthony Lucchese | System and Methods for Determining Language Classification of Text Content in Documents |
CN107491518A (en) * | 2017-08-15 | 2017-12-19 | 北京百度网讯科技有限公司 | Method and apparatus, server, storage medium are recalled in one kind search |
CN107491547A (en) * | 2017-08-28 | 2017-12-19 | 北京百度网讯科技有限公司 | Searching method and device based on artificial intelligence |
US20180203921A1 (en) * | 2017-01-17 | 2018-07-19 | Xerox Corporation | Semantic search in document review on a tangible user interface |
US20180300315A1 (en) * | 2017-04-14 | 2018-10-18 | Novabase Business Solutions, S.A. | Systems and methods for document processing using machine learning |
WO2019118253A1 (en) * | 2017-12-14 | 2019-06-20 | Microsoft Technology Licensing, Llc | Document recall based on vector nearest neighbor search |
US20190221204A1 (en) * | 2018-01-18 | 2019-07-18 | Citrix Systems, Inc. | Intelligent short text information retrieve based on deep learning |
CN110377714A (en) * | 2019-07-18 | 2019-10-25 | 泰康保险集团股份有限公司 | Text matching technique, device, medium and equipment based on transfer learning |
US20190347281A1 (en) * | 2016-11-11 | 2019-11-14 | Dennemeyer Octimine Gmbh | Apparatus and method for semantic search |
-
2020
- 2020-03-17 CN CN202010185622.1A patent/CN111339261A/en active Pending
Patent Citations (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH1031677A (en) * | 1996-07-15 | 1998-02-03 | Sharp Corp | Document retrieval device |
US20030069873A1 (en) * | 1998-11-18 | 2003-04-10 | Kevin L. Fox | Multiple engine information retrieval and visualization system |
US20040243601A1 (en) * | 2003-04-30 | 2004-12-02 | Canon Kabushiki Kaisha | Document retrieving method and apparatus |
US20070299838A1 (en) * | 2006-06-02 | 2007-12-27 | Behrens Clifford A | Concept based cross media indexing and retrieval of speech documents |
US20080021860A1 (en) * | 2006-07-21 | 2008-01-24 | Aol Llc | Culturally relevant search results |
US20080319973A1 (en) * | 2007-06-20 | 2008-12-25 | Microsoft Corporation | Recommending content using discriminatively trained document similarity |
US20150112683A1 (en) * | 2012-03-13 | 2015-04-23 | Mitsubishi Electric Corporation | Document search device and document search method |
US20140129494A1 (en) * | 2012-11-08 | 2014-05-08 | Georges Harik | Searching text via function learning |
US20170193291A1 (en) * | 2015-12-30 | 2017-07-06 | Ryan Anthony Lucchese | System and Methods for Determining Language Classification of Text Content in Documents |
CN106407311A (en) * | 2016-08-30 | 2017-02-15 | 北京百度网讯科技有限公司 | Method and device for obtaining search result |
US20190347281A1 (en) * | 2016-11-11 | 2019-11-14 | Dennemeyer Octimine Gmbh | Apparatus and method for semantic search |
US20180203921A1 (en) * | 2017-01-17 | 2018-07-19 | Xerox Corporation | Semantic search in document review on a tangible user interface |
US20180300315A1 (en) * | 2017-04-14 | 2018-10-18 | Novabase Business Solutions, S.A. | Systems and methods for document processing using machine learning |
CN107491518A (en) * | 2017-08-15 | 2017-12-19 | 北京百度网讯科技有限公司 | Method and apparatus, server, storage medium are recalled in one kind search |
US20190065506A1 (en) * | 2017-08-28 | 2019-02-28 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Search method and apparatus based on artificial intelligence |
CN107491547A (en) * | 2017-08-28 | 2017-12-19 | 北京百度网讯科技有限公司 | Searching method and device based on artificial intelligence |
WO2019118253A1 (en) * | 2017-12-14 | 2019-06-20 | Microsoft Technology Licensing, Llc | Document recall based on vector nearest neighbor search |
CN109948044A (en) * | 2017-12-14 | 2019-06-28 | 微软技术许可有限责任公司 | Document query based on vector nearest neighbor search |
US20190221204A1 (en) * | 2018-01-18 | 2019-07-18 | Citrix Systems, Inc. | Intelligent short text information retrieve based on deep learning |
CN110377714A (en) * | 2019-07-18 | 2019-10-25 | 泰康保险集团股份有限公司 | Text matching technique, device, medium and equipment based on transfer learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109408526B (en) | SQL sentence generation method, device, computer equipment and storage medium | |
CN111274371B (en) | Intelligent man-machine conversation method and equipment based on knowledge graph | |
CN111274797A (en) | Intention recognition method, device and equipment for terminal and storage medium | |
CN110968725B (en) | Image content description information generation method, electronic device and storage medium | |
CN112270188A (en) | Questioning type analysis path recommendation method, system and storage medium | |
CN111859916A (en) | Ancient poetry keyword extraction and poetry sentence generation method, device, equipment and medium | |
CN110890088A (en) | Voice information feedback method and device, computer equipment and storage medium | |
CN112632248A (en) | Question answering method, device, computer equipment and storage medium | |
CN112364622A (en) | Dialog text analysis method, dialog text analysis device, electronic device and storage medium | |
CN115526171A (en) | Intention identification method, device, equipment and computer readable storage medium | |
CN111062211A (en) | Information extraction method and device, electronic equipment and storage medium | |
CN114282513A (en) | Text semantic similarity matching method and system, intelligent terminal and storage medium | |
CN117094383B (en) | Joint training method, system, equipment and storage medium for language model | |
CN117034921B (en) | Prompt learning training method, device and medium based on user data | |
CN111767378A (en) | Method and device for intelligently recommending scientific and technical literature | |
CN111881672A (en) | Intention identification method | |
CN113570380A (en) | Service complaint processing method, device and equipment based on semantic analysis and computer readable storage medium | |
CN111339261A (en) | Document extraction method and system based on pre-training model | |
CN114969347A (en) | Defect duplication checking implementation method and device, terminal equipment and storage medium | |
CN114995729A (en) | Voice drawing method and device and computer equipment | |
CN114238595A (en) | Metallurgical knowledge question-answering method and system based on knowledge graph | |
CN112395402A (en) | Depth model-based recommended word generation method and device and computer equipment | |
CN114911922A (en) | Emotion analysis method, emotion analysis device and storage medium | |
CN110569331A (en) | Context-based relevance prediction method and device and storage equipment | |
CN113919355B (en) | Semi-supervised named entity recognition method suitable for small training corpus scene |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |