CN109800303A

CN109800303A - A kind of document information extracting method, storage medium and terminal

Info

Publication number: CN109800303A
Application number: CN201811621569.4A
Authority: CN
Inventors: 陈满棠
Original assignee: Shenzhen Strong Component Network Co Ltd
Current assignee: Shenzhen Strong Component Network Co Ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2019-05-24

Abstract

The present invention relates to a kind of document information extracting method, storage medium and terminals.This method comprises: obtaining the text information and text position information of document, text information corresponds to text position information；Keyword is extracted from text information using training morpheme classification model；The corresponding hyperlink of keyword is set.The document properties information and keyword classification of document where storing keyword, the corresponding hyperlink of keyword, the corresponding text position information of keyword, keyword.The present invention can extract technical term keyword, product keyword, category keyword, attribute keywords from the information source of the information document in vertical field, fix document information lookup more really, improve search matching degree, improve user's search experience.

Description

A kind of document information extracting method, storage medium and terminal

Technical field

The present invention relates to file retrieval field, more specifically to a kind of document information extracting method, storage medium and Terminal.

Background technique

To the Word Input of information document, there are two methods at present, one is OCR identification technology is utilized, by information document It is converted into image, by printed page analysis, row character segmentation, Text region export result；Another method is to utilize information document It is parsed, extracts text information, directly export result.But above two method focuses on the text for extracting information document, Vertical field technical term keyword, product keyword, category keyword, the attribute for not being described original document content are crucial Word, also without the relationship between description keyword.This, which becomes, restricts the bottleneck that people retrieve in vertical industry realm information.Cause This, the research for carrying out information extraction to information document seems particularly significant.

Summary of the invention

The technical problem to be solved in the present invention is that in view of the above drawbacks of the prior art, a kind of document information is provided and is mentioned Take method, storage medium and terminal.

The technical solution adopted by the present invention to solve the technical problems is: constructing a kind of document information extracting method, comprising:

The text information and text position information of document are obtained, the text information corresponds to the text position information；

Keyword is extracted from the text information using training morpheme classification model；

The corresponding hyperlink of the keyword is set.

Further, document information extracting method of the present invention, the document are PDF document, the acquisition document Text information and text position information include:

The text information in the PDF document is identified using optical character recognition method, while obtaining the text information Location information and number of pages location information in a certain page within said document.

Further, document information extracting method of the present invention, the text position information include the text information X-axis information, y-axis information, z-axis information, wherein the x-axis information and y-axis information be the text information within said document Location information in a certain page, the z-axis information are number of pages information of the text information in the document.

Further, document information extracting method of the present invention, it is described to use training morpheme classification model from the text Keyword is extracted in this information includes:

Use the part of speech of the list of training morpheme, the trained morpheme list in the trained morpheme classification model, described Training morpheme list and the correlation and goal-selling morpheme of default resource extract keyword from the text information.

Further, document information extracting method of the present invention, it is described using training morpheme classification model from described After extracting keyword in text information, and before the corresponding hyperlink of the setting keyword, the method is also wrapped It includes:

Keyword decoding and keyword classification are carried out to the keyword, wherein keyword decoding refers to according to the text The file structure of shelves carries out data decoding；The keyword classification refers to classifies according to default classification mode, wherein described pre- If classification mode includes technical term keyword patterns, product keyword patterns, category keyword patterns, attribute keywords mould Formula.

Further, document information extracting method of the present invention, in the corresponding hyperlink of the setting keyword Later, the method also includes:

Store the keyword, the corresponding hyperlink of the keyword, the corresponding text position information of the keyword, institute The document properties information and keyword classification of document where stating keyword, wherein the document properties information includes document mark Topic, document structure tree date, documentation release number.

Further, document information extracting method of the present invention, storing, the keyword, the keyword are corresponding The document properties information and key of document where hyperlink, the corresponding text position information of the keyword, the keyword After word classification, the method also includes:

Receive keyword；

Search corresponding with keyword search result, the search result include Document Title, the document structure tree date, Documentation release number, keyword, the corresponding text position information of the keyword and the corresponding hyperlink of the keyword.

Further, document information extracting method of the present invention searches retrieval corresponding with the keyword described As a result after, the method also includes:

Document where opening the keyword according to the hyperlink, and believed according to the corresponding text position of the keyword Breath locating and displaying goes out the keyword position.

In addition, being stored thereon with computer program, the computer the present invention also provides a kind of computer readable storage medium Such as above-mentioned document information extracting method is realized when program is executed by processor.

In addition, the terminal includes processor the present invention also provides a kind of terminal, the processor is for executing in memory It realizes when the computer program of storage such as the step of above-mentioned document information extracting method.

Implement a kind of document information extracting method, storage medium and terminal of the invention, has the advantages that the party Method includes: the text information and text position information for obtaining document, and text information corresponds to text position information；Use training morpheme Classification model extracts keyword from text information；The corresponding hyperlink of keyword is set.It is corresponding to store keyword, keyword The document properties information and keyword classification of document where the corresponding text position information of hyperlink, keyword, keyword. The present invention can extract technical term keyword, product keyword, category from the information source of the information document in vertical field Keyword, attribute keywords fix document information lookup more really, improve search matching degree, improve user's search experience.

Detailed description of the invention

Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:

Fig. 1 is the document information extracting method flow chart that one embodiment of the invention provides；

Fig. 2 is the document information extracting method flow chart that one embodiment of the invention provides；

Fig. 3 is the document information extracting method flow chart that one embodiment of the invention provides；

Fig. 4 is a kind of structural schematic diagram of terminal of the present invention.

Specific embodiment

For a clearer understanding of the technical characteristics, objects and effects of the present invention, now control attached drawing is described in detail A specific embodiment of the invention.

Embodiment

As shown in Figure 1, the present embodiment document information extracting method includes:

S1, the text information and text position information for obtaining document, text information correspond to text position information.As choosing It selects, document includes but is not limited to word document, PDF document, excel document, TXT document, PPT document, WPS document etc., this article Shelves include text information.In document each text information will corresponding text position information, can be with by text position information Navigate to text information.Preferably, document is PDF document, and the text information and text position information for obtaining document include: to make With the text information in optical character recognition method identification PDF document, while text information is obtained in a document in a certain page Location information and number of pages location information.

Further, coordinate system is established in a document, which includes that x-axis, y-axis, z-axis, wherein x-axis and y-axis are located at text In shelves in each page, for position of the localization of text information in the page；Z-axis indicates document number of pages information, for positioning The number of pages of the page where text information.So each text position information obtained includes the x-axis information of text information, y-axis letter Breath, z-axis information, wherein x-axis information and y-axis information are location information of the text information in a document in a certain page, z-axis letter Breath is number of pages information of the text information in document.It can fast and accurately be navigated to by x-axis information, y-axis information, z-axis information The position of text information in a document.

S2, keyword is extracted from text information using training morpheme classification model.Training morpheme classification model is to pass through Training corpus training study comprising various trained morphemes is obtained, training morpheme classification model includes training morpheme column Table, the part of speech of training morpheme list, the correlation and goal-selling morpheme of training morpheme list and default resource.So making With training morpheme classification model from text information extract keyword include: using training morpheme classification model in training morpheme List, the part of speech of training morpheme list, the correlation of training morpheme list and default resource and goal-selling morpheme are from text Keyword is extracted in information.

Alternatively, the document information extracting method of the present embodiment is in use training morpheme classification model from text information After extracting keyword, and before the corresponding hyperlink of setting keyword, method further include:

Keyword decoding and keyword classification are carried out to keyword, wherein keyword decoding refers to the file structure according to document Carry out data decoding；Keyword classification refers to classifies according to default classification mode, wherein default classification mode includes professional art Language keyword patterns, product keyword patterns, category keyword patterns, attribute keywords mode.

S3, the corresponding hyperlink of setting keyword.Hyperlink is all arranged to all keywords extracted in text information, is closed Keyword and hyperlink correspond, and include the corresponding text position information of text information in the hyperlink, pass through the hyperlink Connect can quickly position to keyword position in a document.

The present embodiment can extract technical term keyword from the information source of the information document in vertical field, product closes Keyword, category keyword, attribute keywords fix document information lookup more really.

Embodiment

As shown in Fig. 2, on the basis of the above embodiments, the document information extracting method of the present embodiment is in setting keyword Further include that information Step is extracted in storage after corresponding hyperlink:

S4, establish database, storage keyword, the corresponding hyperlink of keyword, the corresponding text position information of keyword, The document properties information and keyword classification of document where keyword, wherein document properties information includes Document Title, document Date of formation, documentation release number.In the database, each keyword and its corresponding hyperlink of corresponding keyword, keyword The document properties information of document where corresponding text position information, keyword and keyword classification form a storage number According to.During later retrieval, object is matched using keyword as retrieval, whole storage number can be obtained by Keywords matching According to.It is appreciated that because in same document there may be in multiple keywords or different document there may be same keyword, So a plurality of storing data may be present in same keyword.

Alternatively, database is storable on the server being separately provided or data lab setting is in cloud platform.

The present embodiment can extract technical term keyword from the information source of the information document in vertical field, product closes Keyword, category keyword, attribute keywords, and private database is established, fix document information lookup more really.

Embodiment

As shown in figure 3, on the basis of the above embodiments, the document information extracting method of the present embodiment is crucial in storage The corresponding hyperlink of word, keyword, the corresponding text position information of keyword, document where keyword document properties information, And after keyword classification, method further includes searching step:

S5, keyword is received.Alternatively, keyword can be received by input equipment, or is connect by phonetic incepting equipment It receives and identifies keyword, or keyword etc. is received by the bar code or two dimensional code of camera scanning electron element.

S6, lookup search result corresponding with keyword.Search procedure are as follows: by whether the key that matching judgment receives Whether in the database word, if the Keywords matching in the keyword and database that receive, it is corresponding reads the keyword One storing data, obtains search result.If the crucial word mismatch in the keyword and database that receive, illustrates do not have The keyword data.Search result includes that Document Title, document structure tree date, documentation release number, keyword, keyword are corresponding Text position information and the corresponding hyperlink of keyword.

Alternatively, the document information extracting method of the present embodiment, after searching search result corresponding with keyword, Method further includes that search result shows step:

S7, keyword place document is opened according to hyperlink, and positioned and shown according to the corresponding text position information of keyword Keyword position is shown.Each text position information includes the x-axis information, y-axis information, z-axis information of text information, In, x-axis information and y-axis information are the location information in the text information in a document a certain page, and z-axis information is that text information exists The number of pages information of document.Text information can be fast and accurately navigated in document by x-axis information, y-axis information, z-axis information In position.

Alternatively, if in search result including a plurality of keyword data, retrieval knot is shown according to predetermined order mode Fruit, such as the display of document structure tree date, show according to the context of keyword in a document, or according to keyword number in document Keyword etc. in the high document of the preferential display frequency of aobvious frequency.Superposing type arrangement, window may be selected in the arrangement of display window Horizontal Tile arrangement, window tile arrangement, window chequered order etc. vertically.It, can for multiple keywords in same document It is shown by splitting display window.

Alternatively, after locating and displaying goes out keyword position, the modes such as highlighted, underscore, background colour can be passed through Keyword is highlighted, user is facilitated to check.

The present embodiment can extract technical term keyword from the information source of the information document in vertical field, product closes Keyword, category keyword, attribute keywords, are retrieved by keyword, are fixed document information lookup more really, are improved search Matching degree improves user's search experience.

Alternatively, above-mentioned several document information extracting methods are applied in electronic component document, electronic component here Document includes the component parameters document of electronic component, element operation instruction document, order document, element circuitry document etc..

The present embodiment also provides a kind of computer readable storage medium, is stored thereon with computer program, computer program Such as above-mentioned document information extracting method is realized when being executed by processor.

Embodiment

As shown in figure 4, the present embodiment also provides a kind of terminal, terminal includes processor, and processor is for executing memory It realizes when the computer program of middle storage such as the step of above-mentioned document information extracting method.Alternatively, terminal includes but unlimited In smart phone, tablet computer, laptop, desktop computer, server etc..

It is crucial that the present invention can extract technical term keyword, product from the information source of the information document in vertical field Word, category keyword, attribute keywords fix document information lookup more really, improve search matching degree, improve user and search for body It tests.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered Think beyond the scope of this invention.

The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.

Above embodiments only technical concepts and features to illustrate the invention, its object is to allow person skilled in the art Scholar can understand the contents of the present invention and implement accordingly, can not limit the scope of the invention.It is all to be wanted with right of the present invention The equivalent changes and modifications that range is done are sought, should belong to the covering scope of the claims in the present invention.

Claims

1. a kind of document information extracting method characterized by comprising

The corresponding hyperlink of the keyword is set.

2. document information extracting method according to claim 1, which is characterized in that the document is PDF document, described to obtain The text information and text position information for taking document include:

The text information in the PDF document is identified using optical character recognition method, while obtaining the text information in institute State the location information and number of pages location information in document in a certain page.

3. document information extracting method according to claim 1, which is characterized in that the text position information includes described X-axis information, y-axis information, the z-axis information of text information, wherein the x-axis information and y-axis information are the text information in institute The location information in document in a certain page is stated, the z-axis information is number of pages information of the text information in the document.

4. document information extracting method according to claim 1, which is characterized in that described to use training morpheme classification model Keyword is extracted from the text information includes:

Use the training morpheme list in the trained morpheme classification model, the part of speech of the trained morpheme list, the training Morpheme list and the correlation and goal-selling morpheme of default resource extract keyword from the text information.

5. document information extracting method according to claim 1, which is characterized in that use training morpheme classification mould described Plate is after extracting keyword in the text information, described and before the corresponding hyperlink of the setting keyword Method further include:

Keyword decoding and keyword classification are carried out to the keyword, wherein keyword decoding refers to according to the document File structure carries out data decoding；The keyword classification refers to classifies according to default classification mode, wherein described default point Quasi-mode includes technical term keyword patterns, product keyword patterns, category keyword patterns, attribute keywords mode.

6. document information extracting method according to claim 5, which is characterized in that corresponding in the setting keyword Hyperlink after, the method also includes:

Store the keyword, the corresponding hyperlink of the keyword, the corresponding text position information of the keyword, the pass The document properties information and keyword classification of document where keyword, wherein the document properties information includes Document Title, text The shelves date of formation, documentation release number.

7. document information extracting method according to claim 6, which is characterized in that storing the keyword, the pass The document properties letter of document where the corresponding hyperlink of keyword, the corresponding text position information of the keyword, the keyword After breath and keyword classification, the method also includes:

Receive keyword；

Search result corresponding with the keyword is searched, the search result includes Document Title, document structure tree date, document Version number, keyword, the corresponding text position information of the keyword and the corresponding hyperlink of the keyword.

8. document information extracting method according to claim 7, which is characterized in that in the lookup and the keyword pair After the search result answered, the method also includes:

Document where opening the keyword according to the hyperlink, and it is fixed according to the corresponding text position information of the keyword Position shows the keyword position.

9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program quilt The document information extracting method as described in any one of claim 1-8 is realized when processor executes.

10. a kind of terminal, which is characterized in that the terminal includes processor, and the processor is stored for executing in memory Computer program when realize as described in any one of claim 1-8 the step of document information extracting method.