CN116227438A

CN116227438A - Knowledge document structure understanding and converting device

Info

Publication number: CN116227438A
Application number: CN202310145977.1A
Authority: CN
Inventors: 赵涛涛; 周松柏
Original assignee: Zhongke Guoli Zhenjiang Intelligent Technology Co ltd
Current assignee: Zhongke Guoli Zhenjiang Intelligent Technology Co ltd
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2023-06-06

Abstract

The invention discloses a knowledge document structure understanding and converting device, which comprises a module 1: if the input is Word or PDF knowledge document, converting it into HTML knowledge document; module 2: converting the HTML knowledge document into a sequence of plain text, tables and pictures, module 3: identifying the titles in the plain text, form and picture sequence of the HTML knowledge document, module 4: identifying the text of the title in the plain text, table and picture sequence of the HTML knowledge document, module 5: generating an analysis chart of the HTML knowledge document, and a module 6: normalizing the title hierarchy of the HTML knowledge document, module 7: the HTML knowledge document is converted to a normalized knowledge document template. By means of natural language understanding technology, deep learning technology and other technology, the invention not only improves the construction efficiency of the intelligent knowledge base, but also reduces the manual workload of converting the historical knowledge document into the intelligent knowledge base.

Description

Knowledge document structure understanding and converting device

Technical Field

The invention relates to the field of knowledge document natural language processing, in particular to a knowledge document structure understanding and converting device.

Background

In recent years, artificial intelligence technology is becoming mature and is being paid attention to various industries. Artificial intelligence techniques and products are increasingly being put to practical use in many industries. Currently, in the knowledge management and service of enterprises, in order to improve the knowledge service efficiency and reduce the knowledge service cost, artificial intelligence natural language understanding technology is introduced into knowledge document management of enterprises, and an intelligent knowledge base is formed.

The intelligent knowledge base is an intelligent enterprise knowledge management tool, and is required to perform structural management on document knowledge, and convert unstructured free text into structured knowledge with standard definition, multiple granularity and multiple layers, so that services such as quick positioning, searching and related knowledge recommendation of fine granularity knowledge are realized. In practical application, a user searches a knowledge document in a knowledge base through keywords, and a system sorts according to the weight and gives relevant accurate knowledge document items instead of returning the whole knowledge document, so that the knowledge document searching service quality is greatly improved; the intelligent question-answering robot can directly and accurately return specific contents in the knowledge document with shorter length instead of the whole knowledge document or the links thereof after understanding the consultation intention of the user.

The structured document knowledge base, while having many advantages in application, is more labor intensive to harvest. The gatherer cannot upload a freely written document entirely, but must enter knowledge document contents piece by piece according to a predefined knowledge document template and entries thereof, which adds a great deal of manual work to the gatherer. In addition, enterprises usually store a large amount of historical knowledge documents, and the historical knowledge documents also need to be converted into structured documents in the intelligent upgrading process of the knowledge base, and a large amount of manual editing and conversion work is needed.

In particular, various enterprises (especially large enterprises) commonly face three technical difficulties when collecting and upgrading unstructured non-canonical knowledge documents to a structured document intelligent knowledge base:

technical difficulty 1: how quickly the enterprise's existing knowledge documents enter the knowledge base? For most knowledge bases, the entry process is typically performed manually against the knowledge document, or the original knowledge document is uploaded, directly previewed or viewed in an attachment form. The knowledge document input in the mode is not beneficial to search service, consumes a large amount of human resources, and does not play a role in knowledge base intellectualization.

Technical difficulty 2: the titles of knowledge documents present a particular problem in that they affect the quality of the presentation of the knowledge documents and also prevent users from reading and understanding the knowledge documents and therefore require identification and correction. In general, from the inventors' summary from a large number of knowledge documents, it was found that the title of a knowledge document has several common editing errors:

(1) The sequence of the document title is missing, for example, the title sequence numbers are one, two, four, five, wherein the missing sequence number is three.

(2) The sequence of the document title is discontinuous, for example, the title sequence numbers are one, two, four and five, wherein the third sequence number is discontinuous at two.

(3) The document title is not uniform and standard. Mainly embodied in titles with colon and explanatory text.

(4) The document title has no subordinate sub-title and the title has no text content.

Technical difficulty 3: how quickly the enterprise historical knowledge document is structured and automatically entered into the knowledge base? In general, most businesses typically have a large number of historic knowledge documents that serve their customers or internal employees. However, due to lack of specifications and strict management means, these historical knowledge documents have various problems such as non-uniform structure, disordered titles of the knowledge documents, and unclear waiting of title expressions. If these non-canonical unstructured knowledge documents are uniformly and clearly transformed onto templates, it is still another technical difficulty facing enterprises in actual production and service processes.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a knowledge document structure understanding and converting device which can realize the automatic conversion from a non-structure knowledge document to a structured document, can introduce various automatic identification and correction methods for the title errors of the knowledge document, realize the structuring of the knowledge document and store the knowledge document in a knowledge base.

The aim of the invention is achieved by the following technical scheme.

A knowledge document structure understanding and converting apparatus comprising:

module 1: converting the input Word or PDF knowledge document into an HTML knowledge document;

module 2: converting the HTML knowledge document into a plain text, a table and a picture sequence;

module 3: identifying the titles in the plain text, the table and the picture sequence of the HTML knowledge document;

module 4: identifying the text of the title in the sequence of the plain text, the table and the picture of the HTML knowledge document;

module 5: generating an analysis chart of the HTML knowledge document;

and (6) module 6: normalizing the title hierarchy of the HTML knowledge document;

module 7: the HTML knowledge document is converted to a normalized knowledge document template.

The module 1 directly converts a Word knowledge document input by a user or a PDF knowledge document into an HTML document by using an API interface provided by the POI;

the module 2 receives the HTML document output by the module 1, and operates the HTML document by utilizing a Jso tool to form a plain text, a table and a picture sequence;

the module 3 divides the HTML knowledge document into a title with a first order sequence number, a title with a non-first order sequence number, a catalog title, an emphasized title, a step title, a FAQ title and a no-sequence number title, and respectively identifies the title and the no-sequence number title;

The module 4 accurately corresponds the titles of all the levels of the knowledge document with the corresponding texts in the knowledge document understanding process;

the module 5 outputs an analysis chart corresponding to the knowledge document, wherein the nodes of the analysis chart are composed of titles in the document, levels in the knowledge document and classification categories, and the edges of the analysis chart are lower or precedent;

the module 6 converts all lower titles of the non-canonical knowledge document title, together with the title body owned by all lower titles, into the title body of the non-canonical knowledge document title.

Step 2-1: for the HTML document, for the ith table in the HTML document, completely take outThe ith table, denoted as Tab _i The retrieved forms are respectively denoted as Tab in their order in the HTML document ₁ 、Tab ₂ … …, a sequence of forms called an HTML document;

step 2-2: for the HTML document, for the j-th picture in the HTML document, completely taking out the j-th picture, and marking the j-th picture as IMG _j The retrieved pictures are respectively noted as IMGs in their order in the HTML document ₁ 、IMG ₂ … …, a sequence of pictures called an HTML document;

step 2-3: for an HTML document, the kth paragraph of plain text in the HTML document is completely fetched by using Jso and is marked as TXT _k The extracted plain text paragraphs are respectively denoted as TXT in the order in the HTML document ₁ 、TXT ₂ … …, a sequence of plain text paragraphs called an HTML document;

step 2-4: outputting a content sequence B according to the table, picture and plain text paragraphs taken in the steps 2-1, 2-2 and 2-3 and the order of appearance of the HTML document ₁ 、B ₂ 、……、B _m 、……、B _n Plain text, form and picture sequences known as HTML documents, wherein B _m Is a table in a table sequence of an HTML document, or is a picture in a picture sequence of an HTML document, or is a plain text paragraph in a plain text paragraph sequence of an HTML document;

step 2-4: return B ₁ 、B ₂ 、……、B _m 、……、B _n 。

Step 3-1: b output to module 2 ₁ 、B ₂ 、……、B _m 、……、B _n Put docseq=b ₁ 、B ₂ 、……、B _m 、……、B _n EOS, where EOS is a special marker, facilitating the processing of module 3, a marker indicating the end of the sequence, B' =b ₁ ，move＝1；

Step 3-2: if B '=eos, the tag of B' is set to "×", docSeq is returned, and the method ends;

step 3-3: if B' is a table, docseq=b is set ₁ 、B ₂ 、……、B _move-1 B'/form, B _move+1 、……、B _n EOS, set B' =b _move+1 Move=move+1, go to step 3-2;

step 3-4: if B' is a picture, docseq=b is set ₁ 、B ₂ 、……、B _move-1 B'/picture, B _move+1 、……、B _n EOS, set B' =b _move+1 Move=move+1, go to step 3-2;

step 3-5: if B 'is a plain text paragraph and B' is matched by any of the recognition patterns pats of class 7 patterns, noting that the title category recognized by pats is Tag, the following sub-steps are performed:

step 3-5-1: if B' is a header with a first order sequence number and Tag +. ₁ /Tag ₁ 、B ₂ /Tag ₂ 、……、B _move-1 /Tag _move-1 、B′ _move /Tag、B _move+1 、……、B _n EOS, set B' =b _move+1 Move=move+1, go to step 3-2;

step 3-5-2: if B' is a complete match by Pat, tag = title with primary sequence number, and B _move Can be used as<Character string 1>Or alternatively<Character string 2>Matching, then docseq=b ₁ /Tag ₁ 、B ₂ /Tag ₂ 、……、B _move-1 /Tag _move -1、<Character string 1>Title with primary serial number,<Character string 1>Title text, B _move+1 /Tag _move+1 、……、B _n /Tag _n EOS, and set B' =b _move+1 Move=move+2, go to step 3-2;

step 3-5-3: splitting B ' into two parts, set as B ' =b ' ₁ +B′ ₂ Wherein B' ₁ Is completely matched by Pat, B' ₂ Is the remainder of B' that is not matched by Pat, put DocSeq=B ₁ /Tag ₁ 、B ₂ /Tag ₂ 、……、B _move -1/Tag _move -1、B′ ₁ /Tag、B′ ₂ 、B _move+1 、……、B _n ，B′＝B ₂ Move=move+1, go to step 3-2.

Step 4-1: put move=k;

step 4-2: textParagmaps = B _move ；

Step 4-3: for B in DocSeq _move /Tag _move If B _move =bos, then return TitleText, the method ends;

step 4-4: if Tag _move Not equal to the header with the primary sequence number, and Tag _move Not equal to the title with the non-primary sequence number, then TitleText (B) _move ) = ", i.e. empty string, textParagraphs = B _move +“<br>"+textParaggraphs, move=move-1, go to step 4-2, wherein"<br>"is a special mark, which means that two fall-off compartments;

step 4-5: titleText (B) _move ) =textpara graphs, move=move-1, go to step 4-2.

Step 5-1: docseq=b ₁ /Tag ₁ 、B ₂ /Tag ₂ 、……、B _m /Tag _m 、……、B _k /Tag _k 、EOS/*，FatherTitle＝root,FatherTag＝“*”，level＝0，BeginIndex＝1，EndIndex＝k+1，PreviousIndex＝0，Graph＝{}；

Step 5-2: graph = build document Graph (DosSeq, fatherTitle, level, fatherTag, beginIndex, endInxdex, previousIndex, graph);

step 5-3: returning to Graph;

in the step 5-2, the buildingdocultgraph is a recursive method, and the specific implementation process is as follows:

step 5-2-1: move = begin index;

step 5-2-2: if move=k, returning to Graph, and ending the method;

step 5-2-3: searching from the position of move in DosSeq, finding the first item matched by the identification mode of the title with the first-level sequence number or the title with the non-first-level sequence number in the knowledge document title identification mode library, if not, moving=move+1, and turning to the step 5-2-2;

step 5-2-4: recording the title identification pattern found in step 5-2-3 as Pat, the pattern matching to DosSeq with item B _move /Tag _move ；

Step 5-2-5: put graph=graph { (Fathertitle, level, fatherTag, lower level, B) _move ，level+1,Tag _move )}，PreviousTitle＝B _move ，PreviousTag＝Tag _move ，PreviousIndex＝move；

Step 5-2-6: move=move+1;

step 5-2-7: starting from the move position in DocSeq, steps 5-2-7-1 to 5-2-7-4 are performed:

step 5-2-7-1: if move=endindex, then buildingdocuntgraph is recursively called, i.e., graph=graph =U.Buildingdocuntgraph (DosSeq, previousTitle, level+1, tag) _move Previous index+1, move, previous index, graph), and returns the result Graph;

step 5-2-7-2: if Pat matches B _move /Tag _move B in (B) _move Then graph=graph { (Fathertitle, level, fatherTag, lower level, B) _move ,level+1,Tag _move ) U } { (previostitle, level+1, prevustag, prior to B) _move ,level+1,Tag _move )}，PreviousTitle＝B _move ；

Step 5-2-7-3: recursively calling the buildingdocuntgraph, namely graph=graph =graph &buildingdocuntgraph (DocSeq, previousTitle, level+1, previousstag, previoussindex+1, move, previoussindex, graph);

step 5-2-7-4: turning to the step 5-2-6.

Step 6-1: for any node (Title, level, tag) of Graph, labeled "untreated";

step 6-2: for any node (Title, level, tag) of Graph, if (Title, level, tag) is marked "unprocessed", then the following sub-steps are performed:

step 6-2-1: if there is no lower node reached from (Title, level, tag) through the "lower" edge, then (Title, level, tag) is marked as "processed", go to step 6-2;

Step 6-2-2: the lower node reached from (Title, level, tag) through the "lower" edge is denoted as (Title) ₁ ,level+1,Tag ₁ )、……、(Title _m ,level+1,Tag _m )、……、(Title _n ,level+1,Tag _n )；

Step 6-3: at (Title) ₁ ,level+1,Tag ₁ )、……、(Title _m ,level+1,Tag _m )、……、(Title _n ,level+1,Tag _n ) In (1), titleText (Title) _m ) The number of= "" is Subtitles;

step 6-4: if the Subtitles/n>0.5, then in Graph, the (Title ₁ ,level+1,Tag ₁ )、、、(Title _n ,level+1,Tag _n ) Instead (Title) ₁ Level+1, title body), (Title) _n Level+1, title body);

step 6-5: the (Title, level, tag) is marked as "processed", and go to step 6-2.

Step 7-1: for any node (Title, level, tag) of Graph, labeled "unconverted";

step 7-2: if Graph does not exist (Title, level, tag) marked "unconverted", the method ends;

step 7-3: for any of the graphs) is marked as "unconverted" node (Title, level, tag), the following sub-steps are performed:

step 7-3-1: if Tag is not equal to the header with the primary sequence number and Tag is not equal to the header with the primary sequence number, the (Title, level, tag) is marked as 'converted', and the step 7-2 is shifted;

step 7-3-2: the Title is segmented by using a jieba segmenter, and the obtained segmentation result is recorded as token (Title) =tw ₁ TW ₂ ……TW _i ……TW _p ；

Step 7-3-3: for token (Title) =tw ₁ TW ₂ ……TW _i ……TW _p For each term TW _i (p.gtoreq.i.gtoreq.1), the following steps are performed:

Step 7-3-4: initializing an array TitleHits (KMTitle) =0 for any title KMTitle of KM;

step 7-4-5: the word segmentation result of KMTitle is recorded as Tokens (KMTitle) = { KTW ₁ ,……,KTW _j ,……,KTW _M If KTW is present _j So as to TW _i ∈SimWords(KTW _j ) TitleHits (KMTitle) = TitleHits (KMTitle) +1;

step 7-4-6: for all titles of knowledge templates KM, search for the title with the maximum value in TitleHits (KMTitle), and mark as KMTitle ^* If the Title is composed of a plurality of maxima, i.e. Title ^* Not only, then one is arbitrarily selected as Title ^* 。

Step 7-4-7: set knowledge document Title conversion array TitleTrans (Title) =title ^* Turning to the step 7-2;

finally, according to the knowledge document analysis Graph output by the module 5, the TitleTrans array output by the module 7 and the TitleText output by the module 4, the automatic conversion from the knowledge document to the standard knowledge template is finally completed.

Compared with the prior art, the invention has the advantages that: by means of natural language understanding technology, deep learning technology and other technology, the invention not only improves the construction efficiency of the intelligent knowledge base, but also reduces the manual workload of converting the historical knowledge document into the intelligent knowledge base. Through the test of 200 knowledge documents and the manual check, the accuracy of the system in the invention for understanding the structure of the 200 knowledge documents reaches 99.0%, and the accuracy of the conversion to the knowledge template reaches 87.9%. Therefore, the system of the invention obtains excellent structural understanding and conversion performance of the knowledge document, and solves the problem of quick storage of a large number of enterprise documents.

Drawings

FIG. 1 is a diagram of a product knowledge document template.

Fig. 2 is a block diagram of a knowledge document structure understanding and converting apparatus.

FIG. 3 is a schematic diagram of a product knowledge template.

Fig. 4 is a schematic word segmentation diagram of fig. 3.

Detailed Description

The invention will now be described in detail with reference to the drawings and the accompanying specific examples.

In order to clearly state the invention, several basic terms are first given below.

(1) Characters: the most basic unit of a character string in the present invention is a character. For example, a Chinese character, english letter, japanese kana, number, chinese and English punctuation mark, etc. are characters.

(2) Knowledge document: document knowledge describes content such as services, products, management, regulations and the like, and forms knowledge content with a certain structure. Although different organizations have different document knowledge, the basic form of knowledge documents is consistent: document knowledge typically contains multiple levels of titles, one upper level title may include multiple lower level subtitles. Each title is typically provided with a sequence number. The title numbers may have various forms such as one, two, three, or 1, 2, 3, and the lower subtitle forms may be more various such as (one), (two), (three), or (1), (2), (3), etc.

(3) HTML, DOM, CSS: HTML (full name HyperText Markup Language) is a markup language for web pages, also known as HTML. DOM is a standard for W3C (full scale World Wide Web Consortium), which is a platform and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure, and style of documents. CSS (full scale Cascading Style Sheets) is a computer language used to represent styles of HTML documents or XML documents.

(4) Word knowledge document, PDF knowledge document, HTML knowledge document: the Word knowledge document refers to a document which is input by Word typesetting software in Microsoft office or WPS (Word Processing System) and the like, and is called Word document for short. PDF (Portable Document Format) knowledge documents refer to documents in PDF form, abbreviated PDF documents. The HTML knowledge document refers to a document typeset in HTML language, and is abbreviated as HTML document.

(5) Jso: the Java HTML player, which is a Java platform, is an open source free tool for HTML document parsing, which uses DOM or CSS selectors to find and retrieve data from the HTML document.

(6) POI: the full scale Java API for Microsoft Documents, also called POI tool, is a Java processing tool of an open-source Microsoft office document, and can automatically convert Word documents or PDF documents into HTML documents, so that the documents can be uniformly understood in a structured manner.

Labeling of knowledge documents

(1) Title of knowledge document and 7 categories thereof

According to the examination of a large number of industry knowledge documents, the invention classifies the titles of the knowledge documents into 7 title categories, see the following knowledge document of "double eleven TLL air conditioner sales activities".

A description of 7 title categories and the corresponding BNF (Bakus-Naur Form) format gives a specific pattern for knowledge document title identification as follows: wherein < N > is a Chinese character number string or an Arabic number string, < M > is a Chinese character number string, an Arabic number string or an English letter, < title > is a character string composed of Chinese characters, english words, circular brackets or square brackets, and specific examples can refer to a 'double eleven TLL air conditioner sales activity' knowledge document.

1) Title with primary sequence number: chinese numbers are used as serial numbers (wherein Chinese numbers are one, two, … …, nine, ten, eleven and the like), or Arabic numbers are used as serial numbers (wherein Arabic numbers are 1, 2, … …, 9,

10. 11, etc.). The identification mode of the title with the primary serial number is as follows:

< title with primary sequence number > = < N > < title > | < N >, < title > | < N >: < title > < N > < title > < colon > < N > < title > < colon > < N > < colon > < N >. < title > < colon > | < N >: < title > < colon >

< colon > =: i (L):

note that, according to the above expression of < title with primary sequence number >, there are 10 forms of < title with primary sequence number >, where each "|" represents one form. This description also applies to the following recognition mode.

2) Title with non-primary sequence number: titles with non-primary sequence numbers refer to low-level titles such as secondary, tertiary, etc. under the title. The identification pattern of the title with the non-primary sequence number is as follows:

3) Catalog title: when a knowledge document is too long, the common knowledge document may have a catalog for easy reading by the user.

The identification pattern of the directory title is as follows:

4) Emphasis on title: emphasis titles are used to alert the user to important content or content that requires special attention. Its recognition pattern is as follows:

< emphasized title > = < left Fang Kuohu > < title > item < right bracket > | < left Fang Kuohu > < title > alert < right bracket > | < left Fang Kuohu > < title > alert < right bracket > | < left Fang Kuohu > note < right bracket >

Wherein, the liquid crystal display device comprises a liquid crystal display device,

< left Fang Kuohu > = [ | [ s ]

< right Fang Kuohu >: = ] |

5) Step title: step headings refer to steps in a process. Its recognition pattern is as follows:

the =step < Z >, |step < Z > < colon > | < N >, step < Z > | < colon > | < N >, step < Z > < colon >

Wherein < Z > is a Chinese character number string, an English letter, an Arabic number string, an Arabic number hierarchy string, wherein:

< Arabic number hierarchy string > = < Arabic number string > < connector > < Arabic number string > | < Arabic number string > < connector > < Arabic number hierarchy string >

< connector > = | - |_

For example, "4-3-1", "3.2.2", etc. are specific examples of strings of Arabic numerals.

6) FAQ header: FAQ (Frequently Asked Question) title, a common problem title. The FAQ header may or may not have a sequence number. The title ends with a question mark. The recognition pattern of the FAQ header is as follows:

< FAQ header > = < header with primary sequence number > < question > answer < colon > | < header with non-primary sequence number > < question > answer < colon > | < header name > < question > answer < colon >

< question mark > =? ?

7) No. title: the invention also uses a special unusual recognition mode to specifically recognize a class of no sequence number titles. Although the no-sequence-number heading does not carry any sequence number, it acts as a heading on the text statement.

Its recognition pattern is as follows:

< unordered header >: = < header name > < colon >

In order to improve the accuracy of the identification of the unnumbered header, among the < unnumbered header >, it is required that the < header > is not provided with any sequence number and does not exceed a certain length (default length is 20 characters).

Given a recognition pattern Pat and a Text, pat complete matching Text means that the following two conditions are satisfied: (1) Pat can be matched to upper Text, (2) all content in Text is matched by Pat. The Pat partial match Text means that the above condition (1) is satisfied. For example, the recognition pattern Pat ₁ ＝<N>、<Header name><Colon sign>，Pat ₂ ＝<N>、<Header name>Text = one, product profile: the TLL double-frequency air conditioner adopts an advanced refrigeration and heating technology, has fashionable appearance, and has the functions of rapid refrigeration and heating, low noise, low energy consumption, automatic cleaning and sterilization and the like. Pat ₁ Can be partially Text, and the obtained matching result is<N>The number of times of one is =one,<header name>Product profile. And Pat ₂ Can be completely matched with Text, and the obtained matching result is<N>The number of times of one is =one,<header name>Product profile =: the TLL double-frequency air conditioner adopts an advanced refrigeration and heating technology, has fashionable appearance, and has the functions of rapid refrigeration and heating, low noise, low energy consumption, automatic cleaning and sterilization and the like.

For a better understanding of the 7 title categories, an example document is given below: knowledge documentation of "double eleven TLL air conditioning sales campaign".

The knowledge document has four primary titles with sequence numbers, namely ' first, product introduction ', ' second, main product parameters ', ' third, after-sales policy (including aspects of cap return, upgrading, installation and warranty) ' fourth, and others '. Under the heading of "second, product main parameters", there are three unordered headings, namely "energy consumption level: "," rated cooling capacity: "," refrigeration power: ". Under the heading "three, after-market product policies," there are three secondary headings with non-primary serial numbers, namely "(one) return and upgrade policies", "(two) install charging criteria", "(three) warranty and charging policies". Under the second-level title "(one) return and upgrade policy", there are two third-level titles, namely "(1) return: the user may have no reason to return the product within 7 days after sale, reserving after-market personnel to go to the gate to retrieve the product. "," (2) upgrade: after one year of use, the user needs to upgrade a more powerful air conditioner, charge a product spread, and charge an upgrade fee of 200 yuan/desk. "under the heading" four, other ", there are two secondary FAQ headings," a) how to replace after purchasing the air conditioner? "," b) how do air conditioners warrant? How does warranty charge? ".

(2) Header body of knowledge document

There are typically paragraphs under a title that describe the title in more detail, and the present invention refers to these paragraphs that describe the title as title text, abbreviated as text.

For example, in the knowledge document of "double eleven TLL air conditioner sales activities", the "TLL double-frequency air conditioner adopts an advanced cooling and heating technology, has fashionable appearance, and has the functions of rapid cooling and heating, small noise, low energy consumption, automatic cleaning and sterilization and the like. The "is the body of the title" first, product introduction ".

(3) Forms and pictures of knowledge documents

Typically, knowledge documents often contain appearance forms and pictures. Except for labeling the table and the picture, the invention does not analyze the table and the picture. To this end, the invention introduces two corresponding tags: tables and pictures.

(4) Annotation set of knowledge document

According to the above statement, the present invention introduces a first level set of labels of knowledge documents, noted as TAGS= { title, text, form, picture }, and a second level set of labels, noted as TAGS ⁺ = { header with primary sequence number, header with non primary sequence number, directory header, emphasized header, step header, FAQ header, no sequence number header, header body, table, picture }.

Normalized knowledge document template

Before converting a large number of knowledge documents into canonical knowledge document templates, an enterprise first creates a set of canonical knowledge document templates, abbreviated as knowledge templates. FIG. 3 is a simple knowledge template-the "product knowledge template":

typically, a normalized knowledge document template includes two aspects of content:

(1) Canonical names of knowledge document titles and their hierarchies. For example, in a "product knowledge template" knowledge template, there are 4 primary document titles (simply primary titles), namely product profiles, product performance parameters, product after-market policies, and others.

(2) A subordinate title of the knowledge document title. For example, in the "product knowledge template," the after-market policy has 3 lower headings, namely return to stock and upgrade, install and charge, warranty and charge.

In the present invention, in order to improve the conversion efficiency of the knowledge document, the title in the knowledge template is segmented in advance, see the right part of fig. 4. The word segmentation tool adopts a Java edition of a jieba word segmentation device (see

https://github.com/huaban/jieba-analysis)。

Given the Title in a knowledge template, using the token (Title) = { TW ₁ ,……,TW _M The term of Title.

Given a knowledge template KM, KMTokens (KM) = { W1, … …, W _i ，……，W _N For all terms in the knowledge template, for each W _i (1. Gtoreq.i.gtoreq.N), calculated by Gensim model (https:// radimreherek. Com/genesim/template. Html) to obtain W _i The top 20 similar words are noted as SimWhods (W _i ). For example, simWords (policies) = { policies, rules, … …, regimes }.

For the convenience of calculation, the invention stores normalized knowledge templates in a graph mode, which are called knowledge template graphs and template graphs, and refer to fig. 1. The nodes of the knowledge template graph are composed of the title and a special node in the knowledge template. The special node is called the top node of the template graph and is denoted root. The edges in the knowledge template graph consist of three types: the second side is the side of the title and the next level title, and the two sides are indicated by solid lines in fig. 1, and represent the meaning of "next level". The third side is the order in which the titles of the same hierarchy appear in the knowledge template, and represents the meaning of "prior" and is represented by the dashed line in fig. 1.

The device of the present invention and its method of implementation are explained in detail below in conjunction with fig. 2. The device for understanding and converting the structure of the knowledge document comprises 7 modules:

module 1: if the input is Word or PDF knowledge document, it is converted into HTML knowledge document

The invention mainly aims at understanding and converting the HTML knowledge document, and when the document format input by a user is Word or PDF format, the module 1 automatically converts the document format into the knowledge document in the HTML format.

The implementation method adopted by the module 1 is as follows: invoking API (Application Programming Interface) interface provided by POI, directly converting Word knowledge document or PDF knowledge document input by user into HTML document.

Module 2: converting HTML knowledge document into plain text, form and picture sequence

The module 2 receives the HTML document output by the module 1, and processes it to form a plain text, table and picture sequence. During processing, the HTML document is manipulated using the Jsoup tool.

The implementation method of the module 2 is as follows:

step 2-1: for the HTML document, the ith table in the HTML document is completely fetched and marked as Tab _i . The retrieved forms are denoted as Tab, respectively, in their order in the HTML document ₁ 、Tab ₂ … …, a sequence of forms called HTML documents.

Step 2-2: for the HTML document, for the j-th picture in the HTML document, completely taking out the j-th picture, and marking the j-th picture as IMG _j . The retrieved pictures are respectively noted as IMGs in their order in the HTML document ₁ 、IMG ₂ … …, a sequence of pictures called an HTML document.

Step 2-3: for HTML documents, jso is used to identify the kth text paragraph (tag<p>、<br>An H label,<li>、<ol>Representing an HTML paragraph), the kth plain text paragraph is completely fetched, denoted as TXT _k . The extracted plain text paragraphs are respectively denoted as TXT in the order in the HTML document ₁ 、TXT ₂ … …, a sequence of plain text paragraphs called an HTML document.

Step 2-4: outputting a content sequence B according to the table, picture and plain text paragraphs taken in the steps 2-1, 2-2 and 2-3 and the order of appearance of the HTML document ₁ 、B ₂ 、……、B _m 、……、B _n Plain text, form and picture sequences known as HTML documents, wherein B _m Is a table in a sequence of tables of an HTML document, or a picture in a sequence of pictures of an HTML document, or a paragraph of plain text in a sequence of paragraphs of plain text of an HTML document.

Step 2-4: return B ₁ 、B ₂ 、……、B _m 、……、B _n 。

Module 3: identifying titles in plain text, form, and picture sequences of HTML knowledge documents

The main function of the module 3 is to perform header recognition on the output result of the module 2. Identifying the title in the HTML knowledge document is one of the core modules in the whole aspect, and when the title of the HTML knowledge document is identified, the hierarchical structure of the whole document is very clear.

From our examination of the knowledge documents of multiple industries, we find that the variety and format of the title are very non-uniform as the line text is usually free when business personnel actually write the knowledge document. To deal with this difficult problem, the present invention first classifies the titles of HTML knowledge documents into 7 major categories (see section "titles of knowledge documents and 7 categories thereof"), and then recognizes them separately.

From the above statements, the title recognition method of the knowledge document is given below:

step 3-1: b output to module 2 ₁ 、B ₂ 、……、B _m 、……、B _n Put docseq=b ₁ 、B ₂ 、……、B _m 、……、B _n EOS, wherein EOS (End of Sequence) is a special marker, facilitating the processing of module 3, a marker indicating the end of the sequence, B' =b ₁ ，move＝1。

Step 3-2: if B '=eos, the label of B' is set to "(i.e., EOS/"; here "×" is a special label, which is convenient for the subsequent module to uniformly process), and DocSeq is returned, and the method ends.

Step 3-3: if B' is a table, docseq=b is set ₁ 、B ₂ 、……、B _move -1, B'/form, B _move+1 、……、B _n EOS, set B' =b _move+1 Move=move+1, go to step 3-2.

Step 3-4: if B' is a picture, docseq=b is set ₁ 、B ₂ 、……、B _move -1, B'/picture, B _move+1 、……、B _n EOS, set B' =b _move+1 Move=move+1, go to step 3-2.

step 3-5-1: if B' is a header with a first order sequence number and Tag +. ₁ /Tag ₁ 、B ₂ /Tag ₂ 、……、B _move-1 /Tag _move-1 、B′ _move /Tag、B _move+1 、……、B _n EOS, set B' =b _move+1 Move=move+1, go to step 3-2.

Step 3-5-2: if B' is a complete match by Pat, tag = title with primary sequence number, and B _move Can be used as<Character string 1>(<Character string 2>) Matching, then docseq=b ₁ /Tag ₁ 、B ₂ /Tag ₂ 、……、B _move-1 /Tag _move-1 、<Character string 1>Title with primary serial number,<Character(s)String 1>Title text, B _move+1 /Tag _move+1 、……、B _n /Tag _n EOS, and set B' =b _move+1 Move=move+2, go to step 3-2.

Step 3-5-3: the separation of B 'into two parts may be set to B' =b '' ₁ +B′ ₂ Wherein B' ₁ Is completely matched by Pat, B' ₂ Is the remainder of B' that is not matched by the pats. Let docseq=b ₁ /Tag ₁ 、B ₂ /Tag ₂ 、……、B _move-1 /Tag _move-1 、B′ ₁ /Tag、B′ ₂ 、B _move+1 、……、B _n ，B′＝B ₂ Move=move+1, go to step 3-2.

What needs to be stated is: step 3-5-2 described above is to deal with the non-normative nature of a common knowledge document. As shown in the heading "three, after-sales policy (including aspects of cover return, upgrade, installation and warranty)" in the knowledge document "double eleven TLL air conditioner sales campaign", explanatory words (including aspects of cover return, upgrade, installation and warranty) "appear in the heading, and the consistency of the expression style of the knowledge document is destroyed, so that the explanatory words are converted into heading texts in the module 3, so that the knowledge document is more standard and more readable.

Module 4: identifying header body in plain text, form and picture sequence of HTML knowledge document

A knowledge document relates to a plurality of titles, which are respectively at different hierarchical levels. For this reason, in knowledge document understanding, it is necessary to accurately correspond the titles of the respective levels of the knowledge document with their corresponding texts.

In step 3-5-3 of module 3, B' is split to form two parts, so that the output length of module 3 may be greater than the original length. For ambiguity avoidance, the output of block 3 is hereinafter denoted docseq=bos/, B ₁ /Tag ₁ 、B ₂ /Tag ₂ 、……、B _m /Tag _m 、B _k /Tag _k The length is k; wherein BOS (Begin of Sequence) is the present moduleThe added sequence start tag is added to facilitate the present method.

The implementation method of the module 4 is as follows:

step 4-1: put move=k.

Step 4-2: textParagmaps = B _move 。

Step 4-3: for B in DocSeq _move /Tag _move If B _move =bos, then return TitleText and the method ends.

Step 4-4: if Tag _move Not equal to the header with the primary sequence number, and Tag _move Not equal to the title with the non-primary sequence number, then TitleText (B) _move ) = "(i.e., empty string), textParagraphs=B _move +“<br>"+textParaggraphs, move=move-1, go to step 4-2. Wherein " <br>"is a special mark representing two falling partitions, which can be changed to when the invention is implemented<p>"\n", or "\r\n".

Step 4-5: titleText (B) _move ) =textpara graphs, move=move-1, go to step 4-2.

Module 5: generating a parse graph of an HTML knowledge document

Module 4 has two outputs, one is docseq=b ₁ /Tag ₁ 、B ₂ /Tag ₂ 、……、B _m /Tag _m 、B _k /Tag _k EOS/, wherein EOS (End of Sequence) is a special marker, facilitating the handling of module 5, a marker indicating the end of the sequence; secondly TitleText (B) _m ) Wherein 1 is greater than or equal to m is greater than or equal to k.

The output form of the module 5 is a knowledge document analysis Graph: the node form of Graph is (Title, level, tag), namely Title, level of level in knowledge document and classification category Tag; edges are of two types: lower level, prior to. For convenience of description hereinafter, the present invention adopts a form (Title ₁ ,level ₁ ,Tag ₁ Subordinate, title ₂ ,level ₂ ,Tag ₂ ) Or (Title) ₁ ,level ₁ ,Tag ₁ Prior to Title ₂ ,level ₂ ,Tag ₂ ) Five kinds ofThe form of the tuple represents an edge in Graph, where (Title ₁ ,level ₁ ,Tag ₁ )、(Title ₂ ,level ₂ ,Tag ₂ ) Is a node of Graph.

Step 5-1: docseq=b ₁ /Tag ₁ 、B ₂ /Tag ₂ 、……、B _m /Tag _m 、……、B _k /Tag _k EOS/, fathertitle=root, fathertag= ", level=0, beginindex=1, endindex=k+1, previousindex=0, graph= { }. Step 5-2: graph=buildingdocumentgraph (DosSeq, fatherTitle, level, fatherTag, beginIndex, endInxdex, previousIndex, graph).

Step 5-3: returning to Graph.

In the step 5-2, the buildingdocultgraph is a recursive method, and the specific implementation process is as follows: buildingDocumentGraph (DosSeq, fatherTitle, level, fatherTag, beginIndex, endInxdex, previousIndex, graph)

Input: dosSeq is the output of module 4; the FatherTitle is an upper title; level is the level of the upper level title; fatherTag is the classification category of the superior title; beginIndex and EndInxdex are two positions in DocSeq, respectively; the PreviousIndex is a "prior" relationship between lower titles used to establish the fathitle; graph currently generates a knowledge document resolution Graph.

And (3) outputting: the knowledge document parsing map generated currently.

The implementation process is as follows:

step 5-2-1: move = begin index.

Step 5-2-2: if move=k, then return Graph, the method ends.

Step 5-2-3: searching from the position of move in DosSeq, finding the first item matched by the identification mode of the title with the first order sequence number or the title with the non-first order sequence number in the knowledge document title identification mode library, and if not, moving=move+1, turning to 5-2-2.

Step 5-2-4: recording the title identification pattern found in step 5-2-3 as Pat, the pattern matching to DosSeq with item B _move /Tag _move 。

Step 5-2-5: put graph=graph { (Fathertitle, level, fatherTag, lower level, B) _move ，level+1,Tag _move )}，PreviousTitle＝B _move ，PreviousTag＝Tag _move ，PreviousIndex＝move。

Step 5-2-6: move=move+1.

step 5-2-7-1: if move=endindex, then buildingdocuntgraph is recursively called, i.e., graph=graph =U.Buildingdocuntgraph (DosSeq, previousTitle, level+1, tag) _move Previous index+1, move, previous index, graph), and returns the result Graph.

Step 5-2-7-2: if Pat matches B _move /Tag _move B in (B) _move Then graph=graph { (Fathertitle, level, fatherTag, lower level, B) _move ,level+1,Tag _move ) U } { (previostitle, level+1, prevustag, prior to B) _move ,level+1,Tag _move )}，PreviousTitle＝B _move 。

Step 5-2-7-3: recursively calling buildingdocuntgraph, i.e., graph=graph =graph &. U. Buildingdocuntgraph (DocSeq, previousTitle, level+1, previousstag, previoussindex+1, move, previoussindex, graph).

Step 5-2-7-4: turning to the step 5-2-6.

And (6) module 6: title hierarchy of normalized HTML knowledge document

In knowledge documents, some title level non-normative situations often occur, so that the quality of the knowledge document is affected. The main appearance is that: a portion of the subordinate titles of a non-canonical knowledge document title lacks title body, while another portion of the subordinate titles thereof possess body.

For such non-canonical situations, the present invention converts all lower headings of the non-canonical knowledge document headings, along with the header body possessed by those lower headings, into the header body of the non-canonical knowledge document heading. The specific method comprises the following steps:

the output of module 5 is denoted Graph, which is a graphical representation of the hierarchical structure, and sequencing of the titles of a translated knowledge document.

Step 6-1: any node (Title, level, tag) of Graph is marked as "unprocessed".

step 6-2-2: the lower node reached from (Title, level, tag) through the "lower" edge is denoted as (Title) ₁ ,level+1,Tag ₁ )、……、(Title _m ,level+1,Tag _m )、……、(Title _n ,level+1,Tag _n )。

Step 6-3: at (Title) ₁ ,level+1,Tag ₁ )、……、(Title _m ,level+1,Tag _m )、……、(Title _n ,level+1,Tag _n ) In (1), titleText (Title) _m ) The number of= "" is Subtitles.

Step 6-4: if the Subtitles/n>0.5, then in Graph, the (Title ₁ ,level+1,Tag ₁ )、、、(Title _n ,level+1,Tag _n ) Instead (Title) ₁ Level+1, title body), (Title) _n Level+1, title body).

Step 6-5: the (Title, level, tag) is marked as "processed", and go to step 6-2.

Module 7: converting HTML knowledge document to normalized knowledge document template

The foregoing modules 1 to 6 mainly perform structural understanding on the knowledge document, and the module 7 converts the knowledge document after structural understanding into a normalized knowledge document template, thereby generating a normalized knowledge document.

Given a normalized knowledge document template KM, KMWords= { W ₁ ,……,W _N The } is marked as KMThe topic is segmented into a collection of segmented terms. The output generated by the registration module 6 is Graph.

Step 7-1: any node (Title, level, tag) to Graph is marked as "untransformed".

Step 7-2: if Graph does not exist (Title, level, tag) marked "unconverted", the method ends.

step 7-3-1: if Tag is not equal to the header with the primary sequence number and Tag is not equal to the header with the primary sequence number, (Title, level, tag) is marked as "converted", go to step 7-2.

Step 7-3-2: the Title is segmented by using a jieba segmenter, and the obtained segmentation result is recorded as token (Title) =tw ₁ TW ₂ ……TW _i ……TW _p 。

Step 7-3-3: for token (Title) =tw ₁ TW ₂ ……TW _i ……TW _p . For each term TW _i (p.gtoreq.i.gtoreq.1), the following steps are performed:

step 7-3-4: for any one of the titles KMTitle of KM, an array TitleHits (KMTitle) =0 is initialized.

Step 7-4-5: the word segmentation result of KMTitle is recorded as Tokens (KMTitle) = { KTW ₁ ,……,KTW _j ,……,KTW _M If KTW is present _j So as to TW _i ∈SimWords(KTW _j ) TitleHits (KMTitle) = TitleHits (KMTitle) +1.

Step 7-4-6: for all titles of knowledge templates KM, search for the title with the maximum value in TitleHits (KMTitle), and mark as KMTitle ^* . If the Title is formed by a plurality of maximum values (i.e. Title ^* Not unique), then arbitrarily select one as Title ^* 。

Step 7-4-7: set knowledge document Title conversion array TitleTrans (Title) =title ^* Turning to the step 7-2.

Claims

1. A knowledge document structure understanding and converting apparatus, characterized by comprising:

module 5: generating an analysis chart of the HTML knowledge document;

2. The knowledge document structure understanding and transformation apparatus of claim 1, wherein:

3. A knowledge document structure understanding and transforming device according to claim 1 or 2, wherein: the implementation method of the module 2 is as follows:

step 2-1: for the HTML document, the ith table in the HTML document is completely fetched and marked as Tab _i The retrieved forms are respectively denoted as Tab in their order in the HTML document ₁ 、Tab ₂ … …, a sequence of forms called an HTML document;

step 2-4: return B ₁ 、B ₂ 、……、B _m 、……、B _n 。

4. A knowledge document structure understanding and transformation apparatus according to claim 3, wherein: the implementation method of the module 3 is as follows:

step 3-5-2: if B' is a complete match by Pat, tag = title with primary sequence number, and B _move Can be used as<Character string 1>Or alternatively<Character string 2>Matching, then docseq=b ₁ /Tag ₁ 、B ₂ /Tag ₂ 、……、B _move-1 /Tag _move-1 、<Character string 1>Title with primary serial number,<Character string 1>Title text, B _move+1 /Tag _move+1 、……、B _n /Tag _n EOS, and set B' =b _move+1 Move=move+2, go to step 3-2;

step 3-5-3: splitting B ' into two parts, set as B ' =b ' ₁ +B′ ₂ Wherein B' ₁ Is completely matched by Pat, B' ₂ Is the remainder of B' that is not matched by Pat, put DocSeq=B ₁ /Tag ₁ 、B ₂ /Tag ₂ 、……、B _move-1 /Tag _move-1 、B′ ₁ /Tag、B′ ₂ 、B _move+1 、……、B _n ，B′＝B ₂ Move=move+1, go to step 3-2.

5. The knowledge document structure understanding and transformation apparatus of claim 4, wherein: the implementation method of the module 4 is as follows:

Step 4-1: put move=k;

step 4-2: textParagmaps = B _move ；

step 4-5: titleText (B) _move ) =textpara graphs, move=move-1, go to step 4-2.

6. The knowledge document structure understanding and transformation apparatus of claim 5, wherein: the implementation method of the module 5 is as follows:

step 5-3: returning to Graph;

step 5-2-1: move = begin index;

step 5-2-2: if move=k, returning to Graph, and ending the method;

Step 5-2-6: move=move+1;

step 5-2-7-1: if move=endindex, then buildingdocuntgraph is recursively called, i.e., graph=graph =U.S. buildingdocuntgraph (DosSeq, P)reviousTitle,level+1,Tag _move Previous index+1, move, previous index, graph), and returns the result Graph;

step 5-2-7-4: turning to the step 5-2-6.

7. The knowledge document structure understanding and transformation apparatus of claim 6, wherein: the implementation method of the module 6 is as follows:

step 6-1: for any node (Title, level, tag) of Graph, labeled "untreated";

step (a)6-4: if the Subtitles/n>0.5, then in Graph, the (Title ₁ ,level+1,Tag ₁ )、、、(Title _n ,level+1,Tag _n ) Instead (Title) ₁ Level+1, title body), (Title) _n Level+1, title body);

step 6-5: the (Title, level, tag) is marked as "processed", and go to step 6-2.

8. The knowledge document structure understanding and transformation apparatus of claim 7, wherein: the implementation method of the module 7 is as follows:

step 7-1: for any node (Title, level, tag) of Graph, labeled "unconverted";

step 7-3-2: the Title is segmented by using a jieba segmenter, and the obtained segmentation result is recorded as token (Title) =tw ₁ TW ₂ …… TW _i …… TW _p ；

step 7-4-6:for all titles of knowledge templates KM, search for the title with the maximum value in TitleHits (KMTitle), and mark as KMTitle ^* If the Title is composed of a plurality of maxima, i.e. Title ^* Not only, then one is arbitrarily selected as Title ^* 。