CN113723078A

CN113723078A - Text logic information structuring method and device and electronic equipment

Info

Publication number: CN113723078A
Application number: CN202111044975.0A
Authority: CN
Inventors: 朱安安; 邱彦林; 赵粉玉; 俞一奇
Original assignee: Hangzhou Xujian Science And Technology Co ltd
Current assignee: Hangzhou Xujian Science And Technology Co ltd
Priority date: 2021-09-07
Filing date: 2021-09-07
Publication date: 2021-11-30

Abstract

The invention discloses a text logic information structuring method, which comprises the following steps: acquiring a text to be edited, and encoding the text to be edited according to a paragraph sequence to acquire a plurality of paragraphs to be classified; acquiring texts to be classified of each text to be classified, and sequentially inputting each text to be classified into a trained twin network according to a coding sequence to perform secondary classification to obtain a first classification result, wherein the first classification result comprises chapter titles or chapter contents; obtaining sentence vectors of the texts to be classified, the first classification result of which is chapter title classification, inputting adjacent sentence vectors of two texts to be classified into a trained twin network for logic structure classification to obtain a logic structure classification result, wherein the logic structure classification result comprises an upper-level relation, a lower-level relation, a flat-level relation or a cross-level relation; and carrying out logic information structuralization processing on the text to be edited based on the first classification result and the logic structure classification result.

Description

Text logic information structuring method and device and electronic equipment

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a text logic information structuring method and apparatus, and an electronic device.

Background

In recent years, with the development of natural language processing technology becoming more mature, a large amount of text information is used for constructing a knowledge graph of a related field and related tasks such as knowledge question answering after relevant processing such as information extraction. Common text information includes, for example, news, comments, short descriptive text, and longer document-like text stored in the form of words, pdfs, etc. Compared with short texts of news, a document is often composed of a title, a chapter, a paragraph and other logic structures, and has richer logic information, and information extraction based on the document logic structure information is more beneficial to knowledge formation and knowledge graph construction. More complete domain knowledge can be constructed and used in downstream tasks.

However, due to reasons such as irregular writing of documents, logical structure information of the documents often cannot be directly used, and in the current knowledge graph construction, much attention is paid to extracting relevant content from a large amount of unstructured text information to construct the knowledge graph, so that abundant logical structure information in the documents is ignored. In a real service scene, the acquisition cost of information extraction marking data is very high, the information extraction performance under a small amount of marking samples is poor, a large amount of text information is not fully used, and the method becomes one of the main bottlenecks which hinder the large-scale application of the knowledge graph in the vertical field.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present application is to provide a method, an apparatus and an electronic device for structuring text logical information, so as to solve the technical problem that the logical structure information of an existing document is difficult to extract and use.

In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:

in a first aspect, an embodiment of the present application provides a text logic information structuring method, where the method includes:

acquiring a text to be edited, and encoding the text to be edited according to a paragraph sequence to acquire a plurality of paragraphs to be classified;

acquiring texts to be classified of each text to be classified, and sequentially inputting each text to be classified into a trained twin network according to a coding sequence to perform secondary classification to obtain a first classification result, wherein the first classification result comprises chapter titles or chapter contents;

obtaining sentence vectors of the texts to be classified, the first classification result of which is chapter title classification, inputting adjacent sentence vectors of two texts to be classified into a trained twin network for logic structure classification to obtain a logic structure classification result, wherein the logic structure classification result comprises an upper-level relation, a lower-level relation, a flat-level relation or a cross-level relation;

and carrying out logic information structuralization processing on the text to be edited based on the first classification result and the logic structure classification result.

In a second aspect, an embodiment of the present application provides an apparatus for structuring text logic information, where the apparatus includes:

the device comprises a first acquisition unit, a second acquisition unit and a control unit, wherein the first acquisition unit is used for acquiring a text to be edited and encoding the text to be edited according to a paragraph sequence to acquire a plurality of paragraphs to be classified;

the second obtaining unit is used for obtaining texts to be classified of each text to be classified, and sequentially inputting each text to be classified into the trained twin network according to a coding sequence to carry out second classification to obtain a first classification result, wherein the first classification result comprises chapter titles or chapter contents;

the first classification unit is used for acquiring the sentence vectors of the texts to be classified, the first classification result of which is chapter title classification, and inputting the adjacent sentence vectors of the two texts to be classified into a trained twin network for logic structure classification to obtain a logic structure classification result, wherein the logic structure classification result comprises an upper-level relation, a lower-level relation, a level relation or a cross-level relation;

and the first processing unit is used for carrying out logic information structuralization processing on the text to be edited based on the first classification result and the logic structure classification result.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor and a memory, where the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is executed by the processor to implement the text logic information structuring method according to the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the computer-readable storage medium, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is executed by a processor to implement the text logic information structuring method according to the first aspect.

The beneficial effects of the embodiment of the application are that: the embodiment of the application provides a text logic information structuring method, a text logic information structuring device and electronic equipment. Based on the technical scheme provided by the embodiment of the application, when other tasks related to knowledge extraction, such as knowledge graph construction, are completed, the structural information of the text can be fully used, and the construction and the improvement of knowledge are facilitated. In addition, the twin network is used, so that the time complexity of the process of comparing sentences pairwise can be greatly reduced, more text information is acquired, and the text processing efficiency is improved.

Drawings

Fig. 1 is a schematic flowchart of a text logic information structuring method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a twin network according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a method for processing structured text logic information to be edited according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an apparatus for structuring text logic information according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions of the present application are further described in detail with reference to the following specific embodiments, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a text logic information structuring method, a text logic information structuring device and electronic equipment, and aims to solve the technical problem that logic structure information of an existing document is difficult to extract and use.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Referring to fig. 1, a schematic flow chart of a text logic information structuring method according to an embodiment of the present application is shown, where the method includes:

s101, obtaining a text to be edited, and coding the text to be edited according to a paragraph sequence to obtain a plurality of paragraphs to be classified;

for step S101, in an embodiment, before the text to be edited is encoded according to the paragraph sequence, a long paragraph in the text to be encoded is truncated to obtain a truncated paragraph, and the truncated paragraph is used to replace the long paragraph to obtain the text to be classified.

It can be understood that, when a part of a long paragraph exists in a text to be edited, in order to improve processing efficiency, the long paragraph is truncated to obtain a truncated paragraph, and the truncated paragraph represents the long paragraph to be processed.

S102, sequentially inputting each text to be classified into a trained twin network according to a coding sequence, and performing secondary classification to obtain a first classification result, wherein the first classification result comprises chapter titles or chapter contents;

s103, obtaining sentence vectors of the texts to be classified, of which the first classification result is chapter title classification, inputting the adjacent sentence vectors of the two texts to be classified into a trained twin network for logic structure classification to obtain a logic structure classification result, wherein the logic structure classification result comprises an upper-level relation, a lower-level relation, a horizontal relation or a horizontal relation;

and S104, carrying out logic information structuralization processing on the text to be edited based on the first classification result and the logic structure classification result.

In one embodiment, the training data construction of the trained twin network comprises:

structuring each text of the training set according to chapter information by adopting a template matching and manual check labeling mode, labeling the text information of each paragraph according to a line feed character, wherein the labeling content comprises a text id (textindex), "whether the text information is a chapter title mark (isTitle)," a father node id (parentid), "and" text content (content) ". Wherein the textIndex is sequentially increased according to the sequence in the document, and the initial value is 1; when the line of text is a chapter title, "isTitle" is 1, otherwise it is 0; for a chapter title, "parentId" is the textIndex where the chapter title of the upper-level directory is located, if the chapter title is a first-level title, "parentId" is 0, and for a chapter title other than a chapter title, "parentId" is the textIndex where the chapter title is located; for text that is a chapter title, its "content" is the chapter name; for non-chapter titles, its "content" is the paragraph content. In particular, since chapter headings only appear at the beginning of paragraphs, pre-trained language models tend to limit the length of the input text, extra time overhead is incurred with excessively long texts, and only the text at the beginning of each paragraph is of interest in the extraction of logical structures. Therefore, for the text of the non-chapter title, punctuation marks are used for dividing the text, and only the first sentence is selected for analysis training.

And after all texts in the training set are processed, constructing training data by adopting the processed structured texts. The embodiment of the application mainly adopts a twin network of the sensor-bert to simultaneously complete the tasks of logic information extraction and structurization. Twin networks contain two classification tasks:

a first classification task: judging whether the input single text is a text for describing chapter titles or not, namely, a two-classification task;

and a second classification task: and simultaneously inputting two texts, and judging the logical structure relationship of the two texts, namely a logical structure classification task.

The classification task has two relations between two texts, including the following 4 cases:

case 1: the two input texts are both texts for describing chapter titles, and the logical structures of the two texts are in a level relation, and the relation id is marked as 0;

case 2: the two input texts are both texts for describing chapter titles, and the logical structures of the two texts are in a top-level and bottom-level relation, and the relation id is marked as 1;

case 3: the two input texts are both texts for describing chapter titles, but the logical structures of the two input texts are in a cross-level relationship, and the cross-level relationship is the relationship between the last subsection title of the previous chapter and the chapter title of the next chapter; or the first input text is a non-chapter title, the second input text is a chapter title, and the first input text and the second input text belong to different chapters, and the relationship id is marked as 2 under the two conditions;

case 4: neither of the two input texts is a text describing a chapter title, or one input is a text describing a chapter title, and the other input is not, but belongs to the same section, and the relationship id of the two cases is marked as 3.

Because the text has natural context information, the disordered text data does not accord with logic in practical application, and the analysis of the logic structure only needs to analyze sequentially according to the sequence of the text. In the stage of constructing training data, two continuous texts are taken out from the processed data each time and used as one input of a model, and two classification tasks are trained simultaneously.

Referring to fig. 2, a schematic structural diagram of a twin network provided in an embodiment of the present application is shown, in which the twin network structure is used to fine-tune Bert, and model parameters are updated, so that sentences embedding generated by the adjusted twin network can directly complete classification tasks through a connection classifier. The twin network is that two sentences form a sentence pair, and the sentence pair is input into two Bert models with shared parameters for training, so that sentence vectors of the two sentences are obtained respectively. The twin network is adopted, so that the model can be guaranteed to be used for a classification task of a single sentence, and the twin network can also be suitable for a task of combining two sentences to classify the relation of the two sentences. The problems of training a plurality of models and repeatedly acquiring sentence vectors are solved. The efficiency is improved.

In the training input stage, each sentence is split according to characters, the [ CLS ] is added at the starting position, and the [ SEP ] is added at the end to mark the start and the end of one sentence. In the output stage, a Bert model is adopted to output a [ CLS ] position vector to represent a sentence vector of an input sentence, and the sentence vector is connected with a Softmax classifier to perform classification task training. And meanwhile, taking an absolute value | U-V | obtained by subtracting the two sentence vectors, and inputting the absolute value | U-V | into another Softmax classifier to train a classification two-task. The loss functions are cross entropy loss functions.

Softmax is a more common function that is widely used in the context of classification tasks. It maps some inputs to real numbers between 0-1 and normalization guarantees a sum of 1, so the sum of the probabilities for multiple classes is also exactly 1. The definition of the Softmax function is shown in formula (2-1):

wherein Vi is the output of the classifier category, i represents the category index, and the total category number is C; si represents the ratio of the index of the current element to the sum of the indexes of all elements, Softmax converts the output values of multiple classifications into relative probabilities, and in practical application, the classification with the highest probability value is the classification result.

In the Loss of the model in the training process, the Loss of the whole model is generated by adding the Loss of the classification task I and the Loss of the classification task II, and the final logic information structured model is obtained through training.

Aiming at the step S102, sequentially performing task one binary classification on each text to be classified by adopting a trained twin network according to a sequence, judging whether the paragraph to be classified represented by the text to be classified is a chapter title or a chapter content, recording the istile of the text to be classified with the classification result of the chapter title as 1, and reserving a sentence vector to be processed in the next step; for the text with the result of not being the section title, the isTitle of the text is recorded as 0, and according to the natural layout of the document structure, the section to be classified represented by the classified text is directly taken as the section content of the text to be classified which appears most adjacent in front and is judged to be the section title, and the section title textIndex is taken as the parentId of the section to be classified, so that the processing of the section to be classified is completed.

In one embodiment, the performing of the logical information structuring process on the text to be edited based on the first classification result and the logical structure classification result includes:

and taking the text to be classified with the first classification result as the chapter content of the text to be classified with the chapter title as the first classification result.

In one embodiment, the performing the logical information structuring process on the text to be edited based on the first classification result and the logical structure classification result further includes:

the logic structure classification result is in a top-level and bottom-level relation, and the text to be classified in the front order is used as a father node of the text to be classified in the back order;

the logic structure classification result is in a hierarchical relationship, and the sequentially back texts to be classified and the sequentially front texts to be classified have the same father node;

and the logic structure classification result is in a cross-level relationship, and the sequentially-followed texts to be classified are independently used as father nodes.

In one embodiment, the texts to be classified having the same parent node are sorted according to the parent node relationship of the texts to be classified, where each first classification result is a chapter title.

Referring to fig. 3, a flow diagram of a text to be edited logic information structuring processing method according to an embodiment of the present application is shown, and logic structuring is continuously performed on a text with istile ═ 1 according to an appearance sequence. The chapter title at the first position in the sequence automatically becomes the first one of the first-level titles, the parent node parentId is 0, and then the sentence vector of the chapter title at the second position in the sequence is input into the trained twin network for logic structure classification. By analogy, two classes are classified according to the sequence. Regarding the classification results as upper and lower level relations, taking the textIndex with the front appearance sequence as the parentId with the back appearance sequence; if the classification result is in a flat relation, the two texts have a common parentId; and for the classification result of the lower-level and upper-level relationship, classifying the father nodes of the second text and the first text, and circulating the process until the node of the level relationship is found. After all the chapter title nodes are classified in multiple ways and parent nodes of all the chapter title nodes are found, numbering the chapter titles of each directory layer according to a sequence, giving a reasonable logic sequence to each chapter title, wherein chapters belonging to the same parent node are arranged according to an appearance sequence, and keeping the same directory expression mode. And finishing the structuralization of the text logic information to be edited after the numbering of the logic sequence is finished.

Referring to fig. 4, a schematic structural diagram of an apparatus for structuring text logic information according to an embodiment of the present application is shown, where the apparatus includes:

a first obtaining unit 401, configured to obtain a text to be edited, and encode the text to be edited according to a paragraph sequence to obtain a plurality of paragraphs to be classified;

a second obtaining unit 402, configured to obtain a text to be classified of each text to be classified, and sequentially input each text to be classified into a trained twin network according to a coding sequence to perform second classification, so as to obtain a first classification result, where the first classification result includes a chapter title or a chapter content;

a first classification unit 403, configured to obtain a sentence vector of the paragraph to be classified, where the first classification result is a chapter title classification, and input the sentence vectors of two adjacent texts to be classified into a trained twin network to perform logical structure classification to obtain a logical structure classification result, where the logical structure classification result includes an upper-lower level relationship, a horizontal level relationship, or a cross-level relationship;

a first processing unit 404, configured to perform logic information structuring processing on the text to be edited based on the first classification result and the logic structure classification result.

Referring to fig. 5, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown, where the electronic device may include: at least one network interface 502, a memory 503, and at least one processor 501. The various components in the electronic device are coupled together by a bus system 504. It will be appreciated that the bus system 504 is used to enable communications among the components. The bus system 504 includes a power bus, a control bus, and a status signal bus in addition to a data bus, but for clarity of illustration, the various buses are labeled as bus system 504 in FIG. 5.

In some embodiments, memory 503 stores the following elements, executable modules or data structures, or a subset thereof, or an expanded set thereof: an operating system 5031 and application programs 5032.

The operating system 5031 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various outgoing services and processing hardware-based tasks. The application 5032 includes various applications, such as a Media Player (Media Player), a Browser (Browser), and the like, for implementing various application services. The program for implementing the method of the embodiment of the present application may be included in an application program.

In the above embodiment, the electronic device further includes: at least one instruction, at least one program, set of codes, or set of instructions stored on the memory 503 that is executable by the processor 501 to perform steps implementing any of the textual logic information structuring methods described in embodiments of the present application.

In one embodiment, the present application further provides a computer-readable storage medium, in which at least one instruction, at least one program, a code set, or a set of instructions is stored, and when executed by a processor, the at least one instruction, the at least one program, the code set, or the set of instructions implements the steps of any of the text logic information structuring methods in the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, and that the at least one instruction, the at least one program, the code set, or the instruction set may be stored in a non-volatile computer-readable storage medium, and when executed, the at least one instruction, the at least one program, the code set, or the instruction set may implement the steps of any of the mapping methods described in the embodiments of the present application. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are only illustrative and not restrictive; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, which are within the protection scope of the present application.

Claims

1. A text logic information structuring method is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein before the text to be edited is encoded according to the paragraph sequence, a longer paragraph in the text to be encoded is truncated to obtain a truncated paragraph, and the truncated paragraph is used to replace the longer paragraph to obtain the text to be classified.

3. The method as claimed in claim 1, wherein the performing of the logical information structuring process on the text to be edited based on the first classification result and the logical structure classification result comprises:

4. The method as claimed in claim 3, wherein the step of performing the logic information structuring process on the text to be edited based on the first classification result and the logic structure classification result further comprises:

5. The method of claim 4, wherein the method further comprises:

and sequencing the texts to be classified with the same father node according to the father node relation of the texts to be classified, wherein each first classification result is a chapter title.

6. The method as claimed in claim 1, wherein the loss function of the trained twin network is obtained by adding the two-class loss function and the logical structure class loss function.

7. The method as claimed in claim 6, wherein said loss functions are cross entropy loss functions.

8. A device for structuring logical information of text, said device comprising:

the first classification unit is used for acquiring the sentence vectors of the paragraphs to be classified, the first classification result of which is chapter title classification, and inputting the adjacent sentence vectors of the texts to be classified into a trained twin network for logic structure classification to obtain a logic structure classification result, wherein the logic structure classification result comprises an upper-level relation, a lower-level relation, a level relation or a cross-level relation;

9. An electronic device, characterized in that the electronic device comprises a processor and a memory, wherein at least one instruction, at least one program, a set of codes or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes or the set of instructions is executed by the processor to realize the text logic information structuring method according to any one of claims 1-7.

10. A computer-readable storage medium, wherein at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the computer-readable storage medium, and wherein the at least one instruction, the at least one program, the set of codes, or the set of instructions is executed by a processor to implement the text logic information structuring method according to any one of claims 1-7.