CN112257412A - Chapter analysis method, electronic device and storage device - Google Patents

Chapter analysis method, electronic device and storage device Download PDF

Info

Publication number
CN112257412A
CN112257412A CN202011024707.8A CN202011024707A CN112257412A CN 112257412 A CN112257412 A CN 112257412A CN 202011024707 A CN202011024707 A CN 202011024707A CN 112257412 A CN112257412 A CN 112257412A
Authority
CN
China
Prior art keywords
analyzed
sections
section
definition
paragraphs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011024707.8A
Other languages
Chinese (zh)
Other versions
CN112257412B (en
Inventor
刘加新
胡加学
王琳博
方逸群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202011024707.8A priority Critical patent/CN112257412B/en
Publication of CN112257412A publication Critical patent/CN112257412A/en
Application granted granted Critical
Publication of CN112257412B publication Critical patent/CN112257412B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a chapter analysis method, electronic equipment and a storage device, wherein the chapter analysis method comprises the following steps: acquiring a chapter to be analyzed; wherein, the chapter to be analyzed comprises a plurality of paragraphs; identifying a plurality of paragraphs, and taking the paragraphs which belong to the same theme type and are continuous as sections corresponding to the theme type; and respectively determining the editing relation between sections corresponding to the same theme type. According to the scheme, the analysis depth of chapters can be deepened.

Description

Chapter analysis method, electronic device and storage device
Technical Field
The present application relates to the field of information technologies, and in particular, to a chapter parsing method, an electronic device, and a storage device.
Background
In daily work and life, people need to read more or less sections such as contract texts, regulation clauses, journal papers and the like. However, reading and understanding chapters usually takes a while, and in the case of longer chapters, it takes even a longer time to digest the entire chapter. In view of this, how to deepen the analysis depth of chapters for reading and understanding chapters is a topic with great research value.
Disclosure of Invention
The technical problem mainly solved by the application is to provide a chapter analysis method, electronic equipment and a storage device, which can deepen chapter analysis depth.
In order to solve the above problem, a first aspect of the present application provides a chapter parsing method, including: acquiring a chapter to be analyzed; wherein, the chapter to be analyzed comprises a plurality of paragraphs; identifying a plurality of paragraphs, and taking the paragraphs which belong to the same theme type and are continuous as sections corresponding to the theme type; and respectively determining the editing relation between sections corresponding to the same theme type.
In order to solve the above problem, a second aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, wherein the memory stores program instructions, and the processor is configured to execute the program instructions to implement the chapter parsing method in the first aspect.
In order to solve the above problem, a third aspect of the present application provides a storage device storing program instructions capable of being executed by a processor, the program instructions being used to implement the chapter resolution method in the first aspect.
According to the scheme, the sections to be analyzed containing the paragraphs are obtained, the paragraphs are identified, the paragraphs which belong to the same theme type and are continuous are used as the sections corresponding to the theme type, and therefore the editing relation between the sections corresponding to the same theme type is determined respectively. Therefore, the sections corresponding to the theme types can be identified on the chapter structure level, and the editing relation between the sections of the same theme type can be respectively determined on the chapter semantic level, so that the chapter analysis depth can be deepened from two dimensions of the chapter structure and further the chapter semantics.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a chapter resolution method of the present application;
FIG. 2 is a state diagram of an embodiment of text recognition;
FIG. 3 is a state diagram of another embodiment of text recognition;
FIG. 4 is a flowchart illustrating an embodiment of step S12 in FIG. 1;
FIG. 5 is a block diagram of an embodiment of a topic type identification model;
FIG. 6 is a state diagram of an embodiment of a preset sentence and defined keyword similarity calculation;
FIG. 7 is a flowchart illustrating an embodiment of step S13 in FIG. 1;
FIG. 8 is a block diagram of an embodiment of a segment-based relational prediction model;
FIG. 9 is a block diagram of an embodiment of an electronic device of the present application;
FIG. 10 is a block diagram of an embodiment of a memory device according to the present application.
Detailed Description
The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.
In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.
The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.
Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a chapter resolution method of the present application. Specifically, the method may include the steps of:
step S11: obtaining a chapter to be analyzed, wherein the chapter to be analyzed comprises a plurality of paragraphs.
In the embodiment of the present disclosure, the chapters to be parsed may be stored in a text format, and specifically, may include but are not limited to: the contract text, the literature thesis, the conference rule, the analysis report and the like can be specifically set according to the actual application needs, and are not limited herein. For example, where a meeting chapter undergoing multiple revisions needs to be parsed, the chapters to be parsed may include the meeting chapter from the past revisions; alternatively, in the case that the analysis of the sales contract is required, the piece to be analyzed may include the sales contract, and the like, which is not illustrated herein.
In the embodiment of the present disclosure, the sections to be parsed may specifically include 1 paragraph, 2 paragraphs, 3 paragraphs, or 4 paragraphs, and the like, which are not limited herein. In addition, the sections to be parsed may also include, but are not limited to: the titles of each level (e.g., the first-level title, the second-level title, etc.), the headers, the footers, the tables, etc., may be set according to the actual situation, and are not limited herein. For example, in the case of an analysis report, the document to be analyzed usually includes a plurality of parts (e.g., a status quoting part, a data analysis part, etc.), and each part usually corresponds to at least one title to distinguish it from other parts, and in addition, tables are usually attached between paragraphs in order to clearly and intuitively show the data. Other cases may be analogized, and no one example is given here.
In one implementation scenario, the sections to be parsed in the text format can be directly obtained. For example, the contract text in text format may be directly copied from a salesperson, or a literature article in text format may be directly copied from a developer, or a survey report in text format may be downloaded from the internet, and so on, which is not to be taken as an example.
In another implementation scenario, the sections to be parsed in the text format can be identified by the images to be processed, which are recorded with the sections to be parsed. For example, scanned images of historically revised meeting chapters may be identified, resulting in a text-formatted meeting chapter; or, the shot image of the survey report may be identified to obtain the survey report in text format, and so on in other cases, which is not illustrated herein. Specifically, the method may include obtaining an image to be processed on which a chapter to be analyzed is recorded, performing image recognition on the image to be processed to obtain a plurality of text lines, performing text recognition on the plurality of text lines, and determining line types of the text lines, where the line types may specifically include, but are not limited to: paragraph start, text, etc., without limitation, so that the line type of the text lines and the text lines obtained by recognition can be used to obtain a chapter to be parsed containing a plurality of paragraphs. In the above manner, through two-stage identification, that is, firstly performing text identification to obtain a plurality of text lines, and then performing text identification to the plurality of text lines to obtain line types, the text structure information lost by the image to be processed can be favorably recovered, so that the accuracy of the sections to be analyzed obtained through identification can be favorably improved, and the accuracy of subsequent section analysis can be favorably improved.
In a specific implementation scenario, an OCR (Optical Character Recognition) method may be used to perform image Recognition on the image to be processed to obtain a plurality of text lines. Specifically, a text detection model such as pixel-Link, textboxes + +, or the like may be used to detect the image to be processed, so as to detect text line regions corresponding to the image to be processed and each text line, perform feature extraction on each text line region by using a CNN (Convolutional Neural Network), obtain image features of each text line region, predict the image features by using an RNN (recurrent Neural Network), obtain a prediction result, and transcribe the prediction result by using a CTC (contact Temporal Classification), so as to obtain a text line in a text format.
In another specific implementation scenario, to improve the accuracy of the line type, line features of the text line may be extracted, and the line features may specifically include at least one of the following: the text content of the text line, the relative position of the text line in the image to be processed and the text size of the text line, so that the line type of the text line is determined by utilizing the line characteristics of the text line. By the method, the line type of the text line can be determined by integrating various line characteristics, so that the accuracy of the line type can be improved.
Referring to fig. 2, fig. 2 is a schematic diagram illustrating a state of an embodiment of text recognition. As shown in fig. 2, a plurality of text lines such as text line 1, text line 2, …, text line n, etc. are input into the text line coding network, so that a line feature related to the text content of text line 1, i.e. text line character content feature 1, a line feature related to the text content of text line 2, i.e. text line character content feature 2, and so on, and a line feature related to the text content of text line n, i.e. text line character content feature n, are obtained, and the character content feature of each text line and the corresponding visual feature (i.e. text line visual feature 1, text line visual feature 2, …, text line visual feature n in fig. 2) are input into the text line type prediction network, so as to obtain the line type of the corresponding text line. In particular, text line encoding networks may include, but are not limited to: BERT (Bi-directional Encoder reproduction From transforms, transform-based Bi-directional Encoder Representation) networks. Further, the text line type prediction network may include, but is not limited to: LSTM (Long Short Term Memory, Long Short Term Memory network). Visual features may include, but are not limited to: the relative position of the text line in the image to be processed (e.g., the position of the text line relative to the left border of the image to be processed, the position relative to the bordered position, the position relative to the upper border, the position relative to the lower border, etc.), the text size of the text line (e.g., the ratio of the height of the text line to the height of the image to be processed, the ratio of the width of the text line to the width of the image to be processed, etc.), and are not limited herein.
In yet another specific implementation scenario, please refer to fig. 3 in combination, and fig. 3 is a schematic state diagram of another embodiment of text recognition. As shown in fig. 3, by identifying the image to be processed on which the sections to be parsed are recorded, a number of text lines represented by solid rectangular frames and line types of each text line indicated by dashed arrows can be obtained, for example, the line type of the text line "company chapter" is "title", the line type of the text line "first line according to" justice "and related law, law" is "paragraph start", and the like. For example, the first paragraph of the text line "is commonly funded by XX and XX according to the rule of" justice "and related laws, laws" and "text line", the text line "XXX limited company" is established to make the chapter "divided into the same paragraph, and the conditions in the second chapter of the text line" are not in compliance with the laws, laws "and" text line ", and are divided into the same paragraph according to the rules of laws and regulations". Other cases may be analogized, and no one example is given here.
Step S12: and identifying a plurality of paragraphs, and taking the paragraphs which belong to the same topic type and are continuous as the sections corresponding to the topic type.
In the embodiment of the present disclosure, the subject type is used to represent the subject matter described in the paragraph text. Taking the to-be-parsed chapters as the conference chapters as an example, the topic types may include but are not limited to: decision making, resolution flow, etc.; alternatively, taking the piece to be parsed as a literature paper as an example, the topic types may include but are not limited to: background technology, theoretical introduction, experimental design, experimental result analysis and the like; other situations can be set according to actual situations, and are not exemplified.
Specifically, a plurality of paragraphs may be identified to obtain a topic type to which each paragraph belongs, and then the paragraphs that belong to the same topic type and are consecutive may be used as the sections corresponding to the topic type. Taking the chapter to be analyzed as the conference chapter as an example, the chapter to be analyzed sequentially comprises: in the case of paragraph 1, paragraph 2, paragraph 3, paragraph 4, and paragraph 5, if the topic type of paragraph 1 is identified as "decision item" and the topic types of paragraphs 2 to 5 are identified as "resolution flow", paragraph 1 may be used as a section corresponding to the topic type "decision item", and paragraphs 2 to 5 may be used as a section corresponding to the topic type "resolution flow", and the rest may be similar, which is not illustrated herein.
In an implementation scenario, in order to improve the efficiency of determining the topic type, a topic type recognition model may be trained in advance, so that several paragraphs may be processed as input data of the topic type recognition model, and further, the topic type of each paragraph is recognized. Specifically, the topic type identification model may specifically include a coding sub-network and a prediction sub-network, where the coding sub-network is configured to code paragraphs to obtain a coded representation of each paragraph, and the prediction sub-network is configured to obtain probability values that each paragraph belongs to a plurality of topic types according to the coded representation of each paragraph, so that a topic type corresponding to the maximum probability value may be used as the topic type of the paragraph. Taking the chapter to be analyzed as the conference chapter as an example, when the probability value of the paragraph 1 belonging to the "decision item" is predicted to be 90% and the probability value of the paragraph 1 belonging to the "resolution flow" is predicted to be 10%, the "decision item" can be used as the topic type of the paragraph 1. Other cases may be analogized, and no one example is given here.
In a specific implementation scenario, the encoding subnetwork may specifically include, but is not limited to: the BERT model, the prediction sub-network, may include at least a fully connected layer and a softmax layer, etc. which are connected in sequence, and is not limited herein.
In an implementation scenario, a plurality of definition sentences may also be set for the topic type, so that a plurality of paragraphs are identified by using the plurality of definition sentences of the topic type, and the topic type to which each paragraph belongs is determined. Specifically, the relevance of each paragraph to several defining sentences of a topic type may be calculated, and in the case that the relevance of the paragraph to several defining sentences of a certain topic type is greater than a preset relevance threshold, it is determined that the paragraph belongs to the topic type. In the above manner, the topic type to which the paragraph belongs can be identified and obtained through the definition sentence of the topic type, and the topic type of the paragraph can be prevented from being directly predicted, so that the sensitivity of the paragraph to the topic type can be reduced, and further, when the topic type is newly added, only the definition sentence of the newly added topic type needs to be maintained, and the topic type identification model is prevented from being retrained.
In a specific implementation scenario, the definition sentence of the topic type includes several sentences closely related to the topic type. Taking the discourse to be resolved as the conference chapter as an example, in the case that the topic type is "resolution flow", the definition sentence may include but is not limited to: "resolution flow specifies resolution of stockholder's conference, board of board conference", "conditions for conference table resolution pass", "specification of conference staff that should attend", and the like; alternatively, taking the discourse to be analyzed as an example of a literature paper, in the case that the topic type is "experimental design", the definition sentence may include, but is not limited to: "preparation before experiment", "experiment can be designed from several dimensions", "attention of experiment", etc., and is not limited herein. Other cases may be analogized, and no one example is given here.
In another specific implementation scenario, a plurality of definition sentences of a topic type may be encoded to obtain an encoded representation of each definition sentence, and a paragraph is encoded to obtain an encoded representation of the paragraph, so that a similarity between the encoded representation of the paragraph and the encoded representation of each definition sentence may be calculated, and further, a total correlation between the paragraph and the topic type may be obtained by counting correlations between the encoded representation of each definition sentence, and similarly, the total correlations between the paragraph and each other topic type may be obtained, and further, the topic type corresponding to the maximum total correlation may be used as the topic type to which the paragraph belongs. Still taking the chapter to be analyzed as the conference chapter as an example, the correlation between the paragraph 1 and each definition sentence of the topic type "resolution flow" can be obtained, the overall correlation between the paragraph 1 and the topic type "resolution flow" can be obtained by counting the correlation between the paragraph 1 and each definition sentence of the topic type "resolution flow", similarly, the overall correlation between the paragraph 1 and the topic type "decision item" can be obtained, and thus the topic type corresponding to the maximum overall correlation can be used as the topic type of the paragraph 1. Other cases may be analogized, and no one example is given here.
Step S13: and respectively determining the editing relation between sections corresponding to the same theme type.
In one implementation scenario, the edit relationship may include, but is not limited to: equivalence relations, complement relations, alternative relations, etc., which are not limited herein.
In a specific implementation scenario, the equivalence relation indicates that the two segments are substantially the same, for example, segment 1 is "XXXX month XX day, XX ten thousand yuan paid by XXX limited, occupying XX%" of registered capital, segment 2 is "XX ten thousand yuan paid by XXX limited in XXXX month XX day, occupying XX%" of registered capital, and segment 1 and segment 2 are substantially the same in content although they are described in different words, so segment 1 and segment 2 are equivalent relations. Other cases may be analogized, and no one example is given here.
In another specific implementation scenario, the supplemental relationship indicates that the content described in one section is a proper subset of the content described in another section. For example, the section 1 is "XX month XX day in XXXX year, XX ten thousand yuan paid by XXX limited account for XX% of registered capital", the section 2 is "XX ten thousand yuan paid by XXX limited company in XX month XX day in XXXX year, account for XX% of registered capital, and award XX ten thousand yuan again in XX month XX day in XXXX year", and the section 2 is substantially added with "and award XX ten thousand yuan again in XX month XX day in XXXX year" on the basis of the section 1, so that the section 1 and the section 2 are in a complementary relationship. Other cases may be analogized, and no one example is given here.
In yet another specific implementation scenario, the replacement relationship indicates that the content described in one section is invalidated and replaced by the content described in another section. For example, sector 1 is "XXXX month XX day, XX ten thousand yuan of XXX limited company, occupies XX%" of registered capital, sector 2 is "XXX limited company modified capital, XXX ten thousand yuan of capital, occupies XX%" of registered capital, and sector 2 essentially replaces the description of sector 1, so that sector 1 and sector 2 are in replacement relationship. Other cases may be analogized, and no one example is given here.
In one implementation scenario, the sections corresponding to the same topic type may be respectively used as sections to be analyzed, and the section characteristics of the sections to be analyzed are used to determine the edit relationship between the sections to be analyzed, where the section characteristics include at least one of: the content of the section to be analyzed and the position of the section to be analyzed in the chapter to be analyzed. On this basis, the above operations are respectively performed on the sections corresponding to at least one topic type, and finally the editing relationship between the sections corresponding to the at least one topic type can be determined. The at least one theme type can be set according to the actual application requirement. Taking the to-be-analyzed chapters as the conference chapters for example, the topic type may be at least one of "decision item" and "resolution flow", for example, if the user is interested in the edit relationship between the sections under the topic type of "decision item", the at least one topic type may be "decision item", or if the user is interested in the edit relationship between the sections under the topic types of "decision item" and "resolution flow", the at least one topic type may be "decision item" and "resolution flow", and the like, which is not illustrated herein. By the method, the editing relation can be determined by integrating the contents of the sections to be analyzed, positions in the sections to be analyzed and other dimensions, so that the accuracy of the editing relation can be improved.
In a specific implementation scenario, a section relation prediction model may be trained in advance, so that section features of a section to be analyzed may be input into the section relation prediction model, and an editing relation between sections to be analyzed may be obtained through prediction. In particular, the segment relational prediction model may comprise a segment coding network for extracting a segment-embedded representation of the segment to be parsed and a relational prediction network for predicting an edit relationship based on the segment-embedded representation. The segment-coded network may specifically include, but is not limited to: the BERT model, the relationship prediction network may specifically include but is not limited to: relation extraction model (relationship Extractor).
In another specific implementation scenario, the position of the to-be-analyzed section in the to-be-analyzed chapter may specifically include a page number of the to-be-analyzed section in the to-be-analyzed chapter.
In yet another specific implementation scenario, in the case that the segment to be parsed includes a table, the segment characteristics of the segment to be parsed may include at least one of: header of table, title of table. By the aid of the method, the expression dimensionality of the section characteristics can be further enriched, and accuracy of the editing relation can be further improved.
According to the scheme, the sections to be analyzed containing the paragraphs are obtained, the paragraphs are identified, the paragraphs which belong to the same theme type and are continuous are used as the sections corresponding to the theme type, and therefore the editing relation between the sections corresponding to the same theme type is determined respectively. Therefore, the sections corresponding to the theme types can be identified on the chapter structure level, and the editing relation between the sections of the same theme type can be respectively determined on the chapter semantic level, so that the chapter analysis depth can be deepened from two dimensions of the chapter structure and further the chapter semantics.
Referring to fig. 4, fig. 4 is a schematic flowchart illustrating an embodiment of step S12 in fig. 1. Specifically, in the embodiment of the present disclosure, several paragraphs are identified by using several definition sentences of the topic type, and the topic type to which each paragraph belongs is determined. The method specifically comprises the following steps:
step S41: and respectively extracting first feature representations of the plurality of paragraphs and respectively extracting second feature representations of the plurality of definition sentences.
In an embodiment of the disclosure, the first feature representation includes contextual semantic information between several paragraphs. As described above, referring to fig. 5 in combination, fig. 5 is a schematic diagram of an embodiment of a topic type identification model, where the topic type identification model includes a coding sub-network, the coding sub-network further includes a first coding sub-network and a second coding sub-network, a plurality of paragraphs and a plurality of definition sentences can be input into the first coding sub-network to extract a first embedded representation of each paragraph and extract a second embedded representation of each definition sentence, further, the first embedded representations of the paragraphs are input into the second coding sub-network to obtain corresponding first feature representations, and after being processed by the second coding sub-network, each feature representation includes context semantic information of the plurality of paragraphs. In particular, the first coding subnetwork may include, but is not limited to: BERT model, the second coding subnetwork may include, but is not limited to: the bidirectional LSTM model is not limited herein.
In one implementation scenario, the definition statements of the topic type may include not only a number of statements related to the topic type, but also statements unrelated to the topic type for distinction, and for convenience of description, the statements related to the topic type may be referred to as regular definition statements, and the statements unrelated to the topic type may be referred to as identification definition statements. Specifically, the identification definition sentences can be used for distinguishing paragraphs which are possibly considered to belong to the topic type literally, and the accuracy of the topic type can be improved by adding the identification definition sentences into a plurality of definition sentences.
In a specific implementation scenario, by taking the definition statements of the resolution flow as examples, the statement related to the resolution flow, such as "resolution flow specifies resolution of shareholder meeting, board meeting", "conditions for meeting resolution pass", "specification of conference staff to attend" and the like, can be taken as the front definition statement of the topic type "resolution flow", and the statement unrelated to the resolution flow but easy to confuse, such as "resolution flow does not include meeting notification related matters", can be taken as the identification definition statement of the topic type "resolution flow". For example, the paragraph "the company holds a stockholder meeting, and a person holding the stockholder meeting should notify the companies about the stockholders twenty days before the meeting is held", and the main point of the paragraph is essentially "meeting notification" which is unrelated to the "resolution flow" but can be literally confused with the "resolution flow", so that the probability of mistakenly considering the "resolution flow" can be effectively reduced by identifying the definition sentence.
In one implementation scenario, several definitional statements of a topic type may be custom set by a user. For example, the user can set several definition sentences for the topic types involved according to the topic types involved in the chapters to be analyzed.
In another implementation scenario, in order to improve efficiency of obtaining the definition sentences, a plurality of definition keywords of the topic type and a plurality of preset sentences related to the topic type may also be obtained, and feature representations of the definition keywords and feature representations of the preset sentences are respectively extracted, so that a total similarity score of the corresponding preset sentence is obtained by using a similarity score between the feature representation of each preset sentence and the feature representation of the definition keywords, and the preset sentence is used as the definition sentence of the topic type when the total similarity score of the preset sentence meets a preset condition. By the mode, the definition sentences of the theme type can be screened out only through the definition keywords of the theme type and the preset sentences, so that the manual setting of the definition sentences can be avoided, and the efficiency of obtaining the definition sentences can be improved.
In a specific implementation scenario, similar to the definition sentence, the plurality of definition keywords of the topic type may specifically include at least one of a positive case definition keyword and an identification definition keyword, wherein the positive case definition keyword represents a keyword related to the topic type, and the identification definition keyword is used for distinguishing from the positive case definition keyword and represents a keyword unrelated to the topic type. Still taking the topic type as "resolution flow" as an example, the formal definition keywords may include, but are not limited to: "voting mode", "attendees", etc., and the authentication definition keywords may include, but are not limited to: "meeting notification", etc., and the other cases may be analogized, not to mention one example. Specifically, under the condition that a formal definition statement of the topic type needs to be acquired, a plurality of formal definition keywords of the topic type can be acquired, and the step of acquiring a plurality of preset statements related to the topic type and subsequent steps are executed; or, in a case that the identification definition sentence of the topic type needs to be acquired, a plurality of identification definition keywords of the topic type may be acquired, and the step of acquiring a plurality of preset sentences related to the topic type and subsequent steps may be performed.
In another specific implementation scenario, please refer to fig. 6 in combination, and fig. 6 is a state diagram illustrating an embodiment of similarity calculation between a preset sentence and a defined keyword. As shown in FIG. 6, several defining keywords (e.g., i positive case defining keywords, which may be denoted as P { t ] for description purposes) may be respectively setp1,tp2,…,tpiJ, which may be denoted as N { t ] for convenience of descriptionn1,tn2,…,tnj}) inputting a coding model (such as BERT model) to obtain feature representation corresponding to the defined keyword, and applying a preset languageWhen a coding model (such as a BERT model) is input to the sentence, and the feature representation of the preset sentence is obtained, the maximum value of similarity scores between the feature representation of the definition keyword and the partial feature representation of the preset sentence at each position can be obtained as the similarity score between the feature representation of the definition keyword and the feature representation of the preset sentence. For example, a keyword t is defined for a positive casep1In other words, the positive case can be defined as the keyword tp1The length of the sentence is used as the size of a sliding window (shown as a dotted line rectangle in fig. 6), the sentence slides on the feature representation of the preset sentence by preset step length (such as 1, 2 and the like), the partial feature representation of the preset sentence in the sliding window is intercepted during each sliding, and the partial feature representation and the regular example definition keyword t are obtainedp1And the maximum value s of the similarity score in the sliding processpiAs the default sentence and the proper case definition keyword tp1Similarly, similarity scores between the feature representations of the preset sentence and other positive example definition keywords can be respectively obtained, so that the total similarity score of the preset sentence can be obtained by the following formula:
Figure BDA0002701805550000121
in the above formula (1), s represents the total similarity score of the preset sentence, and | P | represents the total number of the positive example definition keywords.
In yet another specific implementation scenario, the preset sentences related to the topic type may be specifically obtained from the text material related to the topic type. For example, several preset sentences related to the topic type "resolution" process may be obtained from the meeting chapters related to the "resolution process", several preset sentences related to the topic type "experimental design" may be obtained from the literature papers related to the "experimental design", and so on in other cases, which is not illustrated herein.
In another specific implementation scenario, the preset condition may specifically include: the total similarity score is greater than a preset score threshold. The preset score threshold may be set according to the actual application requirement, and is not limited herein. For example, in the case where the requirement on the accuracy of the definition sentence is high, the preset score threshold may be set to be large, for example, taking the maximum score as 100, and may be set to be 90 points, 95 points, or the like; alternatively, in the case that the requirement on the accuracy of the definition sentence is relatively loose, the preset score threshold may be set to be slightly lower, for example, taking the maximum score as 100, and the preset score threshold may be set to be 75 points, 85 points, and the like, which is not limited herein.
In another specific implementation scenario, in order to further improve the accuracy of the definition sentence, after the definition sentence of the topic type is screened out in the above manner, the screened definition sentence may be output, and the inspection result of the output definition sentence by the user is received, so that the screened definition sentence may be adjusted according to the inspection result to obtain the final definition sentence. Specifically, the check result may include judgment information on whether the output definition sentence does belong to the topic type; alternatively, the inspection result may also include modification information of the output definition statement, which is not limited herein.
Step S42: based on the first feature representation and the second feature representation, a type of subject to which each paragraph belongs is determined.
With reference to fig. 5, for convenience of description, the first feature representations corresponding to the m paragraphs are denoted as P, and the second feature representations corresponding to the n definition sentences are denoted as Q, then P and Q may be subjected to dot product processing to obtain a m × n correlation matrix, which may be denoted as S for convenience of description, where each element in the correlation matrix S represents a correlation between a paragraph and a definition sentence, for example, an element in a first row and a first column represents a correlation between a first paragraph and a first definition sentence, and other elements may be analogized, which is not illustrated herein. After that, on the one hand, the correlation matrix S may be normalized in the row dimension (e.g., softmax) and multiplied by the second feature representation Q to update the first feature representation, which, for the convenience of description, which may be designated as P', and on the other hand the correlation matrix S may be normalized in the row and column dimensions, respectively (e.g., softmax), and multiplied by the first feature representation P to update the first feature representation again, which may be designated as P "for ease of description, so that the relevance between each paragraph and the topic type to which the several definition sentences belong can be predicted based on the original first feature representation P and the first feature representations P ', P' updated by the relevance matrix S, therefore, the topic type of the paragraph can be determined according to the correlation degree between each paragraph and the topic type of the input definition sentences.
In one implementation scenario, the correlation may include: the probability value of the relevance of each paragraph and the topic type of the definition sentences belongs to, so that in the case that the probability value is larger than a preset probability threshold (e.g. 90%, 95%), the topic type of the paragraph is determined as the topic type of the definition sentences.
In another implementation scenario, the correlation may also include: whether each paragraph is related (e.g., related, or unrelated) to the topic type to which the several definition sentences belong, so that the topic type to which the paragraph belongs can be determined as the topic type to which the several definition sentences belong in the case that the paragraph is related to the topic type to which the several definition sentences belong.
Different from the foregoing embodiment, the first feature representations of the plurality of paragraphs are respectively extracted, the second feature representations of the plurality of definition sentences are respectively extracted, and the first feature representations include context semantic information between the plurality of paragraphs, so that the topic type to which each paragraph belongs is determined based on the first feature representations and the second feature representations, and further, the accuracy of determining the topic type can be improved.
Referring to fig. 7, fig. 7 is a flowchart illustrating an embodiment of step S13 in fig. 1. In the embodiment of the present disclosure, the sections corresponding to the same theme type are respectively used as the sections to be analyzed, and based on this, the section characteristics of the sections to be analyzed can be utilized to determine the editing relationship between the sections to be analyzed. Specifically, the method may include the steps of:
step S71: and performing relation prediction by using the section characteristics of the sections to be analyzed to obtain at least one candidate relation between the sections to be analyzed.
Specifically, the segment characteristics of the segments to be analyzed may be utilized to perform relationship prediction, so as to obtain first probability values corresponding to at least one candidate relationship between the segments to be analyzed. The segment characteristics may include at least one of: the content of the segment to be analyzed and the position of the segment to be analyzed in the chapter to be analyzed may specifically refer to the related descriptions in the foregoing embodiments, and are not described herein again.
In one implementation scenario, please refer to fig. 8 in combination, and fig. 8 is a block diagram illustrating an embodiment of a segment relation prediction model. The segment relation prediction model may specifically include a segment coding network (e.g., BERT model), and segment characteristics of the segment to be parsed (e.g., content of the segment to be parsed and position of the segment to be parsed in a chapter to be parsed) may be used as input data of the segment coding network, so that a segment embedded representation corresponding to the segment to be parsed may be obtained through processing by the segment coding network. In addition, the segment Relation prediction model may further include a Relation prediction network (e.g., a relationship estimator), and the segment embedding representation is used as input data of the Relation prediction network, so that the first probability values corresponding to at least one candidate Relation between the segments to be analyzed may be obtained through processing of the Relation prediction network. Taking the to-be-analyzed segment 1, the to-be-analyzed segment 2, …, and the to-be-analyzed segment k shown in fig. 8 as an example, an editing probability matrix of k × k may be obtained through the segment relation prediction model processing, where the ith row and the jth column elements in the editing probability matrix represent first probability values respectively corresponding to at least one candidate relation between the ith to-be-analyzed segment and the jth to-be-analyzed segment, for example, the first probability values respectively representing an equivalence relation, a replacement relation, and a supplement relation between the ith to-be-analyzed segment and the jth to-be-analyzed segment. Other cases may be analogized, and no one example is given here.
In another implementation scenario, the first probability values respectively corresponding to at least one candidate relationship between the segments to be analyzed may be represented by a plurality of triples, in addition to the edited probability matrix, and may also be represented by the triples, which are expressed as < i < th segment to be analyzed, first probability value, j < th segment to be analyzed >, for example, using the i < th segment to be analyzed and the j < th segment to be analyzed >. In addition, it may also be agreed that the previous element in the triple is the subject of the editing relationship and the next element in the triple is the object of the editing relationship. For example, where the first probability value of the at least one candidate relationship comprises a first probability value of an alternative relationship, the meaning represented by the triplet may be agreed to be: a first probability value of a previous element replacing a subsequent element; alternatively, it may be agreed that the previous element in the triplet is the object of the editing relationship and the next element in the triplet is the subject of the editing relationship. For example, where the first probability value of the at least one candidate relationship comprises a first probability value of an alternative relationship, the meaning represented by the triplet may be agreed to be: the latter element replaces the first probability value of the former element. The method can be specifically set according to actual needs, and is not limited herein.
Step S72: and performing post-processing on at least one candidate relationship between the sections to be analyzed, and determining an editing relationship between the sections to be analyzed.
In an implementation scenario, the first probability values may be sorted from large to small, and a candidate relationship corresponding to a first probability value (i.e., a largest first probability value) is selected as a temporary relationship between the sections to be analyzed, and the temporary relationship is detected, where if a pair of contradictory temporary relationships exists between the sections to be analyzed, the corresponding first probability value in the pair of temporary relationships may be deleted less, and replaced with a candidate relationship corresponding to a second first probability value (i.e., a second largest first probability value), and the temporary relationship may be detected again until no contradictory temporary relationship exists. For example, referring to table 1 in combination, table 1 is a schematic table of the relationship between the to-be-analyzed sections obtained by prediction, wherein the initial section corresponds to the element of the triplet as the subject, and the directional section corresponds to the element of the triplet as the object, as shown in table 1, the first probability value of the replacement relationship between the to-be-analyzed section 1 and the to-be-analyzed section 2 is 0.88, and is the largest among all the candidate relationships, the first probability value of the replacement relationship between the to-be-analyzed section 2 and the to-be-analyzed section 3 is 0.92, and is the largest among all the candidate relationships, the first probability value of the equivalence relationship between the to-be-analyzed section 3 and the to-be-analyzed section 1 is 0.82, and is the largest among all the candidate relationships, the replacement relationship between the to-be-analyzed section 1 and the to-be-analyzed section 2, the replacement relationship between the to-be-analyzed section 2 and the to-be-analyzed section 3, and the equivalence relationship between the to-be, all the temporary relations are used as temporary relations, and on the basis, the temporary relations among the three are detected to have contradictions, namely if the replacement relation between the section 1 to be analyzed and the section 2 to be analyzed and the replacement relation between the section 2 to be analyzed and the section 3 to be analyzed are correct, the equivalent relation between the section 3 to be analyzed and the section 1 to be analyzed cannot be established, in this case, the temporary relation with the smallest first probability value among the three, namely the equivalent relation between the section 3 to be analyzed and the section 1 to be analyzed, can be deleted, the replacement relation with the second highest first probability value between the section 3 to be analyzed and the section 1 to be analyzed is replaced, and after the contradictions do not exist among the three, the temporary relations can be used as the final editing relation.
TABLE 1 schematic table of relationship between to-be-analyzed sections obtained by prediction
Opening paragraph Pointing paragraphs Type of relationship First probability value
Paragraph 1 to be parsed Paragraph 2 to be parsed Replacing relationships 0.88
Paragraph 2 to be parsed Paragraph 3 to be parsed Replacing relationships 0.92
Paragraph 3 to be parsed Paragraph 1 to be parsed Equivalence relation 0.82
In another implementation scenario, in order to further improve the accuracy of the editing relationship, before post-processing, the to-be-analyzed segments may be subjected to logic analysis to obtain the logic features of the to-be-analyzed segments, so that the first probability value may be adjusted by using the logic features to obtain second probability values respectively corresponding to at least one candidate relationship between the to-be-analyzed segments, then a maximum spanning tree is obtained based on the second probability values respectively corresponding to at least one candidate relationship between the to-be-analyzed segments, and finally, the editing relationship between the to-be-analyzed segments is determined by using the maximum spanning tree. By the method, the first probability value can be optimized by using the logic characteristics, and the maximum spanning tree is generated on the basis to determine the editing relation between the sections to be analyzed, so that the accuracy of the editing relation can be improved.
In one particular implementation scenario, the logical features may include, but are not limited to: the method comprises the steps of analyzing a section to be analyzed according to the time related characteristic, the editing distance related characteristic between the sections to be analyzed and the text editing related characteristic in the section to be analyzed. Specifically, the time-related features may include time-related text in the section to be parsed (e.g., XX months XX days, etc.); the editing distance represents that the section to be analyzed is the same as another section to be analyzed after N editing steps of adding, deleting, changing and the like, and the editing distance between the section to be analyzed and the another section to be analyzed is N; the feature related to text editing may include keywords such as "modified", "similar" and the like in the section to be parsed, which is not limited herein.
In another specific implementation scenario, the logical characteristics may be obtained by processing a segment-embedded representation of the segment to be parsed by the logical recognition network. Specifically, the logical identification network may include, but is not limited to: conditional Random Field (CRF).
In yet another specific implementation scenario, the logic features include: in the case of the time-related feature in the to-be-analyzed section, if the to-be-analyzed section 1 is before the time of the to-be-analyzed section 2, the to-be-analyzed section 1 may not be in the replacement relationship or the supplement relationship of the to-be-analyzed section 2, and on this basis, a first probability value of the replacement relationship between the to-be-analyzed section 1 and the to-be-analyzed section 2 and a first probability value of the supplement relationship may be correspondingly reduced, for example, the first probability value of the replacement relationship between the to-be-analyzed section 1 and the to-be-analyzed section 2 and the first probability value of the supplement relationship may be multiplied by a coefficient (e.g., 0, 0.5) greater than or equal to 0 and smaller than 1, respectively.
In yet another specific implementation scenario, the logic features include: under the condition of the characteristics related to the edit distance between the sections to be analyzed, if the edit distance between the section 1 to be analyzed and the section 2 to be analyzed is smaller than the preset distance threshold, it may be considered that the section 1 to be analyzed and the section 2 to be analyzed are in an equivalent relationship, and on this basis, only the first probability value corresponding to the equivalent relationship between the section 1 to be analyzed and the section 2 to be analyzed may be retained. In addition, the preset distance threshold may be set according to practical application requirements, for example, in a case where the requirement for precision is high, the preset distance threshold may be set to be small (e.g., 1, 2, etc.), and in a case where the requirement for precision is relatively low, the preset distance threshold may be set to be large (e.g., 20, 30, etc.), which is not limited herein.
In yet another specific implementation scenario, the logic features include: in the case of the feature related to text editing in the to-be-analyzed section, if the to-be-analyzed section 1 includes the keyword such as "modify to", "replace", and the to-be-analyzed section 2 does not include such keyword, it may be considered that the first probability value of the to-be-analyzed section 1 as the replacement relationship of the to-be-analyzed section 2 should be greater than the first probability value as the equivalent relationship, and on this basis, the first probability value as the replacement relationship may be increased. For example, a first probability value of the section to be resolved 1 as the alternative relation of the section to be resolved 2 may be multiplied by a coefficient greater than 1 (e.g., 1.5, 2, etc.).
In another specific implementation scenario, a directed weighted graph with nodes as each section to be analyzed and edges as the first probability value of at least one candidate relationship between the sections to be analyzed may be constructed by using the first probability value of at least one candidate relationship between the sections to be analyzed, so that the maximum spanning tree may be obtained based on algorithms such as kruskal, prim, and the like, which is not described herein again in detail. By constructing the maximum spanning tree, the contradictory candidate relations can be eliminated to the maximum extent, so that the final editing relation is as accurate as possible.
Different from the embodiment, the relationship prediction is performed by using the segment characteristics of the segments to be analyzed to obtain at least one candidate relationship between the segments to be analyzed, the at least one candidate relationship between the segments to be analyzed is post-processed, and the editing relationship between the segments to be analyzed is determined, so that the post-processing can be performed based on the predicted candidate relationship, the candidate relationship which is inconsistent or not in accordance with logic can be reduced as much as possible, and the accuracy of the editing relationship can be improved.
Referring to fig. 9, fig. 9 is a schematic block diagram of an embodiment of an electronic device 90 according to the present application. The electronic device 90 includes a memory 91 and a processor 92 coupled to each other, the memory 91 stores program instructions, and the processor 92 is configured to execute the program instructions to implement the steps in any of the embodiments of the chapter resolution method described above. Specifically, the electronic device 90 may include, but is not limited to: a mobile phone, a notebook computer, a tablet computer, etc., which are not limited herein.
Specifically, the processor 92 is configured to control itself and the memory 91 to implement the steps of any of the above embodiments of the chapter resolution method. The processor 92 may also be referred to as a CPU (Central Processing Unit). The processor 92 may be an integrated circuit chip having signal processing capabilities. The Processor 92 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 92 may be collectively implemented by an integrated circuit chip.
In the disclosed embodiment, the processor 92 is configured to obtain a chapter to be parsed; wherein, the chapter to be analyzed comprises a plurality of paragraphs; the processor 92 is configured to identify a plurality of paragraphs, and use consecutive paragraphs that belong to the same topic type as a section corresponding to the topic type; the processor 92 is configured to determine edit relationships between sections corresponding to the same theme type, respectively.
According to the scheme, the sections to be analyzed containing the paragraphs are obtained, the paragraphs are identified, the paragraphs which belong to the same theme type and are continuous are used as the sections corresponding to the theme type, and therefore the editing relation between the sections corresponding to the same theme type is determined respectively. Therefore, the sections corresponding to the theme types can be identified on the chapter structure level, and the editing relation between the sections of the same theme type can be respectively determined on the chapter semantic level, so that the chapter analysis depth can be deepened from two dimensions of the chapter structure and further the chapter semantics.
In some disclosed embodiments, the processor 92 is configured to respectively use the sections corresponding to the same topic type as the sections to be parsed; the processor 92 is configured to determine an edit relationship between the sections to be analyzed by using the section features of the sections to be analyzed; wherein the segment characteristics include at least one of: the content of the section to be analyzed and the position of the section to be analyzed in the chapter to be analyzed.
Different from the foregoing embodiment, the editing relationship between the sections to be analyzed is determined by using the section characteristics including at least one of the content of the section to be analyzed and the position of the section to be analyzed in the chapter to be analyzed, which is beneficial to integrating multiple dimensions to determine the editing relationship, thereby being beneficial to improving the accuracy of the editing relationship.
In some disclosed embodiments, the processor 92 is configured to perform relationship prediction by using the segment features of the segments to be analyzed, so as to obtain at least one candidate relationship between the segments to be analyzed; the processor 92 is configured to perform post-processing on at least one candidate relationship between the sections to be analyzed, and determine an editing relationship between the sections to be analyzed.
Different from the embodiment, the relationship prediction is performed by using the segment characteristics of the segments to be analyzed to obtain at least one candidate relationship between the segments to be analyzed, the at least one candidate relationship between the segments to be analyzed is post-processed, and the editing relationship between the segments to be analyzed is determined, so that the post-processing can be performed based on the predicted candidate relationship, the candidate relationship which is inconsistent or not in accordance with logic can be reduced as much as possible, and the accuracy of the editing relationship can be improved.
In some disclosed embodiments, the processor 92 is configured to perform relationship prediction by using the segment features of the segments to be analyzed to obtain first probability values corresponding to at least one candidate relationship between the segments to be analyzed, the processor 92 is configured to perform logic analysis on the segments to be analyzed to obtain logic features of the segments to be analyzed, and the processor 92 is configured to adjust the first probability values by using the logic features to obtain second probability values corresponding to at least one candidate relationship between the segments to be analyzed; the processor 92 is configured to obtain a maximum spanning tree based on second probability values respectively corresponding to at least one candidate relationship between the segments to be analyzed; the processor 92 is configured to determine an edit relationship between the segments to be parsed using the maximum spanning tree.
Different from the foregoing embodiment, the first probability value may be optimized by using the logic features, and then the maximum spanning tree is generated on the basis to determine the editing relationship between the sections to be analyzed, which is beneficial to improving the accuracy of the editing relationship.
In some disclosed embodiments, the logical features include at least one of: the method comprises the steps of analyzing a section to be analyzed according to the time related characteristic, the editing distance related characteristic between the sections to be analyzed and the text editing related characteristic in the section to be analyzed.
In distinction to the foregoing embodiments, the logical features are arranged to include at least one of: the characteristics related to time in the section to be analyzed, the characteristics related to the editing distance between the sections to be analyzed and the characteristics related to text editing in the section to be analyzed can be beneficial to optimizing the first probability value from multiple dimensions, and therefore the accuracy of the editing relation can be further improved.
In some disclosed embodiments, when the segment to be parsed comprises a table, the segment characteristics further comprise at least one of: header of table, title of table.
In contrast to the foregoing embodiment, when the segment to be parsed includes a table, the segment feature is set to further include at least one of: the form of the table and the title of the table can further enrich the expression dimension of the section characteristics, thereby being beneficial to improving the accuracy of the editing relation.
In some disclosed embodiments, the processor 92 is configured to identify paragraphs using a number of definitional statements for a topic type, and determine the topic type to which each paragraph belongs.
Different from the foregoing embodiment, the plurality of paragraphs are identified by using the plurality of definition sentences of the topic type, and the topic type to which each paragraph belongs is determined, so that the topic type of the paragraph can be prevented from being directly predicted, the sensitivity of the paragraph to the topic type can be reduced, and further, when the topic type is newly added, only the definition sentences of the newly added topic type need to be maintained, and the model for identifying the topic type is prevented from being retrained.
In some disclosed embodiments, the processor 92 is configured to extract first feature representations of the plurality of paragraphs, respectively, and extract second feature representations of the plurality of definition sentences, respectively; wherein the first feature representation comprises contextual semantic information between the paragraphs; the processor 92 is configured to determine a type of subject to which each paragraph belongs based on the first feature representation and the second feature representation.
Different from the foregoing embodiment, the first feature representations of the plurality of paragraphs are respectively extracted, the second feature representations of the plurality of definition sentences are respectively extracted, and the first feature representations include context semantic information between the plurality of paragraphs, so that the topic type to which each paragraph belongs is determined based on the first feature representations and the second feature representations, and further, the accuracy of determining the topic type can be improved.
In some disclosed embodiments, the processor 92 is configured to obtain a plurality of definition keywords of the topic type and a plurality of preset sentences related to the topic type; the processor 92 is configured to extract third feature representations of the plurality of defined keywords, and extract fourth feature representations of the plurality of preset sentences; the processor 92 is configured to obtain a total similarity score corresponding to each preset sentence by using the similarity score between the fourth feature representation of each preset sentence and the third feature representations of the plurality of defined keywords; the processor 92 is configured to use the preset sentence as a definition sentence of the topic type based on that the total similarity score of the preset sentence satisfies a preset condition.
Different from the embodiment, the definition sentences of the topic type can be screened out only through the definition keywords of the topic type and the preset sentences, so that manual setting of the definition sentences can be avoided, and the efficiency of obtaining the definition sentences can be improved.
In some disclosed embodiments, the number of definitional statements includes at least one of: the system comprises a formal definition statement and an identification definition statement, wherein the formal definition statement represents a statement related to a theme type, and the identification definition statement is used for distinguishing from the formal definition statement and represents a statement unrelated to the theme type; and/or, the number of defined keywords comprises at least one of: the system comprises a positive case definition keyword and an identification definition keyword, wherein the positive case definition keyword represents a keyword related to a theme type, and the identification definition keyword is used for distinguishing from the positive case definition keyword and represents a keyword unrelated to the theme type.
In distinction to the foregoing embodiments, by setting several definitional statements to include at least one of: the method comprises the steps of defining a sentence according to a legal case, identifying the definition sentence, wherein the legal case definition sentence represents the sentence related to a topic type, and identifying the definition sentence is used for distinguishing the sentence from the legal case definition sentence and representing the sentence unrelated to the topic type, and a plurality of definition keywords are set to include at least one of the following: the method comprises the steps of defining keywords in a positive case, identifying the defined keywords, wherein the defined keywords represent keywords related to the theme type, and the identified defined keywords are used for being distinguished from the defined keywords in the positive case and represent keywords unrelated to the theme type, so that the method can be beneficial to finally determining the theme type from at least one dimension in the positive case and identification, and is beneficial to improving the accuracy of determining the theme type.
In some disclosed embodiments, editing the relationship includes: equivalence relations, complement relations, and alternative relations.
Unlike the foregoing embodiment, the edit relationship is set to include: the equivalence relation, the supplement relation and the replacement relation can be beneficial to covering various section relations, and the user experience is improved.
Referring to fig. 10, fig. 10 is a schematic diagram of a memory device 100 according to an embodiment of the present application. The storage device 100 stores program instructions 101 capable of being executed by a processor, and the program instructions 101 are configured to implement the steps of any of the above embodiments of the chapter resolution method.
According to the scheme, the analysis depth of chapters can be deepened.
In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.
The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims (13)

1. A chapter parsing method is characterized by comprising the following steps:
acquiring a chapter to be analyzed; the chapter to be analyzed comprises a plurality of paragraphs;
identifying the paragraphs, and taking the paragraphs which belong to the same topic type and are continuous as the sections corresponding to the topic type;
and respectively determining editing relations between sections corresponding to the same theme type.
2. The method according to claim 1, wherein the respectively determining editing relationship between sections corresponding to the same subject type comprises:
respectively taking sections corresponding to the same theme type as sections to be analyzed;
determining an editing relation between the sections to be analyzed by using the section characteristics of the sections to be analyzed;
wherein the segment features include at least one of: the content of the section to be analyzed and the position of the section to be analyzed in the chapter to be analyzed.
3. The method according to claim 2, wherein the determining the edit relationship between the sections to be resolved by using the section features of the sections to be resolved comprises:
carrying out relation prediction by using the section characteristics of the sections to be analyzed to obtain at least one candidate relation among the sections to be analyzed;
and carrying out post-processing on at least one candidate relationship between the sections to be analyzed, and determining an editing relationship between the sections to be analyzed.
4. The method according to claim 3, wherein the performing the relation prediction by using the section features of the sections to be resolved to obtain at least one candidate relation between the sections to be resolved comprises:
performing relation prediction by using the section characteristics of the sections to be analyzed to obtain first probability values respectively corresponding to at least one candidate relation among the sections to be analyzed;
before the post-processing at least one candidate relationship between the sections to be analyzed and determining the editing relationship between the sections to be analyzed, the method further includes:
carrying out logic analysis on the section to be analyzed to obtain logic characteristics of the section to be analyzed;
the post-processing at least one candidate relationship between the sections to be analyzed to determine the editing relationship between the sections to be analyzed includes:
adjusting the first probability value by using the logic characteristics to obtain second probability values respectively corresponding to at least one candidate relationship among the sections to be analyzed;
obtaining a maximum spanning tree based on second probability values respectively corresponding to at least one candidate relationship among the sections to be analyzed;
and determining the editing relation between the sections to be analyzed by utilizing the maximum spanning tree.
5. The method of claim 4, wherein the logical characteristics comprise at least one of: the analysis method comprises the steps of analyzing a section to be analyzed, wherein the section to be analyzed comprises characteristics related to time in the section to be analyzed, characteristics related to an editing distance between the sections to be analyzed and characteristics related to text editing in the section to be analyzed.
6. The method of claim 2, further comprising:
when the section to be parsed comprises a table, the section features further comprise at least one of: a header of the table, a title of the table.
7. The method of claim 1, wherein the identifying the number of paragraphs comprises:
and identifying the paragraphs by using a plurality of definition sentences of the topic type, and determining the topic type to which each paragraph belongs.
8. The method of claim 7, wherein identifying the paragraphs using the definitional statements of the topic type to determine the topic type to which each of the paragraphs belongs comprises:
respectively extracting first feature representations of the paragraphs and respectively extracting second feature representations of the definitions; wherein the first feature representation comprises contextual semantic information between the number of paragraphs;
determining a topic type to which each of the paragraphs belongs based on the first feature representation and the second feature representation.
9. The method of claim 7, wherein before said identifying said paragraphs with said number of definitional statements of said topic type and determining a topic type to which each of said paragraphs belongs, said method further comprises:
acquiring a plurality of definition keywords of the theme type and a plurality of preset sentences related to the theme type;
respectively extracting third feature representations of the plurality of definition keywords, and respectively extracting fourth feature representations of the plurality of preset sentences;
obtaining a total similarity score corresponding to each preset sentence by using the similarity score of the fourth feature representation of each preset sentence and the similarity score of the third feature representation of the plurality of definition keywords;
and taking the preset sentence as a definition sentence of the theme type based on the fact that the total similarity score of the preset sentence meets a preset condition.
10. The method of claim 9,
the number of definitional statements includes at least one of: a formal example definition statement and an identification definition statement, wherein the formal example definition statement represents a statement related to the subject type, and the identification definition statement is used for distinguishing from the formal example definition statement and representing a statement unrelated to the subject type;
and/or, the number of defined keywords comprises at least one of: the system comprises a positive case definition keyword and an identification definition keyword, wherein the positive case definition keyword represents a keyword related to the theme type, and the identification definition keyword is used for distinguishing from the positive case definition keyword and represents a keyword unrelated to the theme type.
11. The method of any of claims 1 to 10, wherein the editing relationships comprises: equivalence relations, complement relations, and alternative relations.
12. An electronic device comprising a memory and a processor coupled to each other, wherein the memory stores program instructions, and the processor is configured to execute the program instructions to implement the chapter resolution method of any one of claims 1 to 11.
13. A storage device having stored thereon program instructions executable by a processor to perform the method of discourse resolution of any one of claims 1 to 11.
CN202011024707.8A 2020-09-25 2020-09-25 Chapter analysis method, electronic equipment and storage device Active CN112257412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011024707.8A CN112257412B (en) 2020-09-25 2020-09-25 Chapter analysis method, electronic equipment and storage device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011024707.8A CN112257412B (en) 2020-09-25 2020-09-25 Chapter analysis method, electronic equipment and storage device

Publications (2)

Publication Number Publication Date
CN112257412A true CN112257412A (en) 2021-01-22
CN112257412B CN112257412B (en) 2023-12-01

Family

ID=74234962

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011024707.8A Active CN112257412B (en) 2020-09-25 2020-09-25 Chapter analysis method, electronic equipment and storage device

Country Status (1)

Country Link
CN (1) CN112257412B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0466516A2 (en) * 1990-07-13 1992-01-15 Artificial Linguistics Inc Text analysis system
US20150324340A1 (en) * 2014-05-07 2015-11-12 Golden Board Cultural and Creative Ltd., Co, Method for generating reflow-content electronic book and website system thereof
CN106021224A (en) * 2016-05-13 2016-10-12 中国科学院自动化研究所 Bilingual discourse annotation method
CN107145479A (en) * 2017-05-04 2017-09-08 北京文因互联科技有限公司 Structure of an article analysis method based on text semantic
US20180260472A1 (en) * 2017-03-10 2018-09-13 Eduworks Corporation Automated tool for question generation
US20190205322A1 (en) * 2017-12-29 2019-07-04 Aiqudo, Inc. Generating Command-Specific Language Model Discourses for Digital Assistant Interpretation
CN110046355A (en) * 2019-04-25 2019-07-23 讯飞智元信息科技有限公司 A kind of title paragraph detection method and device
CN111291188A (en) * 2020-02-20 2020-06-16 阿基米德(上海)传媒有限公司 Intelligent information extraction method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0466516A2 (en) * 1990-07-13 1992-01-15 Artificial Linguistics Inc Text analysis system
US20150324340A1 (en) * 2014-05-07 2015-11-12 Golden Board Cultural and Creative Ltd., Co, Method for generating reflow-content electronic book and website system thereof
CN106021224A (en) * 2016-05-13 2016-10-12 中国科学院自动化研究所 Bilingual discourse annotation method
US20180260472A1 (en) * 2017-03-10 2018-09-13 Eduworks Corporation Automated tool for question generation
CN107145479A (en) * 2017-05-04 2017-09-08 北京文因互联科技有限公司 Structure of an article analysis method based on text semantic
US20190205322A1 (en) * 2017-12-29 2019-07-04 Aiqudo, Inc. Generating Command-Specific Language Model Discourses for Digital Assistant Interpretation
CN110046355A (en) * 2019-04-25 2019-07-23 讯飞智元信息科技有限公司 A kind of title paragraph detection method and device
CN111291188A (en) * 2020-02-20 2020-06-16 阿基米德(上海)传媒有限公司 Intelligent information extraction method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SUJIAN LI ET AL.: "Text-level discourse dependency parsing", 《PROCEEDINGS OF THE52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》, vol. 1, pages 25 - 35 *
付鸿鹄 等: "基于段落检索和段落内容分析的知识化检索系统设计", 《情报理论与实践》, no. 5, pages 681 - 685 *

Also Published As

Publication number Publication date
CN112257412B (en) 2023-12-01

Similar Documents

Publication Publication Date Title
Hill et al. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study
US7801392B2 (en) Image search system, image search method, and storage medium
CN108920467B (en) Method and device for learning word meaning of polysemous word and search result display method
US8447588B2 (en) Region-matching transducers for natural language processing
US10803387B1 (en) Deep neural architectures for detecting false claims
US8843815B2 (en) System and method for automatically extracting metadata from unstructured electronic documents
US20100161639A1 (en) Complex Queries for Corpus Indexing and Search
US20070230787A1 (en) Method for automated processing of hard copy text documents
US20100254613A1 (en) System and method for duplicate text recognition
CN112926345B (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
Kim et al. Figure text extraction in biomedical literature
US11574287B2 (en) Automatic document classification
CN113312478A (en) Viewpoint mining method and device based on reading understanding
CN115017303A (en) Method, computing device and medium for enterprise risk assessment based on news text
CN114896305A (en) Smart internet security platform based on big data technology
CN113722490A (en) Visual rich document information extraction method based on key value matching relation
CN111931491B (en) Domain dictionary construction method and device
CN112069307B (en) Legal provision quotation information extraction system
Dölek et al. A deep learning model for Ottoman OCR
CN111274354B (en) Referee document structuring method and referee document structuring device
CN112257412B (en) Chapter analysis method, electronic equipment and storage device
CN114861630A (en) Information acquisition and related model training method and device, electronic equipment and medium
CN114495138A (en) Intelligent document identification and feature extraction method, device platform and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant