CN115600593A

CN115600593A - Method and device for acquiring key content of literature

Info

Publication number: CN115600593A
Application number: CN202211362675.1A
Authority: CN
Inventors: 刘译璟; 李亚博; 李�赫; 李彦泽; 宋成; 毛健
Original assignee: Beijing Percent Technology Group Co ltd
Current assignee: Beijing Percent Technology Group Co ltd
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2023-01-13

Abstract

The application discloses a method and a device for acquiring key content of a document, wherein the method comprises the following steps: acquiring a target document of key content to be extracted; inputting target content of the target document into a target model, and outputting a plurality of kinds of key information extracted from the target document, wherein the target model comprises a plurality of sub models corresponding to the plurality of kinds of key information, the plurality of sub models are obtained by training based on labeled document corpora in advance, one sub model is used for extracting one kind of key information, the plurality of kinds of key information comprise at least one of a research object, a specific problem, a solution method, a basic principle and a conclusion, the sub models corresponding to the research object, the specific problem, the solution method and the basic principle are BERT + CRF, and the sub model corresponding to the conclusion is a rule matching model; and combining the multiple kinds of key information to obtain a key content report of the target document. The method and the device can improve the acquisition efficiency of the key content of the literature.

Description

Method and device for acquiring key content of literature

Technical Field

The present application relates to the field of computers, and in particular, to a method and an apparatus for obtaining key content of a document.

Background

At present, document reading is still an important work before research is carried out by scientific researchers, in order to know the research dynamics in a certain field, the scientific researchers need to spend a lot of time reading document abstracts or full texts, summarize key contents in the documents, and then find own research direction to carry out research work according to the key contents.

However, the method of manually reading the documents and summarizing the key contents of the documents occupies much time and energy of scientific research personnel, and is inefficient.

Disclosure of Invention

The embodiment of the application provides a method and a device for acquiring key contents of documents, so that the efficiency of acquiring the key contents of the documents is improved.

In a first aspect, an embodiment of the present application provides a method for acquiring key content of a document, including:

acquiring a target document of key content to be extracted;

inputting target content of the target document into a target model, and outputting a plurality of kinds of key information extracted from the target document, wherein the target model comprises a plurality of sub models corresponding to the plurality of kinds of key information, the plurality of sub models are obtained by training based on labeled document corpora in advance, one sub model is used for extracting one kind of key information, the plurality of kinds of key information comprise at least one of a research object, a specific problem, a solution method, a basic principle and a conclusion, the sub models corresponding to the research object, the specific problem, the solution method and the basic principle are BERT + CRF, and the sub model corresponding to the conclusion is a rule matching model;

and combining the multiple kinds of key information to obtain a key content report of the target document.

In a second aspect, an embodiment of the present application further provides an apparatus for acquiring key content of a document, including:

the first acquisition module is used for acquiring a target document of key content to be extracted;

the information extraction module is used for inputting the target content of the target document into a target model and outputting a plurality of key information extracted from the target document, wherein the target model comprises a plurality of sub-models corresponding to the key information, the sub-models are obtained by training based on labeled document corpora in advance, one sub-model is used for extracting one piece of key information, the key information comprises at least one of a research object, a target problem, a solution method, a basic principle and a conclusion, the sub-models corresponding to the research object, the target problem, the solution method and the basic principle are BERT + CRF, and the sub-model corresponding to the conclusion is a rule matching model;

and the report generation module is used for combining the various key information to obtain a key content report of the target document.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory, a processor and computer executable instructions stored on the memory and executable on the processor, which when executed by the processor implement the steps of the apparatus as described in the first aspect above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium for storing computer-executable instructions that, when executed by a processor, implement the steps of the apparatus according to the first aspect.

According to the technical scheme, the method and the device for acquiring the key content of the target document can automatically extract various key information in a research object, a problem, a solution, a basic principle and a conclusion in the target document by using the pre-trained target model, and automatically combine the key information to form the key content report of the target document, so that the trouble of manually reading the document and manually summarizing the key content of the document is avoided, and the efficiency of acquiring the key content of the document can be improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1A is a schematic flowchart of a method for acquiring key content of a document according to an embodiment of the present application.

Fig. 1B is a diagram illustrating a preset report template according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a framework of a method for acquiring key content of a document according to an embodiment of the present application.

Fig. 3 is a flowchart illustrating a method for obtaining key content of a document according to another embodiment of the present application.

Fig. 4 is a schematic structural diagram of a BERT + CRF model according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of a CRF model according to an embodiment of the present application.

Fig. 6A is a first part of a schematic view illustrating a visualization effect of a method for acquiring key content of a document according to an embodiment of the present application.

Fig. 6B is a second part of one of schematic visual display effects of a method for acquiring key content of a document according to an embodiment of the present application.

Fig. 7A is a first part of a second schematic view illustrating a visualization display effect of a method for obtaining key content of a document according to an embodiment of the present application.

Fig. 7B is a second part of a schematic diagram of a visualization display effect of a method for acquiring key content of a document according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of an apparatus for acquiring key content of a document according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of an apparatus for acquiring key content of a document according to another embodiment of the present application.

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to improve the efficiency of obtaining the key content of the document and save the time of scientific research personnel, the embodiment of the application provides a method and a device for obtaining the key content of the document. The method and the device can be applied to an application program with a visual operation interface. The application program may be run in an electronic device, such as a terminal device or a server device. In other words, the above method may be performed by software or hardware installed in the terminal device or the server device. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

In the embodiments of the present application, the literature refers to a literature of a periodical, a magazine, a paper (including a conference paper), a patent, a book, an article, and the like, which are of research value.

The method for acquiring the key content of the document provided by the embodiment of the application is explained, and the method can comprise two parts, wherein one part is the training of the target model, and the other part is the extraction of the key content from the target document by applying the trained target model. The following section is described with reference to fig. 1A on the basis that the target model is trained.

Fig. 1A shows a schematic flowchart of a method for acquiring key content of a document according to an embodiment of the present application. As shown in fig. 1A, the method may include:

and 101, acquiring a target document of the key content to be extracted.

The target document may be one or more documents determined by a researcher to be required to obtain key content therein. For example, if a researcher wants to know about a research situation under a certain technical subject (or called research object), some or all of the documents under the technical subject may be retrieved from the document retrieval platform and used as target documents.

Step 102, inputting the target content of the target document into a target model, and outputting a plurality of key information extracted from the target document, wherein the target model comprises a plurality of sub-models corresponding to the plurality of key information, the plurality of sub-models are obtained by training based on the labeled document corpus, and one sub-model is used for extracting one key information.

The key information capable of reflecting the key content (subject matter) of a document generally includes at least one of the key information of the research object, the problem, the solution, the rationale, the literature conclusion, etc., so as to serve as an example, the key information of the research object, the problem, the solution, the rationale, the conclusion, etc.

The target content of one document refers to content including the above-mentioned various key information, and it is understood that the target content includes the above-mentioned various key information so as to extract the key information therefrom. For example, assuming that the above-mentioned various key information includes research objects, problems, solutions, rationales and conclusions, since these 5 key information items are generally contained in the title, abstract and body text of the document, the above-mentioned target content may include at least one of the title, abstract and body text.

Optionally, the target content includes two fields, namely a title and an abstract, and it can be understood that, since the title and the abstract of a document usually have summarized the above 5 items of key information, and the data volume of the title and the abstract is significantly less than that of the text, inputting the title and the abstract as the target content into the target model can improve the computational efficiency, thereby improving the efficiency of extracting the various kinds of key information from the target document.

Optionally, the target content may also include other fields of the target document, such as keywords, author, source, publication time, type, citation number, download number, publication organization, funding, EI or SCI index, etc., in order to obtain other key information of the target document.

In the embodiment of the application, the target model comprises a plurality of sub models corresponding to the plurality of kinds of key information, the plurality of sub models are obtained by training based on labeled literature corpora in advance, and one sub model is used for extracting one kind of key information. For example, in the case where the plurality of key information includes a research object, a goal, a solution, a rationale, and a conclusion, the research object, the goal, the solution, the rationale, and the conclusion each correspond to one sub-model.

The applicant researches and finds that 1) aiming at key information of 'research objects, problems, solutions and basic principles', a sub-model of 'BERT + CRF' is more suitable as an extraction model, wherein the BERT is a short name of a Bidirectional Encoder representation model (Bidirectional Encoder from transforms), and the CRF is a short name of a Conditional Random field (Conditional Random Fields); 2) For the key information of the "conclusion", because the contents related to the conclusion in the document generally compare rules, for example, the contents generally include second preset keywords such as "result indication", "experimental proof", "conclusion indication", "simulation result indication", and the like, it is more appropriate to use the "rule matching model" as a sub-model for extracting the key information of the "conclusion", and specifically, the rule matching model may be constructed by using the second preset keyword, where the second preset keyword is a keyword having a meaning of "conclusion".

It can be understood that, since the target model includes a plurality of sub models corresponding to the plurality of kinds of key information, and one sub model is used for extracting one kind of key information, the plurality of kinds of key information extracted from the target document can be output after the target content of the target document is input into the target model.

Optionally, after step 102 and before step 103, the method shown in fig. 1A may further include: and normalizing the plurality of kinds of key information. It can be understood that, since some entities (i.e. key information) extracted from the target document have the same or similar meanings but different expressions, which may interfere with the generation of the key content report later, the extracted multiple kinds of key information may be normalized to make the expressions of the entities having different expressions but the same or similar meanings consistent.

During specific implementation, a Levenshtein similarity algorithm can be adopted to normalize the extracted key information, and the calculation formula of the algorithm is as follows: r = (sum-ldist)/sum, where sum represents the sum of lengths of the character string 1 (str 1) and the character string 2 (str 1), ldist represents a class editing distance, and r represents the similarity between the character string 1 and the character string 2, and two pieces of key information with the similarity greater than a preset threshold (e.g., 0.7) may be normalized to one with a longer text length. For example, assuming that entity 1 in the variety of key information is "synthetic aperture radar image" and entity 2 is "synthetic aperture radar map", both may be normalized to "synthetic aperture radar image".

And 103, combining the multiple kinds of key information to obtain a key content report of the target document.

As an example, the plurality of types of key information are assembled according to a preset report template to obtain a key content report (or referred to as a document briefing) of the target document. For example, in the case that the plurality of kinds of key information include a research object, a problem, a solution, a rationale, and a conclusion, the assembling may be performed according to a template as shown in fig. 1B, specifically, the corresponding key information may be filled in a corresponding position.

Or after the research object, the problem, the solution, the rationale and the conclusion of the target document are obtained, the following key content reports can be formed by combining the publishing time, the publishing institution and other fields of the target document: in X years, month X, XX, mechanism, problem < problem > for < subject >, and < publication mechanism > a < solution > method was proposed, the rationale of which is < rationale >, and the conclusion is < literature conclusion >. In the key content report, the content within "< >" needs to be replaced with specific key information.

It should be noted that the preset report template can be in various forms, and is not limited to the form shown in fig. 1B and the above-mentioned report form including the publication time and publication mechanism.

Fig. 2 shows a schematic frame diagram of a method for acquiring key content of a document according to an embodiment of the present application. As can be seen from fig. 2, after the target document is obtained, the method automatically preprocesses the target document (e.g., determines the target content), then automatically inputs the target content of the target document into a plurality of submodels in the target model, such as inputting submodel 1, submodel 2, submodel 3, submodel 4, and submodel 5, and then the plurality of submodels output 5 pieces of key information corresponding to the study object, the problem, the solution, the rationale, and the conclusion; and finally, automatically assembling the 5 items of key information to obtain a key content report of the target document. For scientific research personnel, only the target literature needs to be provided, and the key content report reflecting the gist of the target literature can be obtained, so that the time and the energy of the scientific research personnel are greatly saved.

According to the method for acquiring the key content of the literature, provided by the embodiment of the application, at least one of key information of a research object, a problem, a solution, a basic principle and a conclusion in the target literature can be automatically extracted by using a pre-trained target model, and the key information can be automatically combined to form a key content report of the target literature, so that the trouble of manually reading the literature and manually summarizing the key content of the literature is avoided, and the efficiency of acquiring the key content of the literature can be improved.

It should be noted that, in the embodiments of the present application, the expression of the research object, the problem, the solution, the rationale, and the conclusion may not be limited to the current description, but may be replaced by other descriptions with the same meaning, for example, for the key information of "conclusion", the expression may also be expressed by descriptions of research conclusion, research result, experimental conclusion, and the like.

The training of the object model is described below with reference to fig. 3.

As shown in fig. 3, the method for acquiring key content of a document according to the embodiment of the present application may further include, before step 101, the following steps:

and 104, obtaining a literature corpus, wherein the literature corpus comprises content data of a plurality of literatures.

In specific implementation, relevant documents under the subject of the target technology in the target technical field can be automatically collected or automatically collected as a document corpus after purchase by an application program implementing the method for obtaining the key content of the document provided by the embodiment of the present application, in the document corpus, fields collected for one document may include, but are not limited to, title, abstract, keyword, text, author, source, publication time, type, citation number, download number, publication institution, fund subsidy, EI or SCI index, and the like.

Optionally, relevant documents under the target technical subject in the target technical field may be periodically acquired as the document corpus to continuously update the document corpus, so that the target model is periodically updated.

Optionally, after step 104 and before step 105, the corpus may be preprocessed to clean dirty data in the corpus, where the preprocessing may include, but is not limited to, one or more of the following:

1) And deleting the document with empty topic content in the document corpus. It can be understood that the document title is a high summary of the document content, and plays a very important role in both annotation and model training, so that documents with an empty title can be deleted from the document corpus.

2) And deleting the document with the abstract content being empty in the document corpus. It can be understood that the abstract is the key content containing the above 5 key information items, and therefore, when the target model is trained by using the abstract as the target content, the document with the empty abstract does not play a role in training the target model, and therefore, the document with the empty abstract can be deleted from the document corpus.

3) The review-like literature was deleted. It can be understood that the review documents mainly describe the development or trend of the technology, and generally do not provide a targeted solution to the problems in a certain field, so when the document corpus mainly is the technical innovation documents, the review documents need to be deleted. As an example, the deletion summary-type document may include: determining whether the titles of the documents in the document corpus are matched with a first preset keyword, wherein the first preset keyword is a keyword capable of representing a review document; and if the documents are matched, deleting the documents from the document corpus. The first preset keyword refers to a keyword for describing a review document, for example, the first preset keyword may include, but is not limited to, a review, progress, status quo, thinking, discussion, trend, prospect, development, and the like.

And 105, obtaining a labeling result aiming at the corpus of the document, wherein the labeling result of one document comprises a labeling result of the multiple key information in the document and a language sequence labeling result of the multiple key information in the document.

In the embodiment of the present application, the labeling of various key information described in one document can be done manually. Since it is often difficult for non-technical personnel to understand the key content of technical documents during labeling, and multiple field information of the documents needs to be paid attention to for summary, which is likely to cause inaccurate labeling result and unsatisfactory training result of the target model, in the embodiment of the present application, a keyword list with weight information is extracted from the target content (such as title, keyword and abstract) of the documents in advance, and then the labeling personnel refers to the keyword list with weight information for labeling. It can be understood that, because the research objects of the literature generally have a higher proportion in the titles and the abstracts and have a higher weight in the keyword list, the research objects can be easily noticed during labeling, so that the labeling accuracy can be improved, and the training effect of the target model is finally improved.

In view of the above, before the obtaining of the labeling result for the corpus of the document, the method may further include the following steps:

firstly, at least part of keywords of the document are added to a target dictionary to ensure the accuracy of professional vocabulary segmentation when segmenting the title, wherein the target dictionary can be any existing dictionary or any dictionary which is about to appear in the future and is used for segmenting the word;

secondly, segmenting words of the title of the document based on the target dictionary, and calculating word frequency of each segmented word in the title in the abstract of the document to obtain the weight of each segmented word in the title in the abstract of the document;

thirdly, segmenting words of the abstract of the document based on the target dictionary, and calculating the word frequency of the key words of the document in the abstract of the document to obtain the weight of the key words of the document in the abstract of the document;

and finally, combining the weight of each participle in the title in the abstract of the literature with the weight of the keyword of the literature in the abstract of the literature to obtain the keyword list with the weight information.

After the keyword list with the weight information exists, when the label is manually labeled, the keyword information with the top rank is locked to be the key information of the document according to the keyword list with the weight information, and a strong reference is provided for labeling personnel, so that the labeling accuracy is improved.

For example, for a document to be labeled, assume:

the title is as follows: provided is a terrain compensation method for measuring height at low elevation angle of a meter-wave radar.

The abstract is as follows: due to the multipath effect, the low elevation height measurement of the meter-wave radar is generally realized by spatial smoothing and decorrelation, or a synthetic steering vector is established by a reflection model, and then elevation matching algorithms such as multi-signal classification and maximum likelihood are adopted. However, the fixed coefficient reflection model is generally only suitable for flat terrain, and it is very difficult to establish an accurate variable coefficient reflection model under the condition of undulating terrain. The change situation of the ground reflection coefficient in the target position change process is analyzed, and a method for directly estimating and synthesizing the guide vector through flight detection data is provided, so that the robustness and the accuracy of the low elevation height measurement algorithm of the meter wave radar under the condition of undulating terrain are improved. The experimental result shows the effectiveness of the method, and particularly when the target distance is far, the elevation angle precision obtained by using the method is obviously superior to that of the traditional method.

The key words are: relief topography, height measurement, meter wave radar, synthetic guide vector, flight detection and topography compensation.

Then, the keyword list with weight information obtained by applying the above method may be: "{" meter wave radar ":4," method ":4," topography ":3," low elevation height measurement ":2," relief topography ":2," height measurement ":2," synthetic steering vector ":2," fly through ":1}".

Correspondingly, the labeling result of the document to be labeled can be: the research object is a meter wave radar, aiming at the problem of low elevation height measurement, the solution is a synthetic guide vector.

Further, the language sequence annotation of the plurality of key information of the document to be annotated in the document may include: marking the language sequence of the multiple kinds of key information in the document based on the BIO rule to obtain a language sequence marking result of the multiple kinds of key information in the document, wherein B is the starting position of the named entity, I is the other part of the named entity except the starting position, and O is other non-predefined entity. For example, "the terrain compensation method for low elevation height measurement of meter-wave radar" is labeled as "meter/B-wave/I-radar/I-low/O-elevation/0-angle/0-measurement/O-high/O-ground/O-shape/O-compensation/O-method/O".

And 106, taking the literature corpus and the labeling result of the literature corpus as input to train the target model.

As described above, the target model includes a plurality of sub models corresponding to the plurality of kinds of key information, the sub models are obtained by training based on labeled corpus, and one sub model is used for extracting one kind of key information. In case the plurality of key information comprises the study object, the solution to the problem, the rationale and the conclusion, the study object, the solution to the problem, the rationale and the conclusion each correspond to one sub-model. The sub-model corresponding to the key information of the research object, the problem, the solution and the basic principle is BERT + CRF, the model structure is shown in figure 4, and the sub-model corresponding to the conclusion is a rule matching model.

For the training of "BERT + CRF", as shown in fig. 4, the input is the target content in the corpus of the document, and the output is the named entity extraction result (i.e., the key information extraction result). In fig. 4, CLS denotes a start Token, E denotes an initial vector, and T denotes a vector obtained by a BERT model. And training and coding the labeling data by using a BERT model to obtain accurate character semantic expression, and further constraining the output result of the upper-layer semantics by using a CRF layer.

BERT refers to a Bidirectional Encoder characterization model (Bidirectional Encoder Representations from transforms). The input of BERT comprises two sentences, which are separated by a separator [ SEP ], the first sentence can be represented by a, and the second sentence can be represented by a user B, in this embodiment, the two sentences are two node sequences, where sentence a is { n0, n1, n2, n3} and sentence B is { n4, n5, n6, n7}, and the output of BERT comprises three parts: word vectors (Token entries), sentence vectors (Segment entries), and Position vectors (Position entries).

The most important part of BERT is a bidirectional Transformer coding structure, and a Self-Attention (Self-Attention) part of the BERT can fully fuse context information and calculate the association degree of words in a text, so that better sequence representation is obtained. The calculation method comprises the following steps:

wherein Q represents a query vector, K represents a key vector, V represents a value vector, d _k Is the input vector dimension. In order to enhance the diversity of the Attention expression, the Transformer adopts a 'multi-head' mode to linearly combine a plurality of enhanced semantic vectors of each word, so as to obtain a final enhanced semantic vector with the same length as the original word vector, and the calculation method comprises the following steps:

MultiHead(Q,K,V)＝Concat(head ₁ ,head ₂ ,…,head _h )W

wherein W is a weight matrix.

In general, the BERT + fully-connected layer can already solve the sequence tagging problem, the output vector of token is processed by Softmax, and the numerical value of each dimension represents the probability that the part of speech of token is a certain part of speech. However, in the embodiment of the present application, a CRF layer is added on the basis of BERT, so that constraints are added to ensure that the final prediction result is valid, and in the training process, the constraints can be automatically learned through the CRF layer, for example, the entity start label must be B, and the label I can only follow the label B. Thus, the text is input into a BERT + CRF model, a corresponding BIO label is output, and then BI is extracted and combined to form an entity.

Fig. 5 shows a block diagram of a CRF. In fig. 5, the lower point represents input, the upper point represents output, and the edges between the points can be divided into two categories, one is a connecting line between x and y, which represents the correlation; the other is the correlation between y at adjacent time instants. That is, when predicting a certain time y, neighboring tags are considered at the same time.

It can be understood that after the target model is trained, preparation is made for extracting the key information in the target document.

Optionally, on the basis of any of the foregoing embodiments, if the target document includes a plurality of documents under the target technical subject, the method may further include: and constructing a knowledge graph aiming at the target technical subject based on the plurality of key information of the target document, wherein in the knowledge graph, one key information represents one node, and an edge exists between the key information with the incidence relation. In general, there is an association between key elements extracted from the same document, for example, there is an association between a research object and an aimed problem in the same document, an association between an aimed problem and a solution, an association between a solution and a basic principle, and the like.

Optionally, after obtaining the report of the key content of the target document and after constructing the knowledge graph, the method may further include: and displaying at least one of the knowledge graph and the specified key content report in a preset visualization mode.

Wherein the specified key content report may include, but is not limited to, at least one of:

(1) A key content report corresponding to a target document of a newly added research problem of the target technical subject, wherein the newly added research problem is determined based on publication time of the target document under the target technical subject and specified key information, and the specified key information comprises a specific problem and a solution;

(2) And a key content report corresponding to a target document of the latest application exploration of the target technical topic, wherein the latest application exploration is determined based on publication time of the target document under the target technical topic and preset information, and the preset information comprises a title.

The new research question about the target technical subject can be determined according to the publication time of the target document and the occurrence frequency of the research question in all documents under the target technical subject, and if a research question is recently appeared in a certain document and has not appeared before, the research question is a new research question.

Similarly, the latest application exploration on the target technical topic can be determined according to the publication time of the target document and the occurrence frequency of the third key word and the like in all documents under the target technical topic, and if an application is recently appeared in a certain document and is not appeared before, the application is the latest application exploration. Wherein the third keyword is a keyword related to the application exploration of the target technical subject.

The preset visual display mode can comprise that the knowledge graph and the specified key content report are displayed in the same display interface at the same time, or the knowledge graph and the specified key content report can be displayed in different display interfaces.

As shown in fig. 6A to 7B, fig. 6A shows a display effect diagram of the knowledge graph, and fig. 6B shows a display effect diagram of a key content report corresponding to "latest application exploration" under the knowledge graph shown in fig. 6A; fig. 7A shows a display effect diagram of the knowledge graph, and fig. 7B shows a display effect diagram of a key content report corresponding to the "latest research question" in the knowledge graph shown in fig. 7A.

The knowledge-graph shown in FIG. 6A includes four types of nodes, study object, problem, solution, and rationale. The knowledge graph shown in fig. 7A includes seven types of nodes of study objects, problem (newly added), solution (newly added), rationale (newly added), problem (history), solution (history), and rationale (history).

As shown in fig. 6A and fig. 7A, when the knowledge graph is displayed, different types of nodes (corresponding to different types of key information) may be displayed by being filled in different filling manners (including at least one of pattern filling or color filling), and edges (connecting lines) between different types of nodes may be displayed by lines with different colors and/or different line types; furthermore, for a node connected to a plurality of other nodes, it can also be represented by a relatively larger node.

As shown in fig. 6B and fig. 7B, when presenting the specified key content report, different key information in the specified key content report may be presented in the same or different highlighting manners, which may include highlighting, flashing, underlining, bolding, blacking, and so on.

When the display mode of simultaneously displaying the knowledge graph and the specified key content report in the same display interface is adopted, fig. 6A and 6B can be respectively displayed in different display areas of the same display interface, and similarly, fig. 7A and 7B can be respectively displayed in different display areas of the same display interface. Optionally, in this case, the nodes corresponding to the specified key content reports may be displayed in a preset highlighting manner in the knowledge graph, where the preset highlighting manner may include highlighting, flashing, increasing, and the like.

When the display mode of displaying the knowledge graph and the specified key content report in different display interfaces is adopted, fig. 6A and 6B can be respectively displayed in different display interfaces, and similarly, fig. 7A and 7B can be respectively displayed in different display interfaces. Optionally, in this case, the nodes corresponding to the specified key content reports may also be displayed in a preset highlighting manner in the knowledge graph. Further, in the display interface for displaying the knowledge graph, if a preset operation (such as clicking) of the user for a node corresponding to the specified key content report is received, the display interface for displaying the specified key content report may be switched (or jumped to). For example, in the display interface displaying the knowledge graph shown in fig. 6A, if a preset operation of the user for the node corresponding to the "latest application exploration" is received, the display interface shown in fig. 6B may be switched to display the key content report corresponding to the "latest application exploration". Similarly, in the display interface displaying the knowledge graph shown in fig. 7A, if a preset operation of the user for the node corresponding to the "latest research question" is received, the display interface shown in fig. 7B may be switched to display the key content report corresponding to the "latest research question".

It should be noted that, in addition to the "new research problem" and the "latest application exploration", the key content report to be displayed may also be determined from other angles according to the actual demand, for example, as shown in fig. 6B, the key content report to be displayed may also be determined from the angles of a new solution, a new basic principle, a historical solution, a historical basic principle, and the like, which is not limited in this specification.

It can be understood that by visually displaying the knowledge graph and specifying the key content report, researchers can quickly understand the current research situation of the related technology, such as latest application exploration, newly-added research problems and the like.

In summary, according to the method for acquiring the key content of the document provided by the embodiment of the application, a pre-trained target model can be used to automatically extract a research object in a target document, and various kinds of key information in a problem, a solution, a basic principle and a conclusion are combined to form a key content report of the target document, so that the trouble of manually reading the document and manually summarizing the key content of the document is avoided, and the efficiency of acquiring the key content of the document can be improved.

The above describes a method for acquiring key content of a document provided in an embodiment of the present application, and in accordance with the above method for acquiring key content of a document, an apparatus for acquiring key content of a document is also provided in an embodiment of the present application, which is described below.

As shown in fig. 8, an apparatus 800 for acquiring key content of a document according to an embodiment of the present application may include: a first acquisition module 801, an information extraction module 802, and a report generation module 803.

The first obtaining module 801 is configured to obtain a target document of a key content to be extracted.

An information extraction module 802, configured to input target content of the target document into a target model, and output multiple types of key information extracted from the target document, where the target model includes multiple sub models corresponding to the multiple types of key information, the multiple sub models are obtained by training based on labeled document corpora in advance, and one sub model is used to extract one type of key information.

The target content of a document refers to a content including the above-mentioned various key information, and it is understood that the target content includes the above-mentioned various key information, and the key information can be extracted therefrom. For example, assuming that the various key information includes research objects, problems, solutions, rationales and conclusions, since these 5 key information items are usually included in the title, abstract and body text of the document, the target content may include at least one of the title, abstract and body text.

Optionally, the target content may also include other fields of the target document, such as keywords, author, source, publication time, type, citation number, download number, publication institution, funding, EI or SCI index, etc., in order to obtain other key information of the target document.

1) For several key information of 'study object, problem, solution and basic principle', it is more suitable to adopt the sub-model 'BERT + CRF' as the extraction model.

2) For the key information of the "conclusion", because the contents related to the conclusion in the document generally compare rules, for example, the contents generally include second preset keywords such as "result indication", "experimental proof", "conclusion indication", "simulation result indication", and the like, it is more appropriate to use the "rule matching model" as a sub-model for extracting the key information of the "conclusion", and specifically, the rule matching model may be constructed by using the second preset keyword, where the second preset keyword is a keyword having a meaning of "conclusion".

Optionally, the apparatus shown in fig. 8 may further include: and normalizing the plurality of kinds of key information. It can be understood that, since some entities (i.e. key information) extracted from the target document have the same or similar meanings but different expressions, which may interfere with the generation of the key content report, the extracted multiple kinds of key information may be normalized to make the entities with different expressions but the same or similar meanings express the same.

During specific implementation, a Levenshtein similarity algorithm can be adopted to normalize the extracted key information, and the calculation formula of the algorithm is as follows: r = (sum-ldist)/sum, where sum represents the sum of lengths of the character string 1 (str 1) and the character string 2 (str 1), ldist represents a class editing distance, and r represents the similarity between the character string 1 and the character string 2, and two pieces of key information having a similarity greater than a preset threshold (e.g., 0.7) may be normalized to one having a longer text length.

A report generating module 803, configured to combine the multiple types of key information to obtain a key content report of the target document.

As an example, the multiple kinds of key information are assembled according to a preset report template to obtain a key content report (or referred to as a document briefing) of the target document. For example, in the case that the plurality of kinds of key information include a research object, a problem, a solution, a rationale, and a conclusion, the assembling may be performed according to a template as shown in fig. 1B, specifically, corresponding key information may be filled in a corresponding position.

Optionally, as shown in fig. 9, an apparatus 800 for acquiring key content of a document according to an embodiment of the present application may further include a second acquiring module 804, a third acquiring module 805, and a model training module 806, in addition to the first acquiring module 801, the information extracting module 802, and the report generating module 803.

A second obtaining module 804, configured to obtain a corpus, where the corpus includes content data of a plurality of documents.

Optionally, relevant documents under the subject of the target technology in the target technical field may be periodically acquired as the document corpus to continuously update the document corpus, so that the target model is periodically updated.

Optionally, the apparatus 800 may further include: a preprocessing module, configured to preprocess the corpus to clean dirty data in the corpus, where the preprocessing may include, but is not limited to, one or more of the following:

1) And deleting the document with the null topic content in the document corpus. It can be understood that the document title is a high summary of the document content, and plays a very important role in both annotation and model training, so that the document with an empty title can be deleted from the document corpus.

2) And deleting the document with empty abstract content in the document corpus. It can be understood that the abstract is the key content containing the above 5 key information items, and therefore, when the target model is trained by using the abstract as the target content, the document with the empty abstract does not play a role in training the target model, and therefore, the document with the empty abstract can be deleted from the document corpus.

3) The review class documents were deleted. It is understood that the review-type documents mainly describe the development or trend of the technology, and generally do not provide a targeted solution to the problems in a certain field, so that when the document corpus is mainly the technical innovation-type documents, the review-type documents need to be deleted. As an example, the delete summary type document may include: determining whether the titles of the documents in the document corpus are matched with a first preset keyword, wherein the first preset keyword is a keyword capable of representing a review document; and if the documents are matched, deleting the documents from the document corpus. The first preset keyword refers to a keyword for describing a review document, for example, the first preset keyword may include, but is not limited to, a review, progress, status quo, thinking, discussion, trend, prospect, development, and the like.

A third obtaining module 805, configured to obtain a labeling result for the corpus of the document, where the labeling result of a document includes a labeling result of the multiple key information in the document and a language sequence labeling result of the multiple key information in the document.

In the embodiment of the present application, the labeling of various key information described in one document can be done manually. Since non-technical personnel often have difficulty in understanding the key content of the technical document during labeling, and need to pay attention to various field information of the document at the same time for inductive summary, the labeling result is inaccurate and the training result of the target model is not ideal, the embodiment of the present application extracts a keyword list with weighting information from the target content (such as title, keyword and abstract) of the document in advance, and then allows a labeling personnel to label the target content with the weighting information by referring to the keyword list. It can be understood that, since the research objects of the documents generally have a higher proportion in the titles and abstracts and a higher weight in the keyword list, the research objects can be easily noticed during labeling, so that the accuracy of labeling can be improved, and the training effect of the target model is finally improved.

secondly, segmenting the title of the literature based on the target dictionary, and calculating the word frequency of each segmentation word in the title in the abstract of the literature to obtain the weight of each segmentation word in the title in the abstract of the literature;

thirdly, segmenting words of the abstract of the document based on the target dictionary, and calculating the word frequency of the keywords of the document in the abstract of the document to obtain the weight of the keywords of the document in the abstract of the document;

Further, the language sequence annotation of the plurality of key information of the document to be annotated in the document may include: and labeling the language sequences of the plurality of key information in the document based on a BIO rule to obtain the language sequence labeling results of the plurality of key information in the document, wherein B is the starting position of the named entity, I is the other part of the named entity except the starting position, and O is other entity which is not predefined in advance. For example, the "terrain compensation method for low elevation angle elevation measurement of meter-wave radar" is labeled as "meter/B-wave/I radar/I low/O elevation/0 angle/0 measurement/O high/O ground/O shape/O compensation/O method/O".

A model training module 806, configured to train the target model using the literature corpus and the labeling result of the literature corpus as input.

Optionally, on the basis of any of the above embodiments, the apparatus 800: and the map construction module is used for constructing a knowledge map aiming at the target technical subject based on the various key information of the target document under the condition that the target document comprises a plurality of documents under the target technical subject, wherein in the knowledge map, one key information represents one node, and an edge exists between the key information with the incidence relation. In general, there is an association between key elements extracted from the same document, for example, there is an association between a research object and a problem in the same document, an association between a problem and a solution, an association between a solution and a rationale, an association between a rationale and a conclusion, and so on.

Optionally, on the basis of any of the foregoing embodiments, the apparatus 800 may further include: and the visual display module is used for displaying at least one of the knowledge graph and the specified key content report in a preset visual mode. Wherein the specified key content reports may include, but are not limited to, at least one of:

(1) A key content report corresponding to a target document of a newly added research problem aiming at the target technical subject, wherein the newly added research problem is determined based on publication time of the target document under the target technical subject and specified key information, and the specified key information comprises a problem and a solution;

It should be noted that, in addition to "new research questions" and "latest application exploration", the key content reports to be presented may be determined from other angles according to actual needs.

It can be understood that by visually displaying the knowledge graph and specifying the key content report, researchers can quickly understand the current research situation of the related technology, such as the latest application exploration, newly added research problems and the like.

It should be noted that, since an apparatus for acquiring key content of a document provided in the embodiments of the present application corresponds to a method for acquiring key content of a document provided in the embodiments of the present application, a description of an apparatus for acquiring key content of a document in the present application is relatively simple, and reference is made to the above description of a method for acquiring key content of a document.

Fig. 10 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 10, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, the network interface, and the memory may be connected to each other by an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 10, but this does not indicate only one bus or one type of bus.

And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.

The processor reads a corresponding computer program from the nonvolatile memory into the memory and then runs the computer program, and forms a device for acquiring the key content of the document on a logic level, and is specifically used for executing the following operations:

acquiring a target document of key content to be extracted;

The method for acquiring the key content of the document disclosed in the embodiment of fig. 1A of the present application may be implemented in a processor, or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and combines hardware thereof to complete the steps of the method.

Therefore, the electronic device executing the method provided by the embodiment of the present application may execute the methods described in the foregoing method embodiments, and implement the functions and beneficial effects of the methods described in the foregoing method embodiments, which are not described herein again.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to the following devices.

(1) The mobile network device features mobile communication function and mainly aims at providing voice and data communication. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has the functions of calculation and processing, and generally has the mobile internet access characteristic. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) The server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.

(4) And other electronic devices with data interaction functions.

An embodiment of the present application further provides a computer-readable storage medium storing one or more programs, where the one or more programs include instructions, which, when executed by an electronic device including multiple application programs, enable the electronic device to perform the method for acquiring the document key content in the embodiment shown in fig. 1A, and are specifically configured to perform the following operations:

acquiring a target document of key content to be extracted;

inputting the target content of the target document into a target model, and outputting a plurality of key information extracted from the target document, wherein the target model comprises a plurality of sub-models corresponding to the key information, the sub-models are obtained by training based on labeled document corpora in advance, one sub-model is used for extracting one piece of key information, the key information comprises at least one of a research object, a target problem, a solution method, a basic principle and a conclusion, the sub-models corresponding to the research object, the target problem, the solution method and the basic principle are BERT + CRF, and the sub-model corresponding to the conclusion is a rule matching model;

and combining the plurality of key information to obtain a key content report of the target document.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that all the embodiments in the present application are described in a related manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional identical elements in the process, method, article, or apparatus comprising the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art to which the present application pertains. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims

1. A method of obtaining key content of a document, the method comprising:

acquiring a target document of key content to be extracted;

2. The method according to claim 1, wherein before the obtaining of the target document of the key content to be extracted, the method further comprises:

acquiring a literature corpus, wherein the literature corpus comprises content data of a plurality of literatures;

obtaining a labeling result aiming at the corpus of the document, wherein the labeling result of one document comprises a labeling result of the multiple kinds of key information in the document and a language sequence labeling result of the multiple kinds of key information in the document;

and taking the literature corpus and the labeling result of the literature corpus as input to train the target model.

3. The method according to claim 2, wherein the target content includes a title and an abstract, the labeling result of the plurality of key information in a document is manually labeled by referring to a keyword list with weight information, and before the obtaining of the labeling result for the corpus of the document, the method further comprises:

adding the keywords of the document to a target dictionary;

segmenting the title of the literature based on the target dictionary, and calculating the word frequency of each segmentation word in the title in the abstract of the literature to obtain the weight of each segmentation word in the title in the abstract of the literature;

segmenting words of the abstract of the document based on the target dictionary, and calculating word frequency of the keywords of the document in the abstract of the document to obtain the weight of the keywords of the document in the abstract of the document;

and combining the weight of each participle in the title in the abstract of the literature with the weight of the keyword of the literature in the abstract of the literature to obtain the keyword list with the weight information.

4. The method according to claim 3, wherein before the obtaining the labeling result for the corpus of documents, the method further comprises:

and labeling the language sequences of the plurality of key information in the document based on a BIO rule to obtain the language sequence labeling results of the plurality of key information in the document, wherein B is the starting position of the named entity, I is the other part of the named entity except the starting position, and O is other entity which is not predefined in advance.

5. The method according to any one of claims 2-4, wherein before the obtaining the labeling result for the literature corpus, the method further comprises:

preprocessing the literature corpus, wherein the preprocessing comprises at least one of the following:

deleting the document with empty topic content in the document corpus;

deleting the document with the abstract content being empty in the document corpus;

the review-like literature was deleted.

6. The method of claim 5, wherein the pre-processing comprises deleting summary-like documents, and wherein deleting summary-like documents comprises:

determining whether the titles of the documents in the document corpus are matched with a first preset keyword, wherein the first preset keyword is a keyword capable of representing a review document;

and if the documents are matched, deleting the documents from the document corpus.

7. The method of claim 1, wherein the target document comprises a plurality of documents under a target technical theme, the method further comprising:

and constructing a knowledge graph aiming at the target technical subject based on the plurality of key information of the target document, wherein in the knowledge graph, one key information represents one node, and an edge exists between the key information with the incidence relation.

8. The method of claim 7, wherein said combining said plurality of key information to obtain a key content report of said target document comprises:

and assembling the plurality of kinds of key information according to a preset report template to obtain a key content report of the target document.

9. The method of claim 8, further comprising:

displaying the knowledge graph and a specified key content report in a preset visualization mode, wherein the specified key content report comprises at least one of the following items:

a key content report corresponding to a target document of a newly added research problem of the target technical subject, wherein the newly added research problem is determined based on publication time of the target document under the target technical subject and specified key information, and the specified key information comprises a specific problem and a solution;

and a key content report corresponding to a target document of the latest application exploration of the target technical topic, wherein the latest application exploration is determined based on publication time of the target document under the target technical topic and preset information, and the preset information comprises a title.

10. An apparatus for obtaining key content of a document, the apparatus comprising:

and the report generation module is used for combining the multiple kinds of key information to obtain a key content report of the target document.