CN115270738B - Research and report generation method, system and computer storage medium - Google Patents

Research and report generation method, system and computer storage medium Download PDF

Info

Publication number
CN115270738B
CN115270738B CN202211210980.9A CN202211210980A CN115270738B CN 115270738 B CN115270738 B CN 115270738B CN 202211210980 A CN202211210980 A CN 202211210980A CN 115270738 B CN115270738 B CN 115270738B
Authority
CN
China
Prior art keywords
outline
semantic
title
report
paragraphs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211210980.9A
Other languages
Chinese (zh)
Other versions
CN115270738A (en
Inventor
刘明童
韦松伟
朱晴晴
周明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lanzhou Technology Co ltd
Original Assignee
Beijing Lanzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lanzhou Technology Co ltd filed Critical Beijing Lanzhou Technology Co ltd
Priority to CN202211210980.9A priority Critical patent/CN115270738B/en
Publication of CN115270738A publication Critical patent/CN115270738A/en
Application granted granted Critical
Publication of CN115270738B publication Critical patent/CN115270738B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention relates to the technical field of natural language processing, in particular to a research and report generation method, which comprises the following steps: obtaining a semantic computation model; acquiring a material base, retrieving words and sentences, and searching related contents in the material base based on the retrieved words and sentences to form a research and report subset; obtaining an outline through a semantic computation model based on the report subset; corresponding to paragraphs in the outline recall search subset; sequencing and combining the recalled paragraphs to form module contents of the corresponding outline; and combining the contents of the modules into a research and report text. Report authoring is done automatically by quickly finding the information of interest to the user from a given repository. The method is based on the semantic computation model to generate the content, and through semantic computation, the method can realize intelligent information processing technologies such as outline generation, outline-content generation, content sequencing and the like, so that high-efficiency integration of a large amount of information is realized, and finally a high-quality readable report is generated for reference of a user.

Description

Research and report generation method, system and computer storage medium
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method and a system for generating a report, and a computer storage medium.
Background
Research reports in professional fields are called research reports for short, and are important sources for people to acquire information, such as industry development reports, security analysis reports and the like. Because of the high professionality and rigor of the research and report, a large number of sections are often needed to carry out deep analysis on specific industries, fields, events or enterprises, the report is usually collected and written by professionals, the comprehensive collection and analysis capability of data and information is needed, a large number of reports with fixed formats are involved, the reports have wide knowledge and more professional knowledge, and the writing needs a large amount of manpower to carry out long-time tedious and tedious data listing and sorting work, and is time-consuming and labor-consuming. Aiming at the problem, a set of automatic research and report generation system is developed to achieve the purpose of high-efficiency information aggregation, and the purpose is to utilize a natural language processing technology and automatically complete the writing of reports based on a database, provide quick report writing auxiliary service for users and play a role in improving writing efficiency.
The traditional research and report generation technology is usually constructed by using a rule-based method or a simple retrieval technology to retrieve contents stored in a database and finally obtain a lot of data related to user query, but the technologies only focus on information query, have low retrieval accuracy, do not refine or semantically understand and deeply process the queried contents, have more presented contents, do not well reflect key information and integrate the information, have poor readability, and finally the user still needs to spend time to find information concerned by the user from the retrieval result, and have insufficient accuracy of the information and difficulty in effectively improving the efficiency.
Disclosure of Invention
The invention provides a method and a system for generating a research report and a computer storage medium, and aims to solve the problems that a large amount of manpower is needed for long-time tedious data listing and sorting work, and time and labor are consumed.
In order to solve the technical problems, the invention provides the following technical scheme: a method of generating a survey, comprising the steps of:
obtaining a semantic computation model; specifically, title-paragraph pair training data are constructed by adopting a title obtaining method based on keyword extraction and/or a title generating method based on an abstract, a double-tower model is adopted as a basic model, and the double-tower model is trained by adopting the title-paragraph pair training data to obtain a semantic calculation model;
acquiring a material base, retrieving words and sentences, and searching related contents in the material base based on the retrieved words and sentences to form a research and report subset;
obtaining an outline through a semantic computation model based on the report subset; extracting titles in a subset of the research, obtaining vectors of all the titles through a semantic calculation model, clustering the titles based on the vectors to obtain at least one cluster, calculating the semantic distance between the titles in the cluster and the center of the cluster, wherein the semantic distance is cosine similarity, and selecting the title closest to the semantic distance of the center of the cluster in each cluster as an outline title to form an outline of a first level;
corresponding to paragraphs in the outline recall search subset;
sequencing and combining the recalled paragraphs to form module contents of the corresponding outline;
and combining the contents of the modules into a research and report text.
Preferably, the method for searching out related contents from the material library based on the retrieval words and sentences to form the subset of the research and report comprises the following steps:
vectorizing the content, the retrieval words and sentences of the material library to obtain text vectors and retrieval word and sentence vectors;
carrying out similarity matching on the text vector and the retrieval word and sentence vector;
and selecting the content with the similarity within a preset range, and sequencing according to the similarity to form a report subset.
Preferably, the vectors for the title are clustered using the K-Means algorithm.
Preferably, after selecting the title closest to the cluster center semantic distance in each cluster as the first level outline, the method further includes:
if other titles besides the outline title exist in the cluster, taking the rest titles as candidate secondary titles to further obtain the candidate secondary titles corresponding to the outline title in each cluster;
judging the similarity of every two candidate secondary titles, if the similarity is greater than a preset value, connecting edges of the two candidate secondary titles, and forming a bipartite graph after iteration, wherein nodes of the bipartite graph represent the candidate secondary titles, and the edges represent the similarity of the two candidate secondary titles;
and taking the out degree of the node as the information content contained in the corresponding candidate secondary title, calculating the bipartite graph through a greedy algorithm to obtain a secondary title set with the most information content and the least number of candidate secondary titles, and taking the secondary title set as a second-level outline.
Preferably, the corresponding schema recalls paragraphs in the subset of the search based on the semantic vector comprises the steps of:
obtaining a module title based on the outline, and generating a corresponding title semantic vector according to the module title based on a semantic calculation model;
obtaining paragraph semantic vectors of paragraphs of a research subset;
and calculating semantic distances between the title semantic vector and the paragraph semantic vector, and selecting the paragraph with the semantic distance within a preset range as the paragraph of the corresponding module title.
Preferably, the module content for composing the corresponding outline by sequencing and combining the recalled paragraphs comprises the following steps:
screening the selected paragraphs corresponding to the module titles, specifically calculating the similarity between the paragraphs according to the semantic vector of the paragraphs, and if the similarity is greater than a preset value, removing the shorter paragraphs;
combining the screened paragraphs pairwise, coding to obtain a deep interactive semantic representation, and judging a sequencing relation based on two classifiers;
and finally, taking the sequenced paragraphs as module contents corresponding to the module titles.
In order to solve the above technical problems, the present invention provides another technical solution as follows: a system for generating a newspaper, for implementing the steps of the method for generating a newspaper, comprising the following modules:
the retrieval module: acquiring a material base, retrieving words and sentences, and searching related contents in the material base based on the retrieved words and sentences to form a research and report subset;
the outline generation module: obtaining an outline through semantic calculation based on the report subset;
a content generation module: recalling paragraphs in the subset of the corresponding outline based on the semantic vector;
a content ordering module: sequencing and combining the recalled paragraphs to form module contents of the corresponding outline;
combining the modules: and combining the contents of the modules into a research and report text.
In order to solve the above technical problems, the present invention provides another technical solution as follows: a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method of generating a survey as set forth above.
Compared with the prior art, the research report generation method, the research report generation system and the computer storage medium have the following beneficial effects:
1. according to the research and report generation method provided by the embodiment of the invention, the information concerned by the user is quickly found from the given material library, and the writing of the report is automatically completed. According to the method, the semantic vectors of the text are obtained, then the content is generated based on the semantic calculation model, and through semantic calculation, the intelligent information processing technologies such as outline generation, outline-content generation and content sequencing can be realized, so that high-efficiency integration of a large amount of information is realized, and finally, a high-quality readable report is generated for a user to refer.
2. According to the research and report generating method provided by the embodiment of the invention, the information which the user wants to pay attention to is retrieved and acquired, and the research and report is generated, so that unnecessary contents in the material library are removed, the generation efficiency of the research and report is improved, the research and report with better quality is obtained, and the high-efficiency organization of the contents in the material library is realized.
3. According to the method for generating the newspaper, the outline is generated through the semantic calculation model, the reading of the newspaper by a user is facilitated, and the user can look up related contents based on the outline conveniently.
4. In the method for generating a newspaper according to the embodiment of the present invention, contents of the newspaper are divided more finely by generating a second level outline, so that structural logicality of the subsequently obtained newspaper is stronger
5. According to the method for generating the newspaper, provided by the embodiment of the invention, when the content is generated according to the outline, the paragraphs in the newspaper subset are taken as the minimum unit, so that more accurate content is obtained, other irrelevant paragraphs are eliminated, and the high quality and readability of the generated newspaper are improved.
6. According to the research and report generation method provided by the embodiment of the invention, the similarity among the paragraphs is calculated through the semantic vector, and some redundant paragraphs with similar semantics are further removed, so that the research and report content is simplified, the research and report quality is improved, and the reading by a user is facilitated.
7. An embodiment of the present invention further provides a system for generating a report, which has the same beneficial effects as the above-mentioned method for generating a report, and is not described herein again.
8. An embodiment of the present invention further provides a computer storage medium, which has the same beneficial effects as the above-mentioned research report generating method, and details are not described herein.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart illustrating steps of a method for generating a report according to a first embodiment of the present invention.
Fig. 2 is a flowchart of the steps of S2 of a report generation method according to a first embodiment of the present invention.
Fig. 3 is a flowchart illustrating a step S3 of a report generation method according to a first embodiment of the present invention.
Fig. 4 is a flowchart of steps after S34 of a report generation method according to a first embodiment of the present invention.
Fig. 5 is a flowchart of the step S4 of a report generation method according to a first embodiment of the present invention.
Fig. 6 is a flowchart of the step S5 of a report generation method according to a first embodiment of the present invention.
Fig. 7 is a flowchart illustrating a step S1 of a report generation method according to a first embodiment of the present invention.
Fig. 8 is a block diagram of a report generation system according to a first embodiment of the present invention.
Description of the figures:
1. a research and report generation system;
10. a retrieval module; 20. an outline generation module; 30. a content generation module; 40. a content sorting module; 50. and (6) combining the modules.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, a first embodiment of the present invention provides a method for generating a report, including the following steps:
s1: obtaining a semantic computation model;
s2: acquiring a material base, retrieving words and sentences, and searching related contents in the material base based on the retrieved words and sentences to form a research and report subset;
s3: obtaining an outline through a semantic computation model based on the report subset;
s4: corresponding to paragraphs in the outline recall search subset;
s5: sequencing and combining the recalled paragraphs to form module contents of the corresponding outline;
s6: and combining the module contents into a report text.
According to the research and report generating method provided by the embodiment, through a semantic computation model, according to search words and sentences provided by a user, contents concerned by the user are found out in a prepared material library, and according to the found contents, research and report writing is automatically completed, so that high-efficiency integration of a large amount of information is realized, and finally, a high-quality readable report is generated for the user to refer.
It can be understood that the material base can be regarded as a collection of a large number of documents, a large number of materials or news collected by the user, a content basis of the written research report, the amount of the materials or news is large, the content is also very scattered, if it is time-consuming and labor-consuming to manually find out the points of interest, the invention firstly searches words and sentences provided by the user, wherein the words and sentences can be sentences, keywords and the like, the most relevant document or documents are searched out from the material base, and usually some documents also contain some other irrelevant contents, and manual sorting is often needed, so that the invention can finish automatic sorting to form a research report with a logic structure; in addition, in the writing of a research and newspaper, it is generally necessary to include contents in a plurality of directions, and one search word or sentence often corresponds to a content in one direction.
Firstly, after relevant contents are searched out, generating an outline based on the relevant contents, specifically, obtaining the outline through semantic calculation by utilizing a semantic calculation model, wherein the outline can be understood as a so-called title catalogue, reflects a rough content of the whole newspaper to a certain extent, and is convenient for a user to look up and read; then, each outline corresponds to a module, the outline is used as a title of the module, paragraphs are matched in related contents according to the outline, paragraphs related to the outline are found out, some irrelevant paragraph contents in the document are eliminated, then the paragraphs are sequenced and combined, module contents with a certain logic structure corresponding to each module are obtained, and finally, the module contents are combined to obtain a complete structured report.
It should be noted that, the method proposed by the present invention is to efficiently organize the existing content, and finally present the organization to the reader, so as to improve the efficiency of acquiring the written information of the research and report. The research and report generated by the invention can also be used as a written manuscript, and a user can edit and process the research and report based on the manuscript content, so that the quality of the research and report is further improved.
Referring to fig. 2, further, the step S2 "search out related content from the material library based on the search term to form a report subset" includes the following steps:
s21: vectorizing the content, the retrieval words and sentences of the material library to obtain text vectors and retrieval word and sentence vectors;
s22: carrying out similarity matching on the text vector and the retrieval word and sentence vector;
s23: and selecting the content with the similarity in a preset range, and sequencing according to the similarity to form a research subset.
Understandably, firstly, vectorizing each document in a material library through a semantic calculation model to obtain a text vector corresponding to each document, wherein the text vector is also represented as a semantic vector of the document, semantic vectorizing a retrieval word and sentence input by a user through the semantic calculation model to obtain a retrieval word and sentence vector, the retrieval word and sentence vector is also represented as a semantic vector of the document, and then selecting contents in the material library with the similarity in a preset range, namely the document, by calculating the similarity between the text vector and the retrieval word and sentence vector, namely the cosine similarity; the preset range can be set according to actual requirements, and finally, the selected contents are sequenced according to the calculated similarity, so that a research and report subset related to the user retrieval words and sentences is obtained and serves as an information source for generating the research and report later.
The information which the user wants to pay attention to is retrieved and acquired, and then the research report is generated, namely, unnecessary contents in the material library are removed, the generation efficiency of the research report is improved, the research report with better quality is obtained, and the efficient organization of the contents in the material library is realized.
Referring to fig. 3, the step S3 "obtaining the outline through the semantic computation model based on the report subset" includes the following steps:
s31: extracting titles in the research subset, and acquiring vectors of all the titles through a semantic calculation model;
s32: clustering the titles based on the vectors to obtain at least one cluster;
s33: calculating the semantic distance between the title in the cluster and the cluster center, wherein the semantic distance is cosine similarity;
s34: and selecting the titles closest to the semantic distance of the cluster center in each cluster to form a first-level outline.
It should be noted that the subset of the research papers are documents extracted from the material library, and these documents generally include titles, and documents such as articles, news reports, newspapers, and the like are all provided with titles, and a title represents the general content of a document.
Firstly, the title in the research subset is directly extracted, the semantic vector of the title is obtained, the title is clustered according to the semantic vector of each title, namely, the titles with similar subjects are grouped together into a category, the titles with different subjects are in different categories, the clustering of the title in the whole research subset can be regarded as the automatic mining of the subject information, each category represents one type of information, so that the content in the research subset can be more comprehensively reflected, wherein the clustering obtains several clusters which represent the titles to be divided into several categories.
After clustering, if the cluster only contains one title, directly taking the title as an outline of a first level, if the cluster contains a plurality of titles, performing the step S33 and the subsequent steps, calculating the semantic distance between each title in the cluster and the center of the cluster, wherein the semantic distance adopts cosine similarity, namely actually calculating the semantic distance between the vector of each title and the center of the cluster, and finally selecting the title closest to the semantic distance from the center of the cluster in each cluster as the outline title to combine the outline of the first level.
Further, as the extracted outline titles are unordered, but the research and report content organization usually has a certain logical relationship, for example, background knowledge is usually in the front position of an article, according to the observation, the relative position of each outline title is firstly calculated, and then the relative positions of all the titles in each cluster are obtained together during sorting, the title sorting method of the first-level outline is as follows:
setting a function of position information of the outline titles, and converting the position of each outline title into a specific value; the title position function adopts a statistical method, a certain outline title in a cluster is assumed to come from a certain document, if 5 titles exist in the document, the position of the current title is 2, the position of the title is the semantic distance between the title and the center of the cluster, the closer the distance is, the smaller the position is, the value of the position function is 0.4, and the smaller the value is, the more the position is, the higher the position is.
If the same document in the research subset has a plurality of titles, normalizing the position values of different titles in the same document;
calculating to obtain the position value of each outline title in the cluster, then calculating the average value of the cluster, and sorting the clusters from large to small according to the average value of each cluster, namely sorting the outline titles extracted from each cluster to obtain an ordered first-level outline.
Specifically, in the present embodiment, the vectors of the titles are subjected to clustering processing using the K-Means algorithm, wherein the semantic distance of the cluster center is obtained based on the K-Means algorithm.
Referring to fig. 4, further, after "selecting the title closest to the cluster center semantic distance in each cluster as the first level outline" in step S34, the method further includes:
s35: if other titles besides the outline title exist in the cluster, taking the other titles of the cluster as candidate secondary titles;
s36: judging the similarity of every two candidate secondary titles, if the similarity is greater than a preset value, connecting edges of the two candidate secondary titles, and forming a bipartite graph after iteration, wherein nodes of the bipartite graph represent the candidate secondary titles, and the edges represent the similarity of the two candidate secondary titles;
s37: and taking the out degree of the node as the information content contained in the corresponding candidate secondary title, calculating the bipartite graph through a greedy algorithm to obtain a secondary title set with the most information content and the least number of candidate secondary titles, and taking the secondary title set as a second-level outline.
It should be noted that the schema can be understood as a directory, the schema of the first level can be expressed as a combination of first-level titles in the directory, and the schema of the second level can be expressed as a combination of second-level titles corresponding to the first-level titles; if there are no other titles in the cluster except for the outline title, it means that there is no corresponding secondary title under the outline title,
it can be understood that calculating the similarity between the candidate secondary titles also calculates the semantic distance based on the semantic vectors of the candidate secondary titles, if the similarity is greater than a preset value, connecting edges indicates that a certain correlation exists between the two candidate secondary titles, after iteration, a bipartite graph is finally obtained, and finally, the bipartite graph is adopted
And the greedy algorithm obtains a secondary title set with the most information content and the least candidate secondary titles, and removes some candidate secondary titles with less information content so as to avoid that the information is relatively messy when the acquired secondary titles are too much to generate a report subsequently.
Further, after the secondary title set is obtained, sequencing the secondary titles corresponding to each outline title; specifically, the second-level titles are sorted according to the importance degree, the importance degree is the degree of closeness of the relationship between the second-level titles and the outline titles, the more relevant the outline titles are, the more important the second-level titles are, and the research and the report are strict and ordered through sorting.
Referring to fig. 5, further, the step S4 "corresponding to paragraphs in the outline recall search subset" includes the following steps:
s41: obtaining a module title based on the outline, and generating a corresponding title semantic vector according to the module title based on a semantic calculation model;
s42: obtaining paragraph semantic vectors of paragraphs of a research subset;
s43: and calculating semantic distances between the title semantic vector and the paragraph semantic vector, and selecting the paragraph with the semantic distance within a preset range as the paragraph of the corresponding module title.
It is understood that the outline includes a first level outline composed of outline titles and a second level outline composed of secondary titles; if the outline title corresponds to the secondary title, each secondary title is used as the module title, and if the outline title does not have the secondary title, the outline title is used as the module title, so that a plurality of modules comprising the module title can be obtained.
When the content is generated corresponding to the outline, the paragraphs in the report subset are used as the minimum unit, so that more accurate content is obtained, other irrelevant paragraphs are eliminated, and the high quality and readability of the generated report are improved.
First, the heading semantic vector corresponding to each module heading and the paragraph semantic vector of each paragraph of the report subset are obtained, the heading semantic vector may also be the vector already obtained in step S31, and the paragraph semantic vector of the report subset may also be pre-vectorized, so as to improve the generation efficiency of the report.
Specifically, for pictures, tables and the like existing in documents in the research subset, corresponding word descriptions are used as features to obtain corresponding semantic vectors.
After the semantic distance is calculated, according to actual requirements, a paragraph in a proper range is selected as the module content of the module, which may be one paragraph or multiple paragraphs, and is not specifically limited.
Referring to fig. 6, step S5 "sequence and combine the recalled paragraphs to form the module content of the corresponding outline" includes the following steps:
s51: screening the selected paragraphs corresponding to the module titles, specifically calculating the similarity between the paragraphs according to the semantic vector of the paragraphs, and if the similarity is greater than a preset value, removing the shorter paragraphs;
s52: combining the screened paragraphs pairwise, coding to obtain a deep interactive semantic representation, and judging a sorting relation based on two classifiers;
s53: and finally, taking the sequenced paragraphs as module contents corresponding to the module titles.
It can be understood that, because there are many contents in the report subset, paragraphs with very similar or identical semantics are easy to appear, so that the similarity between paragraphs is calculated through the semantic vector, some short redundant paragraphs are removed to simplify the content of the report, and the paragraphs can be selectively controlled to be removed according to specific requirements.
Then inputting the paragraphs which are combined pairwise into a classifier II, such as an input pair of paragraphs AB, and according to whether A is in front of B, giving a classification label 1 or 0 by the classifier II to indicate whether A is in front of B, wherein 1 indicates yes, and 0 indicates whether a table is in front of B; specifically, if the input is paragraph a and paragraph B, a corresponding paragraph vector is obtained through a semantic calculation model, the paragraph vector is subtracted, nonlinear mapping is performed by using a Sigmoid function, when reverse propagation is performed, corresponding parameter values can be modified by calculating the loss of the value and a label from back to front layer by layer, and iteration is stopped after the parameters gradually converge.
The finally sequenced module contents ensure that the whole research and report is more strict, logical and organized and more suitable for users to read.
Referring to fig. 7, further, the step S1 of obtaining the semantic computation model includes the following steps:
s11: adopting a title acquisition method based on keyword extraction and/or a title generation method based on abstract to construct title-paragraph pair training data;
s12: and (3) training the double-tower model by using the double-tower model as a basic model and using the title-paragraph to train the training data to obtain a semantic calculation model.
It should be noted that, the content generation is performed through the semantic vector, which relates to the semantic representation problem of the title and the paragraph, and because it is difficult to construct the title-paragraph pair data, the invention automatically generates large-scale training data by adopting a title acquisition method based on keyword extraction and/or a title generation method based on abstract, and performs model training. And (3) training the double-tower model by adopting the double-tower model as a basic model to obtain a semantic calculation model which is finally needed to be used.
Firstly, a model based on semantic vectors usually needs a large amount of manually labeled data sets for training, the model is easy to have an out-of-domain problem, and the effect of the model trained in the field is greatly reduced when the model is changed into another field. In order for the model to learn information in a wide field, unsupervised training is first performed on a large-scale data set in a wide field. Comparing the sample with similar semanteme (positive sample) and the sample with dissimilar semanteme (negative sample), and designing the model structure and the comparison loss to make the vector representation corresponding to the sample with similar semanteme closer in the representation space and the sample with dissimilar semanteme farther to reach the similar clustering effect.
The title obtaining method based on key sentence extraction specifically comprises the steps of giving a section of text, obtaining one sentence in the text as a title by using a key sentence extraction algorithm, and obtaining the rest sentences as documents, so as to construct a title-paragraph training sample. The title generation method based on the text abstract specifically comprises the steps of firstly training an abstract generation model by using an abstract data set based on an abstract generation model, then generating an abstract for a paragraph by using the abstract generation model, and using the generated abstract as a title to construct title-paragraph training data. By adopting the two methods, a large amount of data can be constructed for training, so that the problem that a semantic calculation model lacks training data in the process of generating a report is effectively solved.
The above two methods form good complementation, the title obtained based on the key sentence extraction algorithm can better consider the context information, and the title obtained based on the abstract generation algorithm can better reflect the global information. We combine these two methods to obtain a training positive sample.
When the model is trained, not only positive samples but also negative samples are needed, most of random negative samples are simple negative samples, and the effect of the model is difficult to promote. In the training of the semantic calculation model, the difficulty of strong negative examples in training data is properly increased, and the model effect is favorably improved. In contrast, an initial model is trained by using a random negative sample method, and then some samples with similar semantics are obtained by using the initial model to be used as negative samples, namely hard negative examples. Here, the difficult negative examples are obtained by searching, and in order to further prevent the problem that the model introduces the false negative examples, namely the problem that the model is strongly semantically related but is misjudged to be semantically unrelated. In order to solve this problem, a method based on subject word filtering is further adopted, that is, if the subject word in the negative example and the subject word in the corresponding positive example of the search exceed a certain proportion, the negative sample is discarded. The extraction algorithm of the subject term adopts a classic TF-IDF algorithm.
Specifically, in order to establish the sequential relationship between the generated module contents, a Pointwise mode is adopted to train the semantic calculation model. The Pairwise approach is to approximate the ordering problem as a classification problem. For a plurality of paragraphs corresponding to a module, any two paragraphs are combined to form a paragraph pair as an input sample, and the paragraph pair is encoded together through BERT to obtain a deep interactive semantic representation. Then learning a two-classifier, classifying all document pairs to obtain a group of partial order relations, thereby constructing a pair of paragraph pairs AB input by a document corpus, and giving the ordering relation of a classification label 1 or 0 by the two-classifier according to whether A is in front of B. The principle of the method is that for a given document complete set S, the number of reverse-order document pairs in the ordering is reduced to reduce the ordering error, so that the aim of optimizing the ordering result is fulfilled.
In summary, the report generation method provided by the invention can quickly find the information concerned by the user from the given material library through the semantic calculation model, and automatically complete the report writing. The method adopts a semantic calculation model to obtain the semantic vector of the text, and then generates the content based on the semantic vector. Through semantic calculation, the invention can realize intelligent information processing technologies such as outline generation, outline-content generation, content sequencing and the like, thereby realizing high-efficiency integration of a large amount of information and finally generating a high-quality readable report for user reference.
As shown in fig. 8, the second embodiment of the present invention further provides a system 1 for generating a newspaper, which is used for the steps of the aforementioned method for generating a newspaper, and includes the following modules:
the retrieval module 10: acquiring a material base, retrieving words and sentences, and searching related contents in the material base based on the retrieved words and sentences to form a research and report subset;
outline generation module 20: obtaining an outline through semantic calculation based on the report subset;
the content generation module 30: recalling paragraphs in the search subset based on the semantic vector corresponding to each outline;
the content ordering module 40: sequencing and combining the recalled paragraphs to form module contents corresponding to each outline;
the combination module 50: and combining the contents of the modules into a research and report text.
The system 1 has the same advantages as the above-mentioned method, and will not be described herein.
The third embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements a research report generation method as described above.
The computer storage medium has the same beneficial effects as the aforementioned generation method of a report, and is not described herein again.
In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a, from which B can be determined. It should also be understood, however, that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also appreciate that the embodiments described in this specification are exemplary embodiments in nature, and that the acts and modules involved are not necessarily essential to the invention.
In various embodiments of the present invention, it should be understood that the sequence numbers of the above-mentioned processes do not imply an inevitable order of execution, and the execution order of the processes should be determined by their functions and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
The flowchart and block diagrams in the figures of the present application illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Compared with the prior art, the research report generation method, the research report generation system and the computer storage medium have the following beneficial effects:
1. according to the research and report generation method provided by the embodiment of the invention, the information concerned by the user is quickly found from the given material library, and the writing of the report is automatically completed. According to the method, the semantic vectors of the text are obtained, then the content is generated based on the semantic calculation model, and through semantic calculation, the intelligent information processing technologies such as outline generation, outline-content generation and content sequencing can be realized, so that high-efficiency integration of a large amount of information is realized, and finally, a high-quality readable report is generated for a user to refer.
2. According to the method for generating the research report, information which a user wants to pay attention to is retrieved and acquired first, and then the research report is generated, namely unnecessary contents in the material library are removed, the generation efficiency of the research report is improved, the research report with better quality is obtained, and efficient organization of the contents of the material library is achieved.
3. According to the method for generating the newspaper, the schema is generated through the semantic calculation model, so that the reading of the newspaper by a user is facilitated, and the user can look up related contents based on the schema conveniently.
4. In the method for generating a newspaper according to the embodiment of the present invention, contents of the newspaper are divided more finely by generating a second level outline, so that structural logicality of the subsequently obtained newspaper is stronger
5. According to the method for generating the research report, when the content is generated according to the outline, the paragraphs in the research subset are used as the minimum unit, so that more accurate content is obtained, other irrelevant paragraphs are eliminated, and the high quality and readability of the generated research report are improved.
6. According to the method for generating the newspaper, provided by the embodiment of the invention, the similarity among the paragraphs is calculated through the semantic vector, and some redundant paragraphs with similar semantics are further removed, so that the content of the newspaper is simplified, the quality of the newspaper is improved, and the reading degree of a user is facilitated.
7. The embodiment of the present invention further provides a system for generating a report, which has the same beneficial effects as the above-mentioned method for generating a report, and is not described herein again.
8. The embodiment of the present invention further provides a computer storage medium, which has the same beneficial effects as the above-mentioned research report generation method, and is not described herein again.
The above detailed description is provided for a research and report generating method, system and computer storage medium disclosed in the embodiments of the present invention, and the present application describes the principle and implementation manner of the present invention by applying specific embodiments, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and applications, and in view of the above, the content of the present specification should not be construed as a limitation to the present invention, and any modifications, equivalent substitutions and improvements made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A method for generating a report is characterized by comprising the following steps: the method comprises the following steps:
obtaining a semantic computation model; specifically, title-paragraph pair training data are constructed by adopting a title obtaining method based on keyword extraction and/or a title generating method based on an abstract, a double-tower model is adopted as a basic model, and the double-tower model is trained by adopting the title-paragraph pair training data to obtain a semantic calculation model;
acquiring a material base, retrieving words and sentences, and searching related contents in the material base based on the retrieved words and sentences to form a research and report subset;
obtaining an outline through a semantic calculation model based on the report subset; extracting titles in a subset of the research, acquiring vectors of all the titles through a semantic calculation model, clustering the titles based on the vectors to obtain at least one cluster, calculating the semantic distance between the titles in the cluster and the cluster center, wherein the semantic distance is cosine similarity, and selecting the title closest to the cluster center in each cluster as an outline title to form an outline of a first level;
corresponding to paragraphs in the outline recall search subset;
sequencing and combining the recalled paragraphs to form module contents of the corresponding outline;
and combining the contents of the modules into a report text.
2. The newspaper generating method as recited in claim 1, wherein: the method for searching the related content to form the research and report subset in the material library based on the search words and sentences comprises the following steps:
vectorizing the content of the material library, the retrieval words and sentences to obtain text vectors and retrieval word and sentence vectors;
carrying out similarity matching on the text vector and the retrieval word and sentence vector;
and selecting the content with the similarity within a preset range, and sequencing according to the similarity to form a report subset.
3. The newspaper generating method as recited in claim 1, wherein: the vectors for the title are clustered using the K-Means algorithm.
4. A newspaper generating method as recited in claim 1, wherein: after the title closest to the cluster center semantic distance in each cluster is selected as a first-level outline, the method further comprises the following steps:
if the cluster has other titles besides the outline title, taking the rest titles as candidate secondary titles, and further obtaining the candidate secondary titles corresponding to the outline title in each cluster;
judging the similarity of every two candidate secondary titles, if the similarity is greater than a preset value, connecting edges of the two candidate secondary titles, and forming a bipartite graph after iteration, wherein nodes of the bipartite graph represent the candidate secondary titles, and the edges represent the similarity of the two candidate secondary titles;
and taking the out degree of the node as the information content contained in the corresponding candidate secondary title, calculating the bipartite graph through a greedy algorithm to obtain a secondary title set with the most information content and the least number of the candidate secondary titles, and taking the secondary title set as a second-level outline.
5. The newspaper generating method as recited in claim 1, wherein: the corresponding outline recalls paragraphs in the subset of the search based on the semantic vectors comprising the steps of:
obtaining a module title based on the outline, and generating a corresponding title semantic vector according to the module title based on a semantic calculation model;
obtaining paragraph semantic vectors of paragraphs of a research subset;
and calculating semantic distances between the title semantic vector and the paragraph semantic vector, and selecting the paragraphs with the semantic distances within a preset range as the paragraphs of the corresponding module titles.
6. The report generation method according to claim 5, wherein: the module content for forming the corresponding outline by sequencing and combining the recalled paragraphs comprises the following steps:
screening the selected paragraphs corresponding to the module titles, specifically calculating the similarity between the paragraphs according to the semantic vectors of the paragraphs, and if the similarity is greater than a preset value, removing the shorter paragraphs;
combining the screened paragraphs pairwise, coding to obtain a deep interactive semantic representation, and judging a sequencing relation based on two classifiers;
and finally, taking the sequenced paragraphs as module contents of corresponding module titles.
7. A newspaper generating system for realizing the steps of the newspaper generating method as recited in any one of claims 1 to 6, wherein: the system comprises the following modules:
the retrieval module: acquiring a material base, retrieving words and sentences, and searching related contents in the material base based on the retrieved words and sentences to form a research and report subset;
the outline generation module: obtaining an outline through semantic calculation based on the report subset;
a content generation module: recalling paragraphs in the search subset based on the semantic vector corresponding to each outline;
a content ordering module: sequencing and combining the recalled paragraphs to form module contents corresponding to each outline;
combining the modules: and combining the contents of the modules into a report text.
8. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program, when executed by a processor, implements a method of generating a survey according to any one of claims 1 to 6.
CN202211210980.9A 2022-09-30 2022-09-30 Research and report generation method, system and computer storage medium Active CN115270738B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211210980.9A CN115270738B (en) 2022-09-30 2022-09-30 Research and report generation method, system and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211210980.9A CN115270738B (en) 2022-09-30 2022-09-30 Research and report generation method, system and computer storage medium

Publications (2)

Publication Number Publication Date
CN115270738A CN115270738A (en) 2022-11-01
CN115270738B true CN115270738B (en) 2023-02-03

Family

ID=83757950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211210980.9A Active CN115270738B (en) 2022-09-30 2022-09-30 Research and report generation method, system and computer storage medium

Country Status (1)

Country Link
CN (1) CN115270738B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115952279B (en) * 2022-12-02 2023-09-12 杭州瑞成信息技术股份有限公司 Text outline extraction method and device, electronic device and storage medium
CN116089599B (en) * 2023-04-07 2023-07-25 北京澜舟科技有限公司 Information query method, system and storage medium
CN116383334B (en) * 2023-06-05 2023-08-08 长沙丹渥智能科技有限公司 Method, device, computer equipment and medium for removing duplicate report
CN117473072B (en) * 2023-12-28 2024-03-15 杭州同花顺数据开发有限公司 Financial research report generation method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111190946A (en) * 2019-12-10 2020-05-22 平安医疗健康管理股份有限公司 Report generation method and device, computer equipment and storage medium
CN114492362A (en) * 2022-04-12 2022-05-13 北京澜舟科技有限公司 Method and system for generating research and report questions and answers and computer readable storage medium
CN114970467A (en) * 2022-05-30 2022-08-30 平安科技(深圳)有限公司 Composition initial draft generation method, device, equipment and medium based on artificial intelligence

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9378065B2 (en) * 2013-03-15 2016-06-28 Advanced Elemental Technologies, Inc. Purposeful computing
US11748555B2 (en) * 2021-01-22 2023-09-05 Bao Tran Systems and methods for machine content generation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111190946A (en) * 2019-12-10 2020-05-22 平安医疗健康管理股份有限公司 Report generation method and device, computer equipment and storage medium
CN114492362A (en) * 2022-04-12 2022-05-13 北京澜舟科技有限公司 Method and system for generating research and report questions and answers and computer readable storage medium
CN114970467A (en) * 2022-05-30 2022-08-30 平安科技(深圳)有限公司 Composition initial draft generation method, device, equipment and medium based on artificial intelligence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
科技查新语义角色标注及其在报告自动生成系统中的应用;范午攸;《图书馆学研究》;20200515(第09期);62-66+81 *

Also Published As

Publication number Publication date
CN115270738A (en) 2022-11-01

Similar Documents

Publication Publication Date Title
CN115270738B (en) Research and report generation method, system and computer storage medium
CN110399457B (en) Intelligent question answering method and system
CN111680173B (en) CMR model for unified searching cross-media information
US20220261427A1 (en) Methods and system for semantic search in large databases
US10565233B2 (en) Suffix tree similarity measure for document clustering
Adelfio et al. Schema extraction for tabular data on the web
US8909563B1 (en) Methods, systems, and programming for annotating an image including scoring using a plurality of trained classifiers corresponding to a plurality of clustered image groups associated with a set of weighted labels
CN111581354A (en) FAQ question similarity calculation method and system
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
Tseng et al. Integrated mining of visual features, speech features, and frequent patterns for semantic video annotation
CN107844493B (en) File association method and system
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN114297415A (en) Multi-source heterogeneous data storage method and retrieval method for full media data space
CN111061828B (en) Digital library knowledge retrieval method and device
CN114265926A (en) Natural language-based material recommendation method, system, equipment and medium
Krishnan et al. Bringing semantics in word image retrieval
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
Costache et al. Categorization based relevance feedback search engine for earth observation images repositories
CN111753067A (en) Innovative assessment method, device and equipment for technical background text
Fu et al. A supervised learning and group linking method for historical census household linkage
CN116049376A (en) Method, device and system for retrieving and replying information and creating knowledge
Mahdi et al. Similarity search techniques in exploratory search: a review
CN114610744A (en) Data query method and device and computer readable storage medium
Anand et al. Information Retrieval in Computing Model
CN115186065A (en) Target word retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant