CN113918708B - Abstract extraction method - Google Patents

Abstract extraction method Download PDF

Info

Publication number
CN113918708B
CN113918708B CN202111532196.5A CN202111532196A CN113918708B CN 113918708 B CN113918708 B CN 113918708B CN 202111532196 A CN202111532196 A CN 202111532196A CN 113918708 B CN113918708 B CN 113918708B
Authority
CN
China
Prior art keywords
words
level
word
semantic
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111532196.5A
Other languages
Chinese (zh)
Other versions
CN113918708A (en
Inventor
胡为民
郑喜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Dib Enterprise Risk Management Technology Co ltd
Original Assignee
Shenzhen Dib Enterprise Risk Management Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Dib Enterprise Risk Management Technology Co ltd filed Critical Shenzhen Dib Enterprise Risk Management Technology Co ltd
Priority to CN202111532196.5A priority Critical patent/CN113918708B/en
Publication of CN113918708A publication Critical patent/CN113918708A/en
Application granted granted Critical
Publication of CN113918708B publication Critical patent/CN113918708B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of natural language processing, in particular to a method for abstracting an abstract, which comprises the following steps: s1, preprocessing, namely generalizing the numerical value and time type data in the bulletin text; s2, constructing a first word list; s3, constructing a word co-occurrence matrix of the first word list; s4, reducing the dimension of the word co-occurrence matrix, and extracting semantic representations of all words in the first word list; s5, repeating S2 to S4, and extracting semantic representations of all words in the bulletin text; s6, accumulating and combining the semantic representations by taking the sentences as units to form sentence context semantic representations; s7, inputting a key phrase by a user, and extracting semantic representation of the key phrase; and S8, judging the similarity between the semantic representation of the key phrase and the semantic representation of the sentence context, and if the similarity of the key phrase is more than a set value, extracting the bulletin text sentence comprising the key phrase into a public text abstract. The association degree between the content of the abstract and the keywords input by the user is high.

Description

Abstract extraction method
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method for abstracting an abstract.
Background
At present, the number of enterprises appearing on the market is increasing day by day, and the public notice of the companies appearing on the market, namely the temporary or annual relevant operating conditions of finance, business and the like of the companies appearing on the market, contains a large amount of information; however, the bulletin text lacks of standard writing specifications and is long in space, reading and data analysis are not facilitated, and key sentences and other information need to be manually extracted from the bulletin text by data analysis and auditors, so that the working efficiency is low. Therefore, it is necessary to provide a method for extracting the abstract of the public company bulletin text, which compresses the text of the bulletin text, removes the 'redundant' information which is not concerned by the analysis and audit staff, and improves the work efficiency of the analysis and audit staff.
At present, a relevant abstract extraction method is available, but sentences containing key words or words similar to the key words in semantics are mainly searched through full text, and extracted and synthesized abstracts are obtained. The method mainly adopts the technology of word vector similarity calculation. However, the existing abstract extraction method has some problems when applied to the public company bulletin text, which is mainly reflected in that only semantic associations among keywords are considered, the semantic associations between the keywords and paragraphs and chapters are not considered, and part of the keywords extend through the whole bulletin text, so that the abstract extraction content is not accurate enough.
Disclosure of Invention
The invention provides a method for extracting an abstract, which aims to solve the problem that the content of the abstract extracted by the existing abstract extracting method is not accurate enough and has high association degree with keywords input by a user.
A method for abstracting abstract comprises the following steps:
s1, preprocessing, namely generalizing the numerical value and time type data in the bulletin text;
s2, constructing a first word list;
s3, constructing a word co-occurrence matrix of the first word list;
s4, reducing the dimension of the word co-occurrence matrix, and extracting semantic representations of all words in the first word list;
s5, repeating S2 to S4, and extracting semantic representations of all words in the bulletin text;
s6, accumulating and combining the semantic representations by taking the sentences as units to form sentence context semantic representations;
s7, inputting a key phrase by a user, and extracting semantic representation of the key phrase;
and S8, judging the similarity between the semantic representation of the key phrase and the semantic representation of the sentence context, and if the similarity of the key phrase is more than a set value, extracting the bulletin text sentence comprising the key phrase into a public text abstract.
In the method, the semantic representation of the words is extracted, the similarity between the semantic representation of the sentence context and the semantic representation of the key phrase is judged, the sentences with the similarity larger than a set value are extracted to form a bulletin text abstract, and the association degree between the abstract content and the key words input by a user is high;
further, replacing the bulletin textTextThe numerical value in the text is a Chinese character numerical value, and the announcement text is replacedTextThe middle time is the Chinese character time;
eliminating the mark number in punctuation mark and the pause and colon in punctuation mark, decomposing the announcement text by using the reserved punctuation mark as separatorIs a sentence; bulletin text by adopting jieba word-separating methodTextPerforming Chinese word segmentation, after eliminating stop words in the Chinese words, weighting the words by adopting TFIDF, and arranging the words from large to small according to weight;
further, the step of constructing the first vocabulary in S2 includes obtaining words of 2000 words before the weight arrangement to construct the first vocabularyWords
Figure 480129DEST_PATH_IMAGE001
Whereinw i Is shown asiThe number of the words is one,w j is shown asjThe number of the words is one,nis the number of words
Figure DEST_PATH_IMAGE002
Further, the S3 includes the step of,
for any two words appearing in the same sentence, the same paragraph and the same chapterw i Andw j establishing association and constructing word co-occurrence matrix
Figure 232185DEST_PATH_IMAGE003
Figure 523489DEST_PATH_IMAGE004
Is a sentence level word co-occurrence matrix;
Figure 993784DEST_PATH_IMAGE005
is a paragraph level word co-occurrence matrix;
Figure 864789DEST_PATH_IMAGE006
is a discourse level co-occurrence matrix; matrix row indexiColumn indexjRespectively representing two co-occurrence wordsw i Andw j an index of (2); the elements in the matrix represent the joint probability of two words pointed by row and column indexes
Figure 2509DEST_PATH_IMAGE007
Further, the step S4 includes reducing dimensions of the sentence-level co-occurrence matrix of words, the paragraph-level co-occurrence matrix of words, and the chapter-level co-occurrence matrix of words by using a principal component analysis method, where the dimensions after the dimension reduction are 2000 × 100, where 2000 represents the number of words and 100 represents the dimension of each semantic vector of words; after dimensionality reduction, the three-level vector of the word co-occurrence matrix is three-level semantic representation; the three-level semantic representations are sentence-level, paragraph-level and chapter-level semantic representations of words; and extracting three-level semantic representations of all the words in the first word list.
Further, the dimension reduction calculation formula is as follows:
Figure 464714DEST_PATH_IMAGE008
wherein
Figure 100002_DEST_PATH_IMAGE009
Represents the k-th row vector standard deviation;
Figure 357059DEST_PATH_IMAGE010
to represent
Figure 562913DEST_PATH_IMAGE011
To (1)kA row vector;
Figure 555139DEST_PATH_IMAGE012
representing a covariance matrix;
Figure 453825DEST_PATH_IMAGE013
the first 100 columns of feature column vectors representing the covariance matrix;
Figure 633134DEST_PATH_IMAGE014
meaning co-occurrence of wordsIn the matrix ofkThree levels of semantic representation of individual words.
Further, the S5 is repeated each time, and S2 constructs a vocabulary respectively until all words in the bulletin text are included, and the vocabulary is sequentially words 2000 before weight arrangement;
further, the statement context three-level semantic representation is
Figure 377099DEST_PATH_IMAGE015
Wherein t is the t-th word in the sentence.
The S7 comprises the steps that a user inputs a key phrase, and three-level semantic representations of all key words of the key phrase are extracted; accumulating and combining the three-level semantic representations of all the keywords to form a keyword group three-level semantic representation; the key phrase three-level semantic representation is
Figure 223832DEST_PATH_IMAGE016
Wherein t is the t-th word in the key phrase.
Furthermore, a semantic similarity calculation model based on a twin neural network is constructed, the semantic similarity calculation model based on the twin neural network comprises two groups of isomorphic feedback neural networks, the input of the semantic similarity calculation model based on the twin neural network is the three-level semantic representation of statement context and the three-level semantic representation of user key word groups, and the output is similarity;
inputting three-level semantic representations of statement context and three-level semantic representations of user key phrases, and extracting statements corresponding to the input three-level semantic representations of statement context when the similarity is greater than a set value;
and repeating the steps, and sequentially inputting all the three-level semantic representations of the sentence context in the bulletin text until all the sentences with the similarity between the three-level semantic representations of the sentence context in the bulletin text and the three-level semantic representations of the user key phrase being more than a set value are extracted to form the bulletin text abstract.
Has the advantages that: by extracting semantic representations of words, the similarity between the sentence context semantic representation and the keyword group semantic representation is judged, sentences with the similarity larger than a set value are extracted to form a bulletin text abstract, and the association degree between the abstract content and keywords input by a user is high; the 'redundant' information which is not concerned by the user can be effectively removed, and the working efficiency of the user is improved.
Drawings
The invention is described in further detail below with reference to the figures and specific embodiments.
Fig. 1 is a flowchart of the present embodiment.
FIG. 2 is an architecture diagram of the present embodiment of a twin neural network.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Examples
In this embodiment, taking a public bulletin text of a company in the city as an example, a method for extracting a summary is provided, which specifically includes the following steps.
S1, preprocessing, namely generalizing the numerical value and time type data in the bulletin text; comprises the steps of (a) preparing a mixture of a plurality of raw materials,
replacement bulletin textTextThe numerical value in the Chinese character is Chinese character (numerical value) and replaces the bulletin textTextThe middle time is the Chinese character (time);
eliminating the mark number in the punctuation mark, and the pause number and the colon number in the point number, and decomposing the bulletin text into sentences by using the reserved point number as a separator; bulletin text by adopting jieba word-separating methodTextPerforming Chinese word segmentation, and obtaining a bulletin text after eliminating stop words in the Chinese word segmentationTextThe term (a);
and weighting the words by adopting TFIDF, and arranging the words from large to small according to the weight.
S2, constructing a first vocabulary, including,
obtaining words of 2000 before weight arrangement to construct first word listWords
Figure DEST_PATH_IMAGE017
Whereinw i Is shown asiThe number of the words is one,w j is shown asjThe number of the words is one,nis the number of words
Figure 496682DEST_PATH_IMAGE018
S3, constructing a word co-occurrence matrix of the first word list;
for any two sentences, paragraphs and chaptersWord and phrasew i Andw j establishing association and constructing word co-occurrence matrix
Figure DEST_PATH_IMAGE019
Figure 960024DEST_PATH_IMAGE020
Is a sentence level word co-occurrence matrix;
Figure DEST_PATH_IMAGE021
is a paragraph level word co-occurrence matrix;
Figure 976522DEST_PATH_IMAGE022
is a discourse level co-occurrence matrix;
matrix row indexiColumn indexjRespectively representing two co-occurrence wordsw i Andw j an index of (2); the elements in the matrix represent the joint probability of two words pointed by row and column indexes
Figure 943341DEST_PATH_IMAGE023
S4, reducing the dimension of the word co-occurrence matrix, and extracting semantic representations of all words in the first word list; comprises the steps of (a) preparing a mixture of a plurality of raw materials,
reducing dimensions of the sentence-level word co-occurrence matrix, the paragraph-level word co-occurrence matrix and the chapter-level word co-occurrence matrix by adopting a principal component analysis method, wherein the dimensions after the dimension reduction are 2000 x 100, wherein 2000 represents the number of words, and 100 represents the dimension of each word semantic vector; after dimensionality reduction, the three-level vector of the word co-occurrence matrix is three-level semantic representation; the three-level semantic representations are sentence-level, paragraph-level and chapter-level semantic representations of words; the dimensionality reduction calculation formula is as follows:
Figure 918250DEST_PATH_IMAGE024
wherein
Figure 337730DEST_PATH_IMAGE009
Represents the k-th row vector standard deviation;
Figure 680288DEST_PATH_IMAGE010
to represent
Figure 501613DEST_PATH_IMAGE011
To (1)kA row vector;
Figure 709741DEST_PATH_IMAGE025
representing a covariance matrix;
Figure 616517DEST_PATH_IMAGE026
the first 100 columns of feature column vectors representing the covariance matrix;
Figure 505975DEST_PATH_IMAGE027
means word co-occurrence matrixkThree levels of semantic representation of individual words.
S2-S4 form a three-level semantic coding method, and three-level semantic representations of all words in the bulletin text are extracted by the three-level semantic coding method. And splitting the sentence as a unit, and accumulating and combining the three-level semantic representations of the sentence context words to form the three-level semantic representation of the sentence context.
S5, repeating S2 to S4, and extracting semantic representations of all words in the bulletin text; repeating each time, S2 respectively constructing a word list until all words in the bulletin text are included, wherein the word list is sequentially the words 2000 before the weight arrangement;
s6, accumulating and combining the semantic representations by taking the statement as a unit to form three-level semantic representations of the statement context;
the sentence context three-level semantic representation is as follows:
Figure 447387DEST_PATH_IMAGE028
wherein t is the t-th word in the sentence.
S7, inputting key phrase by user, extracting semantic representation of key phrase, including
Inputting a key phrase by a user, and extracting three-level semantic representations of all key words of the key phrase; accumulating and combining the three-level semantic representations of all the keywords to form a keyword group three-level semantic representation;
the third-level semantic representation of the key phrase is as follows:
Figure DEST_PATH_IMAGE029
in the formula, t is the t-th word in the keyword group.
S8, judging the similarity between the sentence context semantic representation and the key phrase semantic representation, and extracting sentences with the similarity larger than a set value to form a bulletin text abstract; comprises the steps of (a) preparing a mixture of a plurality of raw materials,
constructing a semantic similarity calculation model based on the twin neural network as follows:
Figure 232940DEST_PATH_IMAGE030
the semantic similarity calculation model based on the twin neural network comprises two groups of isomorphic feedback neural networks, the input of the semantic similarity calculation model is the three-level semantic representation of statement context and the three-level semantic representation of user key phrases, and the output of the semantic similarity calculation model is similarity; the semantic similarity calculation model based on the twin neural network specifically comprises two independent parallel input layers, two independent parallel hidden layers and an output layer; the input layer dimension is 1 x 100; the hidden layer dimension is 1 x 10; the two independent parallel input layers are respectively connected with the two independent parallel hidden layers by adopting a Sigmoid () function, and the two independent parallel hidden layers are commonly connected with the output layer by adopting the Sigmoid () function; the output layer is a cross entropy loss function; the output of the output layer is similarity;
adopting the three-level semantic representation of the sentence context and the three-level semantic representation of the user key phrase as the input of a semantic similarity calculation model based on the twin neural network,training a semantic similarity calculation model based on the twin neural network, and calculating the similarity of the three-level semantic representation of the sentence context and the three-level semantic representation of the user key phrase through the semantic similarity calculation model based on the twin neural networkSimilarity(Text,key-words)
Specifically, the sentence context three-level semantic representation adopts a Sigmoid () function to connect one of two independent parallel input layers, and the keyword group three-level semantic representation adopts a Sigmoid () function to connect the other input layer;
judging the similarity between the semantic representation of the key phrase and the semantic representation of the sentence context, wherein the set value is 0.7, and when the similarity is high
Figure 627012DEST_PATH_IMAGE031
And extracting the bulletin text sentences including the key phrases into public text abstracts.
S6-S8 form a summary extraction method for context semantic similarity calculation, and extract sentences containing key information in the bulletin text to form a summary.
The abstract extraction method provided by the embodiment extracts three levels of semantic representations of a sentence level, a paragraph level and a chapter level of words by a three-level semantic coding method; and judging the similarity between the sentence context semantic representation and the key phrase semantic representation by a abstract extraction method of context semantic similarity calculation, and extracting sentences with the similarity larger than a set value to form the abstract of the bulletin text.
The abstract extraction method provided by the implementation considers the relevance of the keywords input by the user with sentences, paragraphs and chapters, and accurately extracts the sentences with high relevance with the keywords input by the user to form abstract texts; the 'redundant' information which is not concerned by the user can be effectively removed, and the working efficiency of the user is improved.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims (7)

1. A method for abstracting an abstract is characterized by comprising the following steps:
s1, preprocessing, namely generalizing the numerical value and time type data in the bulletin text;
s2, constructing a first word list;
s3, constructing a word co-occurrence matrix of the first word list;
s4, reducing the dimension of the word co-occurrence matrix, and extracting semantic representations of all words in the first word list;
s5, repeating S2 to S4, and extracting semantic representations of all words in the bulletin text;
s6, accumulating and combining the semantic representations by taking the sentences as units to form sentence context semantic representations;
s7, inputting a key phrase by a user, and extracting semantic representation of the key phrase;
s8, judging the similarity between the semantic representation of the key phrase and the semantic representation of the sentence context, if the similarity of the key phrase is more than a set value, extracting the bulletin text sentence comprising the key phrase into a public text abstract;
the S1 includes the steps of,
replacement bulletin textTextThe numerical value in the text is a Chinese character numerical value, and the announcement text is replacedTextThe middle time is the Chinese character time;
eliminating the mark number in the punctuation mark, and the pause number and the colon number in the point number, and decomposing the bulletin text into sentences by using the reserved point number as a separator; bulletin text by adopting jieba word-separating methodTextPerforming Chinese word segmentation, after eliminating stop words in the Chinese words, weighting the words by adopting TFIDF, and arranging the words from large to small according to weight;
the step of S2 constructing the first vocabulary includes that the words of 2000 before the weight arrangement are obtained to construct the first vocabularyWords
Figure DEST_PATH_IMAGE001
Whereinw i Is shown asiThe number of the words is one,w j is shown asjThe number of the words is one,nis the number of words
Figure 997326DEST_PATH_IMAGE002
The S3 includes the steps of,
for any two words appearing in the same sentence, the same paragraph and the same chapterw i Andw j establishing association and constructing word co-occurrence matrix
Figure DEST_PATH_IMAGE003
Figure 976783DEST_PATH_IMAGE004
Is a sentence level word co-occurrence matrix;
Figure DEST_PATH_IMAGE005
is a paragraph level word co-occurrence matrix;
Figure 366308DEST_PATH_IMAGE006
is a discourse level co-occurrence matrix;
matrix row indexiColumn indexjRespectively representing two co-occurrence wordsw i Andw j an index of (2); the elements in the matrix represent the joint probability of two words pointed by row and column indexes
Figure DEST_PATH_IMAGE007
2. The abstract extraction method as claimed in claim 1, wherein the S4 includes using principal component analysis to perform dimension reduction on the sentence-level word co-occurrence matrix, paragraph-level word co-occurrence matrix, and chapter-level word co-occurrence matrix, respectively, the dimension after dimension reduction is 2000 x 100, where 2000 represents the number of words and 100 represents the dimension of each word semantic vector; after dimensionality reduction, the three-level vector of the word co-occurrence matrix is three-level semantic representation; the three-level semantic representations are sentence-level, paragraph-level and chapter-level semantic representations of words; and extracting three-level semantic representations of all the words in the first word list.
3. The abstract extraction method as claimed in claim 2, wherein the dimension reduction calculation formula is as follows:
Figure 840145DEST_PATH_IMAGE008
wherein
Figure DEST_PATH_IMAGE009
Represents the k-th row vector standard deviation;
Figure 716834DEST_PATH_IMAGE010
to represent
Figure DEST_PATH_IMAGE011
To (1)kA row vector;
Figure 352346DEST_PATH_IMAGE012
representing a covariance matrix;
Figure DEST_PATH_IMAGE013
the first 100 columns of feature column vectors representing the covariance matrix;
Figure 947276DEST_PATH_IMAGE014
means word co-occurrence matrixkThree levels of semantic representation of individual words.
4. The method for abstracting abstract as claimed in claim 3, wherein in step S5, repeating step S2 is implemented to construct a vocabulary respectively until all words in the bulletin text are included, and the vocabulary is sequentially the words with weight value of 2000.
5. The abstract extraction method of claim 4, wherein the semantic representations are accumulated and merged to form a sentence context three-level semantic representation by taking a sentence as a unit; the statement context three-level semantic representation is
Figure DEST_PATH_IMAGE015
Wherein t is the t-th word in the sentence.
6. The method for abstracting abstract as claimed in claim 5, wherein the step S7 includes inputting a keyword group by a user, extracting three-level semantic representations of all keywords of the keyword group; accumulating and combining the three-level semantic representations of all the keywords to form a keyword group three-level semantic representation; the key phrase three-level semantic representation is
Figure 349438DEST_PATH_IMAGE016
Wherein t is the t-th word in the key phrase.
7. The digest extraction method according to claim 6, wherein the S8 includes,
the method comprises the steps that a semantic similarity calculation model based on a twin neural network is constructed, the semantic similarity calculation model based on the twin neural network comprises two groups of isomorphic feedback neural networks, the input of the semantic similarity calculation model based on the twin neural network is three-level semantic representation of statement context and three-level semantic representation of user key word groups, and the output is similarity;
inputting three-level semantic representations of statement context and three-level semantic representations of user key phrases, and extracting statements corresponding to the input three-level semantic representations of statement context when the similarity is greater than a set value;
and repeating the steps, and sequentially inputting all the three-level semantic representations of the sentence context in the bulletin text until all the sentences with the similarity between the three-level semantic representations of the sentence context in the bulletin text and the three-level semantic representations of the user key phrase being more than a set value are extracted to form the bulletin text abstract.
CN202111532196.5A 2021-12-15 2021-12-15 Abstract extraction method Active CN113918708B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111532196.5A CN113918708B (en) 2021-12-15 2021-12-15 Abstract extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111532196.5A CN113918708B (en) 2021-12-15 2021-12-15 Abstract extraction method

Publications (2)

Publication Number Publication Date
CN113918708A CN113918708A (en) 2022-01-11
CN113918708B true CN113918708B (en) 2022-03-22

Family

ID=79248937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111532196.5A Active CN113918708B (en) 2021-12-15 2021-12-15 Abstract extraction method

Country Status (1)

Country Link
CN (1) CN113918708B (en)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646114A (en) * 2012-02-17 2012-08-22 清华大学 News topic timeline abstract generating method based on breakthrough point
CN104679730A (en) * 2015-02-13 2015-06-03 刘秀磊 Webpage summarization extraction method and device thereof
CN110069622A (en) * 2017-08-01 2019-07-30 武汉楚鼎信息技术有限公司 A kind of personal share bulletin abstract intelligent extract method
CN110188349A (en) * 2019-05-21 2019-08-30 清华大学深圳研究生院 A kind of automation writing method based on extraction-type multiple file summarization method
CN110851598B (en) * 2019-10-30 2023-04-07 深圳价值在线信息科技股份有限公司 Text classification method and device, terminal equipment and storage medium
CN111259136B (en) * 2020-01-09 2024-03-22 信阳师范学院 Method for automatically generating theme evaluation abstract based on user preference
CN111460131A (en) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 Method, device and equipment for extracting official document abstract and computer readable storage medium
US11586829B2 (en) * 2020-05-01 2023-02-21 International Business Machines Corporation Natural language text generation from a set of keywords using machine learning and templates

Also Published As

Publication number Publication date
CN113918708A (en) 2022-01-11

Similar Documents

Publication Publication Date Title
CN110717047B (en) Web service classification method based on graph convolution neural network
Daud et al. Urdu language processing: a survey
Maekawa et al. Balanced corpus of contemporary written Japanese
US8595245B2 (en) Reference resolution for text enrichment and normalization in mining mixed data
CN113704451B (en) Power user appeal screening method and system, electronic device and storage medium
Golpar-Rabooki et al. Feature extraction in opinion mining through Persian reviews
CN113918708B (en) Abstract extraction method
Melero et al. Holaaa!! writin like u talk is kewl but kinda hard 4 NLP
Mollaei et al. Question classification in Persian language based on conditional random fields
Singkul et al. Parsing thai social data: A new challenge for thai nlp
CN115952794A (en) Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph
JP6168057B2 (en) Failure occurrence cause extraction device, failure occurrence cause extraction method, and failure occurrence cause extraction program
CN113836941B (en) Contract navigation method and device
Das et al. An improvement of Bengali factoid question answering system using unsupervised statistical methods
Akhtar et al. A machine learning approach for Urdu text sentiment analysis
Ali et al. Word embedding based new corpus for low-resourced language: Sindhi
CN115908027A (en) Financial data consistency auditing module of financial long text rechecking system
CN115619443A (en) Company operation prediction method and system for emotion analysis based on annual report of listed company
Hamza et al. Text mining: A survey of Arabic root extraction algorithms
CN111428475A (en) Word segmentation word bank construction method, word segmentation method, device and storage medium
Saneifar et al. From terminology extraction to terminology validation: an approach adapted to log files
Seresangtakul et al. Thai-Isarn dialect parallel corpus construction for machine translation
Modrzejewski Improvement of the Translation of Named Entities in Neural Machine Translation
Amezian et al. Towards a large Biscript Moroccan Lexicon
CN115221335A (en) Construction method of knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20220111

Assignee: Shenzhen Mingji Agricultural Development Co.,Ltd.

Assignor: SHENZHEN DIB ENTERPRISE RISK MANAGEMENT TECHNOLOGY CO.,LTD.

Contract record no.: X2023980049635

Denomination of invention: A Summary Extraction Method

Granted publication date: 20220322

License type: Common License

Record date: 20231204

EE01 Entry into force of recordation of patent licensing contract