CN110069623B - Abstract text generation method and device, storage medium and computer equipment - Google Patents

Abstract text generation method and device, storage medium and computer equipment Download PDF

Info

Publication number
CN110069623B
CN110069623B CN201711278814.1A CN201711278814A CN110069623B CN 110069623 B CN110069623 B CN 110069623B CN 201711278814 A CN201711278814 A CN 201711278814A CN 110069623 B CN110069623 B CN 110069623B
Authority
CN
China
Prior art keywords
text
module
key
texts
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711278814.1A
Other languages
Chinese (zh)
Other versions
CN110069623A (en
Inventor
刘康
赵占平
窦晓妍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201711278814.1A priority Critical patent/CN110069623B/en
Priority to PCT/CN2018/119214 priority patent/WO2019109918A1/en
Publication of CN110069623A publication Critical patent/CN110069623A/en
Application granted granted Critical
Publication of CN110069623B publication Critical patent/CN110069623B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a summary text generation method, a device, a computer readable storage medium and computer equipment, wherein the method comprises the following steps: acquiring a normalized text and a corresponding category label; querying preset paradigm characteristics corresponding to the category labels; extracting a key text from the normal text according to the normal features; identifying a text category to which the normalized text belongs; and combining the extracted key texts according to the templates corresponding to the text categories to obtain the abstract text. The scheme provided by the application can improve the efficiency of rewriting the text.

Description

Abstract text generation method and device, storage medium and computer equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a summary text, a computer-readable storage medium, and a computer device.
Background
With the rapid development of the internet, more and more information is disclosed through the network. As more and more information is received by the terminal from the internet, it is very important to quickly search the key information from the redundant information.
In a traditional mode, a worker with high expertise level refines published information, then rewrites the disclosed information by relying on a simple template, and then sends the rewritten text to a terminal. Obviously, this way of relying on manual rewriting is inefficient as more and more information is published on the network.
Disclosure of Invention
In view of the above, it is necessary to provide a method and an apparatus for generating a summary text, a computer-readable storage medium, and a computer device, for solving the technical problem that the existing rewriting method is inefficient.
A summary text generation method comprises the following steps:
acquiring a normalized text and a corresponding category label;
querying preset paradigm characteristics corresponding to the category labels;
extracting a key text from the normal text according to the normal features;
identifying a text category to which the normalized text belongs;
and combining the extracted key texts according to the templates corresponding to the text categories to obtain the abstract text.
A digest text generation apparatus comprising:
the acquisition module is used for acquiring the normalized text and the corresponding category label;
the query module is used for querying preset paradigm characteristics corresponding to the category labels;
the extraction module is used for extracting the key text from the normal form text according to the normal form characteristics;
the recognition module is used for recognizing the text category to which the normalized text belongs;
and the splicing module is used for splicing the extracted key texts according to the templates corresponding to the text categories to obtain the abstract text.
A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the above-described digest text generation method.
A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the above summarized text generation method.
According to the abstract text generation method, the abstract text generation device, the computer readable storage medium and the computer equipment, the key text can be extracted from the normal form text through the searched normal form characteristics corresponding to the normal form text, and after the text type corresponding to the normal form text is identified, the extracted key text can be spliced by depending on the template corresponding to the text type, so that the abstract text is obtained. Because the whole process of generating the abstract text does not need manual participation, the efficiency of rewriting the text can be greatly improved.
Drawings
FIG. 1 is a diagram of an application environment of a method for generating a digest text in one embodiment;
FIG. 2 is a flowchart illustrating a method for generating abstract text according to an embodiment;
FIG. 3 is a diagram illustrating an interface of templates corresponding to normalized texts of rights assignment classes in one embodiment;
FIG. 4 is a flowchart illustrating the steps for obtaining a normalized text and corresponding category labels in one embodiment;
FIG. 5 is a flowchart illustrating the steps of combining the extracted key texts according to the templates corresponding to the text categories to obtain the abstract texts in one embodiment;
FIG. 6 is a flowchart illustrating a method for generating a summary text in an exemplary embodiment;
FIG. 7 is a diagram illustrating generation of a corresponding summary text from a post document published by a exchange according to an embodiment of the present application;
FIG. 8 is a schematic diagram illustrating generation of corresponding summary text from a post document issued by a exchange according to an embodiment of the present application;
FIG. 9 is a schematic diagram of key text extracted from the text content of a post document issued at a transaction according to an embodiment of the present application;
FIG. 10 is a schematic diagram of extracting key texts from a bulletin file according to an embodiment of the present application;
fig. 11 is a schematic diagram illustrating splicing of key texts extracted from a bulletin file according to an embodiment of the present application;
FIG. 12 is a schematic diagram illustrating identification of a text category of a bulletin file in accordance with an embodiment of the present application;
fig. 13 is a schematic diagram illustrating extracted key texts being merged according to an embodiment of the present application;
FIG. 14 is a diagram illustrating matching of semantic words with abstract text according to an embodiment of the present application;
FIG. 15 is a diagram illustrating matching of semantic words for a side-by-side structure of abstract text according to an exemplary embodiment of the present application;
FIG. 16 is a block diagram showing the construction of an abstract text generation apparatus according to an embodiment;
FIG. 17 is a block diagram showing the construction of an abstract text generation apparatus according to an embodiment;
FIG. 18 is a block diagram showing the construction of a digest text generation apparatus according to an embodiment;
FIG. 19 is a block diagram showing the construction of an abstract text generation apparatus according to an embodiment;
FIG. 20 is a block diagram showing the structure of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Fig. 1 is an application environment diagram of a digest text generation method in one embodiment. Referring to fig. 1, the digest text generation method is applied to a digest text generation system. The digest text generation system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.
As shown in fig. 2, in one embodiment, a summary text generation method is provided. The embodiment is mainly illustrated by applying the method to the server 120 in fig. 1. Referring to fig. 2, the method for generating the abstract text specifically includes the following steps:
s202, acquiring the normalized text and the corresponding category label.
The normalized text may be a text content with a fixed normal form, and may be a text content in which components are specified. For example, the normalized text may specifically be the text content in the bulletin document about the listed company issued by the exchange, and may also be the text content in the normative legal document. The announcement files may include a transaction reminder file, an exchange announcement file, a regulatory information file, a listing company information file, a financing and financing information file, a fund information file, a transaction information disclosure file, or a bond information file, etc. Normative legal documents include civil adjudication documents, administrative adjudication documents, criminal adjudication documents, arbitration legal documents, notary legal documents, litigation legal documents, non-complaint legal documents, corporate management documents, corporate clearing documents, patent application publications, or patent grant publications, and the like.
The category label is a label used to classify files having different fixed paradigms. Files with different types of tags have different paradigms, while files corresponding to the same type of tags have the same paradigms. The server can obtain the category label carried by the normalized text when obtaining the normalized text.
For example, in the announcement file issued by the exchange, the listed company information file corresponds to the category label a, and the financing coupon information file corresponds to the category label B. Then, the listed company information file about the first company issued by the exchange and the financing ticket information file are in different paradigms and correspond to the category label a and the category label B, respectively, and the listed company information files about the first company and the second company issued by the exchange and issued by the exchange are in the same paradigms and both correspond to the category label a.
In an embodiment, the server may capture a paradigm document through a network, determine a label category corresponding to the paradigm document according to the captured paradigm document, and extract a paradigm text from the captured paradigm document, where the label category corresponding to the paradigm document is a label category corresponding to the paradigm text.
In an embodiment, the server may monitor a webpage publishing the paradigm document, capture HTML content corresponding to the webpage in real time, analyze tags in the captured HTML content, find a download link corresponding to the paradigm document published by the webpage, and extract the paradigm document from the paradigm document according to the download paradigm document of the link. The normalized file may be a file in a Portable Document Format (PDF) Format, and the file in the PDF Format is converted into a file in a Text Format (TXT) Format to obtain a normalized Text.
And S204, inquiring the preset paradigm characteristics corresponding to the category labels.
The paradigm feature is a feature of a paradigm representing the paradigm of the paradigm text, and can be used for positioning the key texts in the paradigm text. The preset paradigm feature is a prestored paradigm feature corresponding to the category label. Different category labels typically correspond to different paradigm characteristics. Specifically, the paradigm feature may include a paragraph position of a key paragraph in the paradigm text, may include a sequence text cue, and may further include a keyword.
For example, in an announcement file issued by an exchange, a listed company information file corresponds to a category tag a, and after the server extracts a canonical text from the listed company file, querying a canonical feature corresponding to the category tag a may include: the positions of key paragraphs in the listed company information text are: a summary section in the listing company information text; the sequence text cue words are: "important content tips"; the keywords are: "transfer," "equity," and the like.
In an embodiment, after acquiring the normalized text and the category label corresponding to the normalized text, the server may search a mapping relationship between the category label and the paradigm feature, which are stored in advance, and query the paradigm feature corresponding to the category label according to the mapping relationship.
In an embodiment, after obtaining the normalized file, the server may read a file identification number corresponding to the normalized file, and determine, according to the file identification number, a category tag corresponding to the normalized text corresponding to the normalized file.
And S206, extracting the key text from the normalized text according to the normalized features.
The key text is a text carrying related key information in the normalized text. The key text may be composed of multiple pieces of text. Specifically, the server may extract the key text from the canonical text according to a digest extraction algorithm.
In one embodiment, the server may perform a pruning process on redundant characters in the extracted key text after extracting the key text from the normalized text. The redundant characters may be at least one of parentheses, comments, or appendices. Specifically, the server may match each character in the extracted key text with a preset redundant character set one by one, for example, match with a regular expression, and then delete the matched redundant character belonging to the preset redundant character set from the extracted key text.
In one embodiment, the server can correct missing sequence numbers and disorder sequence numbers existing in the key text after extracting the key text from the normalized text. Specifically, the server may determine whether there is a sequence number disorder by comparing the sequence numbers sequentially extracted. And when the sequence number in the key text is checked to be discontinuous, correcting the sequence number in the text.
In one embodiment, the server may perform simplified expression processing on the numerical values in the extracted key text. For example, the decimal in the key text may be rounded, such as converting "25.3831%" to "25.4%". The server can also represent the numeric characters occupying longer characters by Chinese characters, for example, the characters "100000 yuan" in the key text are reduced to "10 ten thousand yuan". The server may also round the larger value characters in the key text, e.g., replacing the character "1000004567 yuan" with "about 10 billion yuan".
In one embodiment, the server may perform deduplication processing on the extracted key text. Specifically, the server may detect the extracted key texts, and perform deduplication processing on key information with the same degree or the similarity reaching a preset value when detecting that the extracted key texts have the same key information or the similarity reaching the preset value.
S208, identifying the text type to which the normalized text belongs.
The text category is a category corresponding to the text content of the normalized text after the text content is classified. The text categories are divided according to the text content of the normalized text. The text category is different from the category label.
For example, in the announcement file issued by the exchange, the listed company information file corresponds to the category label a, and the listed announcement information file may be an enterprise reorganization file related to the listed company, an enterprise merge file, an enterprise right transfer file, or the like. Then, the normalized texts extracted from the enterprise reorganization file, the enterprise merged file, and the enterprise share right transfer file respectively correspond to three different text categories, namely, a reorganization category, a merging category, and a transfer category.
In one embodiment, the server, after identifying the text category of the normalized text, obtains a plurality of text categories corresponding to the normalized text. The plurality of text categories may specifically be 3 text categories.
In an embodiment, the server may classify the input normalized text according to a classification algorithm to obtain a text category corresponding to the normalized text. The classification algorithm can be Rocchio algorithm, naive Bayes classification algorithm, K-nearest neighbor algorithm, decision tree algorithm, neural network algorithm or support vector machine algorithm, etc.
In one embodiment, the server may identify the text category of the normalized text through a machine learning model. And the server inputs the vector corresponding to the constructed normalized text into the machine learning model for prediction to obtain the text category corresponding to the normalized text. The Machine learning model may specifically be an SVM (Support Vector Machine) model.
In one embodiment, the server classifies the text by determining text features according to the normalized text and transforming the text features into category features, corresponding scores are given to each category feature, and the category with a high score value is used as the text category corresponding to the normalized text.
And S210, combining the extracted key texts according to the templates corresponding to the text categories to obtain abstract texts.
Wherein the template is a design having a predetermined fixed format. The template can be used for splicing the extracted key texts in a fixed format to obtain abstract texts. And displaying the key text in a preset fixed format of the template by the spliced abstract text.
As shown in fig. 3, a template 300 corresponding to a text category as an equity transfer category is briefly shown, in which template 300, a title of a summary text is presented with a preset symbol 302, such as "[ ]"; presenting the position of the title in the abstract text at a preset first position 304; the report object corresponding to the abstract text is displayed at a preset second position 306, and the report date of the abstract text is displayed at a preset third position 308.
In an embodiment, the server may pre-establish a correspondence table between a text category and a template in the database, and after the server identifies the text category corresponding to the normalized text, pull the template corresponding to the text category by looking up the correspondence table, and stitch the extracted key text with the template to obtain the abstract text.
According to the abstract text generation method, the key texts can be extracted from the normal form texts through the searched normal form characteristics corresponding to the normal form texts, and after the text types corresponding to the normal form texts are identified, the extracted key texts can be spliced by means of templates corresponding to the text types, so that the abstract texts are obtained. Because the whole process of generating the abstract text does not need manual participation, the efficiency of rewriting the text can be greatly improved.
As shown in fig. 4, in an embodiment, step S202 specifically includes:
s402, monitoring the announcement file source.
Wherein the publication file source is the source of the publication file. The source of the announcement file may specifically be a web page that publishes the announcement file. The source of the announcement file can also be a database corresponding to a website for publishing the announcement file, and the database can be one or more. Specifically, after the server obtains the access right of the database corresponding to the website, the server performs real-time detection on the database.
In one embodiment, the server may establish a monitoring thread for monitoring a database corresponding to a website that issues the advertisement file, and periodically query whether the number of the advertisement files in the database changes at a preset time interval through the monitoring thread, so as to implement real-time monitoring on the public file source.
S404, when the newly added announcement file of the announcement file source is monitored, the newly added announcement file is obtained.
In one embodiment, when the monitoring thread monitors that the number of the announcement files in the announcement file source is increased, it is monitored that the announcement file source has a newly added announcement file, the monitoring thread returns an identifier corresponding to the announcement file, association query is performed according to the identifier to obtain a download address corresponding to the announcement file, and the newly added announcement file is obtained by loading according to the download address.
S406, extracting the normalized text from the bulletin file.
In one embodiment, the announcement file exists in the form of a picture, and in order to read the normalized text in the announcement file, the server can extract the announcement file through an image recognition algorithm to obtain the normalized text.
S408, reading the category label associated with the notice file.
In one embodiment, the announcement files are stored in a table in a database corresponding to the website publishing the announcement files, and the fields associated with the identification of the public files include category labels, download addresses, and the like. The server can read the category label corresponding to the announcement file through association query when finding the identifier corresponding to the newly added announcement file.
In the embodiment, the server monitors the source of the announcement file in real time, and when it is monitored that the source of the announcement file has a new announcement file issued, the server extracts the newly issued announcement file, extracts the normalized text from the newly issued announcement file, and uses the extracted normalized text as a material for generating the abstract text, so that the source of the announcement file can be monitored comprehensively, and the updated announcement file can be rewritten in real time.
In one embodiment, the key text includes at least one of a key paragraph, a key whole sentence, and a key half sentence; step S206 specifically includes: when the paradigm feature comprises a paragraph position of the key paragraph in the paradigm text, extracting the key paragraph from the paradigm text according to the paragraph position; when the paradigm characteristics comprise the sequence text cue words, extracting key whole sentences from positions, corresponding to the sequence text cue words, in the paradigm texts; when the paradigm features include keywords, key half sentences including the keywords are extracted from the paradigm texts.
The half sentence is a dispersed sentence obtained by dividing the normalized text by any punctuation marks. Punctuation marks such as at least one of commas, pause signs, periods or line feed symbols, etc. The whole sentence is a dispersed sentence obtained by dividing the normalized text by periods, exclamation marks or question marks and the like in the punctuation marks. The paragraphs are scattered paragraphs obtained by dividing the normalized text by line-feed symbols. The key paragraphs, key full sentences and key half sentences are half sentences, full sentences and paragraphs extracted from the canonical texts, respectively.
The paragraph position is a preset position for indexing a key paragraph in the normalized text. The sequence text cue words are preset cue words used for indexing the key whole sentence in the canonicalized text. The keyword is a preset word for indexing the key complete sentence.
In one embodiment, when the paradigm feature includes a paragraph position where the key paragraph is in the paradigm text, the server can first determine a position where the key paragraph is located in the paradigm text according to the paragraph position, and then extract the key paragraph from the paragraph position in the paradigm text according to the paragraph position.
In one embodiment, when the paradigm feature includes a sequence text cue, the server matches the sequence text cue with the paradigm text, determines the position of the key whole sentence, and then extracts the key whole sentence from the position corresponding to the sequence text cue.
In one embodiment, when the paradigm characteristic comprises a keyword, the server matches the keyword with the paradigm text, and when a half sentence containing the keyword exists in the paradigm text, the server extracts the key half sentence containing the keyword from the paradigm text.
In the above embodiment, the server queries the paradigm characteristics corresponding to the text category through the text category corresponding to the paradigm text, and can refine the paradigm text more specifically according to the paradigm characteristics, so that the extracted key text is comprehensive and accurate.
In one embodiment, the step of extracting key paragraphs from the normalized text according to paragraph positions specifically includes: screening a first half sentence split from a paragraph position in the normalized text; acquiring a weight value corresponding to the screened first sentence; determining a first half sentence with a weight value meeting a first preset condition in the screened first half sentence; and forming a key paragraph by the continuous first half sentence which meets the first preset condition.
The first half sentence is each half sentence obtained by splitting the text at the position of the paragraph in the canonical text by taking the half sentence as a unit. The weight value is a quantized representation value of the importance of each half sentence to the corresponding normalized text. The first preset condition is that the weight values corresponding to part of the preset first half sentences meet. The first preset condition may specifically be that the half sentence weight value in the first half sentence is ranked top 10.
Specifically, the server may split the canonical text by taking a half sentence as a unit, obtain a weight value corresponding to each half sentence obtained by splitting, screen the half sentence at the corresponding paragraph position from the split sentence after obtaining the paragraph position of the key paragraph, obtain the weight value corresponding to the first half sentence obtained by screening as the first half sentence, and form the key paragraph from the continuous first half sentence in which the weight value in the first half sentence satisfies the first preset condition.
In one embodiment, the server may sequentially traverse punctuation marks in the normalized text, and when any punctuation mark is found, the text which is continuous before the punctuation mark and does not contain the punctuation mark is taken as a half sentence.
For example, when the server matches the first punctuation mark in the normalized text, the server forms a half sentence of the text before the first punctuation mark; when a second punctuation mark in the corresponding normalized text is matched, taking the text which is the character before the second punctuation mark and is behind the first punctuation mark as a half sentence; and analogizing in turn to split the normalized text by taking a half sentence as a unit.
In one embodiment, the server may sequentially add labels to each of the split sentences in turn, with the label value indicating the position and order of the half sentence in the normalized text. When the weight value in the screened first half sentence meets the first condition, the server can judge whether the screened first half sentence is continuous or not by judging whether the label values corresponding to the screened first half sentence are continuous in sequence or not, and the continuous half sentences form continuous whole sentences, so that corresponding key paragraphs are obtained.
In one embodiment, the weight value of each half sentence corresponding to the corresponding normalized text can be obtained through a digest extraction algorithm.
In the above embodiment, the first half sentences are extracted from the positions of the key paragraphs, and the first half sentences with higher weight values are selected from the first half sentences to form the key paragraphs, so that the key paragraphs of the normalized text can be obtained, and the extracted key paragraphs are more accurate.
In one embodiment, the step of extracting the key complete sentence from the position corresponding to the prompt word of the sequence text in the normalized text specifically includes: screening a second half sentence corresponding to the prompt word of the sequence text in the normalized text; acquiring a weighted value corresponding to the screened second half sentence; determining a second half sentence with a weight value meeting a second preset condition in the screened second half sentence; and forming a key whole sentence by the continuous second half sentence which meets the second preset condition.
And the second half sentence is each half sentence obtained by splitting the text at the position corresponding to the prompt word of the sequence text in the normalized text by taking the half sentence as a unit. The second preset condition is that the obtained weighted values of part of the second half sentence meet the condition. The second preset condition may specifically be that the half sentence weight value in the second half sentence is 5 top.
Specifically, the server may split the normalized text in units of half sentences, obtain weight values corresponding to the split half sentences, screen the split sentences to obtain half sentences at corresponding positions after obtaining positions corresponding to the prompt words of the sequence text, obtain weight values corresponding to the screened second half sentences as second half sentences, and form continuous second half sentences in which the weight values in the second half sentences meet second preset conditions into key whole sentences.
In one embodiment, the server may, after obtaining the second half sentence with the weight value satisfying the second preset condition, traverse the screened second half sentence to search for any one of the period, the exclamation mark or the question mark, and when any one of the symbols is searched, use a continuous text that is before the symbol and does not contain any other period, the exclamation mark or the question mark as the key whole sentence.
For example, when the server matches the first sentence number in the screened second half sentence, the server forms a key whole sentence by the text before the first sentence number; when a second sentence number in the screened second half sentence is matched, the text which is the character before the second sentence number and is after the first sentence number is used as a key whole sentence; and analogizing in sequence to obtain the key whole sentence corresponding to the screened second half sentence.
In the above embodiment, the second half sentence is extracted from the position corresponding to the sequence text cue word, and the second half sentence with a higher weight value is screened from the second half sentence to form the key whole sentence, so that the extracted key whole sentence is more matched with the normalized text.
In one embodiment, the step of extracting the key half sentence including the keyword from the normalized text specifically includes: screening a third half sentence comprising the key words from the half sentences split from the normalized texts; acquiring a weight value corresponding to the screened third half sentence; and taking the third half sentence with the weight value meeting a third preset condition as a key half sentence.
And the third half sentence is a half sentence obtained by splitting the canonical text by taking the half sentence as a unit and containing the keywords. The third preset condition is that the obtained weight values of part of the half sentences in the third half sentence meet the condition. The third preset condition may specifically be that the half sentence weight value in the third half sentence is ranked top 10. The keywords are preset keywords corresponding to text categories of the canonical texts.
Specifically, the server splits the paradigm text by taking a half sentence as a unit to obtain a third half sentence, calculates a weight value corresponding to each third half sentence, screens the third half sentence containing the keyword from the split third half sentence, obtains the weight value corresponding to the screened third half sentence, and takes the half sentence with the weight value meeting a third preset condition in the screened third half sentence as the key half sentence.
In one embodiment, the server may split the canonical text by taking a half sentence as a unit after obtaining the keywords corresponding to the canonical text to obtain half sentences, traverse the obtained half sentences according to the keywords, and use the half sentence containing the keyword as the third half sentence when any half sentence contains the keyword.
In one embodiment, the keyword may be plural. The server can match each half sentence obtained after splitting the normalized text with the keywords respectively to obtain each half sentence corresponding to the normalized text.
In the above embodiment, the key half sentence used for generating the abstract text can be specifically found from the canonical text through the key words in the canonical form feature corresponding to the category label.
In one embodiment, step S208 specifically includes: screening words belonging to a preset word set from a normalized text; and identifying the text category to which the normalized text belongs according to the screened words.
The preset word set is a word set obtained through statistics according to the text content of the normalized text of each text category.
Specifically, the server may perform word segmentation on the canonical form text, match the obtained words with a preset word set, screen out words belonging to the preset word set from the words obtained by word segmentation, and identify a text category to which the canonical form text belongs according to the screened words.
In an embodiment, before the server matches the obtained word with the preset word set, the server may further calculate a TF (term frequency) value of the normalized text corresponding to each word after removing stop words in the normalized text, screen out words whose frequency meets a preset condition from the obtained words, and then match the words with the preset word set.
In one embodiment, the server may divide the preset word set according to the text categories to obtain sub-preset word sets corresponding to the text categories. And when a word with a preset proportion in the screened words belongs to a certain sub-preset word set, taking the text category corresponding to the sub-preset word set as the text category of the paradigm text.
In the above embodiment, the server generates the preset word set and the corresponding text category in advance, and obtains the text category corresponding to the normalized text according to the words in the normalized text by the correspondence between the preset word set and the text category, so as to implement the classification of the normalized text.
In one embodiment, the step of identifying the text category to which the normalized text belongs according to the screened words specifically includes: acquiring the importance degree of the screened words to the normalized text; constructing a text vector representing the canonical text according to the importance degree; and inputting the text vector into the trained machine learning model to obtain the text category.
The importance degree of the word to the normalized text is the association degree between the information expressed by the word and the information expressed by the whole normalized text. The greater the degree of association, the more representative the information to be expressed by the normalized text, and the greater the degree of importance of the word. A text vector is a vector that represents the text content of the canonical text. The trained machine learning model is a machine learning model for predicting a text type of a normalized text. And inputting the text vector corresponding to the normalized text into the machine learning model, and outputting to obtain the text category corresponding to the normalized text. The trained machine learning model may specifically be an SVM model.
In one embodiment, the server may use the TF-IDF value corresponding to the screened word as the importance degree of the word to the normalized text by obtaining the TF-IDF (term frequency-inverse document frequency) value corresponding to the screened word. The TF-IDF value is equal to the product of the TF value and the IDF value, the TF value representing the frequency with which the word occurs in the normalized text, the IDF value representing the ability of the word to represent the category of text to which the normalized text belongs, wherein,
Figure GDA0003750202840000131
n i,j representing the number of times a word i appears in a normalized text j, m j Is the number of all words remaining in the normalized text j except the stop word,
Figure GDA0003750202840000132
p is the total number of files in the document library formed by the server according to a plurality of preset normalized texts, and P i Is the total number of files in the document library that contain the word i.
In an embodiment, the server may perform statistics according to a preset normalized text of each text category, perform word segmentation on the normalized text corresponding to each text category after the statistics, form a corresponding preset word set according to words obtained by the word segmentation, generate an n-dimensional vector according to the obtained preset word set, generate a text vector corresponding to the normalized text according to a word selected from the normalized text of the text vector to be generated and a TF-IDF value corresponding to the word i, input the text vector to a trained machine learning model, and output a text category corresponding to the normalized text.
For example, a preset word set W is generated according to the normalized text in the document library, where the preset word set W is { W1, W2, W3,..,. wn }, and an n-dimensional vector generated according to the preset word set W is V ═ V1, V2, V3,.., vn); the words selected from the normalized text M and appearing in the predetermined word set W include W1, W3, the sentence, wk, wn, which correspond to TF-IDF values TI, TI2, the sentence, TIk, the sentence, Tin, and the vector corresponding to the generated normalized text M is Vm (TI1, 0, TI3, the sentence, TIk, the sentence).
In the above embodiment, by inputting the text vector representing the normalized text into the trained machine learning model, the text category corresponding to the normalized text can be accurately obtained for the normalized texts of the same text category issued by different writing style personnel.
In an embodiment, step S208 specifically includes: classifying the normalized texts to obtain an initial classification result; acquiring historical data corresponding to the primary classification result; comparing the normalized text with the historical data to obtain a comparison result; and when the comparison result meets a fourth preset condition, taking the primary classification result as the text category to which the normalized text belongs.
And the initial classification result is a classification result obtained after the normalized text is classified. The server may further verify the primary classification result through historical data. The history data is data corresponding to the classified category. The fourth preset condition is a preset value of a matching degree between the normalized text and the history data.
Specifically, the server may compare the normalized text with the historical data corresponding to the type after classifying the normalized text, and if the comparison result satisfies a fourth preset condition, may use the initial classification result as the text classification corresponding to the normalized text.
In an embodiment, the server may pre-establish a mapping relationship between the historical normalized text data and the primary classification result according to the historical data, after the server obtains the primary classification result corresponding to the normalized text, pull the historical normalized text data corresponding to the classification result according to the mapping relationship, compare the normalized text with the historical normalized text data, for example, compare the normalized text with the historical normalized text data by a duplication checking method, and if the repetition rate reaches a preset value, take the primary classification result as the text category corresponding to the normalized text.
In one embodiment, the primary classification result obtained by the server may be obtained by inputting a text vector corresponding to the normalized text into the trained machine learning model and outputting the classification result.
In the above embodiment, the accuracy of the primary classification result is further verified by comparing the primary classification result corresponding to the normalized text with the historical data, so that the text category corresponding to the normalized text obtained by classification is more accurate.
As shown in fig. 5, in an embodiment, step S210 specifically includes:
and S502, respectively allocating templates corresponding to the text categories to each extracted key text.
Specifically, the server may pre-store a corresponding relationship between the text category and the template, and after the text category corresponding to the normalized text is obtained, match the corresponding template for the extracted key text according to the corresponding relationship.
S504, matching the extracted key texts with corresponding connecting words through the distributed templates.
And the connecting words are used for connecting the extracted words of each key text. Specifically, one template may correspond to a plurality of conjunctions. For example, the connection words corresponding to the template corresponding to the normalized text with the text category of the financial bulletin may include "preliminary measurement and calculation by the company financial department", and since such words do not usually appear in the normalized text, the extracted key text does not include such words, so that the connection words may be matched for the extracted key text, and the generated abstract text is smoother.
S506, the key texts are spliced through corresponding connecting words to obtain abstract texts.
Specifically, for the key paragraphs, the key whole sentences and the key half sentences extracted from the normalized text, the key paragraphs, the key whole sentences and the key half sentences can be pieced together according to the connecting words corresponding to the template, so as to obtain the abstract text corresponding to the normalized text.
In one embodiment, the server may further assign the extracted key texts to a plurality of templates corresponding to the text categories respectively to generate abstract texts expressing different key texts.
In the embodiment, the extracted key texts are matched with the connecting words through the distributed templates, and the key texts are spliced by using the distributed templates and the connecting words, so that the abstract texts with smoother sentences can be obtained.
In one embodiment, the summary text generation method further includes: determining the logic structure type of the abstract text; separating the logic unit text from the abstract text; and recombining the logical unit texts according to a text recombination mode corresponding to the logical structure type to obtain a recombined abstract text.
And the logic structure type is the type of text logic corresponding to the generated abstract text. The text logic may specifically include at least one of a parallel structure type, a progressive structure type, a turn structure type, and an overview structure type. The logic unit text is the unit text corresponding to the logic structure type in the abstract text. For example, in the abstract text with the logical structure type of the parallel structure type, two sections of texts a and B are included, and a and B are logically parallel to each other from the text, so that a and B both belong to the logical unit text corresponding to the abstract text.
The text recombination mode is a mode for recombining the logic unit texts in the abstract texts with different logic structure types.
In one embodiment, the server may extract logical words present in the abstract text to determine the logical structure type corresponding to the abstract text. The logical words may reflect the logical structure type of the abstract text. Logical words such as "wherein", "further", "although", "but", etc., may also be words that are repeated in the logical unit text. Specifically, the server may generate a correspondence between a logical word and a logical structure type in advance, match a word in the digest text with a preset logical word, and determine the logical structure type corresponding to the digest text according to the correspondence when a word matching the preset logical word exists.
In one embodiment, after determining the logical structure type corresponding to the abstract text, the server may split the text before the logical word and the text after the logical word in the abstract text, so as to separate the logical unit text from the abstract text.
In one embodiment, after determining the logical structure type corresponding to the abstract text, the server may match corresponding conjunction words for the abstract text, and implement the reorganization of the logical unit text through the conjunction words.
In one embodiment, when the logical structure type is a parallel structure type, the step of recombining the logical unit texts according to the text recombination mode corresponding to the logical structure type to obtain a recombined abstract text includes: determining a head text and a tail text in each separated logic unit text; merging the head texts according to a parallel expression mode to obtain a merged head text; merging the tail texts according to a parallel expression mode to obtain merged tail texts; and connecting the combined head text and the combined tail text through the parallel connecting words corresponding to the parallel structure types to obtain the recombined abstract text.
In the abstract text with the logic structure type of the parallel structure type, the head text in the logic unit text has the parallel relation, and the tail text in the logic unit text has the parallel relation. The parallel conjunction word is a preset conjunction word corresponding to the type of the parallel logic structure. The parallel connection words may be, for example, "respectively", "sequentially", etc. For example, the obtained abstract text is: the main reasons for loss reduction are: compared with the same period of the last year, the sales of the cement and the commodity clinker of the company in the period are increased, the main profit and profit rate is improved, and the total of the three expenses is reduced more. The main reasons for the loss are: the cost of part of subsidiaries and production lines of the company which are put into production in recent years is higher, the production and sales volume is low under the current market environment, the productivity cannot be effectively exerted, and the server can judge that the logical structure type of the abstract text is a parallel structure according to repeated main reasons in the abstract text; separating the logic unit text from the abstract text to obtain a logic unit text A: the main reasons for loss reduction: compared with the same period of the last year, the sales of the cement and the commodity clinker of the company in the period are increased, the main profit and profit rate is improved, and the total three expenses are reduced more' and a logic unit text B: "the main causes of loss: the cost of part of subsidiaries and production lines of companies which are put into production by the companies in recent years is higher, the production and sales volume is low in the current market environment, and the productivity cannot be effectively exerted; determining that the head text in the logic unit text A is 'loss main reason', the tail text is 'the same period as the last year, the sales volume of the cement and the commodity clinker of the company in the period is increased, the main profit rate is improved, the total of the three expenses is reduced more', determining that the head text in the logic unit text B is 'loss main reason', the tail text is 'part of subsidiaries and part of production lines of the company which are put into production in recent years are higher in cost, the production and sales volume is low in the current market environment, and the productivity cannot be effectively exerted'; combining the head texts of the logic unit texts to obtain a combined head text which is ' the main reason for loss and loss, and combining the tail texts to obtain a combined tail text which is ' the same period in comparison with the last year ', the sales of the cement and the commodity clinker of the company in the period is increased, the main profit and profit rate is improved, and the total of the three expenses is reduced more; the cost of part of subsidiaries and production lines of companies which are put into production by the companies in recent years is higher, the production and sales volume is low in the current market environment, and the productivity cannot be effectively exerted; the head text and the tail text which are combined are jointed and combined through parallel connection words of parallel structures, namely, the main reasons of loss reduction and loss are as follows: compared with the same period of the last year, the sales of the cement and the commodity clinker of the company in the period are increased, the main profit and profit rate is improved, and the total of the three expenses is reduced more. The cost of part of subsidiaries and production lines of companies which are put into production in recent years is high, the production and sales volume is low in the current market environment, and the productivity cannot be effectively exerted. "
In one embodiment, when the logical structure type is a progressive structure type, the step of recombining the logical unit texts according to a text recombination mode corresponding to the logical structure type to obtain a recombined abstract text includes: determining the progressive sequence of the texts of the logic units; acquiring progressive connection words corresponding to the progressive structure types and the progressive sequence; and linking the logic unit texts according to the progressive sequence and the corresponding progressive linking words to obtain the recombined abstract text.
Wherein, the progressive order is the hierarchical order between the logical unit texts in the abstract text with the logical structure type being the progressive structure type. For example, logical unit text D is based on logical unit text C. The progressive conjunction words are preset conjunction words corresponding to progressive structure types. The incremental conjunctions may be, for example, "e.g.," and, "and," etc. In one embodiment, when the logical structure type is the transition structure type, the step of recombining the logical unit texts according to the text recombination mode corresponding to the logical structure type to obtain a recombined abstract text includes: identifying a logic unit text of basic semantics and a logic unit text of transition semantics from the separated logic unit texts; determining turning conjunctions in the abstract text; and deleting the logic unit text and the turning conjunctions of the basic semantics from the abstract text to obtain a recombined abstract text.
The logic unit text of the basic semantics is a partial sentence in the abstract text of which the logic structure type is the turning structure type. The logical unit text of the transition semantics is a positive sentence in the abstract text of which the logical structure type is the transition structure type. In a sentence in a disjointed relationship, a partial sentence is opposite or opposite in meaning to a positive sentence.
In one embodiment, when the logical structure type is the summary structure type, the step of recombining the logical unit texts according to the text recombination mode corresponding to the logical structure type to obtain the recombined abstract text includes: determining a parent-level logic structure type of the abstract text and a child-level logic structure type of each logic unit text; separating corresponding sub logic unit texts from the logic unit texts respectively; recombining the corresponding sub-logic unit texts separated from each logic unit text according to the text recombination modes corresponding to the corresponding sub-level logic structure types to obtain recombined logic unit texts; and recombining the recombined logic unit texts according to a text recombination mode corresponding to the parent-level logic structure type to obtain a recombined abstract text.
The summary structure type is a corresponding logic structure type in the abstract text in which two or more logic structure types are nested. The parent-level logic structure type corresponding to the abstract text and the child-level logic structure type corresponding to each logic unit text in the abstract text are nested logic structure types. For example, the parent logical structure type corresponding to the abstract text is a parallel structure type, and logical unit texts A, B and C having a parallel relationship are separated from the abstract text. The corresponding sub logical unit texts separated from the logical unit texts A, B and C are A1, A2 and A3, B1, B2 and B3, C1, C2 and C3, respectively. The sub-level logic structure type corresponding to the sub-logic unit texts a1, a2 and A3 may be at least one of a parallel structure type, a transition structure type or a progressive structure type. By analogy, the sub-level logic structure type corresponding to each logic unit text is obtained, the sub-logic unit texts can be recombined according to the text recombination mode corresponding to the sub-level logic structure type to obtain recombined logic unit texts, and finally the recombined logic unit texts are recombined according to the text recombination mode corresponding to the parent-level logic structure type to obtain recombined abstract texts.
In the above embodiment, the logic unit texts in the abstract texts are recombined according to the logic structure types corresponding to the abstract texts, so that the key texts obtained by mechanically splicing together through templates are avoided, and the abstract texts obtained by recombination can meet the requirements of formal reports.
In one embodiment, the method for generating abstract text further comprises: acquiring user data; determining the pushing priority of the abstract text according to the user data and the abstract text; and pushing the abstract text to the terminal corresponding to the user data according to the pushing priority.
The user data is data capable of reflecting the rule that a user reads the normalized file from the network. The user data may specifically include download amount, usage frequency, access amount, access rate, retention time, etc. The push priority is the value level of the same abstract text corresponding to different user data. It can be understood that since each user data reflects the characteristics of the corresponding user reading the normalized file from the network, the same abstract text generally has different push priorities for different user data.
In one embodiment, the server may record registration information of a website where a user corresponds to the normalized file, mine user data corresponding to the registration information, match the user data with the generated abstract text, determine a push priority of each user data corresponding to the abstract text according to a matching result, and push the abstract text to a terminal corresponding to the user data according to the obtained push priority.
In the embodiment, by determining the push priority of the generated abstract text and the user data, the generated abstract text can be pushed to the terminal where the user data is located more accurately according to the push priority.
In one embodiment, as shown in fig. 6, the method for generating the abstract text specifically includes:
s601, monitoring a bulletin file source;
s602, when monitoring that the announcement file source adds an announcement file, acquiring the added announcement file;
s603, extracting a normalized text from the bulletin file;
s604, reading a category label associated with the announcement file;
s605, inquiring preset paradigm characteristics corresponding to the category labels;
s606-1, when the paradigm characteristics comprise paragraph positions of key paragraphs in the paradigm text, screening a first sentence split from the paragraph positions in the paradigm text; acquiring a weight value corresponding to the screened first sentence; determining a first half sentence with a weight value meeting a first preset condition in the screened first half sentences; forming a key paragraph by using continuous first half sentences meeting first preset conditions;
s606-2, when the paradigm feature comprises a sequence text cue word, screening a second half sentence corresponding to the sequence text cue word in the paradigm text; acquiring a weighted value corresponding to the screened second half sentence; determining a second half sentence with a weight value meeting a second preset condition in the screened second half sentence; forming a key whole sentence by the continuous second half sentence which meets a second preset condition;
s606-3, when the paradigm characteristics comprise the keywords, extracting key half sentences comprising the keywords from the paradigm texts; screening a third half sentence comprising the key words from the half sentences split from the normalized texts; acquiring a weighted value corresponding to the screened third half sentence; taking a third half sentence with a weight value meeting a third preset condition as a key half sentence;
s607, screening words belonging to a preset word set from the canonical text;
s608, acquiring the importance degree of the screened words to the normalized text;
s609, constructing a text vector for expressing the canonical text according to the importance degree;
s610, inputting the text vector into the trained machine learning model to obtain the text category;
s611, respectively allocating templates corresponding to the text types to each extracted key text;
s612, matching the extracted key texts with corresponding connecting words through the distributed templates, and splicing the key texts through the corresponding connecting words to obtain abstract texts;
s613, determining the logic structure type of the abstract text, and separating a logic unit text from the abstract text;
s614, recombining the logic unit texts according to the text recombination mode corresponding to the logic structure type to obtain a recombined abstract text;
s615, acquiring user data, and determining the pushing priority of the abstract text according to the user data and the abstract text;
and S616, pushing the abstract text to the terminal corresponding to the user data according to the pushing priority.
According to the abstract text generation method, the key text can be extracted from the normalized text through the searched normal form characteristics corresponding to the normalized text, and after the text category corresponding to the normalized text is identified, the extracted key text can be spliced by depending on the template corresponding to the text category, so that the abstract text is obtained. Because the whole process of generating the abstract text does not need manual participation, the efficiency of rewriting the text can be greatly improved.
Fig. 6 is a flowchart illustrating a method for generating a digest text in one embodiment. It should be understood that, although the steps in the flowchart of fig. 6 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 6 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
Fig. 7 and 8 are schematic diagrams illustrating generation of corresponding summary texts according to the bulletin files issued by the exchange according to the embodiment of the application. With reference to fig. 7 and 8, the sources of fixed normal form announcement documents such as exchanges are monitored in real time to obtain published announcement documents, key sentences and paragraphs are extracted from text contents in the announcement documents, a text structure is reconstructed by using templates corresponding to document types of the announcement documents through matching, abstract texts are obtained, news values of the obtained abstract texts are judged and are pushed to a terminal, the fixed normal form official document abstracts represented by the announcement documents of the listed companies are written by a machine, the news points are mined from the announcement documents instead of manual mining, and 7-24-thousands of online companies can be monitored in all weather.
For example, the following is the textual content of the announcement file: "
XX stock control: notice about assignment of XX property and strategic cooperation with YY property
The announcement date: 2017-07-0100:00:00
Security code: 6006XX stock abbreviation: XX stock control numbering: announcement of the temporarily 2017-048XX holdings group stocks Limited on transferring XX property and performing strategic cooperation with YY property
The board of the company and the board of all the boards ensure that the content of the bulletin does not have any false description, misleading statement or significant omission, and individual and joint responsibility is carried out on the authenticity, accuracy and completeness of the content.
Important content cues
Transaction content
The company intends to transfer 100% of the stock right of Shanghai XX property service company of the full-resources company to YY property management service company, and the transfer price of the stock right is 1,000,000,000.00 Yuan. Meanwhile, the company plans to form a strategic partner with the YY property, and supports future property management of the YY property and rapid development of various community value-added services. For this reason, companies have agreed with YY property, and in 5 years from 2018 to 2022, the real estate project developed by companies will have YY property as a priority property partner.
The transaction does not constitute a related transaction.
This transaction does not constitute a significant reorganization of assets.
There is no major legal barrier to the implementation of this transaction.
First, this transaction summary
1. Basic situation
Recently, companies signed "cooperative framework agreements" with YY group stock control limited company and YY property management service limited company, and company capital subsidiary XX group stock control limited company (hereinafter referred to as "XX group") signed "share right transfer agreement" about XX property with YY property management service limited company (hereinafter referred to as "YY property"), shanghai XX property service limited company (hereinafter referred to as "XX property" or "targeted company"), and the company was supposed to transfer 100% of the share right (hereinafter referred to as "targeted share right") of shanghai XX property service limited company of capital subsidiary company to YY property management service limited company, and the share right transfer price was 1,000,000,000.00 yuan. Meanwhile, the company plans to form a strategic cooperative partner with the YY property, and supports future property management of the YY property and rapid development of various community value-added services. For this reason, companies have agreed with YY property, and in 5 years from 2018 to 2022, the real estate project developed by companies will have YY property as a priority property partner.
2. Approval and other procedures in need of fulfillment of transactions
The transaction is validated.
……
Fourth, protocol main content
1. Equity transfer price
XX property 100% equity transfer price is 100,000 ten thousand yuan. According to the price, YY property (the "transferee") is entitled to 90.9091% and 9.0909% of the right of the XX property in the form of RMB 90,909.1 ten thousand yuan and 9,090.9 ten thousand yuan respectively to the XX group and the company (collectively, the "transferor").
2. Equity delivery
The transaction parties agree that the date on which the following conditions are simultaneously satisfied is the date of transfer and delivery of the targeted equity:
(1) the right to stock transfer obtains the resolution of the board of the company;
(2) payment is completed for 51% of the equity transfer;
(3) all parties have signed the book of confirmation of equity delivery.
Each party should take the best effort to ensure that the delivery date is no later than 2017, 6 and 30.
All right obligations related to the target equity, profits and losses of the target company belong to the transferee from the delivery date to the business change completion date.
3. Payment of equity transfer
The YY property should pay 51% of the equity transfer to the transferor 30 months before 2017, 6 and 30 months;
the remaining 49% of the equity transfer money is paid to the transferor within 60 days from the date of the endorsement of the equity transfer agreement.
4. Strategic collaboration
After the share right transfer is completed, YY property adopts double brand strategic management of YY property and XX property, XX property is created into a post brand of a residential and commercial property management project, the business scale and the bearing capacity of XX property are enlarged, the business management level is improved, and the value-added guarantee value of a real estate project developed by the XX group is realized through YY property and XX property. The company promises that the XX property can be used in the property management business range by the trademark free grant (including but not limited to the character and graphic trademark related to the XX) required by the XX property business operation within the allowed range of applicable laws, regulations and rules, the grant period of the brand free grant is not less than five years, the grant is expired, and the YY property and the company can negotiate additionally.
The company supports the quick development of YY property management and various community value-added services, discusses the cooperation relationship of the equity level, and supports the development of XX community finance, intelligent home, community endowment, remote medical treatment and other services. Real estate projects subsequently developed by companies will have YY properties as priority property partners. Within five years from 1/2018 to 12/31/2022, on the premise of conforming to all applicable laws, regulations and rules, the YY property is ensured to obtain a property service area of 700 ten thousand square meters from the property developed by the company every year through legal compliance procedures; on the basis, the YY property is promoted to preferentially obtain 300-ten-thousand square meter property service area from the properties developed by the company every year through legal compliance programs. The average property management fee unit price for the aforementioned property service area is not less than the market fair price in principle, subject to compliance with relevant government guidelines (if any).
In view of the fact that the YY property is positioned as a middle-high-end property service provider, a company develops and legally delivers a real estate item for providing the property service to the YY property, and the price matched with the property service level of the YY property is reasonably determined on the premise of meeting the legal requirements of related government guide prices (if any) and other prices.
Meanwhile, the company supports policies on the aspects of the early intervention fee, the repair fee, the house inspection fee and the like of the property project accepted by the YY property according to the marketization principle.
……”
The abstract text obtained by the abstract text generation method in the embodiment is as follows:
"[ XX stock control: the company plans to transfer XX property by 10 hundred million yuan XX and controls 6 months and 30 days of evening bulletin, and the company plans to transfer 100% of the stock right of the Shanghai XX property service company to YY property management service company by 10 hundred million yuan. Meanwhile, the two parties will become strategic partners, and the real estate project developed by the company in 5 years from 2018 to 2022 takes YY property as a priority property partner. The XX property is a primary supplier of the XX group, and 1.22 million yuan earnings and 286.33 ten thousand yuan earnings are realized in 2016. "
Fig. 9 is a schematic diagram of a key text extracted according to a text content of a post file issued in a transaction according to an embodiment of the present application. As shown in fig. 9, the key paragraphs extracted from the original bulletin file are: "the operation performance in 2017 and half year is expected to be lost, the net profit of stockholders belonging to the listed companies is about-4,306 ten thousand yuan, and compared with the same period in the last year, the net profit is reduced by 6,639 ten thousand yuan. The operation performance of the half year operation in 2017 is expected to be lost, the net profit of the stockholders belonging to the listed companies is about-4,306 ten thousand yuan, and compared with the same period in the last year, the loss is 6,639 ten thousand yuan. "; the extracted key whole sentence is: the main reasons for loss reduction are: compared with the same period of the last year, the sales of the cement and the commodity clinker of the company in the period are increased, the main profit and profit rate is improved, and the total of the three expenses is reduced more. "," loss main cause: the cost of part of subsidiaries and production lines of companies which are put into production in recent years is high, the production and sales volume is low in the current market environment, and the productivity cannot be effectively exerted. "; the key half sentences extracted are: "Fujian Cement (stock code 6008 XX)", "the net profit of the stockholder who belonged to the listed company in the same year is-1.09 hundred million yuan. ".
Fig. 10 is a schematic diagram illustrating extraction of a key text from a bulletin file according to an embodiment of the present application. As shown in fig. 10, a bulletin file in the PDF format is downloaded from a bulletin source and converted into a TXT file to obtain text content corresponding to the bulletin file, and then a key paragraph, a key whole sentence, and a key half sentence can be extracted from the obtained text content by using a Textrank algorithm according to a position of a key text preset corresponding to the bulletin file, a chapter sequence cue word, and key word information.
Fig. 11 is a diagram illustrating splicing of key texts extracted from a bulletin file. As shown in fig. 11, firstly, the extracted key texts are simplified, redundant information in the obtained key texts is removed, sequence number missing, disorder and unit and data error correction are checked and corrected, then, a matched template is allocated to each crudely extracted key text, the extracted key texts are spliced by using the templates to obtain abstract texts, and finally, text logics of the abstract texts are identified, and language words are matched for the obtained abstract texts, so as to further improve the generated abstract texts.
Fig. 12 is a diagram illustrating identification of a text category of a bulletin file. As shown in fig. 12, a bulletin file and corresponding text content are obtained by monitoring a bulletin source, word segmentation is performed on the text content to obtain a word sequence, a word with a TF value reaching a preset threshold value is screened from the word sequence, a corresponding primary classification result is obtained by inputting the screened word into an SVM model, the text content of the bulletin file is compared with historical data corresponding to the primary classification result, and if the matching degree reaches a preset value, the primary classification result is used as a text category corresponding to the bulletin file.
Fig. 13 is a diagram illustrating stitching of extracted key texts. As shown in fig. 13, the extracted key texts are matched with templates of corresponding text categories, the extracted key words are matched with connecting words through the templates, and then the extracted key texts are spliced through the matched connecting words to obtain abstract texts.
Fig. 14 and fig. 15 are schematic diagrams showing matching of the linguistic words for the abstract text with the logic structure type of the parallel structure. With reference to fig. 14 and fig. 15, the sentence pattern of the generated abstract text is judged to be a parallel structure, the parallel abstract text is separated to obtain a logic unit text, i.e., a parallel sub-speech bundle, then the head content and the tail content of the parallel sub-speech bundle are respectively merged, and the refined abstract text is obtained by matching the word "respectively yes".
In one embodiment, as shown in fig. 16, a summary text generation apparatus 1600 is provided. Referring to fig. 16, the digest text generation apparatus 1600 includes: an acquisition module 1602, a query module 1604, an extraction module 1606, an identification module 1608, and a stitching module 1610.
An obtaining module 1602, configured to obtain a canonical text and a corresponding category label;
the query module 1604 is configured to query a preset paradigm characteristic corresponding to the category label;
an extracting module 1606, configured to extract a key text from the canonical form text according to the canonical form feature;
an identifying module 1608 for identifying a text category to which the normalized text belongs;
the combining module 1610 is configured to combine the extracted key texts according to the templates corresponding to the text categories to obtain the abstract text.
According to the abstract text generation device, the key text can be extracted from the normalized text through the searched normal form characteristics corresponding to the normalized text, and after the text category corresponding to the normalized text is identified, the extracted key text can be spliced by depending on the template corresponding to the text category, so that the abstract text is obtained. Because the whole process of generating the abstract text does not need manual participation, the efficiency of rewriting the text can be greatly improved.
In one embodiment, as shown in fig. 17, the obtaining module 1602 in the abstract text generating module 1600 includes: an announcement source monitoring module 1702, an announcement acquisition module 1704, a canonicalized text extraction module 1706, and a category tag reading module 1708.
An announcement file source monitoring module 1702, configured to monitor an announcement file source;
an announcement file obtaining module 1704, configured to obtain a newly added announcement file when it is monitored that the announcement file is newly added to the announcement file source;
a normalized text extracting module 1706, configured to extract a normalized text from the bulletin file;
a category tag reading module 1708, configured to read a category tag associated with the advertisement file.
In one embodiment, the key text includes at least one of a key paragraph, a key whole sentence, and a key half sentence; as shown in fig. 18, the extraction module 1606 in the abstract text generation module 1600 includes: a key paragraph extraction module 1802, a key whole sentence extraction module 1804, and a key half sentence extraction module 1806.
A key paragraph extracting module 1802, configured to extract a key paragraph from the normalized text according to a paragraph position when the canonical form feature includes the paragraph position of the key paragraph in the normalized text;
a key whole sentence extracting module 1804, configured to extract a key whole sentence from a position, corresponding to the sequence text cue word, in the paradigm characteristic when the paradigm characteristic includes the sequence text cue word;
a key half sentence extracting module 1806, configured to extract a key half sentence including the keyword from the canonical text when the canonical form feature includes the keyword.
In one embodiment, the key paragraph extraction module 1802 includes: the sentence screening system comprises a first screening module, a first sentence weight value obtaining module, a first sentence screening module and a key paragraph forming module. The first screening module is used for screening a first half sentence split from a paragraph position in the normalized text; the first sentence weight value acquisition module is used for acquiring a weight value corresponding to the screened first sentence; the first sentence screening module is used for determining a first sentence with a weight value meeting a first preset condition in the screened first sentence; the key paragraph forming module is used for forming a key paragraph by the continuous first half sentence which meets the first preset condition.
In one embodiment, the key whole sentence extraction module 1804 includes: the second filtering module, the second half sentence weight value obtaining module, the second half sentence filtering module and the key whole sentence forming module. The second screening module is used for screening a second half sentence corresponding to the prompt words of the sequence text in the normalized text; the second half sentence weight value acquisition module is used for acquiring a weight value corresponding to the screened second half sentence; the second half sentence screening module is used for determining a second half sentence with a weight value meeting a second preset condition in the screened second half sentence; and the key whole sentence forming module is used for forming a continuous second half sentence which meets a second preset condition into a key whole sentence.
In one embodiment, the key half sentence extraction module 1806 includes: the third screening module, the third half sentence weight value obtaining module and the key half sentence forming module. The third screening module is used for screening a third half sentence comprising the key words in the half sentences split from the paradigm text; the third half sentence weight value acquisition module is used for acquiring a weight value corresponding to the screened third half sentence; and the key half sentence forming module is used for taking a third half sentence with the weight value meeting a third preset condition as a key half sentence.
In one embodiment, the recognition module 1608 further includes a screening module for screening words belonging to a predetermined set of words from the normalized text; the recognition module is also used for recognizing the text category to which the normalized text belongs according to the screened words.
In one embodiment, the recognition module 1608 comprises an importance level acquisition module, a text vector construction module, and a text category recognition module. The importance degree acquisition module is used for acquiring the importance degree of the screened words to the normalized text; the text vector construction module is used for constructing a text vector for expressing the canonical text according to the importance degree; and the text type identification module is used for inputting the text vector into the trained machine learning model to obtain the text type.
In one embodiment, the identification module 1608 includes a classification module, a historical data acquisition module, a comparison module, and a text category determination module. The classification module is used for classifying the normalized texts to obtain an initial classification result; the historical data acquisition module is used for acquiring historical data corresponding to the primary classification result; the comparison module is used for comparing the normalized text with the historical data to obtain a comparison result; and the text type determining module is used for taking the primary classification result as the text type to which the normalized text belongs when the comparison result meets a fourth preset condition.
In one embodiment, the split module 1610 includes: the system comprises a template distribution module, a connecting word matching module and a key text splicing module. The template distribution module is used for respectively distributing templates corresponding to the text categories for each extracted key text; the connecting word matching module is used for matching the extracted key texts with corresponding connecting words through the distributed templates; the key text splicing module is used for splicing the key texts through corresponding connecting words to obtain abstract texts.
In one embodiment, as shown in fig. 19, the abstract text generating apparatus 1600 further includes: a digest text logic determination module 1902, a digest text separation module 1904, and a reorganization module 1906.
A summary text logic determination module 1902, configured to determine a logical structure type of the summary text;
a summarized text separating module 1904, configured to separate a logic unit text from the summarized text;
the restructuring module 1906 is configured to restructure the text of the logic unit according to the text restructuring manner corresponding to the logic structure type, so as to obtain a restructured abstract text.
In one embodiment, the digest text generation apparatus further includes: the device comprises a user data acquisition module, a push priority acquisition module and a push module. The user data acquisition module is used for acquiring user data; the push priority acquisition module is used for determining the push priority of the abstract text according to the user data and the abstract text; and the pushing module is used for pushing the abstract text to the terminal corresponding to the user data according to the pushing priority.
According to the abstract text generation device, the key text can be extracted from the normal form text through the searched normal form characteristics corresponding to the normal form text, and after the text type corresponding to the normal form text is identified, the extracted key text can be spliced by depending on the template corresponding to the text type, so that the abstract text is obtained. Because the whole process of generating the abstract text does not need manual participation, the efficiency of rewriting the text can be greatly improved.
FIG. 20 is a diagram that illustrates an internal structure of the computer device in one embodiment. The computer device may specifically be the server 120 in fig. 1. As shown in fig. 20, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the digest text generation method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform the method for generating a text summary.
Those skilled in the art will appreciate that the architecture shown in fig. 20 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the abstract text generation apparatus provided in the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 20. The memory of the computer device may store various program modules constituting the digest text generation apparatus, such as an acquisition module 1602, a query module 1604, an extraction module 1606, a recognition module 1608, and a composition module 1610 shown in fig. 16. The computer program constituted by the respective program modules causes the processor to execute the steps in the digest text generation method of the respective embodiments of the present application described in the present specification.
For example, the computer device shown in fig. 20 may execute step S202 by the acquisition module 1602 in the digest text generation apparatus shown in fig. 16. Step S204 is performed by the query module 1604. Step S206 is performed by the extraction module 1606. Step S208 is performed by the recognition module 1608. Step S210 is performed by the splicing module 1610.
In one embodiment, a computer readable storage medium is provided, having a computer program stored thereon, which, when executed by a processor, causes the processor to perform the steps of: acquiring a normalized text and a corresponding category label; querying preset paradigm characteristics corresponding to the category labels; extracting a key text from the normal text according to the normal features; identifying a text category to which the normalized text belongs; and combining the extracted key texts according to the templates corresponding to the text categories to obtain the abstract text.
In one embodiment, the computer program causes the processor when executing the step of obtaining the canonical text and the corresponding category label to further specifically perform the following steps: monitoring a source of announcement documents; when monitoring that an announcement file is newly added to an announcement file source, acquiring the newly added announcement file; extracting a normalized text from the announcement file; a category label associated with the announcement file is read.
In one embodiment, the key text comprises at least one of a key paragraph, a key whole sentence, and a key half sentence; the computer program causes the processor when performing the step of extracting the key text from the canonical representation text according to the canonical representation feature to specifically further perform the steps of: when the paradigm feature comprises a paragraph position of the key paragraph in the paradigm text, extracting the key paragraph from the paradigm text according to the paragraph position; when the paradigm characteristic comprises a sequence text cue word, extracting a key whole sentence from a position corresponding to the sequence text cue word in the paradigm text; when the paradigm features include keywords, key half sentences including the keywords are extracted from the paradigm texts.
In one embodiment, the computer program causes the processor in performing the step of extracting key paragraphs from the normalized text according to paragraph positions to perform in particular the further steps of: screening a first half sentence split from a paragraph position in the normalized text; acquiring a weight value corresponding to the screened first sentence; determining a first half sentence with a weight value meeting a first preset condition in the screened first half sentences; and forming a key paragraph by the continuous first half sentence which meets the first preset condition.
In one embodiment, the computer program causes the processor when performing the step of extracting key whole sentences from positions in the normalized text corresponding to the sequence text prompt words to further specifically perform the steps of: screening a second half sentence corresponding to the prompt word of the sequence text in the normalized text; acquiring a weighted value corresponding to the screened second half sentence; determining a second half sentence with a weight value meeting a second preset condition in the screened second half sentence; and forming a key whole sentence by using continuous second half sentences meeting second preset conditions.
In one embodiment, the computer program causes the processor in performing the step of extracting key half-sentences comprising keywords from the canonicalized text to perform in particular the further steps of: screening a third half sentence comprising the key words from the half sentences split from the normalized texts; acquiring a weighted value corresponding to the screened third half sentence; and taking the third half sentence with the weight value meeting a third preset condition as a key half sentence.
In one embodiment, the computer program causes the processor when performing the step of identifying a text category to which the canonical text belongs from the filtered words specifically further performs the steps of: screening words belonging to a preset word set from the normalized text; and identifying the text category to which the normalized text belongs according to the screened words.
In one embodiment, the computer program causes the processor when performing the step of identifying a text category to which the canonical text belongs to further specifically perform the steps of: acquiring the importance degree of the screened words to the normalized text; constructing a text vector representing the canonical text according to the importance degree; and inputting the text vector into the trained machine learning model to obtain the text category.
In one embodiment, the computer program causes the processor when performing the step of identifying a text category to which the canonical text belongs to further specifically perform the steps of: classifying the normalized texts to obtain an initial classification result; acquiring historical data corresponding to the primary classification result; comparing the normalized text with the historical data to obtain a comparison result; and when the comparison result meets a fourth preset condition, taking the primary classification result as the text category to which the normalized text belongs.
In one embodiment, the computer program enables the processor to further perform the following steps when performing the step of merging the extracted key texts according to the templates corresponding to the text categories to obtain the abstract text: respectively allocating templates corresponding to the text categories for each extracted key text; matching the extracted key texts with corresponding connecting words through the distributed templates; and (5) splicing the key texts through corresponding connecting words to obtain an abstract text.
In one embodiment, the computer program, when executed by the processor, further causes the processor to perform the steps of: determining the logic structure type of the abstract text; separating the logic unit text from the abstract text; and recombining the logical unit texts according to the text recombination mode corresponding to the logical structure type to obtain a recombined abstract text.
In one embodiment, the computer program, when executed by the processor, further causes the processor to perform the steps of: acquiring user data; determining the pushing priority of the abstract text according to the user data and the abstract text; and pushing the abstract text to the terminal corresponding to the user data according to the pushing priority.
The computer storage medium can extract the key texts from the normal form texts through the searched normal form features corresponding to the normal form texts, and after the text types corresponding to the normal form texts are identified, the extracted key texts can be spliced by means of templates corresponding to the text types, so that abstract texts are obtained. Because the whole process of generating the abstract text does not need manual participation, the efficiency of rewriting the text can be greatly improved.
In one embodiment, there is provided a computer device comprising a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, causes the processor to perform the steps of: acquiring a normalized text and a corresponding category label; querying preset paradigm characteristics corresponding to the category labels; extracting a key text from the normal text according to the normal features; identifying a text category to which the normalized text belongs; and combining the extracted key texts according to the templates corresponding to the text categories to obtain the abstract text.
In one embodiment, the computer program causes the processor when executing the step of obtaining the canonical text and the corresponding category label to further specifically perform the following steps: monitoring a source of announcement documents; when monitoring that an announcement file is newly added to an announcement file source, acquiring the newly added announcement file; extracting a normalized text from the bulletin file; a category label associated with the announcement file is read.
In one embodiment, the key text includes at least one of a key paragraph, a key whole sentence, and a key half sentence; the computer program causes the processor when performing the step of extracting the key text from the canonical representation text according to the canonical representation feature to specifically further perform the steps of: when the paradigm feature comprises a paragraph position of the key paragraph in the paradigm text, extracting the key paragraph from the paradigm text according to the paragraph position; when the paradigm characteristic comprises a sequence text cue word, extracting a key whole sentence from a position corresponding to the sequence text cue word in the paradigm text; when the paradigm features include keywords, key half sentences including the keywords are extracted from the paradigm texts.
In one embodiment, the computer program causes the processor when performing the step of extracting key paragraphs from the normalized text according to paragraph positions to further perform the steps of: screening a first half sentence split from a paragraph position in the normalized text; acquiring a weight value corresponding to the screened first sentence; determining a first half sentence with a weight value meeting a first preset condition in the screened first half sentence; and forming a key paragraph by using continuous first half sentences meeting first preset conditions.
In one embodiment, the computer program causes the processor, when performing the step of extracting the key whole sentence from the position in the canonical text corresponding to the sequence text cue, to specifically further perform the following steps: screening a second half sentence corresponding to the prompt words of the sequence text in the normalized text; acquiring a weighted value corresponding to the screened second half sentence; determining a second half sentence with a weight value meeting a second preset condition in the screened second half sentence; and forming a key whole sentence by using continuous second half sentences meeting second preset conditions.
In one embodiment, the computer program causes the processor in performing the step of extracting key half-sentences comprising keywords from the canonicalized text to perform in particular the further steps of: screening a third half sentence comprising the key words from the half sentences split from the normalized texts; acquiring a weighted value corresponding to the screened third half sentence; and taking the third half sentence with the weight value meeting a third preset condition as a key half sentence.
In one embodiment, the computer program causes the processor when performing the step of identifying a text category to which the canonical text belongs from the filtered words specifically further performs the steps of: screening words belonging to a preset word set from a normalized text; and identifying the text category to which the normalized text belongs according to the screened words.
In one embodiment, the computer program causes the processor when performing the step of identifying a text category to which the canonical text belongs to further specifically perform the steps of: acquiring the importance degree of the screened words to the normalized text; constructing a text vector representing the canonical text according to the importance degree; and inputting the text vector into the trained machine learning model to obtain the text category.
In one embodiment, the computer program causes the processor when performing the step of identifying a text category to which the canonical text belongs to further specifically perform the steps of: classifying the normalized texts to obtain an initial classification result; acquiring historical data corresponding to the primary classification result; comparing the normalized text with the historical data to obtain a comparison result; and when the comparison result meets a fourth preset condition, taking the primary classification result as the text category to which the normalized text belongs.
In one embodiment, the computer program causes the processor to further specifically execute the following steps when performing the step of combining the extracted key texts according to the templates corresponding to the text categories to obtain the abstract text: respectively allocating templates corresponding to the text categories for each extracted key text; matching the extracted key texts with corresponding connecting words through the distributed templates; and splicing the key texts through corresponding connecting words to obtain the abstract text.
In one embodiment, the computer program, when executed by the processor, further causes the processor to perform the steps of: determining the logic structure type of the abstract text; separating the logic unit text from the abstract text; and recombining the logical unit texts according to the text recombination mode corresponding to the logical structure type to obtain a recombined abstract text.
In one embodiment, the computer program, when executed by the processor, further causes the processor to perform the steps of: acquiring user data; determining the pushing priority of the abstract text according to the user data and the abstract text; and pushing the abstract text to the terminal corresponding to the user data according to the pushing priority.
The computer equipment can extract the key texts from the normalized texts through the searched normal form characteristics corresponding to the normalized texts, and can combine the extracted key texts by depending on the templates corresponding to the text categories after identifying the text categories corresponding to the normalized texts, so as to obtain the abstract texts. Because the whole process of generating the abstract text does not need manual participation, the efficiency of rewriting the text can be greatly improved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, and the program can be stored in a non-volatile computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (34)

1. A summary text generation method comprises the following steps:
acquiring a normal form text and corresponding class labels, wherein the class labels are used for classifying files in different normal forms, and the files with different class labels have different normal forms;
querying a preset paradigm characteristic corresponding to the category label;
extracting a key text from the normalized text according to the normal form characteristics;
identifying a text category to which the normalized text belongs, wherein the text category is a category corresponding to text content of the normalized text; the text category is different from the category label;
and combining the extracted key texts according to the templates corresponding to the text categories to obtain the abstract text.
2. The method of claim 1, wherein obtaining the normalized text and the corresponding category label comprises:
monitoring a source of announcement documents;
when monitoring that an announcement file is newly added to an announcement file source, acquiring the newly added announcement file;
extracting a normalized text from the announcement file;
reading a category label associated with the announcement file.
3. The method of claim 1, wherein the key text comprises at least one of a key paragraph, a key whole sentence, and a key half sentence; the extracting of the key text from the paradigm text according to the paradigm characteristics comprises:
when the canonical characteristic comprises a paragraph position of a key paragraph in the canonical text, extracting the key paragraph from the canonical text according to the paragraph position;
when the paradigm feature comprises a sequence text cue word, extracting a key whole sentence from a position corresponding to the sequence text cue word in the paradigm text;
when the paradigm characteristics comprise key words, extracting key half sentences comprising the key words from the paradigm texts.
4. The method of claim 3, wherein extracting key paragraphs from the normalized text according to the paragraph positions comprises:
screening a first half sentence split from the paragraph position in the normalized text;
acquiring a weight value corresponding to the screened first sentence;
determining a first half sentence with a weight value meeting a first preset condition in the screened first half sentence;
and forming a key paragraph by the continuous first half sentence which meets the first preset condition.
5. The method of claim 3, wherein extracting the key complete sentence from the position of the normalized text corresponding to the sequence of text cues comprises:
screening a second half sentence corresponding to the prompt words of the sequence text in the normalized text;
acquiring a weighted value corresponding to the screened second half sentence;
determining a second half sentence with a weight value meeting a second preset condition in the screened second half sentence;
and forming the continuous second half sentences which meet second preset conditions into key whole sentences.
6. The method of claim 3, wherein extracting key half sentences comprising the keywords from the normalized text comprises:
screening a third half sentence comprising the key words from the half sentences split from the normalized texts;
acquiring a weight value corresponding to the screened third half sentence;
and taking the third half sentence with the weight value meeting a third preset condition as a key half sentence.
7. The method according to any one of claims 1 to 6, wherein the identifying the text category to which the normalized text belongs comprises:
screening words belonging to a preset word set from the normalized text;
and identifying the text category to which the normalized text belongs according to the screened words.
8. The method according to claim 7, wherein the identifying the text category to which the normalized text belongs according to the screened words comprises:
acquiring the importance degree of the screened words to the normalized text;
constructing a text vector representing the normalized text according to the importance degree;
and inputting the text vector into a trained machine learning model to obtain the text category.
9. The method according to any one of claims 1 to 6, wherein the identifying the text category to which the normalized text belongs comprises:
classifying the normalized texts to obtain an initial classification result;
acquiring historical data corresponding to the primary classification result;
comparing the normalized text with the historical data to obtain a comparison result;
and when the comparison result meets a fourth preset condition, taking the primary classification result as the text category to which the normalized text belongs.
10. The method according to any one of claims 1 to 6, wherein the matching the extracted key texts according to the templates corresponding to the text categories to obtain the abstract text comprises:
respectively allocating templates corresponding to the text categories for each extracted key text;
matching the extracted key texts with corresponding connecting words through the distributed templates;
and splicing the key texts through the corresponding connecting words to obtain abstract texts.
11. The method according to any one of claims 1 to 6, further comprising:
acquiring user data;
determining the pushing priority of the abstract text according to the user data and the abstract text;
and pushing the abstract text to a terminal corresponding to the user data according to the pushing priority.
12. The method according to any one of claims 1 to 6, further comprising:
determining the logic structure type of the abstract text;
separating a logic unit text from the abstract text;
and recombining the logical unit texts according to a text recombination mode corresponding to the logical structure type to obtain a recombined abstract text.
13. The method according to claim 12, wherein when the logical structure type is a parallel structure type, the recombining the logical unit texts according to the text recombination manner corresponding to the logical structure type to obtain a recombined abstract text comprises:
determining a head text and a tail text in each separated logic unit text;
merging the head texts in a parallel expression mode to obtain merged head texts;
merging the tail texts in a parallel expression mode to obtain merged tail texts;
and linking the combined head text and the combined tail text through the parallel linking words corresponding to the parallel structure types to obtain the recombined abstract text.
14. The method according to claim 12, wherein when the logical structure type is a progressive structure type, the recombining the logical unit texts according to the text recombination manner corresponding to the logical structure type to obtain a recombined abstract text comprises:
determining the progressive sequence of the texts of the logic units;
acquiring progressive connection words corresponding to the progressive structure types and corresponding to the progressive sequence;
and linking the logic unit texts according to the progressive sequence and the corresponding progressive linking words to obtain a recombined abstract text.
15. The method according to claim 12, wherein when the logical structure type is a turn structure type, the recombining the logical unit texts according to the text recombination manner corresponding to the logical structure type to obtain a recombined abstract text comprises:
identifying a logic unit text of basic semantics and a logic unit text of transition semantics from the separated logic unit texts;
determining turning conjunctions in the abstract text;
and deleting the logic unit text of the basic semantics and the turning conjunctions from the abstract text to obtain a recombined abstract text.
16. The method according to claim 12, wherein when the logical structure type is an overview structure type, the recombining the logical unit texts according to the text recombination manner corresponding to the logical structure type to obtain a recombined abstract text comprises:
determining a parent-level logic structure type of the abstract text and a child-level logic structure type of each logic unit text;
separating corresponding sub logic unit texts from the logic unit texts respectively;
recombining the corresponding sub-logic unit texts separated from each logic unit text according to the text recombination modes corresponding to the corresponding sub-level logic structure types to obtain recombined logic unit texts;
and recombining the recombined logic unit texts according to the text recombination mode corresponding to the parent-level logic structure type to obtain a recombined abstract text.
17. An abstract text generation apparatus, the apparatus comprising:
the system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring a normal form text and corresponding class labels, the class labels are used for classifying files of different normal forms, and the files of the different class labels have different normal forms;
the query module is used for querying the preset paradigm characteristics corresponding to the category labels;
the extraction module is used for extracting a key text from the normal text according to the normal features;
the recognition module is used for recognizing the text category to which the normalized text belongs, wherein the text category is a category corresponding to the text content of the normalized text; the text category is different from the category label;
and the splicing module is used for splicing the extracted key texts according to the template corresponding to the text category to obtain the abstract text.
18. The apparatus of claim 17, wherein the obtaining module comprises: the system comprises a notice file source monitoring module, a notice file acquisition module, a paradigm text extraction module and a category label reading module;
the announcement file source monitoring module is used for monitoring an announcement file source;
the announcement file acquisition module is used for acquiring a newly added announcement file when monitoring that the announcement file is newly added to an announcement file source;
the normalized text extraction module is used for extracting normalized texts from the bulletin files;
the category label reading module is used for reading the category label associated with the announcement file.
19. The apparatus of claim 17, wherein the extraction module comprises: the system comprises a key paragraph extraction module, a key whole sentence extraction module and a key half sentence extraction module;
the key paragraph extracting module is used for extracting a key paragraph from the normalized text according to the paragraph position when the normalized feature comprises the paragraph position of the key paragraph in the normalized text;
the key whole sentence extracting module is used for extracting a key whole sentence from a position corresponding to the sequence text cue word in the canonical text when the canonical form feature comprises the sequence text cue word;
and the key half sentence extraction module is used for extracting the key half sentence comprising the key word from the paradigm text when the paradigm characteristic comprises the key word.
20. The apparatus of claim 19, wherein the key paragraph extraction module comprises: the system comprises a first screening module, a first sentence weight value obtaining module, a first sentence screening module and a key paragraph forming module;
the first screening module is used for screening a first half sentence split from the paragraph position in the canonical text;
the first sentence weight value obtaining module is used for obtaining a weight value corresponding to the screened first sentence;
the first sentence screening module is used for determining a first sentence with a weight value meeting a first preset condition in the screened first sentence;
and the key paragraph forming module is used for forming a key paragraph from the continuous first half sentence which meets the first preset condition.
21. The apparatus of claim 19, wherein the key whole sentence extraction module comprises: the second filtering module, the second half sentence weight value obtaining module, the second half sentence filtering module and the key whole sentence forming module;
the second screening module is used for screening a second half sentence corresponding to the sequence text cue word in the normalized text;
the second half sentence weight value obtaining module is used for obtaining a weight value corresponding to the screened second half sentence;
the second half sentence screening module is used for determining a second half sentence with a weight value meeting a second preset condition in the screened second half sentence;
and the key whole sentence forming module is used for forming the continuous second half sentences which meet the second preset condition into key whole sentences.
22. The apparatus of claim 19, wherein the key half sentence extraction module comprises: the third screening module, the third half sentence weight value obtaining module and the key half sentence forming module;
the third screening module is used for screening a third half sentence comprising the keyword from the half sentences split from the normalized texts;
the third half sentence weight value obtaining module is used for obtaining a weight value corresponding to the screened third half sentence;
and the key half sentence forming module is used for taking the third half sentence with the weight value meeting a third preset condition as the key half sentence.
23. The apparatus of any one of claims 17 to 22, wherein the identification module further comprises a screening module;
the screening module is used for screening words belonging to a preset word set from the normalized text;
and the identification module is also used for identifying the text category to which the normalized text belongs according to the screened words.
24. The apparatus according to claim 23, wherein the recognition module comprises an importance level obtaining module, a text vector construction module and a text category recognition module;
the importance degree acquisition module is used for acquiring the importance degree of the screened words to the normalized text;
the text vector construction module is used for constructing a text vector representing the normalized text according to the importance degree;
and the text type identification module is used for inputting the text vector into the trained machine learning model to obtain the text type.
25. The apparatus according to any one of claims 17 to 22, wherein the identification module comprises a classification module, a historical data acquisition module, a comparison module and a text category determination module;
the classification module is used for classifying the normalized texts to obtain an initial classification result;
the historical data acquisition module is used for acquiring historical data corresponding to the primary classification result;
the comparison module is used for comparing the normalized text with the historical data to obtain a comparison result;
and the text type determining module is used for taking the initial classification result as the text type to which the normalized text belongs when the comparison result meets a fourth preset condition.
26. The apparatus of any one of claims 17 to 22, wherein said split module comprises: the system comprises a template distribution module, a connecting word matching module and a key text splicing module;
the template distribution module is used for respectively distributing templates corresponding to the text categories for each extracted key text;
the connecting word matching module is used for matching the extracted key texts with corresponding connecting words through the distributed templates;
and the key text splicing module is used for splicing the key texts through the corresponding connecting words to obtain abstract texts.
27. The apparatus of any one of claims 17 to 22, further comprising:
the user data acquisition module is used for acquiring user data;
the pushing priority acquiring module is used for determining the pushing priority of the abstract text according to the user data and the abstract text;
and the pushing module is used for pushing the abstract text to the terminal corresponding to the user data according to the pushing priority.
28. The apparatus of any one of claims 17 to 22, further comprising: the device comprises a summary text logic determination module, a summary text separation module and a recombination module;
the abstract text logic determination module is used for determining the logic structure type of the abstract text;
the abstract text separation module is used for separating the logic unit text from the abstract text;
and the recombination module is used for recombining the logic unit texts according to the text recombination mode corresponding to the logic structure type to obtain a recombined abstract text.
29. The apparatus according to claim 28, wherein the restructuring module is further configured to determine a head text and a tail text in each of the separated logical unit texts when the logical structure type is a parallel structure type; merging the head texts according to a parallel expression mode to obtain merged head texts; merging the tail texts in a parallel expression mode to obtain merged tail texts; and linking the combined head text and the combined tail text through the parallel linking words corresponding to the parallel structure types to obtain the recombined abstract text.
30. The apparatus of claim 28, wherein the restructuring module is further configured to determine a progressive order of each of the logical unit texts when the logical structure type is a progressive structure type; acquiring progressive connection words corresponding to the progressive structure types and corresponding to the progressive sequence; and linking the logic unit texts according to the progressive sequence and the corresponding progressive linking words to obtain a recombined abstract text.
31. The apparatus of claim 28, wherein the reorganization module is further configured to identify a logical unit text of a base semantic and a logical unit text of a transition semantic from the separated logical unit texts when the logical structure type is a transition structure type; determining turning conjunctions in the abstract text; and deleting the logic unit text of the basic semantics and the turning conjunctions from the abstract text to obtain a recombined abstract text.
32. The apparatus of claim 28, wherein the restructuring module is further configured to determine a parent level logical structure type of the abstract text and a child level logical structure type of each logical unit text when the logical structure type is a summary structure type; separating corresponding sub logic unit texts from the logic unit texts respectively; recombining the corresponding sub-logic unit texts separated from each logic unit text according to the text recombination modes corresponding to the corresponding sub-level logic structure types to obtain recombined logic unit texts; and recombining the recombined logic unit texts according to a text recombination mode corresponding to the parent-level logic structure type to obtain a recombined abstract text.
33. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method according to any one of claims 1 to 16.
34. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any one of claims 1 to 16.
CN201711278814.1A 2017-12-06 2017-12-06 Abstract text generation method and device, storage medium and computer equipment Active CN110069623B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201711278814.1A CN110069623B (en) 2017-12-06 2017-12-06 Abstract text generation method and device, storage medium and computer equipment
PCT/CN2018/119214 WO2019109918A1 (en) 2017-12-06 2018-12-04 Abstract text generation method, computer readable storage medium and computer device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711278814.1A CN110069623B (en) 2017-12-06 2017-12-06 Abstract text generation method and device, storage medium and computer equipment

Publications (2)

Publication Number Publication Date
CN110069623A CN110069623A (en) 2019-07-30
CN110069623B true CN110069623B (en) 2022-09-23

Family

ID=66750771

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711278814.1A Active CN110069623B (en) 2017-12-06 2017-12-06 Abstract text generation method and device, storage medium and computer equipment

Country Status (2)

Country Link
CN (1) CN110069623B (en)
WO (1) WO2019109918A1 (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750974B (en) * 2019-09-20 2023-04-25 成都星云律例科技有限责任公司 Method and system for structured processing of referee document
CN110706774A (en) * 2019-09-29 2020-01-17 广州达美智能科技有限公司 Medical record generation method, terminal device and computer readable storage medium
CN110956041A (en) * 2019-11-27 2020-04-03 重庆邮电大学 Depth learning-based co-purchase recombination bulletin summarization method
CN111160019B (en) * 2019-12-30 2023-08-15 中国联合网络通信集团有限公司 Public opinion monitoring method, device and system
CN111539012B (en) * 2020-03-19 2021-07-20 重庆特斯联智慧科技股份有限公司 Privacy data distribution storage system and method of edge framework
CN113742478B (en) * 2020-05-29 2023-09-05 国家计算机网络与信息安全管理中心 Directional screening device and method for massive text data
CN111859885A (en) * 2020-06-19 2020-10-30 广州大学 Automatic generation method and system for legal decision book
CN111737989A (en) * 2020-06-24 2020-10-02 深圳前海微众银行股份有限公司 Intention identification method, device, equipment and storage medium
CN112182224A (en) * 2020-10-12 2021-01-05 深圳壹账通智能科技有限公司 Referee document abstract generation method and device, electronic equipment and readable storage medium
CN112183077A (en) * 2020-10-13 2021-01-05 京华信息科技股份有限公司 Mode recognition-based official document abstract extraction method and system
CN112668316A (en) * 2020-11-17 2021-04-16 国家计算机网络与信息安全管理中心 word document key information extraction method
CN112395885B (en) * 2020-11-27 2024-01-26 安徽迪科数金科技有限公司 Short text semantic understanding template generation method, semantic understanding processing method and device
CN112541073B (en) * 2020-12-15 2022-12-06 科大讯飞股份有限公司 Text abstract generation method and device, electronic equipment and storage medium
CN112784585A (en) * 2021-02-07 2021-05-11 新华智云科技有限公司 Abstract extraction method and terminal for financial bulletin
CN113658652B (en) * 2021-08-18 2023-07-28 四川大学华西医院 Binary relation extraction method based on electronic medical record data text
CN113435212B (en) * 2021-08-26 2021-11-16 山东大学 Text inference method and device based on rule embedding
CN113806522A (en) * 2021-09-18 2021-12-17 北京百度网讯科技有限公司 Abstract generation method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101692240A (en) * 2009-08-14 2010-04-07 北京中献电子技术开发中心 Rule-based method for patent abstract automatic extraction and keyword indexing
CN105468713A (en) * 2015-11-19 2016-04-06 西安交通大学 Multi-model fused short text classification method
CN106156204A (en) * 2015-04-23 2016-11-23 深圳市腾讯计算机系统有限公司 The extracting method of text label and device
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN106897439A (en) * 2017-02-28 2017-06-27 百度在线网络技术(北京)有限公司 The emotion identification method of text, device, server and storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7251781B2 (en) * 2001-07-31 2007-07-31 Invention Machine Corporation Computer based summarization of natural language documents
CN101604312A (en) * 2007-12-07 2009-12-16 宗刚 The method and system of the searching, managing and communicating of information
CN101620608A (en) * 2008-07-04 2010-01-06 全国组织机构代码管理中心 Information collection method and system
US8335754B2 (en) * 2009-03-06 2012-12-18 Tagged, Inc. Representing a document using a semantic structure
CN103699525B (en) * 2014-01-03 2016-08-31 江苏金智教育信息股份有限公司 A kind of method and apparatus automatically generating summary based on text various dimensions feature
US9940099B2 (en) * 2014-01-03 2018-04-10 Oath Inc. Systems and methods for content processing
CN104572849A (en) * 2014-12-17 2015-04-29 西安美林数据技术股份有限公司 Automatic standardized filing method based on text semantic mining
CN105159886B (en) * 2015-10-10 2016-10-12 广东卓维网络有限公司 A kind of Outlier Detection method and system based on voucher summary texts
CN107403375A (en) * 2017-04-19 2017-11-28 北京文因互联科技有限公司 A kind of listed company's bulletin classification and abstraction generating method based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101692240A (en) * 2009-08-14 2010-04-07 北京中献电子技术开发中心 Rule-based method for patent abstract automatic extraction and keyword indexing
CN106156204A (en) * 2015-04-23 2016-11-23 深圳市腾讯计算机系统有限公司 The extracting method of text label and device
CN105468713A (en) * 2015-11-19 2016-04-06 西安交通大学 Multi-model fused short text classification method
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN106897439A (en) * 2017-02-28 2017-06-27 百度在线网络技术(北京)有限公司 The emotion identification method of text, device, server and storage medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Combining MPEG Tools to Generate Video Summaries Adapted to the Terminal and Network;Luis Herranz 等;《IEEE》;20130531;第529-553页 *
一种基于文本单元关联网络的自动文摘方法;陶余会等;《模式识别与人工智能》;20090615(第03期);第440-444页 *
一种基于语义标注特征的金融文本分类方法;罗明等;《计算机应用研究》;20170721(第08期);第2281-2284+228页 *
基于综合的句子特征的文本自动摘要;程园等;《计算机科学》;20150415(第04期);第226-229页 *
基于语义图的医学多文档摘要提取模型构建;张晗等;《图书情报工作》;20170420(第08期);第112-119页 *
文本主题的自动提取方法研究与实现;张其文等;《计算机工程与设计》;20060816(第15期);第2744-2746+2766页 *

Also Published As

Publication number Publication date
CN110069623A (en) 2019-07-30
WO2019109918A1 (en) 2019-06-13

Similar Documents

Publication Publication Date Title
CN110069623B (en) Abstract text generation method and device, storage medium and computer equipment
US11861751B2 (en) Machine evaluation of contract terms
He et al. A database linking Chinese patents to China’s census firms
US20230334254A1 (en) Fact checking
US20150032645A1 (en) Computer-implemented systems and methods of performing contract review
WO2019222742A1 (en) Real-time content analysis and ranking
EP3584728B1 (en) Method and device for analyzing open-source license
CN112182246B (en) Method, system, medium, and application for creating an enterprise representation through big data analysis
CN106649223A (en) Financial report automatic generation method based on natural language processing
US11720615B2 (en) Self-executing protocol generation from natural language text
US20070088743A1 (en) Information processing device and information processing method
CN110990529B (en) Industry detail dividing method and system for enterprises
CN112035595A (en) Construction method and device of audit rule engine in medical field and computer equipment
CN115358201B (en) Method and system for processing research report in futures field
CN115423578B (en) Bid bidding method and system based on micro-service containerized cloud platform
CN109710918A (en) Public sentiment relation recognition method, apparatus, computer equipment and storage medium
US20230401247A1 (en) Clause taxonomy system and method for structured document construction and analysis
CN114303140A (en) Analysis of intellectual property data related to products and services
US20240062235A1 (en) Systems and methods for automated processing and analysis of deduction backup data
Mahadevan et al. Credible user-review incorporated collaborative filtering for video recommendation system
CN117195319A (en) Verification method and device for electronic part of file, electronic equipment and medium
CN115841365A (en) Model selection and quotation method, system, equipment and medium based on natural language processing
US11880394B2 (en) System and method for machine learning architecture for interdependence detection
CN114861622A (en) Documentary credit generating method, documentary credit generating device, documentary credit generating equipment, storage medium and program product
Plachouras et al. Information extraction of regulatory enforcement actions: From anti-money laundering compliance to countering terrorism finance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant