WO2021196825A1 - 摘要生成方法、装置、电子设备及介质 - Google Patents

摘要生成方法、装置、电子设备及介质 Download PDF

Info

Publication number
WO2021196825A1
WO2021196825A1 PCT/CN2021/070995 CN2021070995W WO2021196825A1 WO 2021196825 A1 WO2021196825 A1 WO 2021196825A1 CN 2021070995 W CN2021070995 W CN 2021070995W WO 2021196825 A1 WO2021196825 A1 WO 2021196825A1
Authority
WO
WIPO (PCT)
Prior art keywords
abstract
target
template
text
abstracts
Prior art date
Application number
PCT/CN2021/070995
Other languages
English (en)
French (fr)
Inventor
赵焕丽
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021196825A1 publication Critical patent/WO2021196825A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • This application relates to the field of data processing technology, and in particular to an abstract generation method, device, electronic device, and medium.
  • the deep learning-based abstract generation scheme analyzes the specific outline of the original text of the report, and then generates the summary of the report in general.
  • This method requires a large amount of analysis because of the need to analyze the outline of the original text.
  • Annotated training samples because annotated training samples are not easy to obtain, if there are not enough training samples, the accuracy of automatic abstract generation will be low.
  • the traditional extractive abstract generation scheme extracts sentences from the text. Although this method is convenient, the coherence between sentences is not strong and the readability is poor.
  • the first aspect of the present application provides an abstract generation method, the abstract generation method includes:
  • a second aspect of the present application provides an electronic device including a processor and a memory, and the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
  • a third aspect of the present application provides a computer-readable storage medium having at least one computer-readable instruction stored thereon, and the at least one computer-readable instruction is executed by a processor to implement the following steps:
  • a fourth aspect of the present application provides a device for generating a summary, and the generating of the summary includes:
  • the execution unit is used to obtain at least one bulletin summary of at least one enterprise, and to de-duplicate the at least one bulletin summary;
  • the preprocessing unit is used to preprocess each bulletin summary after de-duplication processing to obtain at least one word segmentation of each bulletin summary;
  • the generating unit is used to input at least one word segmentation of each announcement abstract into the pre-trained parameter extraction model to generate at least one abstract template;
  • a fusion unit for fusing the at least one abstract template to obtain an abstract template library
  • the extraction unit is configured to extract the target text from the abstract generation instruction when the abstract generation instruction is received;
  • the determining unit is used to determine the text type to which the target text belongs, and to determine the company type to which the company corresponding to the target text belongs;
  • the determining unit is further configured to determine a target abstract template that matches both the text type and the business type from the abstract template library;
  • the generating unit is further configured to extract information required by the target abstract template from the target text, and generate an abstract corresponding to the target text according to the extracted information and the target abstract template.
  • this application since this application directly analyzes the information in the announcement summary without analyzing the main idea of the original report, fewer training samples are required, and in the case of the same training samples, this application obtains The model is more accurate, which improves the accuracy of the summary generation.
  • the summary is generated according to the summary template to ensure the continuity of the generated summary.
  • Fig. 1 is a flowchart of a preferred embodiment of a method for generating an abstract of the present application.
  • Fig. 2 is a functional block diagram of a preferred embodiment of the abstract generation device of the present application.
  • FIG. 3 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the abstract generation method according to the present application.
  • FIG. 1 it is a flowchart of a preferred embodiment of a method for generating an abstract of the present application. According to different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.
  • the summary generation method is applied to one or more electronic devices.
  • the electronic device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes but is not limited to Microprocessor, Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA), Digital Signal Processor (DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • embedded equipment etc.
  • the electronic device may be any electronic product that can perform human-computer interaction with the user, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television ( Internet Protocol Television, IPTV), smart wearable devices, etc.
  • a personal computer a tablet computer
  • a smart phone a personal digital assistant (PDA)
  • PDA personal digital assistant
  • IPTV interactive network television
  • smart wearable devices etc.
  • the electronic device may also include a network device and/or user equipment.
  • the network device includes, but is not limited to, a single network server, a server group composed of multiple network servers, or a cloud composed of a large number of hosts or network servers based on cloud computing.
  • the network where the electronic device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), etc.
  • S10 Obtain at least one bulletin summary of at least one enterprise, and perform deduplication processing on the at least one bulletin summary.
  • the source of the at least one bulletin summary includes, but is not limited to: the financial website where the bulletin summary is published, the website of at least one enterprise, etc. By building crawlers on various websites, and then crawling webpage information from various websites, to obtain at least one public announcement summary.
  • the electronic device deduplicating the at least one bulletin summary includes:
  • the electronic device calculates the hash value of each bulletin summary according to the summary title in each bulletin summary. Further, the electronic device extracts preset features from each bulletin summary and builds a feature index, according to every two The hash value of the announcement summary. The electronic device uses the cosine distance formula to calculate the similarity distance of every two announcement abstracts, and obtains the similarity distance of each pair of abstracts, where each pair of abstracts includes any two announcement abstracts. Further, The electronic device searches for a summary pair whose similarity distance is greater than a threshold value through the feature index, and determines the summary pair as a similar summary pair. Furthermore, the electronic device determines whether a preset feature in the similar summary pair is Same, when the preset features in the pair of similar abstracts are the same, the electronic device deletes any one of the abstracts in the pair of similar abstracts.
  • the electronic device preprocesses each bulletin summary after deduplication processing, and obtains at least one word segmentation of each bulletin summary including:
  • the electronic device performs denoising processing for each announcement summary after deduplication processing to obtain the first text. Further, the electronic device performs lexical analysis processing on the preset fields in the first text to obtain the second Further, the electronic device segments the second text according to a preset custom dictionary to obtain a segmentation position, and according to the segmentation position, the electronic device constructs at least one directed acyclic Directed acyclic graph (DAG), and calculates the probability of each directed acyclic graph according to the weights in the custom dictionary.
  • DAG directed acyclic Directed acyclic graph
  • the electronic device associates the directed acyclic graph with the highest probability with the corresponding The segmentation position is determined as the target segmentation position, and at least one characteristic word is determined according to the target segmentation position, and the electronic device performs standardized processing on the at least one characteristic word to obtain at least one word segmentation of each announcement summary.
  • the denoising processing includes removing tags, special characters, and stop words in each announcement summary.
  • preset fields include, but are not limited to: time, amount, percentage, and so on.
  • the preset custom dictionary stores at least one custom word and the weight corresponding to each custom word.
  • the electronic device performs standardized processing on the at least one characteristic word, and obtaining at least one word segmentation of each announcement summary includes:
  • the electronic device uses a shallow semantic analysis method to identify the at least one feature word, and normalizes the identified feature words with similar meanings to obtain the at least one word segmentation.
  • the electronic device recognizes that "turnover” and “operating income” are feature words with similar meanings. Further, the electronic device normalizes "turnover” and "operating income” to obtain the participle as "Turnover”.
  • S12 Input at least one word segmentation of each announcement summary into a pre-trained parameter extraction model to generate at least one summary template.
  • the method before inputting at least one word segmentation of each announcement abstract into the pre-trained parameter extraction model, the method further includes:
  • the electronic device uses web crawler technology to obtain at least one piece of historical abstract, and the electronic device marks the at least one piece of historical abstract in an abstract category to obtain an abstract category corresponding to each historical abstract. Further, the electronic device is based on The at least one historical abstract and the corresponding abstract category construct a data set, the electronic device divides the data set using a cross-validation method to obtain a training set and a verification set, and the electronic device performs a verification on the training set and the verification set.
  • Each historical abstract in the set is subjected to word segmentation processing to obtain at least one feature in the training set and at least one feature in the verification set, and the electronic device inputs at least one feature in the training set to the input gate layer for training, Obtain a learner, and according to at least one feature in the verification set, the electronic device performs error analysis and adjustment on the learner until the error is less than a configured value to obtain the parameter extraction model.
  • the input gate layer contains at least one preset parameter list of enterprise type, and the preset parameter list can be determined by analyzing the summary of the announcement.
  • an accurate parameter extraction model can be trained to extract parameters for each announcement abstract based on the parameter extraction model, thereby facilitating the generation of abstract templates.
  • the method further includes:
  • the electronic device calculates the number of historical abstracts corresponding to each abstract category, and further, the electronic device determines whether the number is less than a preset number, and when the number is less than the preset number, the electronic device passes The perturbation method increases the number of historical abstracts corresponding to the number.
  • this application does not limit the value of the preset number.
  • the electronic device uses a cross-validation method to divide the data set to obtain a training set and a verification set including:
  • the electronic device randomly divides the data set into at least one data packet according to a preset ratio. Further, the electronic device determines any one of the at least one data packet as the verification set, and the rest The data packet is determined as the training set, and the above steps are repeated until all the data packets are sequentially used as the verification set.
  • the preset ratio can be customized, which is not limited in this application.
  • the electronic device divides the data set into three data packets, namely data packet E, data packet F, and data packet G, and determines the data packet E as the verification set, data packet F and The data packet G is determined as the training set. Secondly, the data packet F is determined as the verification set, and the data packet E and the data packet G are determined as the training set. Finally, the data packet G is determined to be the verification set, and the data packet E and the data packet F are determined to be the training set.
  • each data in the data set participates in training and verification, thereby improving the fitness of training the parameter extraction model.
  • the electronic device inputs at least one word segmentation of each announcement summary into a pre-trained parameter extraction model to obtain the entity corresponding to each word segmentation. Further, the electronic device integrates each word segmentation. The entities in the bulletin summary obtain the summary template corresponding to each bulletin summary. Furthermore, the electronic device integrates the summary templates corresponding to the same summary category and the same enterprise category to obtain the at least one summary template.
  • the abstract template library records template information of at least one abstract template, and the template information includes the abstract template, the abstract category of the abstract template, and the company corresponding to the abstract template. category.
  • the summary category may be a financial indicator, etc., and further, the business category may be a chemical industry or the like.
  • the electronic device crawls web page information from various websites, it can obtain the column identifier to which the bulletin summary belongs in the web page. Further, the electronic device determines the summary of the bulletin summary according to the column identifier. Categories and business categories.
  • the information in the abstract generation instruction includes the target text, the text type to which the target text belongs, the company type to which the company corresponding to the target text belongs, and the like.
  • the extracting target text from the abstract generation instruction includes:
  • the electronic device determines a target tag, and further, the electronic device extracts text information corresponding to the target tag from the abstract generation instruction as the target text.
  • S15 Determine the text type to which the target text belongs, and determine the company type to which the company corresponding to the target text belongs.
  • the electronic device can read from the target text according to the target text.
  • the text type and the business type are determined in the abstract generation instruction.
  • the electronic device can determine the target summary category and target business category corresponding to the text type and the business type from the summary template library, and further, the electronic device can determine the target summary category and the target business category corresponding to the text type and the business type from the summary template library.
  • the abstract category and the target company category determine the target abstract template from the abstract library.
  • S17 Extract the information required by the target abstract template from the target text, and generate an abstract corresponding to the target text according to the extracted information and the target abstract template.
  • the information required to extract the target abstract template from the target text includes:
  • the electronic device extracts the target identifier corresponding to the space in the target summary template, and further, the electronic device extracts the feature value corresponding to the target identifier from the target text as the required information.
  • the target identification facilitates the establishment of a bridge between the target summary template and the target text, which in turn facilitates the accurate entry of the information into the target summary template.
  • the electronic device records the extracted information into the space corresponding to the target identifier in the target summary template to obtain the summary corresponding to the target text.
  • the method further includes:
  • the electronic device determines a target parameter list according to the type of enterprise. Further, the electronic device acquires all parameters in the target parameter list. Furthermore, the electronic device determines whether the summary includes all the parameters. Parameters, when it is detected that the summary includes all the parameters, prompt information is generated according to the summary, and the electronic device sends the prompt information to a terminal device of a designated contact.
  • the parameters in the target parameter list are the parameters that must be included in the announcement summary of the enterprise type.
  • the designated contact person may be the person in charge of generating the abstract.
  • the generated summary contains all the parameters, it can be ensured that the generated summary has all the necessary parameters.
  • the prompt information is sent to the designated contact The terminal device of the person to remind the designated contact person to check.
  • this application since this application directly analyzes the information in the announcement summary without analyzing the main idea of the original report, fewer training samples are required, and in the case of the same training samples, this application obtains The model is more accurate, which improves the accuracy of the summary generation.
  • the summary is generated according to the summary template to ensure the continuity of the generated summary.
  • the abstract generation device 11 includes an execution unit 110, a preprocessing unit 111, a generation unit 112, a fusion unit 113, an extraction unit 114, a determination unit 115, an acquisition unit 116, an annotation unit 117, a construction unit 118, a division unit 119, and a processing unit 120, an input unit 121, a calculation unit 122, a judgment unit 123, and a sending unit 124.
  • the module/unit referred to in this application refers to a series of computer program segments that can be acquired by the processor 13 and can complete fixed functions, and are stored in the memory 12. In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.
  • the execution unit 110 obtains at least one bulletin summary of at least one enterprise, and performs deduplication processing on the at least one bulletin summary.
  • the source of the at least one bulletin summary includes, but is not limited to: the financial website where the bulletin summary is published, the website of at least one enterprise, etc. By building crawlers on various websites, and then crawling webpage information from various websites, to obtain at least one public announcement summary.
  • the execution unit 110 deduplicating the at least one bulletin summary includes:
  • the execution unit 110 calculates the hash value of each bulletin summary according to the summary title in each bulletin summary. Further, the execution unit 110 extracts preset features from each bulletin summary and builds a feature index. For the hash values of two announcement abstracts, the execution unit 110 uses the cosine distance formula to calculate the similarity distance of each two announcement abstracts to obtain the similarity distance of each pair of abstracts, where each pair of abstracts includes any two announcement abstracts, Further, the execution unit 110 searches for a summary pair whose similarity distance is greater than a threshold value through the feature index, and determines the summary pair as a similar summary pair. Further, the execution unit 110 determines that the similar summary pair is in the When the preset features of the pair of similar abstracts are the same, the execution unit 110 deletes any abstract of the pair of similar abstracts.
  • the preprocessing unit 111 preprocesses each bulletin summary after de-duplication processing to obtain at least one word segmentation of each bulletin summary.
  • the preprocessing unit 111 preprocesses each bulletin summary after deduplication processing, and obtains at least one word segmentation of each bulletin summary including:
  • the preprocessing unit 111 performs denoising processing for each announcement summary after deduplication processing to obtain the first text. Further, the preprocessing unit 111 performs lexical analysis processing on the preset fields in the first text , Obtain the second text, and further, the preprocessing unit 111 segments the second text according to a preset custom dictionary to obtain a segmentation position, and according to the segmentation position, the preprocessing unit 111 Construct at least one directed acyclic graph (DAG), and calculate the probability of each directed acyclic graph according to the weights in the custom dictionary.
  • DAG directed acyclic graph
  • the preprocessing unit 111 will The segmentation position corresponding to the directed acyclic graph with the highest probability is determined as the target segmentation position, at least one characteristic word is determined according to the target segmentation position, and the preprocessing unit 111 performs standardization processing on the at least one characteristic word, Get at least one participle of each announcement summary.
  • the denoising processing includes removing tags, special characters, and stop words in each announcement summary.
  • preset fields include, but are not limited to: time, amount, percentage, and so on.
  • the preset custom dictionary stores at least one custom word and the weight corresponding to each custom word.
  • the preprocessing unit 111 performs standardized processing on the at least one characteristic word, and obtaining at least one word segmentation of each announcement summary includes:
  • the preprocessing unit 111 uses a shallow semantic analysis method to identify the at least one feature word, and normalizes the identified feature words with similar meanings to obtain the at least one word segmentation.
  • the preprocessing unit 111 recognizes that "sales” and “operating income” are feature words with similar meanings. Further, the preprocessing unit 111 normalizes “sales” and “operating income”. , Get the participle as "turnover”.
  • the generating unit 112 inputs at least one word segmentation of each announcement abstract into a pre-trained parameter extraction model to generate at least one abstract template.
  • the obtaining unit 116 uses web crawler technology to obtain at least one historical abstract, and the labeling unit 117 responds to the At least one historical abstract is labeled with the abstract category, and the abstract category corresponding to each historical abstract is obtained.
  • the construction unit 118 constructs a data set based on the at least one historical abstract and the corresponding abstract category, and the dividing unit 119 uses a cross-validation method to divide the Data set, a training set and a verification set are obtained, and the processing unit 120 performs word segmentation processing on each historical abstract in the training set and the verification set to obtain at least one feature in the training set and at least one feature in the verification set.
  • the input unit 121 inputs at least one feature in the training set to the input gate layer for training to obtain a learner, and the execution unit 110 performs error analysis and adjustment on the learner according to at least one feature in the verification set , Until the error is less than the configured value, the parameter extraction model is obtained.
  • the input gate layer contains at least one preset parameter list of enterprise type, and the preset parameter list can be determined by analyzing the summary of the announcement.
  • an accurate parameter extraction model can be trained to extract parameters for each announcement abstract based on the parameter extraction model, thereby facilitating the generation of abstract templates.
  • the calculating unit 122 calculates the number of historical abstracts corresponding to each abstract category, and the determining unit 123 determines Whether the number is less than the preset number, when the number is less than the preset number, the execution unit 110 increases the number of historical abstracts corresponding to the number by using a perturbation method.
  • this application does not limit the value of the preset number.
  • the dividing unit 119 uses a cross-validation method to divide the data set to obtain a training set and a verification set including:
  • the dividing unit 119 randomly divides the data set into at least one data packet according to a preset ratio, and further, the dividing unit 119 determines any one of the at least one data packet as the verification set, The remaining data packets are determined as the training set, and the above steps are repeated until all the data packets are used as the verification set in turn.
  • the preset ratio can be customized, which is not limited in this application.
  • the dividing unit 119 divides the data set into three data packets, namely data packet E, data packet F, and data packet G, and determines the data packet E as the verification set, and data packet F And the data packet G is determined as the training set. Secondly, the data packet F is determined as the verification set, and the data packet E and the data packet G are determined as the training set. Finally, the data packet G is determined to be the verification set, and the data packet E and the data packet F are determined to be the training set.
  • each data in the data set participates in training and verification, thereby improving the fitness of training the parameter extraction model.
  • the generating unit 112 inputs at least one word segment of each announcement summary into a pre-trained parameter extraction model to obtain the entity corresponding to each word segmentation. Further, the generating unit 112 The entities in each announcement abstract are merged to obtain an abstract template corresponding to each announcement abstract. Furthermore, the generating unit 112 integrates abstract templates corresponding to the same abstract category and the same enterprise category to obtain the at least one abstract template .
  • the fusion unit 113 fuses the at least one abstract template to obtain an abstract template library.
  • the abstract template library records template information of at least one abstract template, and the template information includes the abstract template, the abstract category of the abstract template, and the company corresponding to the abstract template. category.
  • the summary category may be a financial indicator, etc., and further, the business category may be a chemical industry or the like.
  • the obtaining unit 116 crawls web page information from various websites, it can obtain the column identifier to which the announcement summary belongs in the web page. Further, the determining unit 115 determines the summary of the announcement summary according to the column identifier. Categories and business categories.
  • the extraction unit 114 When receiving the abstract generation instruction, the extraction unit 114 extracts the target text from the abstract generation instruction.
  • the information in the abstract generation instruction includes the target text, the text type to which the target text belongs, the company type to which the company corresponding to the target text belongs, and the like.
  • the extraction unit 114 extracting target text from the abstract generation instruction includes:
  • the extracting unit 114 determines a target tag, and further, the extracting unit 114 extracts text information corresponding to the target tag from the abstract generation instruction as the target text.
  • the determining unit 115 determines the text type to which the target text belongs, and determines the company type to which the company corresponding to the target text belongs.
  • the determining unit 115 may obtain information from the target text according to the target text.
  • the text type and the business type are determined in the abstract generation instruction.
  • the determining unit 115 determines a target abstract template that matches both the text type and the business type from the abstract template library.
  • the determining unit 115 can determine the target summary category and the target company category corresponding to the text type and the business type from the summary template library. Further, the determining unit 115 can determine the target summary category and the target business category corresponding to the The target abstract category and the target company category determine the target abstract template from the abstract library.
  • the generating unit 112 extracts the information required by the target abstract template from the target text, and generates an abstract corresponding to the target text according to the extracted information and the target abstract template.
  • the generating unit 112 extracting the information required by the target summary template from the target text includes:
  • the generating unit 112 extracts the target identifier corresponding to the space in the target summary template. Further, the generating unit 112 extracts the feature value corresponding to the target identifier from the target text as the required information.
  • the target identification facilitates the establishment of a bridge between the target summary template and the target text, which in turn facilitates the accurate entry of the information into the target summary template.
  • the generating unit 112 enters the extracted information into the space corresponding to the target identifier in the target summary template to obtain the summary corresponding to the target text.
  • the determining unit 115 determines the target parameter list according to the type of enterprise, and the obtaining unit 116 obtains all of the target parameter lists.
  • the judging unit 123 judges whether all the parameters are included in the summary, and when it is detected that the summary contains all the parameters, the generating unit 112 generates prompt information according to the summary, and the sending unit 124 transmits the The prompt message is sent to the terminal device of the designated contact.
  • the parameters in the target parameter list are the parameters that must be included in the announcement summary of the enterprise type.
  • the designated contact person may be the person in charge of generating the abstract.
  • the generated summary contains all the parameters, it can be ensured that the generated summary has all the necessary parameters.
  • the prompt information is sent to the designated contact The terminal device of the person to remind the designated contact person to check.
  • this application since this application directly analyzes the information in the announcement summary without analyzing the main idea of the original report, fewer training samples are required, and in the case of the same training samples, this application obtains The model is more accurate, which improves the accuracy of the summary generation.
  • the summary is generated according to the summary template to ensure the continuity of the generated summary.
  • FIG. 3 it is a schematic structural diagram of an electronic device implementing a preferred embodiment of the abstract generation method of the present application.
  • the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and a computer program stored in the memory 12 and running on the processor 13, such as Abstract generation program.
  • the schematic diagram is only an example of the electronic device 1 and does not constitute a limitation on the electronic device 1.
  • the electronic device 1 may also include an input/output device, a network access device, a bus, and the like.
  • the processor 13 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
  • the processor 13 is the computing core and control center of the electronic device 1 and connects the entire electronic device with various interfaces and lines. Each part of 1, and obtain the operating system of the electronic device 1, and various installed applications, program codes, etc.
  • the processor 13 obtains the operating system of the electronic device 1 and various installed applications.
  • the processor 13 obtains the application program to implement the steps in the above-mentioned summary generation method embodiments, for example, the steps shown in FIG. 1.
  • the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and acquired by the processor 13 to complete this Application.
  • the one or more modules/units may be a series of computer-readable instruction segments capable of completing specific functions, and the instruction segments are used to describe the acquisition process of the computer program in the electronic device 1.
  • the computer program can be divided into an execution unit 110, a preprocessing unit 111, a generation unit 112, a fusion unit 113, an extraction unit 114, a determination unit 115, an acquisition unit 116, an annotation unit 117, a construction unit 118, and a division unit 119.
  • a processing unit 120 an input unit 121, a calculation unit 122, a judgment unit 123, and a sending unit 124.
  • the memory 12 may be used to store the computer-readable instructions and/or modules.
  • the processor 13 executes or executes the computer-readable instructions and/or modules stored in the memory 12 and calls the computer-readable instructions and/or modules stored in the memory 12
  • the data inside realizes various functions of the electronic device 1.
  • the memory 12 may mainly include a storage readable instruction area and a storage data area, where the storage readable instruction area may store an operating system and application readable instructions required by at least one function (such as a sound playback function, an image playback function, etc.) Etc.; the data storage area can store data created according to the use of electronic devices, etc.
  • the memory 12 may include non-volatile and volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, a flash memory card ( Flash Card), at least one magnetic disk storage device, flash memory device, or other storage device.
  • non-volatile and volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, a flash memory card ( Flash Card), at least one magnetic disk storage device, flash memory device, or other storage device.
  • the memory 12 may be an external memory and/or an internal memory of the electronic device 1. Further, the memory 12 may be a memory in a physical form, such as a memory stick, a TF card (Trans-flash Card), and so on.
  • TF card Trans-flash Card
  • the integrated module/unit of the electronic device 1 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium, which may be non-volatile. It can also be volatile.
  • this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium.
  • the computer-readable instruction when executed by the processor, it can implement the steps of the foregoing method embodiments.
  • the computer-readable instruction includes computer-readable instruction code
  • the computer-readable instruction code may be in the form of source code, object code, executable file, or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory, random access memory, etc. .
  • the memory 12 in the electronic device 1 stores multiple instructions to implement a summary generation method
  • the processor 13 can obtain the multiple instructions so as to achieve: obtain at least one piece of at least one enterprise Announcement summary, and de-duplicate the at least one bulletin summary; pre-process each bulletin summary after de-duplication processing to obtain at least one word segmentation of each bulletin summary; divide at least one word segmentation of each bulletin summary Input into the pre-trained parameter extraction model to generate at least one abstract template; fuse the at least one abstract template to obtain an abstract template library; when the abstract generation instruction is received, extract the target text from the abstract generation instruction; determine The text type to which the target text belongs, and the company type to which the company corresponding to the target text belongs; determine from the abstract template library a target abstract template that matches both the text type and the company type; from the Extracting the information required by the target abstract template from the target text, and generating an abstract corresponding to the target text according to the extracted information and the target abstract template.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

一种摘要生成方法、装置、电子设备及介质,涉及人工智能技术。该方法能够获取至少一个企业的至少一篇公告摘要并进行去重处理,对去重处理后的每篇公告摘要进行预处理,得到至少一个分词,将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,生成至少一个摘要模板,融合至少一个摘要模板,得到摘要模板库,当接收到摘要生成指令时,从摘要生成指令中提取目标文本,确定目标文本所属的文本类型,及确定目标文本对应的企业所属的企业类型,确定同时与所述文本类型及所述企业类型匹配的目标摘要模板,从目标文本中提取所述目标摘要模板所需的信息,生成目标文本对应的摘要,该方法能够提高摘要生成的准确率。

Description

摘要生成方法、装置、电子设备及介质
本申请要求于2020年03月31日提交中国专利局,申请号为202010244210.0,发明名称为“摘要生成方法、装置、电子设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种摘要生成方法、装置、电子设备及介质。
背景技术
随着监管机构对企业的监督与指导力度的加大,要求企业定期向社会公告与证券交易相关的重大事件,并披露相关统计数据,如:募集说明书、上市公告书等报告,为方便第三方快速了解企业的运行状况,需要为这些报告提供摘要,由于每篇报告一般长达几十页,因此,人工处理时需要熟读每篇报告后才能提供摘要,影响了摘要生成的效率,为此,自动生成摘要的方式应运而生。
发明人意识到,在现有的摘要生成方案中,基于深度学习的摘要生成方案是通过分析报告原文的具体大意,进而概括地生成报告的摘要,该方法由于需要分析原文的大意,因此需要大量标注好的训练样本,然而,由于标注好的训练样本不易获取,因此在没有足够多的训练样本的情况下,将会导致摘要自动生成的准确率低。传统的抽取式摘要生成方案从文本中抽取语句,这种方式虽然便捷,但是句子之间连贯性不强,可读性较差。
因此,如何构建准确且连贯性强的摘要生成方案,成了有待解决的技术问题。
发明内容
鉴于以上内容,有必要提供一种摘要生成方法、装置、电子设备及介质,不仅能够提高摘要生成的准确率,还能保证生成的摘要的连贯性。
本申请的第一方面提供一种摘要生成方法,所述摘要生成方法包括:
获取至少一个企业的至少一篇公告摘要,并对所述至少一篇公告摘要进行去重处理;
对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词;
将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,生成至少一个摘要模板;
融合所述至少一个摘要模板,得到摘要模板库;
当接收到摘要生成指令时,从所述摘要生成指令中提取目标文本;
确定所述目标文本所属的文本类型,及确定所述目标文本对应的企业所属的企业类型;
从所述摘要模板库中确定同时与所述文本类型及所述企业类型匹配的目标摘要模板;
从所述目标文本中提取所述目标摘要模板所需的信息,及根据提取的信息及所述目标摘要模板,生成所述目标文本对应的摘要。
本申请的第二方面提供一种电子设备,所述电子设备包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令以实现以下步骤:
获取至少一个企业的至少一篇公告摘要,并对所述至少一篇公告摘要进行去重处理;
对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词;
将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,生成至少一个摘要模板;
融合所述至少一个摘要模板,得到摘要模板库;
当接收到摘要生成指令时,从所述摘要生成指令中提取目标文本;
确定所述目标文本所属的文本类型,及确定所述目标文本对应的企业所属的企业类型;
从所述摘要模板库中确定同时与所述文本类型及所述企业类型匹配的目标摘要模板;
从所述目标文本中提取所述目标摘要模板所需的信息,及根据提取的信息及所述目标摘要模板,生成所述目标文本对应的摘要。
本申请的第三方面提供一种计算机可读存储介质,所述计算机可读存储介质上存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
获取至少一个企业的至少一篇公告摘要,并对所述至少一篇公告摘要进行去重处理;
对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词;
将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,生成至少一个摘要模板;
融合所述至少一个摘要模板,得到摘要模板库;
当接收到摘要生成指令时,从所述摘要生成指令中提取目标文本;
确定所述目标文本所属的文本类型,及确定所述目标文本对应的企业所属的企业类型;
从所述摘要模板库中确定同时与所述文本类型及所述企业类型匹配的目标摘要模板;
从所述目标文本中提取所述目标摘要模板所需的信息,及根据提取的信息及所述目标摘要模板,生成所述目标文本对应的摘要。
本申请的第四方面提供一种摘要生成装置,所述摘要生成包括:
执行单元,用于获取至少一个企业的至少一篇公告摘要,并对所述至少一篇公告摘要进行去重处理;
预处理单元,用于对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词;
生成单元,用于将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,生成至少一个摘要模板;
融合单元,用于融合所述至少一个摘要模板,得到摘要模板库;
提取单元,用于当接收到摘要生成指令时,从所述摘要生成指令中提取目标文本;
确定单元,用于确定所述目标文本所属的文本类型,及确定所述目标文本对应的企业所属的企业类型;
所述确定单元,还用于从所述摘要模板库中确定同时与所述文本类型及所述企业类型匹配的目标摘要模板;
所述生成单元,还用于从所述目标文本中提取所述目标摘要模板所需的信息,及根据提取的信息及所述目标摘要模板,生成所述目标文本对应的摘要。
由以上技术方案可以看出,由于本申请直接分析公告摘要的信息,而无需分析报告原文的大意,因此,所需的训练样本较少,进而在相同的训练样本的情况下,本申请得到的模型更精确,提高了摘要生成的准确率,另外,根据摘要模板生成摘要,保证了生成的摘要的连贯性。
附图说明
图1是本申请摘要生成方法的较佳实施例的流程图。
图2是本申请摘要生成装置的较佳实施例的功能模块图。
图3是本申请实现摘要生成方法的较佳实施例的电子设备的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本申请 进行详细描述。
如图1所示,是本申请摘要生成方法的较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。
所述摘要生成方法应用于一个或者多个电子设备中,所述电子设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字信号处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述电子设备可以是任何一种可与用户进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备等。
所述电子设备还可以包括网络设备和/或用户设备。其中,所述网络设备包括,但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算(Cloud Computing)的由大量主机或网络服务器构成的云。
所述电子设备所处的网络包括但不限于互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。
S10,获取至少一个企业的至少一篇公告摘要,并对所述至少一篇公告摘要进行去重处理。
在本申请的至少一个实施例中,所述至少一篇公告摘要的来源包括,但不限于:发布公告摘要的金融网站上,至少一个企业的网站上等。通过在各类网站上构建爬虫,进而从各类网站上爬取网页信息,以得到公开的至少一篇公告摘要。
由于不同的来源可能存在相同的公告摘要,因此需要对公告摘要进行去重处理。
在本申请的至少一个实施例中,所述电子设备对所述至少一篇公告摘要进行去重处理包括:
所述电子设备根据每篇公告摘要中的摘要标题,计算每篇公告摘要的哈希值,进一步地,所述电子设备从每篇公告摘要中抽取预设特征并建立特征索引,根据每两篇公告摘要的哈希值,所述电子设备采用余弦距离公式计算每两篇公告摘要的相似距离,得到每对摘要对的相似距离,其中,每对摘要对包括任意两篇公告摘要,进一步地,所述电子设备通过所述特征索引搜索出相似距离大于阈值的摘要对,并将该摘要对确定为相似摘要对,更进一步地,所述电子设备判断所述相似摘要对中的预设特征是否相同,当所述相似摘要对中的预设特征相同时,所述电子设备删除所述相似摘要对中的任意一条摘要。
通过上述实施方式,由于无需对重复的公告摘要进行再次分析处理,因此能够提高摘要模板库的生成效率,另外,还能够节省所述电子设备的内存。
S11,对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词。
在本申请的至少一个实施例中,所述电子设备对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词包括:
所述电子设备针对去重处理后的每篇公告摘要进行去噪处理,得到第一文本,进一步地,所述电子设备对所述第一文本中的预设字段进行词法分析处理,得到第二文本,更进一步地,所述电子设备根据预设的自定义词典对所述第二文本进行切分,得到切分位置,根据所述切分位置,所述电子设备构建至少一个有向无环图(Directed acyclic graph,DAG),并根据所述自定义词典中的权值计算每个有向无环图的概率,更进一步地,所述电子设备将概率最大的有向无环图对应的切分位置确定为目标切分位置,并根据所述目标切分位置确定至少一个特征词,所述电子设备对所述至少一个特征词进行标准化处理,得到每篇公告摘要的至少一个分词。
其中,所述去噪处理包括去除每篇公告摘要中的标签、特殊字符以及停用词等。
进一步地,所述预设字段包括,但不限于:时间、金额、百分数等。
更进一步地,所述预设的自定义词典中存储至少一个自定义词及每个自定义词对应的权 值。
具体地,所述电子设备对所述至少一个特征词进行标准化处理,得到每篇公告摘要的至少一个分词包括:
所述电子设备采用浅层式语义分析方法识别所述至少一个特征词,并将识别出的含义相似的特征词进行归一化处理,得到所述至少一个分词。
例如,所述电子设备识别出“营业额”与“营业收入”为含义相似的特征词,进一步地,所述电子设备对“营业额”与“营业收入”进行归一化处理,得到分词为“营业额”。
通过对去重处理后的每篇公告摘要进行去噪处理,不仅能够减少无效的数据量,还能够节省所述电子设备的内存;通过对所述预设字段进行词法分析处理,能够避免后续生成摘要模板时发生不必要的扰动;通过具有权值的自定义词典切分所述第二文本,能够准确地确定所述至少一个特征词;通过对所述至少一个特征词进行标准化处理,能够对所述至少一个特征词的表述进行统一,有利于参数抽取模型抽取实体。
S12,将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,生成至少一个摘要模板。
在本申请的至少一个实施例中,在将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型之前,所述方法还包括:
所述电子设备采用网络爬虫技术获取至少一篇历史摘要,所述电子设备对所述至少一篇历史摘要进行摘要类别标注,得到每篇历史摘要对应的摘要类别,进一步地,所述电子设备基于所述至少一篇历史摘要及对应的摘要类别构建数据集,所述电子设备采用交叉验证法划分所述数据集,得到训练集及验证集,所述电子设备对所述训练集及所述验证集中的每篇历史摘要进行分词处理,得到所述训练集中的至少一个特征及所述验证集中的至少一个特征,所述电子设备将所述训练集中的至少一个特征输入到输入门层进行训练,得到学习器,根据所述验证集中的至少一个特征,所述电子设备对所述学习器进行误差分析并调整,直至误差小于配置值时,得到所述参数抽取模型。
其中,所述输入门层中含有至少一个企业类型的预设参数列表,所述预设参数列表可以通过分析公告摘要而确定。
通过上述实施方式,能够训练得到精准的参数抽取模型,以便基于所述参数抽取模型对每篇公告摘要进行参数抽取,进而有利于摘要模板的生成。
在本申请的至少一个实施例中,所述基于所述至少一篇历史摘要及对应的摘要类别构建数据集之后,所述方法还包括:
所述电子设备计算每一摘要类别对应的历史摘要的数量,进一步地,所述电子设备判断所述数量是否小于预设数量,当所述数量小于所述预设数量时,所述电子设备通过扰动法增加与所述数量对应的历史摘要的数量。
其中,本申请对所述预设数量的取值不作限制。
通过上述实施方式,能够避免由于某一摘要类别的历史摘要的样本数量不足,导致训练得到的参数抽取模型不准确,进而影响摘要生成的准确率。
在本申请的至少一个实施例中,所述电子设备采用交叉验证法划分所述数据集,得到训练集及验证集包括:
所述电子设备将所述数据集按照预设比例随机划分为至少一个数据包,进一步地,所述电子设备将所述至少一个数据包中的任意一个数据包确定为所述验证集,其余的数据包确定为所述训练集,重复上述步骤,直至所有的数据包全都依次被用作为所述验证集。
其中,所述预设比例可以自定义设置,本申请不作限制。
例如:所述电子设备将所述数据集划分为3个数据包,分别为数据包E、数据包F、数据包G,并将所述数据包E确定为所述验证集,数据包F以及数据包G确定为所述训练集。其次,将所述数据包F确定为所述验证集,数据包E以及数据包G确定为所述训练集。最后,所述数据包G确定为所述验证集,数据包E以及数据包F确定为所述训练集。
通过划分所述数据集,使所述数据集中的每个数据均参加训练及验证,由此,提高训练所述参数抽取模型的拟合度。
在本申请的至少一个实施例中,所述电子设备将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,得到每个分词对应的实体,进一步地,所述电子设备融合每篇公告摘要中的实体,得到每篇公告摘要对应的摘要模板,更进一步地,所述电子设备将同一摘要类别及同一企业类别对应的摘要模板进行整合,得到所述至少一个摘要模板。
S13,融合所述至少一个摘要模板,得到摘要模板库。
在本申请的至少一个实施例中,所述摘要模板库中记载至少一个摘要模板的模板信息,所述模板信息包括所述摘要模板、所述摘要模板的摘要类别、所述摘要模板对应的企业类别。
其中,所述摘要类别可以是财务指标等,进一步地,所述企业类别可以是化工类等。
具体地,所述电子设备从各类网站上爬取网页信息时,能够获取到公告摘要在网页中所属的栏目标识,进一步地,所述电子设备根据所述栏目标识确定所述公告摘要的摘要类别及企业类别。
通过在所述摘要模板库中记载摘要模板对应的摘要类别及企业类别,为后续所述电子设备从摘要模板库中选取目标摘要模板奠定基础。
S14,当接收到摘要生成指令时,从所述摘要生成指令中提取目标文本。
在本申请的至少一个实施例中,所述摘要生成指令中的信息包括所述目标文本、所述目标文本所属的文本类型、所述目标文本对应的企业所属的企业类型等。
在本申请的至少一个实施例中,所述从所述摘要生成指令中提取目标文本包括:
所述电子设备确定目标标签,进一步地,所述电子设备从所述摘要生成指令中提取所述目标标签对应的文本信息,作为所述目标文本。
S15,确定所述目标文本所属的文本类型,及确定所述目标文本对应的企业所属的企业类型。
由于所述摘要生成指令中的信息包括所述目标文本、所述目标文本所属的文本类型、所述目标文本对应的企业所属的企业类型,因此,所述电子设备可以根据所述目标文本从所述摘要生成指令中确定出所述文本类型及所述企业类型。
S16,从所述摘要模板库中确定同时与所述文本类型及所述企业类型匹配的目标摘要模板。
在本申请的至少一个实施例中,由于所述摘要模板库中记载着至少一个摘要模板的模板信息,所述模板信息包括所述摘要模板、所述摘要模板的摘要类别、所述摘要模板对应的企业类别,因此,所述电子设备能够从所述摘要模板库中确定与所述文本类型及所述企业类型对应的目标摘要类别及目标企业类别,进一步地,所述电子设备根据所述目标摘要类别及所述目标企业类别从所述摘要库中确定所述目标摘要模板。
S17,从所述目标文本中提取所述目标摘要模板所需的信息,及根据提取的信息及所述目标摘要模板,生成所述目标文本对应的摘要。
在本申请的至少一个实施例中,所述从所述目标文本中提取所述目标摘要模板所需的信息包括:
所述电子设备提取所述目标摘要模板中的空格对应的目标标识,进一步地,所述电子设备从所述目标文本中提取与所述目标标识对应的特征值,作为所述所需的信息。
通过所述目标标识,便于所述目标摘要模板与所述目标文本建立桥梁,进而有利于将所述信息准确地录入到所述目标摘要模板中。
在本申请的至少一个实施例中,所述电子设备将提取的信息录入所述目标摘要模板中的所述目标标识对应的空格中,得到所述目标文本对应的摘要。
通过上述实施方式,能够生成准确的摘要。
在本申请的至少一个实施例中,在生成所述目标文本对应的摘要后,所述方法还包括:
所述电子设备根据所述企业类型确定目标参数列表,进一步地,所述电子设备获取所述目标参数列表中的所有参数,更进一步地,所述电子设备判断所述摘要中是否包含所述所有 参数,当检测到所述摘要中包含所述所有参数时,根据所述摘要生成提示信息,所述电子设备将所述提示信息发送至指定联系人的终端设备。
其中,所述目标参数列表中的参数为所述企业类型的公告摘要中必含的参数。
进一步地,所述指定联系人可以是所述摘要生成的负责人。
由于不同企业类型对应不同的参数列表,因此,通过判断生成的摘要是否包含所述所有参数,能够确保生成的摘要中具有必备的所有参数,另外,将所述提示信息发送至所述指定联系人的终端设备,以提醒所述指定联系人进行查收。
由以上技术方案可以看出,由于本申请直接分析公告摘要的信息,而无需分析报告原文的大意,因此,所需的训练样本较少,进而在相同的训练样本的情况下,本申请得到的模型更精确,提高了摘要生成的准确率,另外,根据摘要模板生成摘要,保证了生成的摘要的连贯性。
如图2所示,是本申请摘要生成装置的较佳实施例的功能模块图。所述摘要生成装置11包括执行单元110、预处理单元111、生成单元112、融合单元113、提取单元114、确定单元115、获取单元116、标注单元117、构建单元118、划分单元119、处理单元120、输入单元121、计算单元122、判断单元123及发送单元124。本申请所称的模块/单元是指一种能够被处理器13所获取,并且能够完成固定功能的一系列计算机程序段,其存储在存储器12中。在本实施例中,关于各模块/单元的功能将在后续的实施例中详述。
执行单元110获取至少一个企业的至少一篇公告摘要,并对所述至少一篇公告摘要进行去重处理。
在本申请的至少一个实施例中,所述至少一篇公告摘要的来源包括,但不限于:发布公告摘要的金融网站上,至少一个企业的网站上等。通过在各类网站上构建爬虫,进而从各类网站上爬取网页信息,以得到公开的至少一篇公告摘要。
由于不同的来源可能存在相同的公告摘要,因此需要对公告摘要进行去重处理。
在本申请的至少一个实施例中,所述执行单元110对所述至少一篇公告摘要进行去重处理包括:
所述执行单元110根据每篇公告摘要中的摘要标题,计算每篇公告摘要的哈希值,进一步地,所述执行单元110从每篇公告摘要中抽取预设特征并建立特征索引,根据每两篇公告摘要的哈希值,所述执行单元110采用余弦距离公式计算每两篇公告摘要的相似距离,得到每对摘要对的相似距离,其中,每对摘要对包括任意两篇公告摘要,进一步地,所述执行单元110通过所述特征索引搜索出相似距离大于阈值的摘要对,并将该摘要对确定为相似摘要对,更进一步地,所述执行单元110判断所述相似摘要对中的预设特征是否相同,当所述相似摘要对中的预设特征相同时,所述执行单元110删除所述相似摘要对中的任意一条摘要。
通过上述实施方式,由于无需对重复的公告摘要进行再次分析处理,因此能够提高摘要模板库的生成效率,另外,还能够节省电子设备的内存。
预处理单元111对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词。
在本申请的至少一个实施例中,所述预处理单元111对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词包括:
所述预处理单元111针对去重处理后的每篇公告摘要进行去噪处理,得到第一文本,进一步地,所述预处理单元111对所述第一文本中的预设字段进行词法分析处理,得到第二文本,更进一步地,所述预处理单元111根据预设的自定义词典对所述第二文本进行切分,得到切分位置,根据所述切分位置,所述预处理单元111构建至少一个有向无环图(Directed acyclic graph,DAG),并根据所述自定义词典中的权值计算每个有向无环图的概率,更进一步地,所述预处理单元111将概率最大的有向无环图对应的切分位置确定为目标切分位置,根据所述目标切分位置确定至少一个特征词,所述预处理单元111对所述至少一个特征词进行标准化处理,得到每篇公告摘要的至少一个分词。
其中,所述去噪处理包括去除每篇公告摘要中的标签、特殊字符以及停用词等。
进一步地,所述预设字段包括,但不限于:时间、金额、百分数等。
更进一步地,所述预设的自定义词典中存储至少一个自定义词及每个自定义词对应的权值。
具体地,所述预处理单元111对所述至少一个特征词进行标准化处理,得到每篇公告摘要的至少一个分词包括:
所述预处理单元111采用浅层式语义分析方法识别所述至少一个特征词,并将识别出的含义相似的特征词进行归一化处理,得到所述至少一个分词。
例如,所述预处理单元111识别出“营业额”与“营业收入”为含义相似的特征词,进一步地,所述预处理单元111对“营业额”与“营业收入”进行归一化处理,得到分词为“营业额”。
通过对去重处理后的每篇公告摘要进行去噪处理,不仅能够减少无效的数据量,还能够节省所述电子设备的内存;通过对所述预设字段进行词法分析处理,能够避免后续生成摘要模板时发生不必要的扰动;通过具有权值的自定义词典切分所述第二文本,能够准确地确定所述至少一个特征词;通过对所述至少一个特征词进行标准化处理,能够对所述至少一个特征词的表述进行统一,有利于参数抽取模型抽取实体。
生成单元112将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,生成至少一个摘要模板。
在本申请的至少一个实施例中,在将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型之前,获取单元116采用网络爬虫技术获取至少一篇历史摘要,标注单元117对所述至少一篇历史摘要进行摘要类别标注,得到每篇历史摘要对应的摘要类别,构建单元118基于所述至少一篇历史摘要及对应的摘要类别构建数据集,划分单元119采用交叉验证法划分所述数据集,得到训练集及验证集,处理单元120对所述训练集及所述验证集中的每篇历史摘要进行分词处理,得到所述训练集中的至少一个特征及所述验证集中的至少一个特征,输入单元121将所述训练集中的至少一个特征输入到输入门层进行训练,得到学习器,所述执行单元110根据所述验证集中的至少一个特征,对所述学习器进行误差分析并调整,直至误差小于配置值时,得到所述参数抽取模型。
其中,所述输入门层中含有至少一个企业类型的预设参数列表,所述预设参数列表可以通过分析公告摘要而确定。
通过上述实施方式,能够训练得到精准的参数抽取模型,以便基于所述参数抽取模型对每篇公告摘要进行参数抽取,进而有利于摘要模板的生成。
在本申请的至少一个实施例中,所述基于所述至少一篇历史摘要及对应的摘要类别构建数据集之后,计算单元122计算每一摘要类别对应的历史摘要的数量,判断单元123判断所述数量是否小于预设数量,当所述数量小于所述预设数量时,所述执行单元110通过扰动法增加与所述数量对应的历史摘要的数量。
其中,本申请对所述预设数量的取值不作限制。
通过上述实施方式,能够避免由于某一摘要类别的历史摘要的样本数量不足,导致训练得到的参数抽取模型不准确,进而影响摘要生成的准确率。
在本申请的至少一个实施例中,所述划分单元119采用交叉验证法划分所述数据集,得到训练集及验证集包括:
所述划分单元119将所述数据集按照预设比例随机划分为至少一个数据包,进一步地,所述划分单元119将所述至少一个数据包中的任意一个数据包确定为所述验证集,其余的数据包确定为所述训练集,重复上述步骤,直至所有的数据包全都依次被用作为所述验证集。
其中,所述预设比例可以自定义设置,本申请不作限制。
例如:所述划分单元119将所述数据集划分为3个数据包,分别为数据包E、数据包F、数据包G,并将所述数据包E确定为所述验证集,数据包F以及数据包G确定为所述训练集。其次,将所述数据包F确定为所述验证集,数据包E以及数据包G确定为所述训练集。 最后,所述数据包G确定为所述验证集,数据包E以及数据包F确定为所述训练集。
通过划分所述数据集,使所述数据集中的每个数据均参加训练及验证,由此,提高训练所述参数抽取模型的拟合度。
在本申请的至少一个实施例中,所述生成单元112将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,得到每个分词对应的实体,进一步地,所述生成单元112融合每篇公告摘要中的实体,得到每篇公告摘要对应的摘要模板,更进一步地,所述生成单元112将同一摘要类别及同一企业类别对应的摘要模板进行整合,得到所述至少一个摘要模板。
融合单元113融合所述至少一个摘要模板,得到摘要模板库。
在本申请的至少一个实施例中,所述摘要模板库中记载至少一个摘要模板的模板信息,所述模板信息包括所述摘要模板、所述摘要模板的摘要类别、所述摘要模板对应的企业类别。
其中,所述摘要类别可以是财务指标等,进一步地,所述企业类别可以是化工类等。
具体地,所述获取单元116从各类网站上爬取网页信息时,能够获取到公告摘要在网页中所属的栏目标识,进一步地,确定单元115根据所述栏目标识确定所述公告摘要的摘要类别及企业类别。
通过在所述摘要模板库中记载摘要模板对应的摘要类别及企业类别,为后续从摘要模板库中选取目标摘要模板奠定基础。
提取单元114当接收到摘要生成指令时,从所述摘要生成指令中提取目标文本。
在本申请的至少一个实施例中,所述摘要生成指令中的信息包括所述目标文本、所述目标文本所属的文本类型、所述目标文本对应的企业所属的企业类型等。
在本申请的至少一个实施例中,所述提取单元114从所述摘要生成指令中提取目标文本包括:
所述提取单元114确定目标标签,进一步地,所述提取单元114从所述摘要生成指令中提取所述目标标签对应的文本信息,作为所述目标文本。
所述确定单元115确定所述目标文本所属的文本类型,及确定所述目标文本对应的企业所属的企业类型。
由于所述摘要生成指令中的信息包括所述目标文本、所述目标文本所属的文本类型、所述目标文本对应的企业所属的企业类型,因此,所述确定单元115可以根据所述目标文本从所述摘要生成指令中确定出所述文本类型及所述企业类型。
所述确定单元115从所述摘要模板库中确定同时与所述文本类型及所述企业类型匹配的目标摘要模板。
在本申请的至少一个实施例中,由于所述摘要模板库中记载着至少一个摘要模板的模板信息,所述模板信息包括所述摘要模板、所述摘要模板的摘要类别、所述摘要模板对应的企业类别,因此,所述确定单元115能够从所述摘要模板库中确定与所述文本类型及所述企业类型对应的目标摘要类别及目标企业类别,进一步地,所述确定单元115根据所述目标摘要类别及所述目标企业类别从所述摘要库中确定所述目标摘要模板。
所述生成单元112从所述目标文本中提取所述目标摘要模板所需的信息,及根据提取的信息及所述目标摘要模板,生成所述目标文本对应的摘要。
在本申请的至少一个实施例中,所述生成单元112从所述目标文本中提取所述目标摘要模板所需的信息包括:
所述生成单元112提取所述目标摘要模板中的空格对应的目标标识,进一步地,所述生成单元112从所述目标文本中提取与所述目标标识对应的特征值,作为所述所需的信息。
通过所述目标标识,便于所述目标摘要模板与所述目标文本建立桥梁,进而有利于将所述信息准确地录入到所述目标摘要模板中。
在本申请的至少一个实施例中,所述生成单元112将提取的信息录入所述目标摘要模板中的所述目标标识对应的空格中,得到所述目标文本对应的摘要。
通过上述实施方式,能够生成准确的摘要。
在本申请的至少一个实施例中,在生成所述目标文本对应的摘要后,所述确定单元115根据所述企业类型确定目标参数列表,所述获取单元116获取所述目标参数列表中的所有参数,判断单元123判断所述摘要中是否包含所述所有参数,当检测到所述摘要中包含所述所有参数时,所述生成单元112根据所述摘要生成提示信息,发送单元124将所述提示信息发送至指定联系人的终端设备。
其中,所述目标参数列表中的参数为所述企业类型的公告摘要中必含的参数。
进一步地,所述指定联系人可以是所述摘要生成的负责人。
由于不同企业类型对应不同的参数列表,因此,通过判断生成的摘要是否包含所述所有参数,能够确保生成的摘要中具有必备的所有参数,另外,将所述提示信息发送至所述指定联系人的终端设备,以提醒所述指定联系人进行查收。
由以上技术方案可以看出,由于本申请直接分析公告摘要的信息,而无需分析报告原文的大意,因此,所需的训练样本较少,进而在相同的训练样本的情况下,本申请得到的模型更精确,提高了摘要生成的准确率,另外,根据摘要模板生成摘要,保证了生成的摘要的连贯性。
如图3所示,是本申请实现摘要生成方法的较佳实施例的电子设备的结构示意图。
在本申请的一个实施例中,所述电子设备1包括,但不限于,存储器12、处理器13,以及存储在所述存储器12中并可在所述处理器13上运行的计算机程序,例如摘要生成程序。
本领域技术人员可以理解,所述示意图仅仅是电子设备1的示例,并不构成对电子设备1的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述电子设备1还可以包括输入输出设备、网络接入设备、总线等。
所述处理器13可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,所述处理器13是所述电子设备1的运算核心和控制中心,利用各种接口和线路连接整个电子设备1的各个部分,及获取所述电子设备1的操作系统以及安装的各类应用程序、程序代码等。
所述处理器13获取所述电子设备1的操作系统以及安装的各类应用程序。所述处理器13获取所述应用程序以实现上述各个摘要生成方法实施例中的步骤,例如图1所示的步骤。
示例性的,所述计算机程序可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器12中,并由所述处理器13获取,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机程序在所述电子设备1中的获取过程。例如,所述计算机程序可以被分割成执行单元110、预处理单元111、生成单元112、融合单元113、提取单元114、确定单元115、获取单元116、标注单元117、构建单元118、划分单元119、处理单元120、输入单元121、计算单元122、判断单元123及发送单元124。
所述存储器12可用于存储所述计算机可读指令和/或模块,所述处理器13通过运行或执行存储在所述存储器12内的计算机可读指令和/或模块,以及调用存储在存储器12内的数据,实现所述电子设备1的各种功能。所述存储器12可主要包括存储可读指令区和存储数据区,其中,存储可读指令区可存储操作系统、至少一个功能所需的应用可读指令(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。此外,存储器12可以包括非易失性和易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他存储器件。
所述存储器12可以是电子设备1的外部存储器和/或内部存储器。进一步地,所述存储器12可以是具有实物形式的存储器,如内存条、TF卡(Trans-flash Card)等等。
所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中,该计算机存储介质可以是非易失性,也可以是易失性的。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。
其中,所述计算机可读指令包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器、随机存取存储器等。
结合图1,所述电子设备1中的所述存储器12存储多个指令以实现一种摘要生成方法,所述处理器13可获取所述多个指令从而实现:获取至少一个企业的至少一篇公告摘要,并对所述至少一篇公告摘要进行去重处理;对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词;将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,生成至少一个摘要模板;融合所述至少一个摘要模板,得到摘要模板库;当接收到摘要生成指令时,从所述摘要生成指令中提取目标文本;确定所述目标文本所属的文本类型,及确定所述目标文本对应的企业所属的企业类型;从所述摘要模板库中确定同时与所述文本类型及所述企业类型匹配的目标摘要模板;从所述目标文本中提取所述目标摘要模板所需的信息,及根据提取的信息及所述目标摘要模板,生成所述目标文本对应的摘要。
具体地,所述处理器13对上述指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。本申请中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一、第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (20)

  1. 一种摘要生成方法,其中,所述摘要生成方法包括:
    获取至少一个企业的至少一篇公告摘要,并对所述至少一篇公告摘要进行去重处理;
    对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词;
    将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,生成至少一个摘要模板;
    融合所述至少一个摘要模板,得到摘要模板库;
    当接收到摘要生成指令时,从所述摘要生成指令中提取目标文本;
    确定所述目标文本所属的文本类型,及确定所述目标文本对应的企业所属的企业类型;
    从所述摘要模板库中确定同时与所述文本类型及所述企业类型匹配的目标摘要模板;
    从所述目标文本中提取所述目标摘要模板所需的信息,及根据提取的信息及所述目标摘要模板,生成所述目标文本对应的摘要。
  2. 根据权利要求1所述的摘要生成方法,其中,所述对所述至少一篇公告摘要进行去重处理包括:
    根据每篇公告摘要中的摘要标题,计算每篇公告摘要的哈希值;
    从每篇公告摘要中抽取预设特征并建立特征索引;
    根据每两篇公告摘要的哈希值,采用余弦距离公式计算每两篇公告摘要的相似距离,得到每对摘要对的相似距离,其中,每对摘要对包括任意两篇公告摘要;
    通过所述特征索引搜索出相似距离大于阈值的摘要对,并将该摘要对确定为相似摘要对;
    判断所述相似摘要对中的预设特征是否相同;
    当所述相似摘要对中的预设特征相同时,删除所述相似摘要对中的任意一条摘要。
  3. 根据权利要求1所述的摘要生成方法,其中,所述对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词包括:
    针对去重处理后的每篇公告摘要进行去噪处理,得到第一文本;
    对所述第一文本中的预设字段进行词法分析处理,得到第二文本;
    根据预设的自定义词典对所述第二文本进行切分,得到切分位置;
    根据所述切分位置,构建至少一个有向无环图;
    根据所述自定义词典中的权值计算每个有向无环图的概率;
    将概率最大的有向无环图对应的切分位置确定为目标切分位置;
    根据所述目标切分位置确定至少一个特征词;
    对所述至少一个特征词进行标准化处理,得到每篇公告摘要的至少一个分词。
  4. 根据权利要求1所述的摘要生成方法,其中,在将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型之前,所述摘要生成方法还包括:
    采用网络爬虫技术获取至少一篇历史摘要;
    对所述至少一篇历史摘要进行摘要类别标注,得到每篇历史摘要对应的摘要类别;
    基于所述至少一篇历史摘要及对应的摘要类别构建数据集;
    采用交叉验证法划分所述数据集,得到训练集及验证集;
    对所述训练集及所述验证集中的每篇历史摘要进行分词处理,得到所述训练集中的至少一个特征及所述验证集中的至少一个特征;
    将所述训练集中的至少一个特征输入到输入门层进行训练,得到学习器;
    根据所述验证集中的至少一个特征,对所述学习器进行误差分析并调整,直至误差 小于配置值时,得到所述参数抽取模型。
  5. 根据权利要求4所述的摘要生成方法,其中,所述基于所述至少一篇历史摘要及对应的摘要类别构建数据集之后,所述摘要生成方法还包括:
    计算每一摘要类别对应的历史摘要的数量;
    判断所述数量是否小于预设数量;
    当所述数量小于所述预设数量时,通过扰动法增加与所述数量对应的历史摘要的数量。
  6. 根据权利要求1所述的摘要生成方法,其中,所述摘要模板库中记载至少一个摘要模板的模板信息,所述模板信息包括所述摘要模板、所述摘要模板的摘要类别、所述摘要模板对应的企业类别。
  7. 根据权利要求1所述的摘要生成方法,其中,在生成所述目标文本对应的摘要后,所述摘要生成方法还包括:
    根据所述企业类型确定目标参数列表;
    获取所述目标参数列表中的所有参数;
    判断所述摘要中是否包含所述所有参数;
    当检测到所述摘要中包含所述所有参数时,根据所述摘要生成提示信息;
    将所述提示信息发送至指定联系人的终端设备。
  8. 一种摘要生成装置,其中,所述摘要生成装置包括:
    执行单元,用于获取至少一个企业的至少一篇公告摘要,并对所述至少一篇公告摘要进行去重处理;
    预处理单元,用于对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词;
    生成单元,用于将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,生成至少一个摘要模板;
    融合单元,用于融合所述至少一个摘要模板,得到摘要模板库;
    提取单元,用于当接收到摘要生成指令时,从所述摘要生成指令中提取目标文本;
    确定单元,用于确定所述目标文本所属的文本类型,及确定所述目标文本对应的企业所属的企业类型;
    所述确定单元,还用于从所述摘要模板库中确定同时与所述文本类型及所述企业类型匹配的目标摘要模板;
    所述生成单元,还用于从所述目标文本中提取所述目标摘要模板所需的信息,及根据提取的信息及所述目标摘要模板,生成所述目标文本对应的摘要。
  9. 一种电子设备,其中,所述电子设备包括处理器和存储器,所述处理器用于执行存储器中存储的至少一个计算机可读指令以实现以下步骤:
    获取至少一个企业的至少一篇公告摘要,并对所述至少一篇公告摘要进行去重处理;
    对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词;
    将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,生成至少一个摘要模板;
    融合所述至少一个摘要模板,得到摘要模板库;
    当接收到摘要生成指令时,从所述摘要生成指令中提取目标文本;
    确定所述目标文本所属的文本类型,及确定所述目标文本对应的企业所属的企业类型;
    从所述摘要模板库中确定同时与所述文本类型及所述企业类型匹配的目标摘要模板;
    从所述目标文本中提取所述目标摘要模板所需的信息,及根据提取的信息及所述目 标摘要模板,生成所述目标文本对应的摘要。
  10. 根据权利要求9所述的电子设备,其中,在所述对所述至少一篇公告摘要进行去重处理时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:
    根据每篇公告摘要中的摘要标题,计算每篇公告摘要的哈希值;
    从每篇公告摘要中抽取预设特征并建立特征索引;
    根据每两篇公告摘要的哈希值,采用余弦距离公式计算每两篇公告摘要的相似距离,得到每对摘要对的相似距离,其中,每对摘要对包括任意两篇公告摘要;
    通过所述特征索引搜索出相似距离大于阈值的摘要对,并将该摘要对确定为相似摘要对;
    判断所述相似摘要对中的预设特征是否相同;
    当所述相似摘要对中的预设特征相同时,删除所述相似摘要对中的任意一条摘要。
  11. 根据权利要求9所述的电子设备,其中,在所述对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:
    针对去重处理后的每篇公告摘要进行去噪处理,得到第一文本;
    对所述第一文本中的预设字段进行词法分析处理,得到第二文本;
    根据预设的自定义词典对所述第二文本进行切分,得到切分位置;
    根据所述切分位置,构建至少一个有向无环图;
    根据所述自定义词典中的权值计算每个有向无环图的概率;
    将概率最大的有向无环图对应的切分位置确定为目标切分位置;
    根据所述目标切分位置确定至少一个特征词;
    对所述至少一个特征词进行标准化处理,得到每篇公告摘要的至少一个分词。
  12. 根据权利要求9所述的电子设备,其中,在将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型之前,所述处理器执行所述至少一个计算机可读指令还用以实现以下步骤:
    采用网络爬虫技术获取至少一篇历史摘要;
    对所述至少一篇历史摘要进行摘要类别标注,得到每篇历史摘要对应的摘要类别;
    基于所述至少一篇历史摘要及对应的摘要类别构建数据集;
    采用交叉验证法划分所述数据集,得到训练集及验证集;
    对所述训练集及所述验证集中的每篇历史摘要进行分词处理,得到所述训练集中的至少一个特征及所述验证集中的至少一个特征;
    将所述训练集中的至少一个特征输入到输入门层进行训练,得到学习器;
    根据所述验证集中的至少一个特征,对所述学习器进行误差分析并调整,直至误差小于配置值时,得到所述参数抽取模型。
  13. 根据权利要求12所述的电子设备,其中,在基于所述至少一篇历史摘要及对应的摘要类别构建数据集之后,所述处理器执行所述至少一个计算机可读指令还用以实现以下步骤:
    计算每一摘要类别对应的历史摘要的数量;
    判断所述数量是否小于预设数量;
    当所述数量小于所述预设数量时,通过扰动法增加与所述数量对应的历史摘要的数量。
  14. 根据权利要求9所述的电子设备,其中,在生成所述目标文本对应的摘要后,所述处理器执行所述至少一个计算机可读指令还用以实现以下步骤:
    根据所述企业类型确定目标参数列表;
    获取所述目标参数列表中的所有参数;
    判断所述摘要中是否包含所述所有参数;
    当检测到所述摘要中包含所述所有参数时,根据所述摘要生成提示信息;
    将所述提示信息发送至指定联系人的终端设备。
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:
    获取至少一个企业的至少一篇公告摘要,并对所述至少一篇公告摘要进行去重处理;
    对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词;
    将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型中,生成至少一个摘要模板;
    融合所述至少一个摘要模板,得到摘要模板库;
    当接收到摘要生成指令时,从所述摘要生成指令中提取目标文本;
    确定所述目标文本所属的文本类型,及确定所述目标文本对应的企业所属的企业类型;
    从所述摘要模板库中确定同时与所述文本类型及所述企业类型匹配的目标摘要模板;
    从所述目标文本中提取所述目标摘要模板所需的信息,及根据提取的信息及所述目标摘要模板,生成所述目标文本对应的摘要。
  16. 根据权利要求15所述的存储介质,其中,在所述对所述至少一篇公告摘要进行去重处理时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
    根据每篇公告摘要中的摘要标题,计算每篇公告摘要的哈希值;
    从每篇公告摘要中抽取预设特征并建立特征索引;
    根据每两篇公告摘要的哈希值,采用余弦距离公式计算每两篇公告摘要的相似距离,得到每对摘要对的相似距离,其中,每对摘要对包括任意两篇公告摘要;
    通过所述特征索引搜索出相似距离大于阈值的摘要对,并将该摘要对确定为相似摘要对;
    判断所述相似摘要对中的预设特征是否相同;
    当所述相似摘要对中的预设特征相同时,删除所述相似摘要对中的任意一条摘要。
  17. 根据权利要求15所述的存储介质,其中,在所述对去重处理后的每篇公告摘要进行预处理,得到每篇公告摘要的至少一个分词时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
    针对去重处理后的每篇公告摘要进行去噪处理,得到第一文本;
    对所述第一文本中的预设字段进行词法分析处理,得到第二文本;
    根据预设的自定义词典对所述第二文本进行切分,得到切分位置;
    根据所述切分位置,构建至少一个有向无环图;
    根据所述自定义词典中的权值计算每个有向无环图的概率;
    将概率最大的有向无环图对应的切分位置确定为目标切分位置;
    根据所述目标切分位置确定至少一个特征词;
    对所述至少一个特征词进行标准化处理,得到每篇公告摘要的至少一个分词。
  18. 根据权利要求15所述的存储介质,其中,在将每篇公告摘要的至少一个分词输入至预先训练的参数抽取模型之前,所述至少一个计算机可读指令被处理器执行还用以实现以下步骤:
    采用网络爬虫技术获取至少一篇历史摘要;
    对所述至少一篇历史摘要进行摘要类别标注,得到每篇历史摘要对应的摘要类别;
    基于所述至少一篇历史摘要及对应的摘要类别构建数据集;
    采用交叉验证法划分所述数据集,得到训练集及验证集;
    对所述训练集及所述验证集中的每篇历史摘要进行分词处理,得到所述训练集中的至少一个特征及所述验证集中的至少一个特征;
    将所述训练集中的至少一个特征输入到输入门层进行训练,得到学习器;
    根据所述验证集中的至少一个特征,对所述学习器进行误差分析并调整,直至误差小于配置值时,得到所述参数抽取模型。
  19. 根据权利要求18所述的存储介质,其中,在所述基于所述至少一篇历史摘要及对应的摘要类别构建数据集之后,所述至少一个计算机可读指令被处理器执行时还用以实现以下步骤:
    计算每一摘要类别对应的历史摘要的数量;
    判断所述数量是否小于预设数量;
    当所述数量小于所述预设数量时,通过扰动法增加与所述数量对应的历史摘要的数量。
  20. 根据权利要求15所述的存储介质,其中,在所述生成所述目标文本对应的摘要后,所述至少一个计算机可读指令被处理器执行还用以实现以下步骤:
    根据所述企业类型确定目标参数列表;
    获取所述目标参数列表中的所有参数;
    判断所述摘要中是否包含所述所有参数;
    当检测到所述摘要中包含所述所有参数时,根据所述摘要生成提示信息;
    将所述提示信息发送至指定联系人的终端设备。
PCT/CN2021/070995 2020-03-31 2021-01-09 摘要生成方法、装置、电子设备及介质 WO2021196825A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010244210.0 2020-03-31
CN202010244210.0A CN111552800A (zh) 2020-03-31 2020-03-31 摘要生成方法、装置、电子设备及介质

Publications (1)

Publication Number Publication Date
WO2021196825A1 true WO2021196825A1 (zh) 2021-10-07

Family

ID=72003780

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/070995 WO2021196825A1 (zh) 2020-03-31 2021-01-09 摘要生成方法、装置、电子设备及介质

Country Status (2)

Country Link
CN (1) CN111552800A (zh)
WO (1) WO2021196825A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115334367A (zh) * 2022-07-11 2022-11-11 北京达佳互联信息技术有限公司 视频的摘要信息生成方法、装置、服务器以及存储介质
CN116501875A (zh) * 2023-04-28 2023-07-28 中电科大数据研究院有限公司 一种基于自然语言和知识图谱的文档处理方法和系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552800A (zh) * 2020-03-31 2020-08-18 深圳壹账通智能科技有限公司 摘要生成方法、装置、电子设备及介质
CN112784585A (zh) * 2021-02-07 2021-05-11 新华智云科技有限公司 金融公告的摘要提取方法与摘要提取终端
CN113806522A (zh) * 2021-09-18 2021-12-17 北京百度网讯科技有限公司 摘要生成方法、装置、设备以及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106713116A (zh) * 2016-06-17 2017-05-24 腾讯科技(深圳)有限公司 信息处理方法、装置及系统
CN109635103A (zh) * 2018-12-17 2019-04-16 北京百度网讯科技有限公司 摘要生成方法和装置
CN110147541A (zh) * 2019-05-23 2019-08-20 北京神州泰岳软件股份有限公司 一种经济报告的生成方法及装置
US20190278843A1 (en) * 2017-02-27 2019-09-12 Tencent Technology (Shenzhen) Company Ltd Text entity extraction method, apparatus, and device, and storage medium
CN110334334A (zh) * 2019-06-19 2019-10-15 腾讯科技(深圳)有限公司 一种摘要生成方法、装置及计算机设备
CN111552800A (zh) * 2020-03-31 2020-08-18 深圳壹账通智能科技有限公司 摘要生成方法、装置、电子设备及介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106713116A (zh) * 2016-06-17 2017-05-24 腾讯科技(深圳)有限公司 信息处理方法、装置及系统
US20190278843A1 (en) * 2017-02-27 2019-09-12 Tencent Technology (Shenzhen) Company Ltd Text entity extraction method, apparatus, and device, and storage medium
CN109635103A (zh) * 2018-12-17 2019-04-16 北京百度网讯科技有限公司 摘要生成方法和装置
CN110147541A (zh) * 2019-05-23 2019-08-20 北京神州泰岳软件股份有限公司 一种经济报告的生成方法及装置
CN110334334A (zh) * 2019-06-19 2019-10-15 腾讯科技(深圳)有限公司 一种摘要生成方法、装置及计算机设备
CN111552800A (zh) * 2020-03-31 2020-08-18 深圳壹账通智能科技有限公司 摘要生成方法、装置、电子设备及介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115334367A (zh) * 2022-07-11 2022-11-11 北京达佳互联信息技术有限公司 视频的摘要信息生成方法、装置、服务器以及存储介质
CN115334367B (zh) * 2022-07-11 2023-10-17 北京达佳互联信息技术有限公司 视频的摘要信息生成方法、装置、服务器以及存储介质
CN116501875A (zh) * 2023-04-28 2023-07-28 中电科大数据研究院有限公司 一种基于自然语言和知识图谱的文档处理方法和系统
CN116501875B (zh) * 2023-04-28 2024-04-26 中电科大数据研究院有限公司 一种基于自然语言和知识图谱的文档处理方法和系统

Also Published As

Publication number Publication date
CN111552800A (zh) 2020-08-18

Similar Documents

Publication Publication Date Title
WO2021196825A1 (zh) 摘要生成方法、装置、电子设备及介质
WO2021114736A1 (zh) 医疗问诊辅助方法、装置、电子设备及介质
WO2021042521A1 (zh) 一种合同自动生成方法、计算机设备及计算机非易失性存储介质
WO2022105115A1 (zh) 问答对匹配方法、装置、电子设备及存储介质
CN111488363B (zh) 数据处理方法、装置、电子设备及介质
CN111737499B (zh) 基于自然语言处理的数据搜索方法及相关设备
WO2022105122A1 (zh) 基于人工智能的答案生成方法、装置、计算机设备及介质
RU2671047C2 (ru) Понимание таблиц для поиска
US11972201B2 (en) Facilitating auto-completion of electronic forms with hierarchical entity data models
WO2021151299A1 (zh) 基于人工智能的数据增强方法、装置、电子设备及介质
WO2021217931A1 (zh) 基于分类模型的字段抽取方法、装置、电子设备及介质
WO2020259280A1 (zh) 日志管理方法、装置、网络设备和可读存储介质
CN111459967A (zh) 结构化查询语句生成方法、装置、电子设备及介质
WO2021073271A1 (zh) 舆情分析方法、装置、计算机装置及存储介质
TWI682287B (zh) 知識圖譜產生裝置、方法及其電腦程式產品
WO2021120688A1 (zh) 医疗误诊检测方法、装置、电子设备及存储介质
WO2021196934A1 (zh) 一种基于字段相似度计算的问题推荐方法、装置和服务器
WO2022078308A1 (zh) 裁判文书摘要生成方法、装置、电子设备及可读存储介质
WO2022089227A1 (zh) 地址参数处理方法及相关设备
CN111538816A (zh) 基于ai识别的问答方法、装置、电子设备及介质
US11836331B2 (en) Mathematical models of graphical user interfaces
CN112214984A (zh) 内容抄袭识别方法、装置、设备及存储介质
CN113434631A (zh) 基于事件的情感分析方法、装置、计算机设备及存储介质
JP2023517518A (ja) ヌル値又は同等の値を有するリレーショナル・テーブルのためのベクトル埋込モデル
CN111222032B (zh) 舆情分析方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21780100

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 16/01/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21780100

Country of ref document: EP

Kind code of ref document: A1