CN109558583A - A kind of method, device and equipment automatically generating digest - Google Patents

A kind of method, device and equipment automatically generating digest Download PDF

Info

Publication number
CN109558583A
CN109558583A CN201710892106.0A CN201710892106A CN109558583A CN 109558583 A CN109558583 A CN 109558583A CN 201710892106 A CN201710892106 A CN 201710892106A CN 109558583 A CN109558583 A CN 109558583A
Authority
CN
China
Prior art keywords
processed
vector
document
feature
lexical characteristics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710892106.0A
Other languages
Chinese (zh)
Inventor
姜珊珊
童毅轩
张永伟
张佳师
董滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to CN201710892106.0A priority Critical patent/CN109558583A/en
Priority to JP2018134689A priority patent/JP6579239B2/en
Publication of CN109558583A publication Critical patent/CN109558583A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The present invention provides a kind of method, device and equipment for automatically generating digest, is related to word processing technical field, can be improved the readability of the digest of generation.The described method includes: extracting the lexical characteristics of document to be processed;Extract the syntactic feature of the document to be processed;The term vector of the document to be processed is obtained, and obtain the vector expression of the lexical characteristics and the vector of the syntactic feature to indicate;The vector for connecting the term vector, the vector expression of the lexical characteristics and the syntactic feature indicates, obtains information to be processed;Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.The readability of the digest of generation can be improved in the embodiment of the present invention.

Description

A kind of method, device and equipment automatically generating digest
Technical field
The present invention relates to word processing technical field more particularly to a kind of method, device and equipments for automatically generating digest.
Background technique
The target that automatic abstract generates is to give a document or multiple documents, generates one section shorter than original text many pluck It wants, the important information in original text shelves is possessed in abstract.With the evolution of depth learning technology and universal, it is based on descriptor and language The method of model is gradually substituted by the document creation method based on coder-decoder framework.Typical encoder and decoding Device includes recurrent neural network (Recurrent Neural Network, RNN) and its variant LSTM (Long Short-Term Memory, shot and long term memory) and GRU (Gated Recurrent Unit, gating cycle unit).
The prior art proposes a kind of encoder for introducing lexical characteristics, and the vector of term vector and lexical characteristics is indicated to connect The effect of entity word and notional word is emphasized in the input being connected together as encoder whereby.
One important measurement standard of Text summarization is readable.It is important to notice that key in the above prior art Word, and keyword is usually noun or noun phrase.Only more keywords, but without the good connection between it, because This, the readability of the digest generated using the prior art is unable to get guarantee.
Summary of the invention
In view of this, the present invention provides a kind of method, device and equipment for automatically generating digest, the text of generation can be improved The readability plucked.
In order to solve the above technical problems, on the one hand, the present invention provides a kind of method for automatically generating digest, comprising:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains vector expression and the syntax spy of the lexical characteristics The vector of sign indicates;
Connect the term vector, the lexical characteristics vector indicate and the syntactic feature vector indicate, obtain to Handle information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
Wherein, the lexical characteristics include: part of speech feature, name substance feature, word frequency statistics feature, reverse document frequency Statistical nature.
Wherein, the syntactic feature includes: interdependent syntax dependence feature, syntactic constituent feature.
Wherein, the vector for obtaining the lexical characteristics indicates and the vector of the syntactic feature indicates, comprising:
Obtain the continuous value tag and discrete value tag in the lexical characteristics and the syntactic feature;
The only hotlist of discrete value tag is shown;
It is target discrete value tag by the successive value Feature Conversion, and shows the target discrete value tag with only hotlist.
Wherein, it is described by the successive value Feature Conversion be target discrete value tag, and with only hotlist show the target from Dissipate value tag, comprising:
The continuous characteristic value is assigned in the bucket of preset quantity and is converted to target discrete characteristic value;
Shown with only hotlist for converting the continuous characteristic value of target in the continuous characteristic value as the bucket of Discrete Eigenvalue Number.
Wherein, the vector table of the vector expression and the syntactic feature of the connection term vector, the lexical characteristics Show, obtain information to be processed, comprising:
For each word in the document to be processed, the vector of the corresponding term vector of each word, lexical characteristics is indicated Indicate that head and the tail connection forms a vector with the vector of syntactic feature, and using multiple vectors of formation as information to be processed.
Second aspect, the embodiment of the present invention provide a kind of device for automatically generating digest, comprising:
First extraction module, for extracting the lexical characteristics of document to be processed;
Second extraction module, for extracting the syntactic feature of the document to be processed;
Module is obtained, for obtaining the term vector of the document to be processed, and the vector table of the acquisition lexical characteristics Showing indicates with the vector of the syntactic feature;
Link block, for connect the term vector, the lexical characteristics vector indicate and the syntactic feature to Amount indicates, obtains information to be processed;
Processing module, for obtaining the text of the document to be processed using the information to be processed as the input of encoder It plucks.
The third aspect, the embodiment of the present invention provide a kind of electronic equipment, comprising: processor;And memory, in the storage Computer program instructions are stored in device,
Wherein, when the computer program instructions are run by the processor, so that the processor executes following step It is rapid:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains vector expression and the syntax spy of the lexical characteristics The vector of sign indicates;
Connect the term vector, the lexical characteristics vector indicate and the syntactic feature vector indicate, obtain to Handle information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, the computer-readable storage medium Matter is stored with computer program, when the computer program is run by processor, so that the processor executes following steps:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains vector expression and the syntax spy of the lexical characteristics The vector of sign indicates;
Connect the term vector, the lexical characteristics vector indicate and the syntactic feature vector indicate, obtain to Handle information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
The advantageous effects of the above technical solutions of the present invention are as follows:
In embodiments of the present invention, the lexical characteristics and syntactic feature of document to be processed are extracted, and by document to be processed The vector of term vector, the vector expression of lexical characteristics and the syntactic feature indicates connection, and forms information to be processed, then should Information to be processed is handled as the input of encoder, to obtain digest.Due in embodiments of the present invention will be to be processed Input of the syntactic feature of document as encoder, and syntactic feature can more embody the pass of the connection in sentence between each ingredient System, thus, the readability of the digest of acquisition can be improved using the scheme of the embodiment of the present invention.
Detailed description of the invention
Fig. 1 is the flow chart of the method for automatically generating digest of the embodiment of the present invention;
Fig. 2 is the lexical characteristics schematic diagram of the embodiment of the present invention;
Fig. 3 and Fig. 4 is the process schematic that syntactic feature is obtained in the embodiment of the present invention;
Fig. 5 is the flow chart of step 103 of the embodiment of the present invention;
Fig. 6 is the schematic diagram of the device for automatically generating digest of the embodiment of the present invention;
Fig. 7 is the schematic diagram of the acquisition module of the embodiment of the present invention;
Fig. 8 is the schematic diagram of the second processing submodule of the embodiment of the present invention;
Fig. 9 is the schematic diagram of the electronic equipment of the embodiment of the present invention;
Figure 10 is the system hardware schematic diagram of the embodiment of the present invention.
Specific embodiment
Below in conjunction with drawings and examples, specific embodiments of the present invention will be described in further detail.Following reality Example is applied for illustrating the present invention, but is not intended to limit the scope of the invention.
As shown in Figure 1, the method for automatically generating digest of the embodiment of the present invention, comprising:
Step 101, the lexical characteristics for extracting document to be processed.
In embodiments of the present invention, the lexical characteristics include but is not limited to are as follows: part of speech feature, name substance feature, word Frequently (Term Frequency, TF) and reverse document frequency statistical nature (Inverse Document Frequency, IDF). In practical applications, the extraction of critical entities word and notional word can be realized by the combination of above-mentioned several feature particular values.Example Such as, part of speech feature is noun or noun phrase;Substance feature is named to behave or organize or place;TF-IDF value is higher etc..
Wherein, part of speech feature and name substance feature are discrete value tag, word frequency statistics feature and reverse document frequency system Meter is characterized in continuous value tag.Fig. 2 is lexical characteristics schematic diagram.
Specifically, above-mentioned continuous value tag is calculated by statistics;Sequence labelling model can be used in discrete value tag It extracts, such as because of Markov model (Hidden Markov Model, HMM), conditional random field models (Conditional Random Fields, CRFs) etc..
Step 102, the syntactic feature for extracting the document to be processed.
In embodiments of the present invention, the syntactic feature includes but is not limited to are as follows: interdependent syntax dependence feature (dependent, vehicle economy P), syntactic constituent feature (abbreviation SC).Wherein, interdependent syntax dependence feature, syntactic constituent are special Sign is discrete value tag.By these features, sentence structure information can get, and emphasize the work of verb and predicate in sentence With.
It is various in practical application, syntactic analysis model can be used and obtain syntactic feature, such as context-free grammar etc..Figure 3 and Fig. 4 is the process schematic for obtaining syntactic feature.By the analysis to sentence shown in Fig. 3, sentence as shown in Figure 4 is obtained Method feature.
Step 103, the term vector for obtaining the document to be processed, and obtain vector expression and the institute of the lexical characteristics The vector for stating syntactic feature indicates.
In this step, by the continuous value tag obtained in above-mentioned steps and discrete value tag, as shown in figure 5, by as follows Mode is handled:
Step 1031, the term vector for obtaining document to be processed.
Term vector is the name set of one group of Language Modeling and feature learning technology in natural language processing, word (word Or phrase) it is mapped to low-dimensional real vector.Word2vec or GloVe can be used in the generation of term vector (the distributed of word indicates) Etc. modes.
Step 1032 shows the only hotlist of discrete value tag.
The only hotlist of discrete value tag obtained in above-mentioned steps is shown, that is, the vector for obtaining discrete value tag indicates.
Step 1033, by the successive value Feature Conversion be target discrete value tag, and with only hotlist show the target from Dissipate value tag.
Here, can be target discrete value tag by the successive value Feature Conversion obtained in above-mentioned steps, and be shown with only hotlist The target discrete value tag.
Target discrete characteristic value is converted to specifically, the continuous characteristic value is assigned in the bucket of preset quantity, with only Hotlist shows for converting the continuous characteristic value of target in the continuous characteristic value as the number of the bucket of Discrete Eigenvalue.Wherein, should Preset quantity can rule of thumb be set.
In above process, the sequencing between each step can be without limitation.
Step 104, the connection term vector, the lexical characteristics vector indicate and the vector table of the syntactic feature Show, obtains information to be processed.
Here, the vector for obtaining its each characteristic value indicates, each word is corresponding for each word in document to be processed Term vector, lexical characteristics vector indicate and syntactic feature vector indicate head and the tail connection formed a vector.In connection In the process, the order of connection can be arbitrarily arranged.So, multiple vectors can be formed in the manner described above, by connection formed it is multiple to Amount is used as information to be processed.
Step 105, using the information to be processed as the input of encoder, obtain the digest of the document to be processed.
In embodiments of the present invention, the lexical characteristics and syntactic feature of document to be processed are extracted, and by document to be processed The vector of term vector, the vector expression of lexical characteristics and the syntactic feature indicates connection, and forms information to be processed, then should Information to be processed is handled as the input of encoder, to obtain digest.Due in embodiments of the present invention will be to be processed Input of the syntactic feature of document as encoder, and syntactic feature can more embody the pass of the connection in sentence between each ingredient System, thus, the readability of the digest of acquisition can be improved using the scheme of the embodiment of the present invention.
As shown in fig. 6, the device 1000 for automatically generating digest of the embodiment of the present invention, comprising:
First extraction module 501, for extracting the lexical characteristics of document to be processed;
Second extraction module 502, for extracting the syntactic feature of the document to be processed;
Module 503 is obtained, for obtaining the term vector of the document to be processed, and the vector of the acquisition lexical characteristics It indicates and the vector of the syntactic feature indicates;
Link block 504, the vector for connecting the term vector, the lexical characteristics indicates and the syntactic feature Vector indicates, obtains information to be processed;
Processing module 505, for obtaining the document to be processed using the information to be processed as the input of encoder Digest.
In embodiments of the present invention, the lexical characteristics include but is not limited to are as follows: part of speech feature names substance feature, word Frequency statistical nature, reverse document frequency statistical nature.The syntactic feature includes but is not limited to are as follows: interdependent syntax dependence is special Sign, syntactic constituent feature.
As shown in fig. 7, the acquisition module 503 includes:
First acquisition submodule 5031, for obtaining the term vector of the document to be processed;
Second acquisition submodule 5032, for obtain the continuous value tag in the lexical characteristics and the syntactic feature and Discrete value tag;
First processing submodule 5033, for showing the only hotlist of discrete value tag;
Second processing submodule 5034, for being target discrete value tag by the successive value Feature Conversion, and with solely hot Indicate the target discrete value tag.
As shown in figure 8, the second processing submodule 5034 includes: allocation unit 50341, by the continuous characteristic value point It is fitted in the bucket of preset quantity and is converted to target discrete characteristic value;Processing unit 50342, for being shown with only hotlist for converting State the number for the bucket that the continuous characteristic value of target in continuous characteristic value is Discrete Eigenvalue.
Specifically, the link block 504 is used for, it is for each word in the document to be processed, each word is corresponding Term vector, lexical characteristics vector indicate and syntactic feature vector indicate head and the tail connection formed a vector, and will be formed Multiple vectors as information to be processed.
The working principle of the device of that embodiment of the invention can refer to the description of preceding method embodiment.
In embodiments of the present invention, the lexical characteristics and syntactic feature of document to be processed are extracted, and by document to be processed The vector of term vector, the vector expression of lexical characteristics and the syntactic feature indicates connection, and forms information to be processed, then should Information to be processed is handled as the input of encoder, to obtain digest.Due in embodiments of the present invention will be to be processed Input of the syntactic feature of document as encoder, and syntactic feature can more embody the pass of the connection in sentence between each ingredient System, thus, the readability of the digest of acquisition can be improved using the scheme of the embodiment of the present invention.
As shown in figure 9, the embodiment of the invention provides a kind of electronic equipment, comprising: processor 801 and memory 802, Computer program instructions are stored in the memory 802, wherein are run in the computer program instructions by the processor When, so that the processor 801 executes following steps:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains vector expression and the syntax spy of the lexical characteristics The vector of sign indicates;
Connect the term vector, the lexical characteristics vector indicate and the syntactic feature vector indicate, obtain to Handle information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
Further, as shown in figure 8, electronic equipment further includes network interface 803, input equipment 804, hard disk 805 and shows Show equipment 806.
It can be interconnected by bus architecture between above-mentioned each interface and equipment.It may include any that bus architecture, which can be, The bus and bridge of the interconnection of quantity.One or more central processing unit (CPU) specifically represented by processor 801, Yi Jiyou The various circuit connections for one or more memory that memory 802 represents are together.Bus architecture can also will be such as outer Peripheral equipment, voltage-stablizer are together with the various other circuit connections of management circuit or the like.It is appreciated that bus architecture is used Connection communication between these components of realization.Bus architecture further includes power bus, controls always in addition to including data/address bus Line and status signal bus in addition, these are all it is known in the art, therefore is no longer described in greater detail herein.
The network interface 803 can connect to network (such as internet, local area network), dependency number obtained from network According to, and can be stored in hard disk 805.
The input equipment 804, can receive operator input various instructions, and be sent to processor 801 for It executes.The input equipment 804 may include keyboard or pointing device (for example, mouse, trace ball (trackball), sense of touch Plate or touch screen etc..
The display equipment 806, the result that processor 801 executes instruction acquisition can be shown.
The memory 802, program necessary to running for storage program area and data and processor 801 are counted The data such as the intermediate result during calculation.
It is appreciated that the memory 802 in the embodiment of the present invention can be volatile memory or nonvolatile memory, It or may include both volatile and non-volatile memories.Wherein, nonvolatile memory can be read-only memory (ROM), Programmable read only memory (PROM), Erasable Programmable Read Only Memory EPROM (EPROM), electrically erasable programmable read-only memory (EEPROM) or flash memory.Volatile memory can be random access memory (RAM), be used as External Cache.Herein The memory 802 of the device and method of description is intended to include but is not limited to the memory of these and any other suitable type.
In some embodiments, memory 802 stores following element, executable modules or data structures, or Their subset of person or their superset: operating system 8021 and application program 808.
Wherein, operating system 8021 include various system programs, such as ccf layer, core library layer, driving layer etc., are used for Realize various basic businesses and the hardware based task of processing.Application program 808 includes various application programs, such as browses Device (Browser) etc., for realizing various applied business.Realize that the program of present invention method may be embodied in using journey In sequence 808.
Above-mentioned processor 801, when calling and executing the application program and data that are stored in the memory 802, specifically , when can be the program or instruction that store in application program 808, execute following steps:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains vector expression and the syntax spy of the lexical characteristics The vector of sign indicates;
Connect the term vector, the lexical characteristics vector indicate and the syntactic feature vector indicate, obtain to Handle information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
The method that the above embodiment of the present invention discloses can be applied in processor 801, or be realized by processor 801. Processor 801 may be a kind of IC chip, the processing capacity with signal.During realization, the above method it is each Step can be completed by the integrated logic circuit of the hardware in processor 801 or the instruction of software form.Above-mentioned processing Device 801 can be general processor, digital signal processor (DSP), specific integrated circuit (ASIC), ready-made programmable gate array (FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware components, may be implemented or Person executes disclosed each method, step and the logic diagram in the embodiment of the present invention.General processor can be microprocessor or Person's processor is also possible to any conventional processor etc..The step of method in conjunction with disclosed in the embodiment of the present invention, can be straight Connect and be presented as that hardware decoding processor executes completion, or in decoding processor hardware and software module combination executed At.Software module can be located at random access memory, and flash memory, read-only memory, programmable read only memory or electrically-erasable can In the storage medium of this fields such as programmable memory, register maturation.The storage medium is located at memory 802, and processor 801 is read Information in access to memory 802, in conjunction with the step of its hardware completion above method.
It is understood that embodiments described herein can with hardware, software, firmware, middleware, microcode or its Combination is to realize.For hardware realization, processing unit be may be implemented in one or more specific integrated circuits (ASIC), number letter Number processor DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), general processor, controller, microcontroller, microprocessor, other electronics lists for executing herein described function In member or combinations thereof.
For software implementations, it can be realized herein by executing the module (such as process, function etc.) of function described herein The technology.Software code is storable in memory and is executed by processor.Memory can in the processor or It is realized outside processor.
Wherein, the lexical characteristics include: part of speech feature, name substance feature, word frequency statistics feature, reverse document frequency Statistical nature.
Wherein, the syntactic feature includes: interdependent syntax dependence feature, syntactic constituent feature.
Specifically, processor 801 is also used to read the computer program, executes following steps:
Obtain the continuous value tag and discrete value tag in the lexical characteristics and the syntactic feature;
The only hotlist of discrete value tag is shown;
It is target discrete value tag by the successive value Feature Conversion, and shows the target discrete value tag with only hotlist.
Specifically, processor 801 is also used to read the computer program, executes following steps:
The continuous characteristic value is assigned in the bucket of preset quantity and is converted to target discrete characteristic value;
Shown with only hotlist for converting the continuous characteristic value of target in the continuous characteristic value as the bucket of Discrete Eigenvalue Number.
Specifically, processor 801 is also used to read the computer program, executes following steps:
For each word in the document to be processed, the vector of the corresponding term vector of each word, lexical characteristics is indicated Indicate that head and the tail connection forms a vector with the vector of syntactic feature, and using multiple vectors of formation as information to be processed.
In addition, the embodiment of the invention also provides a kind of computer readable storage medium, the computer-readable storage medium Matter is stored with computer program, when the computer program is run by processor, so that the processor executes following steps:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains vector expression and the syntax spy of the lexical characteristics The vector of sign indicates;
Connect the term vector, the lexical characteristics vector indicate and the syntactic feature vector indicate, obtain to Handle information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
Wherein, the lexical characteristics include: part of speech feature, name substance feature, word frequency statistics feature, reverse document frequency Statistical nature.
Wherein, the syntactic feature includes: interdependent syntax dependence feature, syntactic constituent feature.
Wherein, the vector for obtaining the lexical characteristics indicates and the vector of the syntactic feature indicates, comprising:
Obtain the continuous value tag and discrete value tag in the lexical characteristics and the syntactic feature;
The only hotlist of discrete value tag is shown;
It is target discrete value tag by the successive value Feature Conversion, and shows the target discrete value tag with only hotlist.
Wherein, it is described by the successive value Feature Conversion be target discrete value tag, and with only hotlist show the target from Dissipate value tag, comprising:
The continuous characteristic value is assigned in the bucket of preset quantity and is converted to target discrete characteristic value;
Shown with only hotlist for converting the continuous characteristic value of target in the continuous characteristic value as the bucket of Discrete Eigenvalue Number.
Wherein, the vector table of the vector expression and the syntactic feature of the connection term vector, the lexical characteristics Show, obtain information to be processed, comprising:
For each word in the document to be processed, the vector of the corresponding term vector of each word, lexical characteristics is indicated Indicate that head and the tail connection forms a vector with the vector of syntactic feature, and using multiple vectors of formation as information to be processed.
It as shown in Figure 10, is system hardware architecture diagram applied by the embodiment of the present invention.The system deployment is in PC system: It outputs and inputs and is stored in storage equipment 1013, functional module and intermediate result are all stored in main memory 1011, functional module It is executed by central processing unit 1000.Data are input in system through input unit 1014, export result by display unit 1015 Display.
In several embodiments provided herein, it should be understood that disclosed method and apparatus, it can be by other Mode realize.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit, only For a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine Or it is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed phase Coupling, direct-coupling or communication connection between mutually can be through some interfaces, the INDIRECT COUPLING or communication of device or unit Connection can be electrical property, mechanical or other forms.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that the independent physics of each unit includes, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes receiving/transmission method described in each embodiment of the present invention Part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, abbreviation ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic or disk etc. are various can store The medium of program code.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art For, without departing from the principles of the present invention, it can also make several improvements and retouch, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims (9)

1. a kind of method for automatically generating digest characterized by comprising
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains the vector expression and the syntactic feature of the lexical characteristics Vector indicates;
Connect the term vector, the vector of the lexical characteristics indicates and the expression of the vector of the syntactic feature, acquisition are to be processed Information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
2. name entity is special the method according to claim 1, wherein the lexical characteristics include: part of speech feature Sign, word frequency statistics feature, reverse document frequency statistical nature.
3. the method according to claim 1, wherein the syntactic feature includes: that interdependent syntax dependence is special Sign, syntactic constituent feature.
4. the method according to claim 1, wherein the vector for obtaining the lexical characteristics indicates and described The vector of syntactic feature indicates, comprising:
Obtain the continuous value tag and discrete value tag in the lexical characteristics and the syntactic feature;
The only hotlist of discrete value tag is shown;
It is target discrete value tag by the successive value Feature Conversion, and shows the target discrete value tag with only hotlist.
5. according to the method described in claim 4, it is characterized in that, it is described by the successive value Feature Conversion be target discrete value Feature, and show the target discrete value tag with only hotlist, comprising:
The continuous characteristic value is assigned in the bucket of preset quantity and is converted to target discrete characteristic value;
Shown with only hotlist for converting the continuous characteristic value of target in the continuous characteristic value as the number of the bucket of Discrete Eigenvalue.
6. the method according to claim 1, wherein the connection term vector, the lexical characteristics to Amount indicates and the vector of the syntactic feature indicates, obtains information to be processed, comprising:
Vector expression and sentence for each word in the document to be processed, by the corresponding term vector of each word, lexical characteristics The vector of method feature indicates that head and the tail connection forms a vector, and using multiple vectors of formation as information to be processed.
7. a kind of device for automatically generating digest characterized by comprising
First extraction module, for extracting the lexical characteristics of document to be processed;
Second extraction module, for extracting the syntactic feature of the document to be processed;
Obtain module, for obtaining the term vector of the document to be processed, and obtain the lexical characteristics vector indicate and The vector of the syntactic feature indicates;
Link block, the vector for connecting the term vector, the lexical characteristics indicates and the vector table of the syntactic feature Show, obtains information to be processed;
Processing module, for obtaining the digest of the document to be processed using the information to be processed as the input of encoder.
8. a kind of electronic equipment characterized by comprising processor;And memory, it is stored with computer in the memory Program instruction,
Wherein, when the computer program instructions are run by the processor, so that the processor executes following steps:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains the vector expression and the syntactic feature of the lexical characteristics Vector indicates;
Connect the term vector, the vector of the lexical characteristics indicates and the expression of the vector of the syntactic feature, acquisition are to be processed Information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
9. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, when the computer program is run by processor, so that the processor executes following steps:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains the vector expression and the syntactic feature of the lexical characteristics Vector indicates;
Connect the term vector, the vector of the lexical characteristics indicates and the expression of the vector of the syntactic feature, acquisition are to be processed Information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
CN201710892106.0A 2017-09-27 2017-09-27 A kind of method, device and equipment automatically generating digest Pending CN109558583A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710892106.0A CN109558583A (en) 2017-09-27 2017-09-27 A kind of method, device and equipment automatically generating digest
JP2018134689A JP6579239B2 (en) 2017-09-27 2018-07-18 Abstract sentence automatic generation method, apparatus, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710892106.0A CN109558583A (en) 2017-09-27 2017-09-27 A kind of method, device and equipment automatically generating digest

Publications (1)

Publication Number Publication Date
CN109558583A true CN109558583A (en) 2019-04-02

Family

ID=65864224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710892106.0A Pending CN109558583A (en) 2017-09-27 2017-09-27 A kind of method, device and equipment automatically generating digest

Country Status (2)

Country Link
JP (1) JP6579239B2 (en)
CN (1) CN109558583A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457483A (en) * 2019-06-21 2019-11-15 浙江大学 A kind of long text generation method based on neural topic model
CN111178053A (en) * 2019-12-30 2020-05-19 电子科技大学 Text generation method for performing generation type abstract extraction by combining semantics and text structure
CN111209751A (en) * 2020-02-14 2020-05-29 全球能源互联网研究院有限公司 Chinese word segmentation method, device and storage medium
CN112765987A (en) * 2021-01-26 2021-05-07 武汉大学 Event identification method and system based on recursive conditional random field decoder
CN113515627A (en) * 2021-05-19 2021-10-19 北京世纪好未来教育科技有限公司 Document detection method, device, equipment and storage medium
WO2022121165A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Long text generation method and apparatus, device and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560456B (en) * 2020-11-03 2024-04-09 重庆安石泽太科技有限公司 Method and system for generating generated abstract based on improved neural network
US20230124296A1 (en) * 2021-10-18 2023-04-20 Samsung Electronics Co., Ltd. Method of natural language processing by performing semantic analysis using syntactic information, and an apparatus for the same

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN104536950A (en) * 2014-12-11 2015-04-22 北京百度网讯科技有限公司 Text summarization generating method and device
CN106383817A (en) * 2016-09-29 2017-02-08 北京理工大学 Paper title generation method capable of utilizing distributed semantic information

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3581074B2 (en) * 2000-03-07 2004-10-27 日本電信電話株式会社 Document digest creation method, document search device, and recording medium
JP2003108571A (en) * 2001-09-28 2003-04-11 Seiko Epson Corp Document summary device, control method of document summary device, control program of document summary device and recording medium
JP2015088064A (en) * 2013-10-31 2015-05-07 日本電信電話株式会社 Text summarization device, text summarization method, and program
JP6537340B2 (en) * 2015-04-28 2019-07-03 ヤフー株式会社 Summary generation device, summary generation method, and summary generation program
EP3394798A1 (en) * 2016-03-18 2018-10-31 Google LLC Generating dependency parses of text segments using neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN104536950A (en) * 2014-12-11 2015-04-22 北京百度网讯科技有限公司 Text summarization generating method and device
CN106383817A (en) * 2016-09-29 2017-02-08 北京理工大学 Paper title generation method capable of utilizing distributed semantic information

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457483A (en) * 2019-06-21 2019-11-15 浙江大学 A kind of long text generation method based on neural topic model
CN111178053A (en) * 2019-12-30 2020-05-19 电子科技大学 Text generation method for performing generation type abstract extraction by combining semantics and text structure
CN111209751A (en) * 2020-02-14 2020-05-29 全球能源互联网研究院有限公司 Chinese word segmentation method, device and storage medium
WO2022121165A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Long text generation method and apparatus, device and storage medium
CN112765987A (en) * 2021-01-26 2021-05-07 武汉大学 Event identification method and system based on recursive conditional random field decoder
CN113515627A (en) * 2021-05-19 2021-10-19 北京世纪好未来教育科技有限公司 Document detection method, device, equipment and storage medium
CN113515627B (en) * 2021-05-19 2023-07-25 北京世纪好未来教育科技有限公司 Document detection method, device, equipment and storage medium

Also Published As

Publication number Publication date
JP2019061656A (en) 2019-04-18
JP6579239B2 (en) 2019-09-25

Similar Documents

Publication Publication Date Title
CN109558583A (en) A kind of method, device and equipment automatically generating digest
KR102557681B1 (en) Time series knowledge graph generation method, device, equipment and medium
Nie et al. Improving named entity recognition with attentive ensemble of syntactic information
Overmyer et al. Conceptual modeling through linguistic analysis using LIDA
CN112329465A (en) Named entity identification method and device and computer readable storage medium
US20170315984A1 (en) Systems and methods for text analytics processor
US20080010615A1 (en) Generic frequency weighted visualization component
Khan et al. Deep recurrent neural networks with word embeddings for Urdu named entity recognition
Singh et al. Development of Marathi part of speech tagger using statistical approach
JP6693582B2 (en) Document abstract generation method, device, electronic device, and computer-readable storage medium
Yu et al. Character composition model with convolutional neural networks for dependency parsing on morphologically rich languages
CN108932218A (en) A kind of example extended method, device, equipment and medium
Umber et al. NL-based automated software requirements elicitation and specification
EP3921760A1 (en) Mapping natural language utterances to operations over a knowledge graph
CN113761197A (en) Application book multi-label hierarchical classification method capable of utilizing expert knowledge
CN116245097A (en) Method for training entity recognition model, entity recognition method and corresponding device
JP2021197179A (en) Entity identification method, device, and computer readable storage medium
CN108491423B (en) Sorting method and device
Wu et al. Clinical named entity recognition via bi-directional LSTM-CRF model
CN110134935A (en) A kind of method, device and equipment for extracting font style characteristic
Roth et al. Interactive feature space construction using semantic information
CN109471969A (en) A kind of application searches method, device and equipment
CN115510188A (en) Text keyword association method, device, equipment and storage medium
US8838562B1 (en) Methods and apparatus for providing query parameters to a search engine
Choudhury et al. Morphological analyzer for manipuri: design and implementation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190402

RJ01 Rejection of invention patent application after publication