CN109558583A - A kind of method, device and equipment automatically generating digest - Google Patents
A kind of method, device and equipment automatically generating digest Download PDFInfo
- Publication number
- CN109558583A CN109558583A CN201710892106.0A CN201710892106A CN109558583A CN 109558583 A CN109558583 A CN 109558583A CN 201710892106 A CN201710892106 A CN 201710892106A CN 109558583 A CN109558583 A CN 109558583A
- Authority
- CN
- China
- Prior art keywords
- processed
- vector
- document
- feature
- lexical characteristics
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 239000013598 vector Substances 0.000 claims abstract description 145
- 239000000284 extract Substances 0.000 claims abstract description 20
- 238000012545 processing Methods 0.000 claims abstract description 18
- 230000015654 memory Effects 0.000 claims description 34
- 238000004590 computer program Methods 0.000 claims description 13
- 238000006243 chemical reaction Methods 0.000 claims description 10
- 239000000470 constituent Substances 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 7
- 230000015572 biosynthetic process Effects 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 9
- 239000000126 substance Substances 0.000 description 7
- 230000008569 process Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000004615 ingredient Substances 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000035800 maturation Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The present invention provides a kind of method, device and equipment for automatically generating digest, is related to word processing technical field, can be improved the readability of the digest of generation.The described method includes: extracting the lexical characteristics of document to be processed;Extract the syntactic feature of the document to be processed;The term vector of the document to be processed is obtained, and obtain the vector expression of the lexical characteristics and the vector of the syntactic feature to indicate;The vector for connecting the term vector, the vector expression of the lexical characteristics and the syntactic feature indicates, obtains information to be processed;Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.The readability of the digest of generation can be improved in the embodiment of the present invention.
Description
Technical field
The present invention relates to word processing technical field more particularly to a kind of method, device and equipments for automatically generating digest.
Background technique
The target that automatic abstract generates is to give a document or multiple documents, generates one section shorter than original text many pluck
It wants, the important information in original text shelves is possessed in abstract.With the evolution of depth learning technology and universal, it is based on descriptor and language
The method of model is gradually substituted by the document creation method based on coder-decoder framework.Typical encoder and decoding
Device includes recurrent neural network (Recurrent Neural Network, RNN) and its variant LSTM (Long
Short-Term Memory, shot and long term memory) and GRU (Gated Recurrent Unit, gating cycle unit).
The prior art proposes a kind of encoder for introducing lexical characteristics, and the vector of term vector and lexical characteristics is indicated to connect
The effect of entity word and notional word is emphasized in the input being connected together as encoder whereby.
One important measurement standard of Text summarization is readable.It is important to notice that key in the above prior art
Word, and keyword is usually noun or noun phrase.Only more keywords, but without the good connection between it, because
This, the readability of the digest generated using the prior art is unable to get guarantee.
Summary of the invention
In view of this, the present invention provides a kind of method, device and equipment for automatically generating digest, the text of generation can be improved
The readability plucked.
In order to solve the above technical problems, on the one hand, the present invention provides a kind of method for automatically generating digest, comprising:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains vector expression and the syntax spy of the lexical characteristics
The vector of sign indicates;
Connect the term vector, the lexical characteristics vector indicate and the syntactic feature vector indicate, obtain to
Handle information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
Wherein, the lexical characteristics include: part of speech feature, name substance feature, word frequency statistics feature, reverse document frequency
Statistical nature.
Wherein, the syntactic feature includes: interdependent syntax dependence feature, syntactic constituent feature.
Wherein, the vector for obtaining the lexical characteristics indicates and the vector of the syntactic feature indicates, comprising:
Obtain the continuous value tag and discrete value tag in the lexical characteristics and the syntactic feature;
The only hotlist of discrete value tag is shown;
It is target discrete value tag by the successive value Feature Conversion, and shows the target discrete value tag with only hotlist.
Wherein, it is described by the successive value Feature Conversion be target discrete value tag, and with only hotlist show the target from
Dissipate value tag, comprising:
The continuous characteristic value is assigned in the bucket of preset quantity and is converted to target discrete characteristic value;
Shown with only hotlist for converting the continuous characteristic value of target in the continuous characteristic value as the bucket of Discrete Eigenvalue
Number.
Wherein, the vector table of the vector expression and the syntactic feature of the connection term vector, the lexical characteristics
Show, obtain information to be processed, comprising:
For each word in the document to be processed, the vector of the corresponding term vector of each word, lexical characteristics is indicated
Indicate that head and the tail connection forms a vector with the vector of syntactic feature, and using multiple vectors of formation as information to be processed.
Second aspect, the embodiment of the present invention provide a kind of device for automatically generating digest, comprising:
First extraction module, for extracting the lexical characteristics of document to be processed;
Second extraction module, for extracting the syntactic feature of the document to be processed;
Module is obtained, for obtaining the term vector of the document to be processed, and the vector table of the acquisition lexical characteristics
Showing indicates with the vector of the syntactic feature;
Link block, for connect the term vector, the lexical characteristics vector indicate and the syntactic feature to
Amount indicates, obtains information to be processed;
Processing module, for obtaining the text of the document to be processed using the information to be processed as the input of encoder
It plucks.
The third aspect, the embodiment of the present invention provide a kind of electronic equipment, comprising: processor;And memory, in the storage
Computer program instructions are stored in device,
Wherein, when the computer program instructions are run by the processor, so that the processor executes following step
It is rapid:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains vector expression and the syntax spy of the lexical characteristics
The vector of sign indicates;
Connect the term vector, the lexical characteristics vector indicate and the syntactic feature vector indicate, obtain to
Handle information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, the computer-readable storage medium
Matter is stored with computer program, when the computer program is run by processor, so that the processor executes following steps:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains vector expression and the syntax spy of the lexical characteristics
The vector of sign indicates;
Connect the term vector, the lexical characteristics vector indicate and the syntactic feature vector indicate, obtain to
Handle information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
The advantageous effects of the above technical solutions of the present invention are as follows:
In embodiments of the present invention, the lexical characteristics and syntactic feature of document to be processed are extracted, and by document to be processed
The vector of term vector, the vector expression of lexical characteristics and the syntactic feature indicates connection, and forms information to be processed, then should
Information to be processed is handled as the input of encoder, to obtain digest.Due in embodiments of the present invention will be to be processed
Input of the syntactic feature of document as encoder, and syntactic feature can more embody the pass of the connection in sentence between each ingredient
System, thus, the readability of the digest of acquisition can be improved using the scheme of the embodiment of the present invention.
Detailed description of the invention
Fig. 1 is the flow chart of the method for automatically generating digest of the embodiment of the present invention;
Fig. 2 is the lexical characteristics schematic diagram of the embodiment of the present invention;
Fig. 3 and Fig. 4 is the process schematic that syntactic feature is obtained in the embodiment of the present invention;
Fig. 5 is the flow chart of step 103 of the embodiment of the present invention;
Fig. 6 is the schematic diagram of the device for automatically generating digest of the embodiment of the present invention;
Fig. 7 is the schematic diagram of the acquisition module of the embodiment of the present invention;
Fig. 8 is the schematic diagram of the second processing submodule of the embodiment of the present invention;
Fig. 9 is the schematic diagram of the electronic equipment of the embodiment of the present invention;
Figure 10 is the system hardware schematic diagram of the embodiment of the present invention.
Specific embodiment
Below in conjunction with drawings and examples, specific embodiments of the present invention will be described in further detail.Following reality
Example is applied for illustrating the present invention, but is not intended to limit the scope of the invention.
As shown in Figure 1, the method for automatically generating digest of the embodiment of the present invention, comprising:
Step 101, the lexical characteristics for extracting document to be processed.
In embodiments of the present invention, the lexical characteristics include but is not limited to are as follows: part of speech feature, name substance feature, word
Frequently (Term Frequency, TF) and reverse document frequency statistical nature (Inverse Document Frequency, IDF).
In practical applications, the extraction of critical entities word and notional word can be realized by the combination of above-mentioned several feature particular values.Example
Such as, part of speech feature is noun or noun phrase;Substance feature is named to behave or organize or place;TF-IDF value is higher etc..
Wherein, part of speech feature and name substance feature are discrete value tag, word frequency statistics feature and reverse document frequency system
Meter is characterized in continuous value tag.Fig. 2 is lexical characteristics schematic diagram.
Specifically, above-mentioned continuous value tag is calculated by statistics;Sequence labelling model can be used in discrete value tag
It extracts, such as because of Markov model (Hidden Markov Model, HMM), conditional random field models (Conditional
Random Fields, CRFs) etc..
Step 102, the syntactic feature for extracting the document to be processed.
In embodiments of the present invention, the syntactic feature includes but is not limited to are as follows: interdependent syntax dependence feature
(dependent, vehicle economy P), syntactic constituent feature (abbreviation SC).Wherein, interdependent syntax dependence feature, syntactic constituent are special
Sign is discrete value tag.By these features, sentence structure information can get, and emphasize the work of verb and predicate in sentence
With.
It is various in practical application, syntactic analysis model can be used and obtain syntactic feature, such as context-free grammar etc..Figure
3 and Fig. 4 is the process schematic for obtaining syntactic feature.By the analysis to sentence shown in Fig. 3, sentence as shown in Figure 4 is obtained
Method feature.
Step 103, the term vector for obtaining the document to be processed, and obtain vector expression and the institute of the lexical characteristics
The vector for stating syntactic feature indicates.
In this step, by the continuous value tag obtained in above-mentioned steps and discrete value tag, as shown in figure 5, by as follows
Mode is handled:
Step 1031, the term vector for obtaining document to be processed.
Term vector is the name set of one group of Language Modeling and feature learning technology in natural language processing, word (word
Or phrase) it is mapped to low-dimensional real vector.Word2vec or GloVe can be used in the generation of term vector (the distributed of word indicates)
Etc. modes.
Step 1032 shows the only hotlist of discrete value tag.
The only hotlist of discrete value tag obtained in above-mentioned steps is shown, that is, the vector for obtaining discrete value tag indicates.
Step 1033, by the successive value Feature Conversion be target discrete value tag, and with only hotlist show the target from
Dissipate value tag.
Here, can be target discrete value tag by the successive value Feature Conversion obtained in above-mentioned steps, and be shown with only hotlist
The target discrete value tag.
Target discrete characteristic value is converted to specifically, the continuous characteristic value is assigned in the bucket of preset quantity, with only
Hotlist shows for converting the continuous characteristic value of target in the continuous characteristic value as the number of the bucket of Discrete Eigenvalue.Wherein, should
Preset quantity can rule of thumb be set.
In above process, the sequencing between each step can be without limitation.
Step 104, the connection term vector, the lexical characteristics vector indicate and the vector table of the syntactic feature
Show, obtains information to be processed.
Here, the vector for obtaining its each characteristic value indicates, each word is corresponding for each word in document to be processed
Term vector, lexical characteristics vector indicate and syntactic feature vector indicate head and the tail connection formed a vector.In connection
In the process, the order of connection can be arbitrarily arranged.So, multiple vectors can be formed in the manner described above, by connection formed it is multiple to
Amount is used as information to be processed.
Step 105, using the information to be processed as the input of encoder, obtain the digest of the document to be processed.
In embodiments of the present invention, the lexical characteristics and syntactic feature of document to be processed are extracted, and by document to be processed
The vector of term vector, the vector expression of lexical characteristics and the syntactic feature indicates connection, and forms information to be processed, then should
Information to be processed is handled as the input of encoder, to obtain digest.Due in embodiments of the present invention will be to be processed
Input of the syntactic feature of document as encoder, and syntactic feature can more embody the pass of the connection in sentence between each ingredient
System, thus, the readability of the digest of acquisition can be improved using the scheme of the embodiment of the present invention.
As shown in fig. 6, the device 1000 for automatically generating digest of the embodiment of the present invention, comprising:
First extraction module 501, for extracting the lexical characteristics of document to be processed;
Second extraction module 502, for extracting the syntactic feature of the document to be processed;
Module 503 is obtained, for obtaining the term vector of the document to be processed, and the vector of the acquisition lexical characteristics
It indicates and the vector of the syntactic feature indicates;
Link block 504, the vector for connecting the term vector, the lexical characteristics indicates and the syntactic feature
Vector indicates, obtains information to be processed;
Processing module 505, for obtaining the document to be processed using the information to be processed as the input of encoder
Digest.
In embodiments of the present invention, the lexical characteristics include but is not limited to are as follows: part of speech feature names substance feature, word
Frequency statistical nature, reverse document frequency statistical nature.The syntactic feature includes but is not limited to are as follows: interdependent syntax dependence is special
Sign, syntactic constituent feature.
As shown in fig. 7, the acquisition module 503 includes:
First acquisition submodule 5031, for obtaining the term vector of the document to be processed;
Second acquisition submodule 5032, for obtain the continuous value tag in the lexical characteristics and the syntactic feature and
Discrete value tag;
First processing submodule 5033, for showing the only hotlist of discrete value tag;
Second processing submodule 5034, for being target discrete value tag by the successive value Feature Conversion, and with solely hot
Indicate the target discrete value tag.
As shown in figure 8, the second processing submodule 5034 includes: allocation unit 50341, by the continuous characteristic value point
It is fitted in the bucket of preset quantity and is converted to target discrete characteristic value;Processing unit 50342, for being shown with only hotlist for converting
State the number for the bucket that the continuous characteristic value of target in continuous characteristic value is Discrete Eigenvalue.
Specifically, the link block 504 is used for, it is for each word in the document to be processed, each word is corresponding
Term vector, lexical characteristics vector indicate and syntactic feature vector indicate head and the tail connection formed a vector, and will be formed
Multiple vectors as information to be processed.
The working principle of the device of that embodiment of the invention can refer to the description of preceding method embodiment.
In embodiments of the present invention, the lexical characteristics and syntactic feature of document to be processed are extracted, and by document to be processed
The vector of term vector, the vector expression of lexical characteristics and the syntactic feature indicates connection, and forms information to be processed, then should
Information to be processed is handled as the input of encoder, to obtain digest.Due in embodiments of the present invention will be to be processed
Input of the syntactic feature of document as encoder, and syntactic feature can more embody the pass of the connection in sentence between each ingredient
System, thus, the readability of the digest of acquisition can be improved using the scheme of the embodiment of the present invention.
As shown in figure 9, the embodiment of the invention provides a kind of electronic equipment, comprising: processor 801 and memory 802,
Computer program instructions are stored in the memory 802, wherein are run in the computer program instructions by the processor
When, so that the processor 801 executes following steps:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains vector expression and the syntax spy of the lexical characteristics
The vector of sign indicates;
Connect the term vector, the lexical characteristics vector indicate and the syntactic feature vector indicate, obtain to
Handle information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
Further, as shown in figure 8, electronic equipment further includes network interface 803, input equipment 804, hard disk 805 and shows
Show equipment 806.
It can be interconnected by bus architecture between above-mentioned each interface and equipment.It may include any that bus architecture, which can be,
The bus and bridge of the interconnection of quantity.One or more central processing unit (CPU) specifically represented by processor 801, Yi Jiyou
The various circuit connections for one or more memory that memory 802 represents are together.Bus architecture can also will be such as outer
Peripheral equipment, voltage-stablizer are together with the various other circuit connections of management circuit or the like.It is appreciated that bus architecture is used
Connection communication between these components of realization.Bus architecture further includes power bus, controls always in addition to including data/address bus
Line and status signal bus in addition, these are all it is known in the art, therefore is no longer described in greater detail herein.
The network interface 803 can connect to network (such as internet, local area network), dependency number obtained from network
According to, and can be stored in hard disk 805.
The input equipment 804, can receive operator input various instructions, and be sent to processor 801 for
It executes.The input equipment 804 may include keyboard or pointing device (for example, mouse, trace ball (trackball), sense of touch
Plate or touch screen etc..
The display equipment 806, the result that processor 801 executes instruction acquisition can be shown.
The memory 802, program necessary to running for storage program area and data and processor 801 are counted
The data such as the intermediate result during calculation.
It is appreciated that the memory 802 in the embodiment of the present invention can be volatile memory or nonvolatile memory,
It or may include both volatile and non-volatile memories.Wherein, nonvolatile memory can be read-only memory (ROM),
Programmable read only memory (PROM), Erasable Programmable Read Only Memory EPROM (EPROM), electrically erasable programmable read-only memory
(EEPROM) or flash memory.Volatile memory can be random access memory (RAM), be used as External Cache.Herein
The memory 802 of the device and method of description is intended to include but is not limited to the memory of these and any other suitable type.
In some embodiments, memory 802 stores following element, executable modules or data structures, or
Their subset of person or their superset: operating system 8021 and application program 808.
Wherein, operating system 8021 include various system programs, such as ccf layer, core library layer, driving layer etc., are used for
Realize various basic businesses and the hardware based task of processing.Application program 808 includes various application programs, such as browses
Device (Browser) etc., for realizing various applied business.Realize that the program of present invention method may be embodied in using journey
In sequence 808.
Above-mentioned processor 801, when calling and executing the application program and data that are stored in the memory 802, specifically
, when can be the program or instruction that store in application program 808, execute following steps:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains vector expression and the syntax spy of the lexical characteristics
The vector of sign indicates;
Connect the term vector, the lexical characteristics vector indicate and the syntactic feature vector indicate, obtain to
Handle information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
The method that the above embodiment of the present invention discloses can be applied in processor 801, or be realized by processor 801.
Processor 801 may be a kind of IC chip, the processing capacity with signal.During realization, the above method it is each
Step can be completed by the integrated logic circuit of the hardware in processor 801 or the instruction of software form.Above-mentioned processing
Device 801 can be general processor, digital signal processor (DSP), specific integrated circuit (ASIC), ready-made programmable gate array
(FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware components, may be implemented or
Person executes disclosed each method, step and the logic diagram in the embodiment of the present invention.General processor can be microprocessor or
Person's processor is also possible to any conventional processor etc..The step of method in conjunction with disclosed in the embodiment of the present invention, can be straight
Connect and be presented as that hardware decoding processor executes completion, or in decoding processor hardware and software module combination executed
At.Software module can be located at random access memory, and flash memory, read-only memory, programmable read only memory or electrically-erasable can
In the storage medium of this fields such as programmable memory, register maturation.The storage medium is located at memory 802, and processor 801 is read
Information in access to memory 802, in conjunction with the step of its hardware completion above method.
It is understood that embodiments described herein can with hardware, software, firmware, middleware, microcode or its
Combination is to realize.For hardware realization, processing unit be may be implemented in one or more specific integrated circuits (ASIC), number letter
Number processor DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array
(FPGA), general processor, controller, microcontroller, microprocessor, other electronics lists for executing herein described function
In member or combinations thereof.
For software implementations, it can be realized herein by executing the module (such as process, function etc.) of function described herein
The technology.Software code is storable in memory and is executed by processor.Memory can in the processor or
It is realized outside processor.
Wherein, the lexical characteristics include: part of speech feature, name substance feature, word frequency statistics feature, reverse document frequency
Statistical nature.
Wherein, the syntactic feature includes: interdependent syntax dependence feature, syntactic constituent feature.
Specifically, processor 801 is also used to read the computer program, executes following steps:
Obtain the continuous value tag and discrete value tag in the lexical characteristics and the syntactic feature;
The only hotlist of discrete value tag is shown;
It is target discrete value tag by the successive value Feature Conversion, and shows the target discrete value tag with only hotlist.
Specifically, processor 801 is also used to read the computer program, executes following steps:
The continuous characteristic value is assigned in the bucket of preset quantity and is converted to target discrete characteristic value;
Shown with only hotlist for converting the continuous characteristic value of target in the continuous characteristic value as the bucket of Discrete Eigenvalue
Number.
Specifically, processor 801 is also used to read the computer program, executes following steps:
For each word in the document to be processed, the vector of the corresponding term vector of each word, lexical characteristics is indicated
Indicate that head and the tail connection forms a vector with the vector of syntactic feature, and using multiple vectors of formation as information to be processed.
In addition, the embodiment of the invention also provides a kind of computer readable storage medium, the computer-readable storage medium
Matter is stored with computer program, when the computer program is run by processor, so that the processor executes following steps:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains vector expression and the syntax spy of the lexical characteristics
The vector of sign indicates;
Connect the term vector, the lexical characteristics vector indicate and the syntactic feature vector indicate, obtain to
Handle information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
Wherein, the lexical characteristics include: part of speech feature, name substance feature, word frequency statistics feature, reverse document frequency
Statistical nature.
Wherein, the syntactic feature includes: interdependent syntax dependence feature, syntactic constituent feature.
Wherein, the vector for obtaining the lexical characteristics indicates and the vector of the syntactic feature indicates, comprising:
Obtain the continuous value tag and discrete value tag in the lexical characteristics and the syntactic feature;
The only hotlist of discrete value tag is shown;
It is target discrete value tag by the successive value Feature Conversion, and shows the target discrete value tag with only hotlist.
Wherein, it is described by the successive value Feature Conversion be target discrete value tag, and with only hotlist show the target from
Dissipate value tag, comprising:
The continuous characteristic value is assigned in the bucket of preset quantity and is converted to target discrete characteristic value;
Shown with only hotlist for converting the continuous characteristic value of target in the continuous characteristic value as the bucket of Discrete Eigenvalue
Number.
Wherein, the vector table of the vector expression and the syntactic feature of the connection term vector, the lexical characteristics
Show, obtain information to be processed, comprising:
For each word in the document to be processed, the vector of the corresponding term vector of each word, lexical characteristics is indicated
Indicate that head and the tail connection forms a vector with the vector of syntactic feature, and using multiple vectors of formation as information to be processed.
It as shown in Figure 10, is system hardware architecture diagram applied by the embodiment of the present invention.The system deployment is in PC system:
It outputs and inputs and is stored in storage equipment 1013, functional module and intermediate result are all stored in main memory 1011, functional module
It is executed by central processing unit 1000.Data are input in system through input unit 1014, export result by display unit 1015
Display.
In several embodiments provided herein, it should be understood that disclosed method and apparatus, it can be by other
Mode realize.For example, the apparatus embodiments described above are merely exemplary, for example, the division of the unit, only
For a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine
Or it is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed phase
Coupling, direct-coupling or communication connection between mutually can be through some interfaces, the INDIRECT COUPLING or communication of device or unit
Connection can be electrical property, mechanical or other forms.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that the independent physics of each unit includes, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, server or the network equipment etc.) executes receiving/transmission method described in each embodiment of the present invention
Part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, abbreviation
ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic or disk etc. are various can store
The medium of program code.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art
For, without departing from the principles of the present invention, it can also make several improvements and retouch, these improvements and modifications
It should be regarded as protection scope of the present invention.
Claims (9)
1. a kind of method for automatically generating digest characterized by comprising
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains the vector expression and the syntactic feature of the lexical characteristics
Vector indicates;
Connect the term vector, the vector of the lexical characteristics indicates and the expression of the vector of the syntactic feature, acquisition are to be processed
Information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
2. name entity is special the method according to claim 1, wherein the lexical characteristics include: part of speech feature
Sign, word frequency statistics feature, reverse document frequency statistical nature.
3. the method according to claim 1, wherein the syntactic feature includes: that interdependent syntax dependence is special
Sign, syntactic constituent feature.
4. the method according to claim 1, wherein the vector for obtaining the lexical characteristics indicates and described
The vector of syntactic feature indicates, comprising:
Obtain the continuous value tag and discrete value tag in the lexical characteristics and the syntactic feature;
The only hotlist of discrete value tag is shown;
It is target discrete value tag by the successive value Feature Conversion, and shows the target discrete value tag with only hotlist.
5. according to the method described in claim 4, it is characterized in that, it is described by the successive value Feature Conversion be target discrete value
Feature, and show the target discrete value tag with only hotlist, comprising:
The continuous characteristic value is assigned in the bucket of preset quantity and is converted to target discrete characteristic value;
Shown with only hotlist for converting the continuous characteristic value of target in the continuous characteristic value as the number of the bucket of Discrete Eigenvalue.
6. the method according to claim 1, wherein the connection term vector, the lexical characteristics to
Amount indicates and the vector of the syntactic feature indicates, obtains information to be processed, comprising:
Vector expression and sentence for each word in the document to be processed, by the corresponding term vector of each word, lexical characteristics
The vector of method feature indicates that head and the tail connection forms a vector, and using multiple vectors of formation as information to be processed.
7. a kind of device for automatically generating digest characterized by comprising
First extraction module, for extracting the lexical characteristics of document to be processed;
Second extraction module, for extracting the syntactic feature of the document to be processed;
Obtain module, for obtaining the term vector of the document to be processed, and obtain the lexical characteristics vector indicate and
The vector of the syntactic feature indicates;
Link block, the vector for connecting the term vector, the lexical characteristics indicates and the vector table of the syntactic feature
Show, obtains information to be processed;
Processing module, for obtaining the digest of the document to be processed using the information to be processed as the input of encoder.
8. a kind of electronic equipment characterized by comprising processor;And memory, it is stored with computer in the memory
Program instruction,
Wherein, when the computer program instructions are run by the processor, so that the processor executes following steps:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains the vector expression and the syntactic feature of the lexical characteristics
Vector indicates;
Connect the term vector, the vector of the lexical characteristics indicates and the expression of the vector of the syntactic feature, acquisition are to be processed
Information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
9. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey
Sequence, when the computer program is run by processor, so that the processor executes following steps:
Extract the lexical characteristics of document to be processed;
Extract the syntactic feature of the document to be processed;
The term vector of the document to be processed is obtained, and obtains the vector expression and the syntactic feature of the lexical characteristics
Vector indicates;
Connect the term vector, the vector of the lexical characteristics indicates and the expression of the vector of the syntactic feature, acquisition are to be processed
Information;
Using the information to be processed as the input of encoder, the digest of the document to be processed is obtained.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710892106.0A CN109558583A (en) | 2017-09-27 | 2017-09-27 | A kind of method, device and equipment automatically generating digest |
JP2018134689A JP6579239B2 (en) | 2017-09-27 | 2018-07-18 | Abstract sentence automatic generation method, apparatus, and electronic device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710892106.0A CN109558583A (en) | 2017-09-27 | 2017-09-27 | A kind of method, device and equipment automatically generating digest |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109558583A true CN109558583A (en) | 2019-04-02 |
Family
ID=65864224
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710892106.0A Pending CN109558583A (en) | 2017-09-27 | 2017-09-27 | A kind of method, device and equipment automatically generating digest |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP6579239B2 (en) |
CN (1) | CN109558583A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110457483A (en) * | 2019-06-21 | 2019-11-15 | 浙江大学 | A kind of long text generation method based on neural topic model |
CN111178053A (en) * | 2019-12-30 | 2020-05-19 | 电子科技大学 | Text generation method for performing generation type abstract extraction by combining semantics and text structure |
CN111209751A (en) * | 2020-02-14 | 2020-05-29 | 全球能源互联网研究院有限公司 | Chinese word segmentation method, device and storage medium |
CN112765987A (en) * | 2021-01-26 | 2021-05-07 | 武汉大学 | Event identification method and system based on recursive conditional random field decoder |
CN113515627A (en) * | 2021-05-19 | 2021-10-19 | 北京世纪好未来教育科技有限公司 | Document detection method, device, equipment and storage medium |
WO2022121165A1 (en) * | 2020-12-10 | 2022-06-16 | 平安科技(深圳)有限公司 | Long text generation method and apparatus, device and storage medium |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112560456B (en) * | 2020-11-03 | 2024-04-09 | 重庆安石泽太科技有限公司 | Method and system for generating generated abstract based on improved neural network |
US20230124296A1 (en) * | 2021-10-18 | 2023-04-20 | Samsung Electronics Co., Ltd. | Method of natural language processing by performing semantic analysis using syntactic information, and an apparatus for the same |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104281645A (en) * | 2014-08-27 | 2015-01-14 | 北京理工大学 | Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency |
CN104536950A (en) * | 2014-12-11 | 2015-04-22 | 北京百度网讯科技有限公司 | Text summarization generating method and device |
CN106383817A (en) * | 2016-09-29 | 2017-02-08 | 北京理工大学 | Paper title generation method capable of utilizing distributed semantic information |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3581074B2 (en) * | 2000-03-07 | 2004-10-27 | 日本電信電話株式会社 | Document digest creation method, document search device, and recording medium |
JP2003108571A (en) * | 2001-09-28 | 2003-04-11 | Seiko Epson Corp | Document summary device, control method of document summary device, control program of document summary device and recording medium |
JP2015088064A (en) * | 2013-10-31 | 2015-05-07 | 日本電信電話株式会社 | Text summarization device, text summarization method, and program |
JP6537340B2 (en) * | 2015-04-28 | 2019-07-03 | ヤフー株式会社 | Summary generation device, summary generation method, and summary generation program |
EP3394798A1 (en) * | 2016-03-18 | 2018-10-31 | Google LLC | Generating dependency parses of text segments using neural networks |
-
2017
- 2017-09-27 CN CN201710892106.0A patent/CN109558583A/en active Pending
-
2018
- 2018-07-18 JP JP2018134689A patent/JP6579239B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104281645A (en) * | 2014-08-27 | 2015-01-14 | 北京理工大学 | Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency |
CN104536950A (en) * | 2014-12-11 | 2015-04-22 | 北京百度网讯科技有限公司 | Text summarization generating method and device |
CN106383817A (en) * | 2016-09-29 | 2017-02-08 | 北京理工大学 | Paper title generation method capable of utilizing distributed semantic information |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110457483A (en) * | 2019-06-21 | 2019-11-15 | 浙江大学 | A kind of long text generation method based on neural topic model |
CN111178053A (en) * | 2019-12-30 | 2020-05-19 | 电子科技大学 | Text generation method for performing generation type abstract extraction by combining semantics and text structure |
CN111209751A (en) * | 2020-02-14 | 2020-05-29 | 全球能源互联网研究院有限公司 | Chinese word segmentation method, device and storage medium |
WO2022121165A1 (en) * | 2020-12-10 | 2022-06-16 | 平安科技(深圳)有限公司 | Long text generation method and apparatus, device and storage medium |
CN112765987A (en) * | 2021-01-26 | 2021-05-07 | 武汉大学 | Event identification method and system based on recursive conditional random field decoder |
CN113515627A (en) * | 2021-05-19 | 2021-10-19 | 北京世纪好未来教育科技有限公司 | Document detection method, device, equipment and storage medium |
CN113515627B (en) * | 2021-05-19 | 2023-07-25 | 北京世纪好未来教育科技有限公司 | Document detection method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
JP2019061656A (en) | 2019-04-18 |
JP6579239B2 (en) | 2019-09-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109558583A (en) | A kind of method, device and equipment automatically generating digest | |
KR102557681B1 (en) | Time series knowledge graph generation method, device, equipment and medium | |
Nie et al. | Improving named entity recognition with attentive ensemble of syntactic information | |
Overmyer et al. | Conceptual modeling through linguistic analysis using LIDA | |
CN112329465A (en) | Named entity identification method and device and computer readable storage medium | |
US20170315984A1 (en) | Systems and methods for text analytics processor | |
US20080010615A1 (en) | Generic frequency weighted visualization component | |
Khan et al. | Deep recurrent neural networks with word embeddings for Urdu named entity recognition | |
Singh et al. | Development of Marathi part of speech tagger using statistical approach | |
JP6693582B2 (en) | Document abstract generation method, device, electronic device, and computer-readable storage medium | |
Yu et al. | Character composition model with convolutional neural networks for dependency parsing on morphologically rich languages | |
CN108932218A (en) | A kind of example extended method, device, equipment and medium | |
Umber et al. | NL-based automated software requirements elicitation and specification | |
EP3921760A1 (en) | Mapping natural language utterances to operations over a knowledge graph | |
CN113761197A (en) | Application book multi-label hierarchical classification method capable of utilizing expert knowledge | |
CN116245097A (en) | Method for training entity recognition model, entity recognition method and corresponding device | |
JP2021197179A (en) | Entity identification method, device, and computer readable storage medium | |
CN108491423B (en) | Sorting method and device | |
Wu et al. | Clinical named entity recognition via bi-directional LSTM-CRF model | |
CN110134935A (en) | A kind of method, device and equipment for extracting font style characteristic | |
Roth et al. | Interactive feature space construction using semantic information | |
CN109471969A (en) | A kind of application searches method, device and equipment | |
CN115510188A (en) | Text keyword association method, device, equipment and storage medium | |
US8838562B1 (en) | Methods and apparatus for providing query parameters to a search engine | |
Choudhury et al. | Morphological analyzer for manipuri: design and implementation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190402 |
|
RJ01 | Rejection of invention patent application after publication |