CN109408826A - A kind of text information extracting method, device, server and storage medium - Google Patents
A kind of text information extracting method, device, server and storage medium Download PDFInfo
- Publication number
- CN109408826A CN109408826A CN201811317522.9A CN201811317522A CN109408826A CN 109408826 A CN109408826 A CN 109408826A CN 201811317522 A CN201811317522 A CN 201811317522A CN 109408826 A CN109408826 A CN 109408826A
- Authority
- CN
- China
- Prior art keywords
- candidate word
- text
- vector
- sentence
- keyword
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 239000013598 vector Substances 0.000 claims abstract description 103
- 238000000605 extraction Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims 1
- 238000012545 processing Methods 0.000 description 20
- 230000006870 function Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000005291 magnetic effect Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 241000220225 Malus Species 0.000 description 3
- 235000021016 apples Nutrition 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000003306 harvesting Methods 0.000 description 3
- 240000007594 Oryza sativa Species 0.000 description 2
- 235000007164 Oryza sativa Nutrition 0.000 description 2
- 238000009792 diffusion process Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 235000009566 rice Nutrition 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000005201 scrubbing Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
A kind of text information extracting method, device, server and storage medium provided in an embodiment of the present invention.This method comprises: determining the term vector of candidate word in text by Word2Vec model, and determine the similarity value between different term vectors;Using term vector as node, and according to the side between the similarity value building node between term vector, candidate word atlas is obtained;Candidate word weight is determined according to the candidate word atlas by TextRank algorithm;According to candidate word weight, the keyword of text is determined.Candidate word is converted into term vector by using Word2Vec model, candidate word can be made to be indicated by the vector of low-dimensional, improve treatment effeciency, it is calculated by similarity value, and construct atlas, it can visually reflect the incidence relation between candidate word, the weighted value of candidate word is calculated finally by TextRank algorithm, thus the more accurate keyword for comprehensively determining text.
Description
Technical field
The present embodiments relate to text extraction techniques field more particularly to a kind of text information extracting methods, device, clothes
Business device and storage medium.
Background technique
With the fast development of internet, the function of network is more and more comprehensive, the amount of web documents information also rapid growth.
But many web documents, there are biggish length, people usually require to consume a large amount of time to read entire article ability
Obtain crucial news information.For needing to extract for the editor of article information or the monitoring personnel of network, in order to obtain
Crucial article information, requires a great deal of time to read the article of big length, greatly reduces working efficiency.Therefore,
The automatically extracting of text key word and text snippet greatly shortens people and obtains key message from big length web documents
Time, while also having saved the human cost of some companies or enterprise well.
Currently used keyword and abstract extraction method are the sort method based on TextRank algorithm, TextRank's
PageRank algorithm of the basic thought based on Google.TextRank universal model can be expressed as an oriented authorized graph G=(V,
E), it is made of point set V and line set E, E is the subset of V × V.In (Vi) is the point set for being directed toward point Vi, and Out (Vi) is
The point set that point Vi is directed toward.The score of point Vi is defined as follows:
Wherein, d is damped coefficient, and value range is 0 to 1, represents a certain specified point from figure and is directed toward any other point
Probability.
Weighted value is calculated according to above-mentioned algorithm and needs to construct atlas according to cooccurrence relation, but this method needs are built in advance
It stands the side between all point sets, then is wherein being chosen by the window being arranged, obtain that there are the sides of incidence relation and candidate
Word node, building process is cumbersome, and treatment effeciency is low, and is unable to get the relative size of each edge weighted value, causes to pass through
The keyword or abstract that TextRank algorithm obtains be not comprehensively accurate.In addition, traditional alphanumeric method form is simple,
The vector dimension of conversion is larger, is unfavorable for calculating and handle.
Summary of the invention
The embodiment of the invention provides a kind of text information extracting method, device, server and storage mediums, solve current
Comprehensively not accurate using keyword in TextRank algorithm progress information extraction process or abstract acquisition, treatment effeciency is low to ask
Topic.
In a first aspect, the embodiment of the invention provides a kind of text information extracting methods, comprising:
The term vector of candidate word in text is determined by Word2Vec model, and determines the similarity between different term vectors
Value;
Using term vector as node, and according to the side between the similarity value building node between term vector, candidate is obtained
Word atlas;
Candidate word weight is determined according to the candidate word atlas by TextRank algorithm;
According to candidate word weight, the keyword of text is determined.
Second aspect, the embodiment of the invention provides a kind of text information extraction element, described device includes:
First determining module for determining the term vector of candidate word in text by Word2Vec model, and determines different
Similarity value between term vector;
First building module, is used for using term vector as node, and construct node according to the similarity value between term vector
Between side, obtain candidate word atlas;
First weight determination module, for according to the candidate word atlas, determining that candidate word is weighed by TextRank algorithm
Weight;
Keyword determining module, for determining the keyword of text according to candidate word weight.
The third aspect, the embodiment of the invention provides a kind of servers, comprising:
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processing
Device is realized such as any text information extracting method in the embodiment of the present invention.
Fourth aspect, the embodiment of the invention provides a kind of computer readable storage mediums, are stored thereon with computer journey
Sequence realizes any text information extracting method in the embodiment of the present invention when program is executed by processor.
A kind of text information extracting method, device, server and storage medium provided in an embodiment of the present invention, by using
Candidate word is converted to term vector by Word2Vec model, and candidate word can be made to be indicated by the vector of low-dimensional, raising processing
Efficiency is calculated by similarity value, and constructs atlas, can visually reflect the incidence relation between candidate word, and lead to
The weighted value that TextRank algorithm calculates candidate word is crossed, thus the more accurate keyword for comprehensively determining text.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing does one and simply introduces, it should be apparent that, drawings in the following description are some embodiments of the invention, for this
For the those of ordinary skill of field, without creative efforts, it can also be obtained according to these attached drawings others
Attached drawing.
Fig. 1 is a kind of text information extracting method flow chart that the embodiment of the present invention one provides;
Fig. 2 is a kind of text information extracting method flow chart provided by Embodiment 2 of the present invention;
Fig. 3 is a kind of text information extraction element structural schematic diagram that the embodiment of the present invention three provides;
Fig. 4 is a kind of server architecture schematic diagram that the embodiment of the present invention four provides.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, hereinafter with reference to attached in the embodiment of the present invention
Figure, clearly and completely describes technical solution of the present invention by embodiment, it is clear that described embodiment is the present invention one
Section Example, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not doing
Every other embodiment obtained under the premise of creative work out, shall fall within the protection scope of the present invention.
Embodiment one
Fig. 1 is a kind of text information extracting method flow chart that the embodiment of the present invention one provides.The technical side of the present embodiment
Case can be adapted for the case where extracting to key messages such as keywords in text.This method can be extracted by text information
Device executes, which can be realized by the mode of software and/or hardware, and is integrated in server.This method specifically include as
Lower operation:
S110, the term vector that candidate word in text is determined by Word2Vec model, and determine between different term vectors
Similarity value.
Specifically, crawling text as text to be processed by web crawlers, wherein the text can be different field
Newsletter archive.Data scrubbing is carried out to text to be processed, removes the non-textual information in text to be processed, such as punctuate symbol
Number, plain text is obtained, and plain text is split as complete sentence.Plain text is segmented using participle tool and carries out word
Property mark, remove stop words, leave the everyday words such as noun, adjective, verb as candidate word.Optionally, participle tool can be with
Tool is segmented for Ansj or jieba segments tool.
The term vector of text candidates word is determined by Word2Vec model.Wherein, Word2Vec model is to determine text
It is obtained before the term vector of candidate word by training, illustratively, a large amount of different types of text datas of selection, such as society,
The newsletter archives such as the people's livelihood, sport, music carry out data cleansing, and participle obtains candidate word, and candidate word is put into file
In, wherein the candidate word of each one sentence of behavior in file.It is trained using the model in Word2Vec algorithm, it is optional
, using the CBOW model based on hierarchical softmax or the Skip- based on hierarchical softmax
Gram model is trained, and obtains Word2Vec model.
After mapping obtains the term vector of candidate word, the similarity value between different term vectors is determined.Wherein, similarity value
It can be indicated with the cosine value of two different term vectors, it may be assumed that
Wherein, a, b indicate two different candidate words, and similarity (" a ", " b ") is indicated between two candidate words
Similarity value, A, B indicate that the corresponding term vector of two candidate words, AB indicate that the dot product of two term vectors, ‖ A ‖ and ‖ B ‖ indicate
The vector length of two term vectors, n indicate the dimension of term vector.The similarity value of two candidate words is obtained according to above-mentioned formula,
Such as:
Similarity (" Shandong ", " Jiangsu ")=0.41542658
Similarity (" Shandong ", " Beijing ")=0.19865009
Similarity (" Shandong ", " men's basketball ")=0.16770135.
Wherein, the similarity value of two candidate words in Shandong and Jiangsu is larger, then illustrates the association between the two candidate words
Degree is higher, and Shandong and the similarity value of two candidate words of men's basketball are smaller, illustrate that the degree of association between the two candidate words is smaller.
S120, using term vector as node, and according between term vector similarity value building node between side, obtain
Candidate word atlas.
Specifically, forming point set using term vector as node, V can be expressed as.Optionally, according between term vector
Similarity value constructs the side between node, comprising: if the similarity value between two term vectors is greater than preset first similarity
Threshold value then constructs the side between described two term vectors, forms side collection, can be expressed as E, and E is the subset of V × V.Wherein, institute
Stating preset first similarity threshold can be configured as needed by technical staff, illustratively, by preset first phase
It is set as 0.450 like degree threshold value, when the similarity value between two candidate words is greater than 0.450, shows the relevance between it
It is higher, then construct the side between two candidate word nodes.Candidate word atlas G=(V, E) is obtained according to point set and Bian Ji, according to time
Select the incidence relation between the available keyword of word atlas G.
S130, candidate word weight is determined according to the candidate word atlas by TextRank algorithm.
Specifically, calculating the weighted value of candidate word, calculation formula according to TextRank algorithm formula are as follows:
Wherein, d is damped coefficient, and value range is 0 to 1, and the general of any other point is directed toward in the certain point represented from figure
Rate, general value are 0.85, ViAnd VjIndicate two different candidate word nodes, WS (Vi) and WS (Vj) indicate two candidate words
Weight, In (Vi) indicate to be directed toward node ViNode set, Out (Vj) indicate node VjThe set of the node of direction, wjiWith
wjkIndicate the weighted value on the side between two nodes, i.e. similarity value between the candidate word of two node on behalf.Pass through above-mentioned public affairs
Formula, and according to the incidence relation between the candidate word in candidate word atlas, determine the weight of each candidate word, and iterative diffusion is each
The weight of node, until convergence, optionally, convergency value can be set to 0.0001.
S140, according to candidate word weight, determine the keyword of text.
Optionally, according to candidate word weight, the keyword of text is determined, comprising: press to candidate word according to candidate word weight
It is ranked up according to inverted order;Keyword of the preceding candidate word of selected and sorted as text.Specifically, being obtained by TextRank algorithm
To the weighted value of candidate word, candidate word is ranked up according to weighted value size, optionally, is arranged according to weighted value inverted order
Sequence selects weighted value to sort preceding candidate word as text key word, and the keyword number of selection can be by technical staff's root
According to being set.
Illustratively, according to candidate word weight, after the keyword for determining text, if can also include: at least two passes
The position of keyword in the text is adjacent, then synthesizes at least two keyword.Specifically, obtained keyword is existed
Position mark is carried out in original text, if the position of at least two keywords in the text is adjacent, by least two adjacent passes
Keyword is synthesized, and more word keywords are formed.
A kind of text information extracting method provided in an embodiment of the present invention, firstly, determining text by Word2Vec model
The term vector of middle candidate word, and determine the similarity value between different term vectors;Then, using term vector as node, and according to
The side between similarity value building node between term vector, obtains candidate word atlas;By TextRank algorithm, according to described
Candidate word atlas determines candidate word weight;Finally, determining the keyword of text according to candidate word weight.By using
Candidate word is converted to term vector by Word2Vec model, and candidate word can be made to be indicated by the vector of low-dimensional, improves place
Efficiency is managed, is calculated by similarity value, can accurately obtain the connection being selected between word, and by building atlas, it can
Reflect the incidence relation between candidate word accurate and visually, the weighted value of candidate word calculated finally by TextRank algorithm,
To the more accurate keyword for comprehensively determining text.
Embodiment two
Fig. 2 is a kind of text information extracting method flow chart provided by Embodiment 2 of the present invention.The present embodiment is in above-mentioned reality
It applies and advanced optimizes on the basis of example, wherein the not content detailed in Example one of detailed description in the present embodiment.Such as Fig. 2 institute
Show, a kind of text information extracting method provided by Embodiment 2 of the present invention specifically includes the following steps:
S210, the term vector that candidate word in text is determined by Word2Vec model, and determine between different term vectors
Similarity value.
S220, using term vector as node, and according between term vector similarity value building node between side, obtain
Candidate word atlas.
S230, candidate word weight is determined according to the candidate word atlas by TextRank algorithm.
S240, according to candidate word weight, determine the keyword of text.
The term vector of S250, the candidate word according to included by sentence in text determine that the vector of sentence indicates, and determine not
With the similarity value between the vector expression of sentence.
Reader is not often only intended to know the keyword of text, it is also necessary to pass through text when reading newsletter archive information
This abstract more comprehensively specifically understands content of text.Illustratively, term vector included by sentence in text is overlapped, is obtained
Dimension to sentence vector, the sentence vector is identical as the dimension of the term vector of candidate word.By between different sentences to
Amount indicates to determine the similarity value between different sentences.Illustratively, available:
Similarity (" Apples of Shandong good harvest ", " peasant is in Jiangsu kind rice ")=0.48500857
Similarity (" Apples of Shandong good harvest ", " failure of Shandong football ")=0.31601506.
Wherein, the similarity value between " Apples of Shandong good harvest " and " peasant is in Jiangsu kind rice " is larger, then illustrates two
The degree of association between sentence is larger.
S260, the vector table of sentence is shown as to node, and section is constructed according to the similarity between the expression of the vector of sentence
Side between point, obtains sentence atlas.
The vector table of sentence is shown as node, point set is formed, V ' can be expressed as.Optionally, according to the vector of sentence
The side between similarity value building node between expression, comprising: if the similarity value between the vector expression of two sentences is big
In preset second similarity threshold, then the side between the vector expression of described two sentences is constructed, side collection is formed, can indicate
It is V ' × V ' subset for E ', E '.Wherein, preset second similarity threshold can be carried out as needed by technical staff
Setting, illustratively, sets 0.550 for preset second similarity threshold, when the similarity value between two candidate words is big
When 0.550, show that the relevance between it is higher, then constructs the side between two sentence nodes.It is obtained according to point set and side collection
To sentence atlas G '=(V ', E '), according to the incidence relation between the available sentence of sentence atlas G '.
S270, sentence weight is determined according to the sentence atlas by TextRank algorithm.
Specifically, being closed by the calculation formula in TextRank according to the association between the different sentences in sentence atlas
System, determines the weighted value of sentence, and the weight of each node of iterative diffusion, until convergence, optionally, convergency value be can be set to
0.0001。
S280, according to sentence weight, determine the abstract of text.
Optionally, according to sentence weight, determine the abstract of text, comprising: according to sentence weight to sentence according to inverted order into
Row sequence;The abstract of the preceding sentence composition text of selected and sorted.Specifically, obtaining the weight of sentence by TextRank algorithm
Value, is ranked up candidate word according to weighted value size, optionally, is ranked up according to weighted value inverted order, selects weighted value row
The preceding sentence of sequence forms text snippet, and the sentence number of selection can be set as needed by technical staff.
A kind of text information extracting method provided in an embodiment of the present invention, increases step: being wrapped according to sentence in text
The term vector of the candidate word included determines that the vector of sentence indicates, and the similarity value between the vector expression of determining different sentences;
The vector table of sentence is shown as node, and according to the side between the similarity building node between the expression of the vector of sentence, is obtained
To sentence atlas;Sentence weight is determined according to the sentence atlas by TextRank algorithm;According to sentence weight, text is determined
This abstract.The vector expression of sentence is formed by the way that candidate term vector to be overlapped, the vector for reducing sentence indicates dimension,
Treatment effeciency is improved, and by calculating similarity value and building atlas, reflects the association between sentence more accurate and visually
Property, sentence weight is obtained finally by TextRank algorithm, and then obtain text snippet, solved at present since sentence vector is tieed up
The low problem of higher caused treatment effeciency is spent, and because of incomplete ask of making a summary caused by similarity calculation error between sentence
Topic, obtained abstract more comprehensively intuitively reflect content of text.
Embodiment three
Fig. 3 is that a kind of text information that the embodiment of the present invention three provides proposes apparatus structure schematic diagram.As shown in figure 3, described
Device includes:
First determining module 310 for determining the term vector of candidate word in text by Word2Vec model, and determines not
With the similarity value between term vector;
First building module 320, is used for using term vector as node, and construct section according to the similarity value between term vector
Side between point, obtains candidate word atlas;
First weight determination module 330, for being determined candidate by TextRank algorithm according to the candidate word atlas
Word weight;
Keyword determining module 340, for determining the keyword of text according to candidate word weight.
Optionally, the first building module 320 is specifically used for:
If similarity value between two term vectors is greater than preset first similarity threshold, construct described two words to
Side between amount.
Optionally, the keyword determining module 340, is specifically used for:
Candidate word is ranked up according to inverted order according to candidate word weight;
Keyword of the preceding candidate word of selected and sorted as text.
Optionally, the keyword determining module 340, is also used to:
If the position of at least two keywords in the text is adjacent, at least two keyword is synthesized.
Optionally, further includes:
Second determining module determines the vector of sentence for the term vector of the candidate word according to included by sentence in text
It indicates, and the similarity value between the vector expression of determining different sentences;
Second building module, for the vector table of sentence to be shown as node, and according between the expression of the vector of sentence
Similarity constructs the side between node, obtains sentence atlas;
Second weight determination module, for determining sentence weight according to the sentence atlas by TextRank algorithm;
Abstract determining module, for determining the abstract of text according to sentence weight.
Optionally, further includes:
Synthesis module, it is crucial to described at least two if adjacent for the position of at least two keywords in the text
Word is synthesized.
A kind of text information extraction element provided in an embodiment of the present invention, a kind of text information proposed with above-described embodiment
Extracting method belongs to same inventive concept, and the technical detail of detailed description not can be found in above-described embodiment in the present embodiment, and
And the present embodiment and above-described embodiment beneficial effect having the same.
Example IV
Fig. 4 is a kind of structure chart for server that the embodiment of the present invention four provides.Fig. 4, which is shown, to be suitable for being used to realizing this hair
The block diagram of the exemplary processing devices 412 of bright embodiment.The processing equipment 412 that Fig. 4 is shown is only an example, should not be right
The function and use scope of the embodiment of the present invention bring any restrictions.
As shown in figure 4, processing equipment 412 is showed in the form of universal computing device.The component of processing equipment 412 can wrap
Include but be not limited to: one or more processor or processing unit 416, system storage 428 connect different system components
The bus 418 of (including system storage 428 and processing unit 416).
Bus 418 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts
For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC)
Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Processing equipment 412 typically comprises a variety of computer system readable media.These media can be it is any can be by
The usable medium that processing equipment 412 accesses, including volatile and non-volatile media, moveable and immovable medium.
System storage 428 may include the computer system readable media of form of volatile memory, such as deposit at random
Access to memory (RAM) 430 and/or cache memory 432.Processing equipment 412 may further include it is other it is removable/no
Movably, volatile/non-volatile computer system storage medium.Only as an example, storage system 434 can be used for reading and writing
Immovable, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").Although not shown in fig 4, may be used
To provide the disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk "), and it is non-volatile to moving
Property CD (such as CD-ROM, DVD-ROM or other optical mediums) read and write CD drive.In these cases, each drive
Dynamic device can be connected by one or more data media interfaces with bus 418.Memory 428 may include at least one journey
Sequence product, the program product have one group of (for example, at least one) program module, these program modules are configured to perform this hair
The function of bright each embodiment.
Program/utility 440 with one group of (at least one) program module 442, can store in such as memory
In 428, such program module 442 includes but is not limited to operating system, one or more application program, other program modules
And program data, it may include the realization of network environment in each of these examples or certain combination.Program module 442
Usually execute the function and/or method in embodiment described in the invention.
Processing equipment 412 can also be with one or more external equipments 414 (such as keyboard, sensing equipment, display 424
Deng) communication, can also be enabled a user to one or more equipment interact with the processing equipment 412 communicate, and/or with make
Any equipment (such as network interface card, the modem that the processing equipment 412 can be communicated with one or more of the other calculating equipment
Etc.) communication.This communication can be carried out by input/output (I/O) interface 422.Also, processing equipment 412 can also lead to
Cross network adapter 420 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, example
Such as internet) communication.As shown, network adapter 420 is communicated by bus 418 with other modules of processing equipment 412.It answers
When understanding, although not shown in the drawings, other hardware and/or software module can be used with combination processing equipment 412, including but unlimited
In: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and number
According to backup storage system etc..
Processing unit 416 is by running at least one of other programs in the multiple programs being stored in system storage 428
It is a, thereby executing various function application and data processing, such as realize a kind of text information provided by the embodiment of the present invention
Extracting method, comprising:
The term vector of candidate word in text is determined by Word2Vec model, and determines the similarity between different term vectors
Value;
Using term vector as node, and according to the side between the similarity value building node between term vector, candidate is obtained
Word atlas;
Candidate word weight is determined according to the candidate word atlas by TextRank algorithm;
According to candidate word weight, the keyword of text is determined.
Embodiment five
The embodiment of the present invention five additionally provides a kind of storage medium comprising computer executable instructions, and the computer can
It executes instruction when being executed by computer processor for executing a kind of text information extracting method, comprising:
The term vector of candidate word in text is determined by Word2Vec model, and determines the similarity between different term vectors
Value;
Using term vector as node, and according to the side between the similarity value building node between term vector, candidate is obtained
Word atlas;
Candidate word weight is determined according to the candidate word atlas by TextRank algorithm;
According to candidate word weight, the keyword of text is determined.
The computer storage medium of the embodiment of the present invention, can be using any of one or more computer-readable media
Combination.Computer-readable medium can be computer-readable signal media or computer readable storage medium.It is computer-readable
Storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device or
Device, or any above combination.The more specific example (non exhaustive list) of computer readable storage medium includes: tool
There are electrical connection, the portable computer diskette, hard disk, random access memory (RAM), read-only memory of one or more conducting wires
(ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-
ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer-readable storage
Medium can be any tangible medium for including or store program, which can be commanded execution system, device or device
Using or it is in connection.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for
By the use of instruction execution system, device or device or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited
In wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof
Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++,
It further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion
Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.?
Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or
Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as mentioned using Internet service
It is connected for quotient by internet).
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that
The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation,
It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention
It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also
It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.
Claims (10)
1. a kind of text information extracting method, which is characterized in that the described method includes:
The term vector of candidate word in text is determined by Word2Vec model, and determines the similarity value between different term vectors;
Using term vector as node, and according to the side between the similarity value building node between term vector, candidate word figure is obtained
Collection;
Candidate word weight is determined according to the candidate word atlas by TextRank algorithm;
According to candidate word weight, the keyword of text is determined.
2. the method according to claim 1, wherein according between the similarity value building node between term vector
Side, comprising:
If similarity value between two term vectors is greater than preset first similarity threshold, construct described two term vectors it
Between side.
3. being wrapped the method according to claim 1, wherein determining the keyword of text according to candidate word weight
It includes:
Candidate word is ranked up according to inverted order according to candidate word weight;
Keyword of the preceding candidate word of selected and sorted as text.
4. the method according to claim 1, wherein determining the word of candidate word in text by Word2Vec model
After vector, further includes:
The term vector of the candidate word according to included by sentence in text determines that the vector of sentence indicates, and determines different sentences
Similarity value between vector expression;
The vector table of sentence is shown as node, and according between the similarity building node between the expression of the vector of sentence
Side obtains sentence atlas;
Sentence weight is determined according to the sentence atlas by TextRank algorithm;
According to sentence weight, the abstract of text is determined.
5. the method according to claim 1, wherein according to candidate word weight, after the keyword for determining text,
Further include:
If the position of at least two keywords in the text is adjacent, at least two keyword is synthesized.
6. a kind of text information extraction element characterized by comprising
First determining module, for determining the term vector of candidate word in text by Word2Vec model, and determine different words to
Similarity value between amount;
First building module, is used for using term vector as node, and according between the similarity value building node between term vector
Side, obtain candidate word atlas;
First weight determination module, for determining candidate word weight according to the candidate word atlas by TextRank algorithm;
Keyword determining module, for determining the keyword of text according to candidate word weight.
7. device according to claim 6, which is characterized in that the first building module is specifically used for:
If similarity value between two term vectors is greater than preset first similarity threshold, construct described two term vectors it
Between side.
8. device according to claim 6, which is characterized in that the keyword determining module is specifically used for:
Candidate word is ranked up according to inverted order according to candidate word weight;
Keyword of the preceding candidate word of selected and sorted as text.
9. a kind of server characterized by comprising
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors are real
Now such as a kind of text information extracting method as claimed in any one of claims 1 to 5.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
A kind of such as text information extracting method as claimed in any one of claims 1 to 5 is realized when execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811317522.9A CN109408826A (en) | 2018-11-07 | 2018-11-07 | A kind of text information extracting method, device, server and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811317522.9A CN109408826A (en) | 2018-11-07 | 2018-11-07 | A kind of text information extracting method, device, server and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109408826A true CN109408826A (en) | 2019-03-01 |
Family
ID=65471876
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811317522.9A Pending CN109408826A (en) | 2018-11-07 | 2018-11-07 | A kind of text information extracting method, device, server and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109408826A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362678A (en) * | 2019-06-04 | 2019-10-22 | 哈尔滨工业大学(威海) | A kind of method and apparatus automatically extracting Chinese text keyword |
CN110377725A (en) * | 2019-07-12 | 2019-10-25 | 深圳新度博望科技有限公司 | Data creation method, device, computer equipment and storage medium |
CN110457699A (en) * | 2019-08-06 | 2019-11-15 | 腾讯科技(深圳)有限公司 | A kind of stop words method for digging, device, electronic equipment and storage medium |
CN110705282A (en) * | 2019-09-04 | 2020-01-17 | 东软集团股份有限公司 | Keyword extraction method and device, storage medium and electronic equipment |
CN110781669A (en) * | 2019-10-24 | 2020-02-11 | 泰康保险集团股份有限公司 | Text key information extraction method and device, electronic equipment and storage medium |
CN110888990A (en) * | 2019-11-22 | 2020-03-17 | 深圳前海微众银行股份有限公司 | Text recommendation method, device, equipment and medium |
CN111125348A (en) * | 2019-11-25 | 2020-05-08 | 北京明略软件系统有限公司 | Text abstract extraction method and device |
CN111241288A (en) * | 2020-01-17 | 2020-06-05 | 烟台海颐软件股份有限公司 | Emergency sensing system of large centralized power customer service center and construction method |
CN111640025A (en) * | 2020-06-09 | 2020-09-08 | 国泰君安证券股份有限公司 | Method for realizing information labeling processing based on label system |
CN112347288A (en) * | 2020-11-10 | 2021-02-09 | 北京北大方正电子有限公司 | Character and picture vectorization method |
CN112445891A (en) * | 2019-08-30 | 2021-03-05 | 智慧芽信息科技(苏州)有限公司 | Text information navigation browsing method, device, server and storage medium |
CN112560477A (en) * | 2020-12-09 | 2021-03-26 | 中科讯飞互联(北京)信息科技有限公司 | Text completion method, electronic device and storage device |
CN113064979A (en) * | 2021-03-10 | 2021-07-02 | 国网河北省电力有限公司 | Keyword retrieval-based method for judging construction period and price reasonability |
CN113326374A (en) * | 2021-05-25 | 2021-08-31 | 成都信息工程大学 | Short text emotion classification method and system based on feature enhancement |
CN113656429A (en) * | 2021-07-28 | 2021-11-16 | 广州荔支网络技术有限公司 | Keyword extraction method and device, computer equipment and storage medium |
CN113672705A (en) * | 2021-08-27 | 2021-11-19 | 工银科技有限公司 | Resume screening method, apparatus, device, medium and program product |
CN114912425A (en) * | 2022-05-17 | 2022-08-16 | 中国银行股份有限公司 | Presentation generation method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105893410A (en) * | 2015-11-18 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Keyword extraction method and apparatus |
CN106227722A (en) * | 2016-09-12 | 2016-12-14 | 中山大学 | A kind of extraction method based on listed company's bulletin summary |
CN106970910A (en) * | 2017-03-31 | 2017-07-21 | 北京奇艺世纪科技有限公司 | A kind of keyword extracting method and device based on graph model |
CN107704503A (en) * | 2017-08-29 | 2018-02-16 | 平安科技(深圳)有限公司 | User's keyword extracting device, method and computer-readable recording medium |
CN107832457A (en) * | 2017-11-24 | 2018-03-23 | 国网山东省电力公司电力科学研究院 | Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm |
US20180285928A1 (en) * | 2017-03-29 | 2018-10-04 | Ebay Inc. | Generating keywords by associative context with input words |
-
2018
- 2018-11-07 CN CN201811317522.9A patent/CN109408826A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105893410A (en) * | 2015-11-18 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Keyword extraction method and apparatus |
CN106227722A (en) * | 2016-09-12 | 2016-12-14 | 中山大学 | A kind of extraction method based on listed company's bulletin summary |
US20180285928A1 (en) * | 2017-03-29 | 2018-10-04 | Ebay Inc. | Generating keywords by associative context with input words |
CN106970910A (en) * | 2017-03-31 | 2017-07-21 | 北京奇艺世纪科技有限公司 | A kind of keyword extracting method and device based on graph model |
CN107704503A (en) * | 2017-08-29 | 2018-02-16 | 平安科技(深圳)有限公司 | User's keyword extracting device, method and computer-readable recording medium |
CN107832457A (en) * | 2017-11-24 | 2018-03-23 | 国网山东省电力公司电力科学研究院 | Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm |
Non-Patent Citations (1)
Title |
---|
宁建飞 等: "融合Word2vec与TextRank的关键词抽取研究", 《现代图书情报技术》 * |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362678A (en) * | 2019-06-04 | 2019-10-22 | 哈尔滨工业大学(威海) | A kind of method and apparatus automatically extracting Chinese text keyword |
CN110377725A (en) * | 2019-07-12 | 2019-10-25 | 深圳新度博望科技有限公司 | Data creation method, device, computer equipment and storage medium |
CN110377725B (en) * | 2019-07-12 | 2021-09-24 | 深圳新度博望科技有限公司 | Data generation method and device, computer equipment and storage medium |
CN110457699A (en) * | 2019-08-06 | 2019-11-15 | 腾讯科技(深圳)有限公司 | A kind of stop words method for digging, device, electronic equipment and storage medium |
CN110457699B (en) * | 2019-08-06 | 2023-07-04 | 腾讯科技(深圳)有限公司 | Method and device for mining stop words, electronic equipment and storage medium |
CN112445891A (en) * | 2019-08-30 | 2021-03-05 | 智慧芽信息科技(苏州)有限公司 | Text information navigation browsing method, device, server and storage medium |
CN110705282A (en) * | 2019-09-04 | 2020-01-17 | 东软集团股份有限公司 | Keyword extraction method and device, storage medium and electronic equipment |
CN110781669A (en) * | 2019-10-24 | 2020-02-11 | 泰康保险集团股份有限公司 | Text key information extraction method and device, electronic equipment and storage medium |
CN110888990A (en) * | 2019-11-22 | 2020-03-17 | 深圳前海微众银行股份有限公司 | Text recommendation method, device, equipment and medium |
CN110888990B (en) * | 2019-11-22 | 2024-04-12 | 深圳前海微众银行股份有限公司 | Text recommendation method, device, equipment and medium |
CN111125348A (en) * | 2019-11-25 | 2020-05-08 | 北京明略软件系统有限公司 | Text abstract extraction method and device |
CN111241288A (en) * | 2020-01-17 | 2020-06-05 | 烟台海颐软件股份有限公司 | Emergency sensing system of large centralized power customer service center and construction method |
CN111640025A (en) * | 2020-06-09 | 2020-09-08 | 国泰君安证券股份有限公司 | Method for realizing information labeling processing based on label system |
CN111640025B (en) * | 2020-06-09 | 2023-08-01 | 国泰君安证券股份有限公司 | Method for realizing information labeling processing based on label system |
CN112347288A (en) * | 2020-11-10 | 2021-02-09 | 北京北大方正电子有限公司 | Character and picture vectorization method |
CN112347288B (en) * | 2020-11-10 | 2024-02-20 | 北京北大方正电子有限公司 | Vectorization method of word graph |
CN112560477A (en) * | 2020-12-09 | 2021-03-26 | 中科讯飞互联(北京)信息科技有限公司 | Text completion method, electronic device and storage device |
CN112560477B (en) * | 2020-12-09 | 2024-04-16 | 科大讯飞(北京)有限公司 | Text completion method, electronic equipment and storage device |
CN113064979A (en) * | 2021-03-10 | 2021-07-02 | 国网河北省电力有限公司 | Keyword retrieval-based method for judging construction period and price reasonability |
CN113326374B (en) * | 2021-05-25 | 2022-12-20 | 成都信息工程大学 | Short text emotion classification method and system based on feature enhancement |
CN113326374A (en) * | 2021-05-25 | 2021-08-31 | 成都信息工程大学 | Short text emotion classification method and system based on feature enhancement |
CN113656429A (en) * | 2021-07-28 | 2021-11-16 | 广州荔支网络技术有限公司 | Keyword extraction method and device, computer equipment and storage medium |
CN113672705A (en) * | 2021-08-27 | 2021-11-19 | 工银科技有限公司 | Resume screening method, apparatus, device, medium and program product |
CN114912425A (en) * | 2022-05-17 | 2022-08-16 | 中国银行股份有限公司 | Presentation generation method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109408826A (en) | A kind of text information extracting method, device, server and storage medium | |
US10402433B2 (en) | Method and apparatus for recommending answer to question based on artificial intelligence | |
US20200401765A1 (en) | Man-machine conversation method, electronic device, and computer-readable medium | |
US10942958B2 (en) | User interface for a query answering system | |
US20190163691A1 (en) | Intent Based Dynamic Generation of Personalized Content from Dynamic Sources | |
US10599983B2 (en) | Inferred facts discovered through knowledge graph derived contextual overlays | |
CN110276023B (en) | POI transition event discovery method, device, computing equipment and medium | |
CN109657054A (en) | Abstraction generating method, device, server and storage medium | |
CN110569335B (en) | Triple verification method and device based on artificial intelligence and storage medium | |
CN113434636B (en) | Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium | |
JP2022046759A (en) | Retrieval method, device, electronic apparatus and storage medium | |
US10474747B2 (en) | Adjusting time dependent terminology in a question and answer system | |
CN107861948B (en) | Label extraction method, device, equipment and medium | |
US20150169676A1 (en) | Generating a Table of Contents for Unformatted Text | |
CN110334268B (en) | Block chain project hot word generation method and device | |
CN110704608A (en) | Text theme generation method and device and computer equipment | |
CN111597800A (en) | Method, device, equipment and storage medium for obtaining synonyms | |
CN106663123B (en) | Comment-centric news reader | |
CN111813993A (en) | Video content expanding method and device, terminal equipment and storage medium | |
CN114706973A (en) | Extraction type text abstract generation method and device, computer equipment and storage medium | |
WO2020052060A1 (en) | Method and apparatus for generating correction statement | |
Yajian et al. | A short text classification algorithm based on semantic extension | |
CN111241273A (en) | Text data classification method and device, electronic equipment and computer readable medium | |
CN113077312A (en) | Hotel recommendation method, system, equipment and storage medium | |
CN116796730A (en) | Text error correction method, device, equipment and storage medium based on artificial intelligence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190301 |