CN109408826A - A kind of text information extracting method, device, server and storage medium - Google Patents

A kind of text information extracting method, device, server and storage medium Download PDF

Info

Publication number
CN109408826A
CN109408826A CN201811317522.9A CN201811317522A CN109408826A CN 109408826 A CN109408826 A CN 109408826A CN 201811317522 A CN201811317522 A CN 201811317522A CN 109408826 A CN109408826 A CN 109408826A
Authority
CN
China
Prior art keywords
candidate word
text
vector
sentence
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811317522.9A
Other languages
Chinese (zh)
Inventor
谢永恒
段小文
万月亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201811317522.9A priority Critical patent/CN109408826A/en
Publication of CN109408826A publication Critical patent/CN109408826A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A kind of text information extracting method, device, server and storage medium provided in an embodiment of the present invention.This method comprises: determining the term vector of candidate word in text by Word2Vec model, and determine the similarity value between different term vectors;Using term vector as node, and according to the side between the similarity value building node between term vector, candidate word atlas is obtained;Candidate word weight is determined according to the candidate word atlas by TextRank algorithm;According to candidate word weight, the keyword of text is determined.Candidate word is converted into term vector by using Word2Vec model, candidate word can be made to be indicated by the vector of low-dimensional, improve treatment effeciency, it is calculated by similarity value, and construct atlas, it can visually reflect the incidence relation between candidate word, the weighted value of candidate word is calculated finally by TextRank algorithm, thus the more accurate keyword for comprehensively determining text.

Description

A kind of text information extracting method, device, server and storage medium
Technical field
The present embodiments relate to text extraction techniques field more particularly to a kind of text information extracting methods, device, clothes Business device and storage medium.
Background technique
With the fast development of internet, the function of network is more and more comprehensive, the amount of web documents information also rapid growth. But many web documents, there are biggish length, people usually require to consume a large amount of time to read entire article ability Obtain crucial news information.For needing to extract for the editor of article information or the monitoring personnel of network, in order to obtain Crucial article information, requires a great deal of time to read the article of big length, greatly reduces working efficiency.Therefore, The automatically extracting of text key word and text snippet greatly shortens people and obtains key message from big length web documents Time, while also having saved the human cost of some companies or enterprise well.
Currently used keyword and abstract extraction method are the sort method based on TextRank algorithm, TextRank's PageRank algorithm of the basic thought based on Google.TextRank universal model can be expressed as an oriented authorized graph G=(V, E), it is made of point set V and line set E, E is the subset of V × V.In (Vi) is the point set for being directed toward point Vi, and Out (Vi) is The point set that point Vi is directed toward.The score of point Vi is defined as follows:
Wherein, d is damped coefficient, and value range is 0 to 1, represents a certain specified point from figure and is directed toward any other point Probability.
Weighted value is calculated according to above-mentioned algorithm and needs to construct atlas according to cooccurrence relation, but this method needs are built in advance It stands the side between all point sets, then is wherein being chosen by the window being arranged, obtain that there are the sides of incidence relation and candidate Word node, building process is cumbersome, and treatment effeciency is low, and is unable to get the relative size of each edge weighted value, causes to pass through The keyword or abstract that TextRank algorithm obtains be not comprehensively accurate.In addition, traditional alphanumeric method form is simple, The vector dimension of conversion is larger, is unfavorable for calculating and handle.
Summary of the invention
The embodiment of the invention provides a kind of text information extracting method, device, server and storage mediums, solve current Comprehensively not accurate using keyword in TextRank algorithm progress information extraction process or abstract acquisition, treatment effeciency is low to ask Topic.
In a first aspect, the embodiment of the invention provides a kind of text information extracting methods, comprising:
The term vector of candidate word in text is determined by Word2Vec model, and determines the similarity between different term vectors Value;
Using term vector as node, and according to the side between the similarity value building node between term vector, candidate is obtained Word atlas;
Candidate word weight is determined according to the candidate word atlas by TextRank algorithm;
According to candidate word weight, the keyword of text is determined.
Second aspect, the embodiment of the invention provides a kind of text information extraction element, described device includes:
First determining module for determining the term vector of candidate word in text by Word2Vec model, and determines different Similarity value between term vector;
First building module, is used for using term vector as node, and construct node according to the similarity value between term vector Between side, obtain candidate word atlas;
First weight determination module, for according to the candidate word atlas, determining that candidate word is weighed by TextRank algorithm Weight;
Keyword determining module, for determining the keyword of text according to candidate word weight.
The third aspect, the embodiment of the invention provides a kind of servers, comprising:
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processing Device is realized such as any text information extracting method in the embodiment of the present invention.
Fourth aspect, the embodiment of the invention provides a kind of computer readable storage mediums, are stored thereon with computer journey Sequence realizes any text information extracting method in the embodiment of the present invention when program is executed by processor.
A kind of text information extracting method, device, server and storage medium provided in an embodiment of the present invention, by using Candidate word is converted to term vector by Word2Vec model, and candidate word can be made to be indicated by the vector of low-dimensional, raising processing Efficiency is calculated by similarity value, and constructs atlas, can visually reflect the incidence relation between candidate word, and lead to The weighted value that TextRank algorithm calculates candidate word is crossed, thus the more accurate keyword for comprehensively determining text.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing does one and simply introduces, it should be apparent that, drawings in the following description are some embodiments of the invention, for this For the those of ordinary skill of field, without creative efforts, it can also be obtained according to these attached drawings others Attached drawing.
Fig. 1 is a kind of text information extracting method flow chart that the embodiment of the present invention one provides;
Fig. 2 is a kind of text information extracting method flow chart provided by Embodiment 2 of the present invention;
Fig. 3 is a kind of text information extraction element structural schematic diagram that the embodiment of the present invention three provides;
Fig. 4 is a kind of server architecture schematic diagram that the embodiment of the present invention four provides.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, hereinafter with reference to attached in the embodiment of the present invention Figure, clearly and completely describes technical solution of the present invention by embodiment, it is clear that described embodiment is the present invention one Section Example, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not doing Every other embodiment obtained under the premise of creative work out, shall fall within the protection scope of the present invention.
Embodiment one
Fig. 1 is a kind of text information extracting method flow chart that the embodiment of the present invention one provides.The technical side of the present embodiment Case can be adapted for the case where extracting to key messages such as keywords in text.This method can be extracted by text information Device executes, which can be realized by the mode of software and/or hardware, and is integrated in server.This method specifically include as Lower operation:
S110, the term vector that candidate word in text is determined by Word2Vec model, and determine between different term vectors Similarity value.
Specifically, crawling text as text to be processed by web crawlers, wherein the text can be different field Newsletter archive.Data scrubbing is carried out to text to be processed, removes the non-textual information in text to be processed, such as punctuate symbol Number, plain text is obtained, and plain text is split as complete sentence.Plain text is segmented using participle tool and carries out word Property mark, remove stop words, leave the everyday words such as noun, adjective, verb as candidate word.Optionally, participle tool can be with Tool is segmented for Ansj or jieba segments tool.
The term vector of text candidates word is determined by Word2Vec model.Wherein, Word2Vec model is to determine text It is obtained before the term vector of candidate word by training, illustratively, a large amount of different types of text datas of selection, such as society, The newsletter archives such as the people's livelihood, sport, music carry out data cleansing, and participle obtains candidate word, and candidate word is put into file In, wherein the candidate word of each one sentence of behavior in file.It is trained using the model in Word2Vec algorithm, it is optional , using the CBOW model based on hierarchical softmax or the Skip- based on hierarchical softmax Gram model is trained, and obtains Word2Vec model.
After mapping obtains the term vector of candidate word, the similarity value between different term vectors is determined.Wherein, similarity value It can be indicated with the cosine value of two different term vectors, it may be assumed that
Wherein, a, b indicate two different candidate words, and similarity (" a ", " b ") is indicated between two candidate words Similarity value, A, B indicate that the corresponding term vector of two candidate words, AB indicate that the dot product of two term vectors, ‖ A ‖ and ‖ B ‖ indicate The vector length of two term vectors, n indicate the dimension of term vector.The similarity value of two candidate words is obtained according to above-mentioned formula, Such as:
Similarity (" Shandong ", " Jiangsu ")=0.41542658
Similarity (" Shandong ", " Beijing ")=0.19865009
Similarity (" Shandong ", " men's basketball ")=0.16770135.
Wherein, the similarity value of two candidate words in Shandong and Jiangsu is larger, then illustrates the association between the two candidate words Degree is higher, and Shandong and the similarity value of two candidate words of men's basketball are smaller, illustrate that the degree of association between the two candidate words is smaller.
S120, using term vector as node, and according between term vector similarity value building node between side, obtain Candidate word atlas.
Specifically, forming point set using term vector as node, V can be expressed as.Optionally, according between term vector Similarity value constructs the side between node, comprising: if the similarity value between two term vectors is greater than preset first similarity Threshold value then constructs the side between described two term vectors, forms side collection, can be expressed as E, and E is the subset of V × V.Wherein, institute Stating preset first similarity threshold can be configured as needed by technical staff, illustratively, by preset first phase It is set as 0.450 like degree threshold value, when the similarity value between two candidate words is greater than 0.450, shows the relevance between it It is higher, then construct the side between two candidate word nodes.Candidate word atlas G=(V, E) is obtained according to point set and Bian Ji, according to time Select the incidence relation between the available keyword of word atlas G.
S130, candidate word weight is determined according to the candidate word atlas by TextRank algorithm.
Specifically, calculating the weighted value of candidate word, calculation formula according to TextRank algorithm formula are as follows:
Wherein, d is damped coefficient, and value range is 0 to 1, and the general of any other point is directed toward in the certain point represented from figure Rate, general value are 0.85, ViAnd VjIndicate two different candidate word nodes, WS (Vi) and WS (Vj) indicate two candidate words Weight, In (Vi) indicate to be directed toward node ViNode set, Out (Vj) indicate node VjThe set of the node of direction, wjiWith wjkIndicate the weighted value on the side between two nodes, i.e. similarity value between the candidate word of two node on behalf.Pass through above-mentioned public affairs Formula, and according to the incidence relation between the candidate word in candidate word atlas, determine the weight of each candidate word, and iterative diffusion is each The weight of node, until convergence, optionally, convergency value can be set to 0.0001.
S140, according to candidate word weight, determine the keyword of text.
Optionally, according to candidate word weight, the keyword of text is determined, comprising: press to candidate word according to candidate word weight It is ranked up according to inverted order;Keyword of the preceding candidate word of selected and sorted as text.Specifically, being obtained by TextRank algorithm To the weighted value of candidate word, candidate word is ranked up according to weighted value size, optionally, is arranged according to weighted value inverted order Sequence selects weighted value to sort preceding candidate word as text key word, and the keyword number of selection can be by technical staff's root According to being set.
Illustratively, according to candidate word weight, after the keyword for determining text, if can also include: at least two passes The position of keyword in the text is adjacent, then synthesizes at least two keyword.Specifically, obtained keyword is existed Position mark is carried out in original text, if the position of at least two keywords in the text is adjacent, by least two adjacent passes Keyword is synthesized, and more word keywords are formed.
A kind of text information extracting method provided in an embodiment of the present invention, firstly, determining text by Word2Vec model The term vector of middle candidate word, and determine the similarity value between different term vectors;Then, using term vector as node, and according to The side between similarity value building node between term vector, obtains candidate word atlas;By TextRank algorithm, according to described Candidate word atlas determines candidate word weight;Finally, determining the keyword of text according to candidate word weight.By using Candidate word is converted to term vector by Word2Vec model, and candidate word can be made to be indicated by the vector of low-dimensional, improves place Efficiency is managed, is calculated by similarity value, can accurately obtain the connection being selected between word, and by building atlas, it can Reflect the incidence relation between candidate word accurate and visually, the weighted value of candidate word calculated finally by TextRank algorithm, To the more accurate keyword for comprehensively determining text.
Embodiment two
Fig. 2 is a kind of text information extracting method flow chart provided by Embodiment 2 of the present invention.The present embodiment is in above-mentioned reality It applies and advanced optimizes on the basis of example, wherein the not content detailed in Example one of detailed description in the present embodiment.Such as Fig. 2 institute Show, a kind of text information extracting method provided by Embodiment 2 of the present invention specifically includes the following steps:
S210, the term vector that candidate word in text is determined by Word2Vec model, and determine between different term vectors Similarity value.
S220, using term vector as node, and according between term vector similarity value building node between side, obtain Candidate word atlas.
S230, candidate word weight is determined according to the candidate word atlas by TextRank algorithm.
S240, according to candidate word weight, determine the keyword of text.
The term vector of S250, the candidate word according to included by sentence in text determine that the vector of sentence indicates, and determine not With the similarity value between the vector expression of sentence.
Reader is not often only intended to know the keyword of text, it is also necessary to pass through text when reading newsletter archive information This abstract more comprehensively specifically understands content of text.Illustratively, term vector included by sentence in text is overlapped, is obtained Dimension to sentence vector, the sentence vector is identical as the dimension of the term vector of candidate word.By between different sentences to Amount indicates to determine the similarity value between different sentences.Illustratively, available:
Similarity (" Apples of Shandong good harvest ", " peasant is in Jiangsu kind rice ")=0.48500857
Similarity (" Apples of Shandong good harvest ", " failure of Shandong football ")=0.31601506.
Wherein, the similarity value between " Apples of Shandong good harvest " and " peasant is in Jiangsu kind rice " is larger, then illustrates two The degree of association between sentence is larger.
S260, the vector table of sentence is shown as to node, and section is constructed according to the similarity between the expression of the vector of sentence Side between point, obtains sentence atlas.
The vector table of sentence is shown as node, point set is formed, V ' can be expressed as.Optionally, according to the vector of sentence The side between similarity value building node between expression, comprising: if the similarity value between the vector expression of two sentences is big In preset second similarity threshold, then the side between the vector expression of described two sentences is constructed, side collection is formed, can indicate It is V ' × V ' subset for E ', E '.Wherein, preset second similarity threshold can be carried out as needed by technical staff Setting, illustratively, sets 0.550 for preset second similarity threshold, when the similarity value between two candidate words is big When 0.550, show that the relevance between it is higher, then constructs the side between two sentence nodes.It is obtained according to point set and side collection To sentence atlas G '=(V ', E '), according to the incidence relation between the available sentence of sentence atlas G '.
S270, sentence weight is determined according to the sentence atlas by TextRank algorithm.
Specifically, being closed by the calculation formula in TextRank according to the association between the different sentences in sentence atlas System, determines the weighted value of sentence, and the weight of each node of iterative diffusion, until convergence, optionally, convergency value be can be set to 0.0001。
S280, according to sentence weight, determine the abstract of text.
Optionally, according to sentence weight, determine the abstract of text, comprising: according to sentence weight to sentence according to inverted order into Row sequence;The abstract of the preceding sentence composition text of selected and sorted.Specifically, obtaining the weight of sentence by TextRank algorithm Value, is ranked up candidate word according to weighted value size, optionally, is ranked up according to weighted value inverted order, selects weighted value row The preceding sentence of sequence forms text snippet, and the sentence number of selection can be set as needed by technical staff.
A kind of text information extracting method provided in an embodiment of the present invention, increases step: being wrapped according to sentence in text The term vector of the candidate word included determines that the vector of sentence indicates, and the similarity value between the vector expression of determining different sentences; The vector table of sentence is shown as node, and according to the side between the similarity building node between the expression of the vector of sentence, is obtained To sentence atlas;Sentence weight is determined according to the sentence atlas by TextRank algorithm;According to sentence weight, text is determined This abstract.The vector expression of sentence is formed by the way that candidate term vector to be overlapped, the vector for reducing sentence indicates dimension, Treatment effeciency is improved, and by calculating similarity value and building atlas, reflects the association between sentence more accurate and visually Property, sentence weight is obtained finally by TextRank algorithm, and then obtain text snippet, solved at present since sentence vector is tieed up The low problem of higher caused treatment effeciency is spent, and because of incomplete ask of making a summary caused by similarity calculation error between sentence Topic, obtained abstract more comprehensively intuitively reflect content of text.
Embodiment three
Fig. 3 is that a kind of text information that the embodiment of the present invention three provides proposes apparatus structure schematic diagram.As shown in figure 3, described Device includes:
First determining module 310 for determining the term vector of candidate word in text by Word2Vec model, and determines not With the similarity value between term vector;
First building module 320, is used for using term vector as node, and construct section according to the similarity value between term vector Side between point, obtains candidate word atlas;
First weight determination module 330, for being determined candidate by TextRank algorithm according to the candidate word atlas Word weight;
Keyword determining module 340, for determining the keyword of text according to candidate word weight.
Optionally, the first building module 320 is specifically used for:
If similarity value between two term vectors is greater than preset first similarity threshold, construct described two words to Side between amount.
Optionally, the keyword determining module 340, is specifically used for:
Candidate word is ranked up according to inverted order according to candidate word weight;
Keyword of the preceding candidate word of selected and sorted as text.
Optionally, the keyword determining module 340, is also used to:
If the position of at least two keywords in the text is adjacent, at least two keyword is synthesized.
Optionally, further includes:
Second determining module determines the vector of sentence for the term vector of the candidate word according to included by sentence in text It indicates, and the similarity value between the vector expression of determining different sentences;
Second building module, for the vector table of sentence to be shown as node, and according between the expression of the vector of sentence Similarity constructs the side between node, obtains sentence atlas;
Second weight determination module, for determining sentence weight according to the sentence atlas by TextRank algorithm;
Abstract determining module, for determining the abstract of text according to sentence weight.
Optionally, further includes:
Synthesis module, it is crucial to described at least two if adjacent for the position of at least two keywords in the text Word is synthesized.
A kind of text information extraction element provided in an embodiment of the present invention, a kind of text information proposed with above-described embodiment Extracting method belongs to same inventive concept, and the technical detail of detailed description not can be found in above-described embodiment in the present embodiment, and And the present embodiment and above-described embodiment beneficial effect having the same.
Example IV
Fig. 4 is a kind of structure chart for server that the embodiment of the present invention four provides.Fig. 4, which is shown, to be suitable for being used to realizing this hair The block diagram of the exemplary processing devices 412 of bright embodiment.The processing equipment 412 that Fig. 4 is shown is only an example, should not be right The function and use scope of the embodiment of the present invention bring any restrictions.
As shown in figure 4, processing equipment 412 is showed in the form of universal computing device.The component of processing equipment 412 can wrap Include but be not limited to: one or more processor or processing unit 416, system storage 428 connect different system components The bus 418 of (including system storage 428 and processing unit 416).
Bus 418 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Processing equipment 412 typically comprises a variety of computer system readable media.These media can be it is any can be by The usable medium that processing equipment 412 accesses, including volatile and non-volatile media, moveable and immovable medium.
System storage 428 may include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (RAM) 430 and/or cache memory 432.Processing equipment 412 may further include it is other it is removable/no Movably, volatile/non-volatile computer system storage medium.Only as an example, storage system 434 can be used for reading and writing Immovable, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").Although not shown in fig 4, may be used To provide the disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk "), and it is non-volatile to moving Property CD (such as CD-ROM, DVD-ROM or other optical mediums) read and write CD drive.In these cases, each drive Dynamic device can be connected by one or more data media interfaces with bus 418.Memory 428 may include at least one journey Sequence product, the program product have one group of (for example, at least one) program module, these program modules are configured to perform this hair The function of bright each embodiment.
Program/utility 440 with one group of (at least one) program module 442, can store in such as memory In 428, such program module 442 includes but is not limited to operating system, one or more application program, other program modules And program data, it may include the realization of network environment in each of these examples or certain combination.Program module 442 Usually execute the function and/or method in embodiment described in the invention.
Processing equipment 412 can also be with one or more external equipments 414 (such as keyboard, sensing equipment, display 424 Deng) communication, can also be enabled a user to one or more equipment interact with the processing equipment 412 communicate, and/or with make Any equipment (such as network interface card, the modem that the processing equipment 412 can be communicated with one or more of the other calculating equipment Etc.) communication.This communication can be carried out by input/output (I/O) interface 422.Also, processing equipment 412 can also lead to Cross network adapter 420 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, example Such as internet) communication.As shown, network adapter 420 is communicated by bus 418 with other modules of processing equipment 412.It answers When understanding, although not shown in the drawings, other hardware and/or software module can be used with combination processing equipment 412, including but unlimited In: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and number According to backup storage system etc..
Processing unit 416 is by running at least one of other programs in the multiple programs being stored in system storage 428 It is a, thereby executing various function application and data processing, such as realize a kind of text information provided by the embodiment of the present invention Extracting method, comprising:
The term vector of candidate word in text is determined by Word2Vec model, and determines the similarity between different term vectors Value;
Using term vector as node, and according to the side between the similarity value building node between term vector, candidate is obtained Word atlas;
Candidate word weight is determined according to the candidate word atlas by TextRank algorithm;
According to candidate word weight, the keyword of text is determined.
Embodiment five
The embodiment of the present invention five additionally provides a kind of storage medium comprising computer executable instructions, and the computer can It executes instruction when being executed by computer processor for executing a kind of text information extracting method, comprising:
The term vector of candidate word in text is determined by Word2Vec model, and determines the similarity between different term vectors Value;
Using term vector as node, and according to the side between the similarity value building node between term vector, candidate is obtained Word atlas;
Candidate word weight is determined according to the candidate word atlas by TextRank algorithm;
According to candidate word weight, the keyword of text is determined.
The computer storage medium of the embodiment of the present invention, can be using any of one or more computer-readable media Combination.Computer-readable medium can be computer-readable signal media or computer readable storage medium.It is computer-readable Storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device or Device, or any above combination.The more specific example (non exhaustive list) of computer readable storage medium includes: tool There are electrical connection, the portable computer diskette, hard disk, random access memory (RAM), read-only memory of one or more conducting wires (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD- ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer-readable storage Medium can be any tangible medium for including or store program, which can be commanded execution system, device or device Using or it is in connection.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, It further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.? Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as mentioned using Internet service It is connected for quotient by internet).
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims (10)

1. a kind of text information extracting method, which is characterized in that the described method includes:
The term vector of candidate word in text is determined by Word2Vec model, and determines the similarity value between different term vectors;
Using term vector as node, and according to the side between the similarity value building node between term vector, candidate word figure is obtained Collection;
Candidate word weight is determined according to the candidate word atlas by TextRank algorithm;
According to candidate word weight, the keyword of text is determined.
2. the method according to claim 1, wherein according between the similarity value building node between term vector Side, comprising:
If similarity value between two term vectors is greater than preset first similarity threshold, construct described two term vectors it Between side.
3. being wrapped the method according to claim 1, wherein determining the keyword of text according to candidate word weight It includes:
Candidate word is ranked up according to inverted order according to candidate word weight;
Keyword of the preceding candidate word of selected and sorted as text.
4. the method according to claim 1, wherein determining the word of candidate word in text by Word2Vec model After vector, further includes:
The term vector of the candidate word according to included by sentence in text determines that the vector of sentence indicates, and determines different sentences Similarity value between vector expression;
The vector table of sentence is shown as node, and according between the similarity building node between the expression of the vector of sentence Side obtains sentence atlas;
Sentence weight is determined according to the sentence atlas by TextRank algorithm;
According to sentence weight, the abstract of text is determined.
5. the method according to claim 1, wherein according to candidate word weight, after the keyword for determining text, Further include:
If the position of at least two keywords in the text is adjacent, at least two keyword is synthesized.
6. a kind of text information extraction element characterized by comprising
First determining module, for determining the term vector of candidate word in text by Word2Vec model, and determine different words to Similarity value between amount;
First building module, is used for using term vector as node, and according between the similarity value building node between term vector Side, obtain candidate word atlas;
First weight determination module, for determining candidate word weight according to the candidate word atlas by TextRank algorithm;
Keyword determining module, for determining the keyword of text according to candidate word weight.
7. device according to claim 6, which is characterized in that the first building module is specifically used for:
If similarity value between two term vectors is greater than preset first similarity threshold, construct described two term vectors it Between side.
8. device according to claim 6, which is characterized in that the keyword determining module is specifically used for:
Candidate word is ranked up according to inverted order according to candidate word weight;
Keyword of the preceding candidate word of selected and sorted as text.
9. a kind of server characterized by comprising
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors are real Now such as a kind of text information extracting method as claimed in any one of claims 1 to 5.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor A kind of such as text information extracting method as claimed in any one of claims 1 to 5 is realized when execution.
CN201811317522.9A 2018-11-07 2018-11-07 A kind of text information extracting method, device, server and storage medium Pending CN109408826A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811317522.9A CN109408826A (en) 2018-11-07 2018-11-07 A kind of text information extracting method, device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811317522.9A CN109408826A (en) 2018-11-07 2018-11-07 A kind of text information extracting method, device, server and storage medium

Publications (1)

Publication Number Publication Date
CN109408826A true CN109408826A (en) 2019-03-01

Family

ID=65471876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811317522.9A Pending CN109408826A (en) 2018-11-07 2018-11-07 A kind of text information extracting method, device, server and storage medium

Country Status (1)

Country Link
CN (1) CN109408826A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362678A (en) * 2019-06-04 2019-10-22 哈尔滨工业大学(威海) A kind of method and apparatus automatically extracting Chinese text keyword
CN110377725A (en) * 2019-07-12 2019-10-25 深圳新度博望科技有限公司 Data creation method, device, computer equipment and storage medium
CN110457699A (en) * 2019-08-06 2019-11-15 腾讯科技(深圳)有限公司 A kind of stop words method for digging, device, electronic equipment and storage medium
CN110705282A (en) * 2019-09-04 2020-01-17 东软集团股份有限公司 Keyword extraction method and device, storage medium and electronic equipment
CN110781669A (en) * 2019-10-24 2020-02-11 泰康保险集团股份有限公司 Text key information extraction method and device, electronic equipment and storage medium
CN110888990A (en) * 2019-11-22 2020-03-17 深圳前海微众银行股份有限公司 Text recommendation method, device, equipment and medium
CN111125348A (en) * 2019-11-25 2020-05-08 北京明略软件系统有限公司 Text abstract extraction method and device
CN111241288A (en) * 2020-01-17 2020-06-05 烟台海颐软件股份有限公司 Emergency sensing system of large centralized power customer service center and construction method
CN111640025A (en) * 2020-06-09 2020-09-08 国泰君安证券股份有限公司 Method for realizing information labeling processing based on label system
CN112347288A (en) * 2020-11-10 2021-02-09 北京北大方正电子有限公司 Character and picture vectorization method
CN112445891A (en) * 2019-08-30 2021-03-05 智慧芽信息科技(苏州)有限公司 Text information navigation browsing method, device, server and storage medium
CN112560477A (en) * 2020-12-09 2021-03-26 中科讯飞互联(北京)信息科技有限公司 Text completion method, electronic device and storage device
CN113064979A (en) * 2021-03-10 2021-07-02 国网河北省电力有限公司 Keyword retrieval-based method for judging construction period and price reasonability
CN113326374A (en) * 2021-05-25 2021-08-31 成都信息工程大学 Short text emotion classification method and system based on feature enhancement
CN113656429A (en) * 2021-07-28 2021-11-16 广州荔支网络技术有限公司 Keyword extraction method and device, computer equipment and storage medium
CN113672705A (en) * 2021-08-27 2021-11-19 工银科技有限公司 Resume screening method, apparatus, device, medium and program product
CN114912425A (en) * 2022-05-17 2022-08-16 中国银行股份有限公司 Presentation generation method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary
CN106970910A (en) * 2017-03-31 2017-07-21 北京奇艺世纪科技有限公司 A kind of keyword extracting method and device based on graph model
CN107704503A (en) * 2017-08-29 2018-02-16 平安科技(深圳)有限公司 User's keyword extracting device, method and computer-readable recording medium
CN107832457A (en) * 2017-11-24 2018-03-23 国网山东省电力公司电力科学研究院 Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm
US20180285928A1 (en) * 2017-03-29 2018-10-04 Ebay Inc. Generating keywords by associative context with input words

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus
CN106227722A (en) * 2016-09-12 2016-12-14 中山大学 A kind of extraction method based on listed company's bulletin summary
US20180285928A1 (en) * 2017-03-29 2018-10-04 Ebay Inc. Generating keywords by associative context with input words
CN106970910A (en) * 2017-03-31 2017-07-21 北京奇艺世纪科技有限公司 A kind of keyword extracting method and device based on graph model
CN107704503A (en) * 2017-08-29 2018-02-16 平安科技(深圳)有限公司 User's keyword extracting device, method and computer-readable recording medium
CN107832457A (en) * 2017-11-24 2018-03-23 国网山东省电力公司电力科学研究院 Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宁建飞 等: "融合Word2vec与TextRank的关键词抽取研究", 《现代图书情报技术》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362678A (en) * 2019-06-04 2019-10-22 哈尔滨工业大学(威海) A kind of method and apparatus automatically extracting Chinese text keyword
CN110377725A (en) * 2019-07-12 2019-10-25 深圳新度博望科技有限公司 Data creation method, device, computer equipment and storage medium
CN110377725B (en) * 2019-07-12 2021-09-24 深圳新度博望科技有限公司 Data generation method and device, computer equipment and storage medium
CN110457699A (en) * 2019-08-06 2019-11-15 腾讯科技(深圳)有限公司 A kind of stop words method for digging, device, electronic equipment and storage medium
CN110457699B (en) * 2019-08-06 2023-07-04 腾讯科技(深圳)有限公司 Method and device for mining stop words, electronic equipment and storage medium
CN112445891A (en) * 2019-08-30 2021-03-05 智慧芽信息科技(苏州)有限公司 Text information navigation browsing method, device, server and storage medium
CN110705282A (en) * 2019-09-04 2020-01-17 东软集团股份有限公司 Keyword extraction method and device, storage medium and electronic equipment
CN110781669A (en) * 2019-10-24 2020-02-11 泰康保险集团股份有限公司 Text key information extraction method and device, electronic equipment and storage medium
CN110888990A (en) * 2019-11-22 2020-03-17 深圳前海微众银行股份有限公司 Text recommendation method, device, equipment and medium
CN110888990B (en) * 2019-11-22 2024-04-12 深圳前海微众银行股份有限公司 Text recommendation method, device, equipment and medium
CN111125348A (en) * 2019-11-25 2020-05-08 北京明略软件系统有限公司 Text abstract extraction method and device
CN111241288A (en) * 2020-01-17 2020-06-05 烟台海颐软件股份有限公司 Emergency sensing system of large centralized power customer service center and construction method
CN111640025A (en) * 2020-06-09 2020-09-08 国泰君安证券股份有限公司 Method for realizing information labeling processing based on label system
CN111640025B (en) * 2020-06-09 2023-08-01 国泰君安证券股份有限公司 Method for realizing information labeling processing based on label system
CN112347288A (en) * 2020-11-10 2021-02-09 北京北大方正电子有限公司 Character and picture vectorization method
CN112347288B (en) * 2020-11-10 2024-02-20 北京北大方正电子有限公司 Vectorization method of word graph
CN112560477A (en) * 2020-12-09 2021-03-26 中科讯飞互联(北京)信息科技有限公司 Text completion method, electronic device and storage device
CN112560477B (en) * 2020-12-09 2024-04-16 科大讯飞(北京)有限公司 Text completion method, electronic equipment and storage device
CN113064979A (en) * 2021-03-10 2021-07-02 国网河北省电力有限公司 Keyword retrieval-based method for judging construction period and price reasonability
CN113326374B (en) * 2021-05-25 2022-12-20 成都信息工程大学 Short text emotion classification method and system based on feature enhancement
CN113326374A (en) * 2021-05-25 2021-08-31 成都信息工程大学 Short text emotion classification method and system based on feature enhancement
CN113656429A (en) * 2021-07-28 2021-11-16 广州荔支网络技术有限公司 Keyword extraction method and device, computer equipment and storage medium
CN113672705A (en) * 2021-08-27 2021-11-19 工银科技有限公司 Resume screening method, apparatus, device, medium and program product
CN114912425A (en) * 2022-05-17 2022-08-16 中国银行股份有限公司 Presentation generation method and device

Similar Documents

Publication Publication Date Title
CN109408826A (en) A kind of text information extracting method, device, server and storage medium
US10402433B2 (en) Method and apparatus for recommending answer to question based on artificial intelligence
US20200401765A1 (en) Man-machine conversation method, electronic device, and computer-readable medium
US10942958B2 (en) User interface for a query answering system
US20190163691A1 (en) Intent Based Dynamic Generation of Personalized Content from Dynamic Sources
US10599983B2 (en) Inferred facts discovered through knowledge graph derived contextual overlays
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN109657054A (en) Abstraction generating method, device, server and storage medium
CN110569335B (en) Triple verification method and device based on artificial intelligence and storage medium
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
JP2022046759A (en) Retrieval method, device, electronic apparatus and storage medium
US10474747B2 (en) Adjusting time dependent terminology in a question and answer system
CN107861948B (en) Label extraction method, device, equipment and medium
US20150169676A1 (en) Generating a Table of Contents for Unformatted Text
CN110334268B (en) Block chain project hot word generation method and device
CN110704608A (en) Text theme generation method and device and computer equipment
CN111597800A (en) Method, device, equipment and storage medium for obtaining synonyms
CN106663123B (en) Comment-centric news reader
CN111813993A (en) Video content expanding method and device, terminal equipment and storage medium
CN114706973A (en) Extraction type text abstract generation method and device, computer equipment and storage medium
WO2020052060A1 (en) Method and apparatus for generating correction statement
Yajian et al. A short text classification algorithm based on semantic extension
CN111241273A (en) Text data classification method and device, electronic equipment and computer readable medium
CN113077312A (en) Hotel recommendation method, system, equipment and storage medium
CN116796730A (en) Text error correction method, device, equipment and storage medium based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190301