CN106776530A - Key words extraction method and device - Google Patents
Key words extraction method and device Download PDFInfo
- Publication number
- CN106776530A CN106776530A CN201510819148.2A CN201510819148A CN106776530A CN 106776530 A CN106776530 A CN 106776530A CN 201510819148 A CN201510819148 A CN 201510819148A CN 106776530 A CN106776530 A CN 106776530A
- Authority
- CN
- China
- Prior art keywords
- word
- document
- descriptor
- matrix
- latent semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of key words extraction method and device.Wherein, the method includes:Obtain the document of institute's extraction descriptor in need and appear in word in the document;The frequency occurred in the document based on each word builds word document matrix, wherein, every a line of word document matrix represents word frequency information of each word in a document, and each row represent word frequency information of the word in each piece document;Semantic analysis is carried out to word document matrix using latent semantic analysis model, latent semantic space is generated;According to latent semantic space extract institute it is in need extraction descriptor document descriptor.The present invention is solved due to the technical problem of polysemy or the synonymous influence key words extraction quality for causing of many words.
Description
Technical field
The present invention relates to natural language processing field, in particular to a kind of key words extraction method and device.
Background technology
Theme can embody the central idea expressed by document, be one of effective means of computer expression document.Extract
Subject information helps to understand the effective information of document, treatment effeciency of the raising computer to document.At present, theme is taken out
The technology of taking is one hot technology of natural language processing field.
Usually, by taking Chinese subject extraction as an example, descriptor, Subject Concept and master are generally divided into subject extraction task
Topic three aspects of sentence.Although single descriptor is unlike Subject Concept and theme line, with clear and definite meaning,
One theme set of words can clearly describe a theme, and be more beneficial for computer disposal.
In the related art, there is provided a kind of key words extraction method, specific implementation procedure is as follows:(1) collect a large amount of
Document builds large-scale collection of document, and the frequency that statistics word occurs in all documents builds the frequency mould of word-document
Type (Inverse Document Frequency, referred to as IDF);(2) for the document for needing extraction theme, system
Word frequency information (Term Frequency, referred to as TF) of the meter word in the document;(3) build based on word frequency letter
The weighting weight calculation model of breath, determines the weighted value of each word in document, and by weighted value size to all words
Sequence;(4) according to threshold value set in advance, top-n word after output previous step sequence.
Inventor has found that above-mentioned technical method has the following disadvantages:(1) the key words extraction model based on word frequency information,
Extract descriptor when need rely on word frequency information, easily influenceed by high-frequency noise word, cause the descriptor for extracting and
Its set is easily polluted by high-frequency noise word, it is impossible to ensure the extraction quality of descriptor;(2) based on weighted value sequence
Key words extraction technology, no matter how weighted value computation model changes, cannot all consider the semanteme of each word, thus
The problems such as Chinese polysemy or synonymous many words can not be solved, i.e., can not efficiently differentiate the semanteme of word, so as to influence
The descriptor of extraction and its quality of set.In addition, such scheme needs to learn IDF models, and IDF models exist
It is obvious regardless of effect in the whole network data in field, and when processing the document of same area, effect is decreased obviously, and typically needs
Want the IDF models in the re -training field, underaction.
For above-mentioned problem, effective solution is not yet proposed at present.
The content of the invention
A kind of key words extraction method and device is the embodiment of the invention provides, at least to solve due to polysemy or many
The technical problem of the synonymous influence key words extraction quality for causing of word.
A kind of one side according to embodiments of the present invention, there is provided key words extraction method, including:Obtain all need
Extract the document of descriptor and appear in word in the document;Based on the frequency that each word occurs in the document
Rate builds word document matrix, wherein, every a line of the word document matrix represents each word in a document
Word frequency information, each row represent word frequency information of the word in each piece document;Using latent semantic analysis model
Semantic analysis is carried out to the word document matrix, latent semantic space is generated;Extracted according to the latent semantic space
The descriptor of described the document for extracting descriptor in need.
Further, semantic analysis is carried out to above-mentioned word document matrix using latent semantic analysis model, is generated potential
Semantic space includes:Using the word in the above-mentioned word document matrix of above-mentioned latent semantic analysis model analysis and document
Corresponding relation;The word in above-mentioned word document matrix and document are mapped to the predetermined dimension of satisfaction according to above-mentioned corresponding relation
In the vector space of degree condition, above-mentioned latent semantic space is generated.
Further, semantic analysis is carried out to above-mentioned word document matrix using latent semantic analysis model, is generated potential
Semantic space includes:Using singular value decomposition model or Non-negative Matrix Factorization model or probability Vector Space Model pair
Above-mentioned word document matrix carries out semantic analysis, generates latent semantic space.
Further, according to above-mentioned latent semantic space extract it is above-mentioned institute it is in need extraction descriptor document descriptor
Including:Descriptor word matrix is determined according to above-mentioned latent semantic space, wherein, above-mentioned descriptor word matrix it is every
A line represents the semantic classes of descriptor, and each row represent what is occurred in above-mentioned the document for extracting descriptor in need
Word;To being sorted by its weighted value per a line word in above-mentioned descriptor word matrix;Extract the descriptor word after sequence
In language matrix weighted value more than predetermined threshold value word as it is above-mentioned institute it is in need extraction descriptor document descriptor.
Further, the document and the word that appears in the document for obtaining institute's extraction descriptor in need include:Obtain
Take above-mentioned the document for extracting descriptor in need;Word segmentation processing is carried out to above-mentioned the document for extracting descriptor in need,
Obtain the above-mentioned word appeared in the document.
Another aspect according to embodiments of the present invention, additionally provides a kind of key words extraction device, including:Acquiring unit,
For obtaining the document of institute's extraction descriptor in need and appearing in word in the document;Construction unit, for base
The frequency occurred in the document in each word builds word document matrix, wherein, above-mentioned word document matrix it is every
A line represents word frequency information of each word in a document, and each row represent word of the word in each piece document
Frequency information;Generation unit, for carrying out semantic analysis to above-mentioned word document matrix using latent semantic analysis model,
Generation latent semantic space;Extracting unit, for being extracted according to above-mentioned latent semantic space, above-mentioned institute is in need to extract master
The descriptor of the document of epigraph.
Further, above-mentioned generation unit includes:Analysis module, for utilizing above-mentioned latent semantic analysis model analysis
The corresponding relation of word and document in above-mentioned word document matrix;Generation module, for being incited somebody to action according to above-mentioned corresponding relation
Word in above-mentioned word document matrix is mapped in the vector space for meeting predetermined dimension condition with document, is generated above-mentioned
Latent semantic space.
Further, above-mentioned generation unit is additionally operable to using singular value decomposition model or Non-negative Matrix Factorization model or probability
Vector Space Model carries out semantic analysis to above-mentioned word document matrix, generates latent semantic space.
Further, above-mentioned extracting unit includes:Determining module, for determining theme according to above-mentioned latent semantic space
Word word matrix, wherein, every a line of above-mentioned descriptor word matrix represents the semantic classes of descriptor, each list
Show the word occurred in above-mentioned the document for extracting descriptor in need;Order module, for above-mentioned descriptor word
Sorted by its weighted value per a line word in language matrix;Abstraction module, for extracting the descriptor word matrix after sequence
Middle weighted value more than predetermined threshold value word as it is above-mentioned institute it is in need extraction descriptor document descriptor.
Further, above-mentioned acquiring unit includes:Acquisition module, for obtaining above-mentioned the descriptor that extracts in need
Document;Word-dividing mode, for carrying out word segmentation processing to above-mentioned the document for extracting descriptor in need, obtain it is above-mentioned go out
Word in present the document.
In embodiments of the present invention, by the way of descriptor is extracted based on semantic analysis result, by obtaining all need
Extract the document of descriptor and appear in word in the document;Based on the frequency that each word occurs in the document
Rate builds word document matrix, wherein, every a line of word document matrix represents word of each word in a document
Frequency information, each row represent word frequency information of the word in each piece document;Using latent semantic analysis model to word
Language document matrix carries out semantic analysis, generates latent semantic space;Institute's extraction in need is extracted according to latent semantic space
The descriptor of the document of descriptor, has reached the purpose that descriptor is extracted based on semantic analysis result, it is achieved thereby that carrying
The technique effect of key words extraction quality high, and then solve due to polysemy or the synonymous influence theme for causing of many words
Word extracts the technical problem of quality.
Brief description of the drawings
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, this hair
Bright schematic description and description does not constitute inappropriate limitation of the present invention for explaining the present invention.In accompanying drawing
In:
Fig. 1 is the flow chart of a kind of optional key words extraction method according to embodiments of the present invention;
Fig. 2 is the schematic diagram of a kind of optional key words extraction device according to embodiments of the present invention.
Specific embodiment
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention
Accompanying drawing, is clearly and completely described to the technical scheme in the embodiment of the present invention, it is clear that described embodiment
The only embodiment of a present invention part, rather than whole embodiments.Based on the embodiment in the present invention, ability
The every other embodiment that domain those of ordinary skill is obtained under the premise of creative work is not made, should all belong to
The scope of protection of the invention.
It should be noted that term " first ", " in description and claims of this specification and above-mentioned accompanying drawing
Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that this
The data that sample is used can be exchanged in the appropriate case, so as to embodiments of the invention described herein can with except
Here the order beyond those for illustrating or describing is implemented.Additionally, term " comprising " and " having " and they
Any deformation, it is intended that covering is non-exclusive to be included, for example, containing process, the side of series of steps or unit
Method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include unclear
List or for these processes, method, product or other intrinsic steps of equipment or unit.
Embodiment 1
According to embodiments of the present invention, there is provided a kind of embodiment of the method for key words extraction method, it is necessary to explanation,
Can be performed in the such as one group computer system of computer executable instructions the step of the flow of accompanying drawing is illustrated,
And, although logical order is shown in flow charts, but in some cases, can be with suitable different from herein
Sequence performs shown or described step.
Fig. 1 is the flow chart of a kind of optional key words extraction method according to embodiments of the present invention, as shown in figure 1,
The method comprises the following steps:
Step S102, obtains the document of institute's extraction descriptor in need and appears in word in the document;
Step S104, the frequency occurred in the document based on each word builds word document matrix, wherein, word
Every a line of document matrix represents word frequency information of each word in a document, and each row represent a word each
Word frequency information in piece document;
Step S106, semantic analysis is carried out to word document matrix using latent semantic analysis model, generates potential applications
Space;
Step S108, according to latent semantic space extract institute it is in need extraction descriptor document descriptor.
For example, it is assumed that having the N document for needing to extract descriptor, these documents are related to M word, the document altogether
Set expression be D={ d1, d2, d3 ..., dn }, the set expression of this M word be W=w1, w2, w3 ...,
Wm }, then a word document matrix A (i.e. word-document matrix A) of N*M can be set up by above-mentioned document and word,
Matrix A is as follows:
Every a line one document of correspondence in matrix A, wherein each word of element representation corresponding word in this document
Frequency information;One word of each row correspondence, wherein word frequency information of each element representation word in correspondence document,
Specifically, a in AijA is passed through by D and Wij=DiWjMapping is obtained, and represents word frequency informations of the word j in document i.
Further, on the basis of matrix A, normalization factor can be calculated, and each row vector is normalized.Normalizing
Change factor computational methods including various, be not limited thereto, for example, can be entered from L2-normal ization methods
Row vector is normalized.Specifically, the computational methods of L2-normal ization normalization factors are as follows:
Norm=(d1)2+…+…(dn)2
By above-mentioned steps, it is possible to achieve chapter level document is processed using latent semantic analysis method, improves base
The deficiency of descriptor is extracted in word frequency information, the semanteme of word is taken into account to reduce influence of the noise word to descriptor quality,
Enable, for representing that the descriptor of theme preferably covers document information, to make the expression of theme more perfect, from
And it is effectively improved the quality of the descriptor being drawn into so that the theme for extracting has more preferably in the later stage is applied
Universality, had great significance to calculating the work such as similarity or file retrieval.
Alternatively, semantic analysis is carried out to word document matrix using latent semantic analysis model, generation potential applications are empty
Between include:
S2, using the word in latent semantic analysis model analysis word document matrix and the corresponding relation of document;
S4, the word in word document matrix and document are mapped to according to corresponding relation meet predetermined dimension condition to
In quantity space, latent semantic space is generated.
The purpose of latent semantic analysis is to find out each word real meaning in a document, that is, potential applications,
So as to the relation between the semantic information and word and theme that obtain word.Particularly, latent semantic space is generated just
It is that one large-scale collection of document is modeled in space is safeguarded using a reasonable dimension, and word and document is all represented
To in the space.For example, have 2000 documents, comprising 7000 words, in latent semantic analysis, by word
Represented according to corresponding relation with document in being 100 vector space to a dimension.
By the embodiment of the present invention, based on latent semantic analysis model extraction theme, the influence of noise word can be reduced,
The descriptor for extracting is set preferably to describe the theme of document.
Based on above-described embodiment, alternatively, semantic analysis is carried out to word document matrix using latent semantic analysis model,
Generation latent semantic space includes:
S6, using singular value decomposition model or Non-negative Matrix Factorization model NMF or probability Vector Space Model pLSI
Semantic analysis is carried out to word document matrix, latent semantic space is generated.
Below as a example by using singular value decomposition K-SVD models, the process of generation latent semantic space is discussed in detail:
Wherein, singular value decomposition (Singular Value Decomposition, referred to as SVD) is linear algebra
A kind of middle important matrix decomposition, is the popularization of normal matrix unitarily diagonalizable in matrix analysis, in signal transacting, statistics
There is important application in etc. field.Unitary matrice U is a complex matrix for n rows n row, meets UTU=UUT=En, wherein, UT
It is the conjugate transposition of U, EnIt is n rank unit matrixs.In linear algebra, matrix column order is the linear independence of matrix
The squillion of file.Similarly, jordan canonical form is the squillion of the linear independence row of matrix.
During implementation, word document matrix is processed using SVD, by matrix A according to A=U Σ VTMode decompose
It is U, Σ, VTThree matrixes, wherein, Σ is diagonal matrix, and each element is the strange of matrix A on diagonal
Different value (i.e. characteristic value).A=U Σ V are described belowTA kind of simple solution method:
(1) matrix A is soughtTThe unitary similar diagonal matrix and unitary similar matrix V of A:
(2) note V=(V1, V2), V1∈Cn×r, V2∈Cn×(n-r),
(3) U is made1=AV1Δ-1, U1∈Cm×r,
(4) U is expanded1It is U matrixes, U=(U1, U2),
(5) singular value decomposition is constructed
Wherein, in Σ each singular value it is corresponding be each " semanteme " dimension weighted value.Further, it is possible to will not
Too important weighted value is configured to 0, and all dimension numerical value that will be less than a certain weight threshold are all configured to 0, only retain
Most important dimensional information, so available latent semantic space can filter some noise words.
By the embodiment of the present invention, using singular value decomposition mode, can be filtered by singular value and degree of membership filters two
The mode of kind, has filtered out the semantic classes and degree of membership word not high of the less descriptor of weighted value, eliminates high frequency and makes an uproar
The influence of sound word so that the descriptor for extracting can preferably describe the theme of document.
Alternatively, the descriptor of document for extracting institute's extraction descriptor in need according to latent semantic space includes:
S8, descriptor word matrix is determined according to latent semantic space, wherein, every a line table of descriptor word matrix
Show the semantic classes of descriptor, each row represent the word that occurs in the document for extracting descriptor in need;
S10, to being sorted by its weighted value per a line word in descriptor word matrix;
S12, weighted value is taken out more than the word of predetermined threshold value as institute is in need in extracting the descriptor word matrix after sequence
Take the descriptor of the document of descriptor.
Based on previous embodiment, after singular value decomposition is carried out to word document matrix A, obtain right in three matrixes
Angular moment battle array Σ and VTTwo matrixes, according to T1=Σ VTProduct mode obtain intermediary matrix T1, filter out T1In matrix
Full 0 row and full 0 row, obtain final descriptor word matrix T2, wherein T2In row represent extract descriptor
Semantic classes, row represent document in word, T2In the word that represents of row where each element representation element
Membership (i.e. degree of membership) between the descriptor represented with the row where the element.Then to matrix T2In it is each
OK, sorted according to weighted value size, and the corresponding word of row and weight using weighted value more than weight threshold is used as theme
Word and subject information are added in theme set, constitute descriptor set of words, are used to represent the theme of each document.
It should be noted that according to the difference of mission requirements, weight threshold can be divided into two kinds:One is integer type m,
Representing needs to extract the theme that the preceding m descriptor related to the theme is used for representing document;Two is decimal type f, table
Showing needs to extract weighted value all words bigger than f as descriptor for representing the theme of document.
Alternatively, the document and the word that appears in the document for obtaining institute's extraction descriptor in need include:
S14, obtain institute it is in need extraction descriptor document;
S16, to it is in need extract descriptor document carry out word segmentation processing, obtain appearing in the word in the document.
That is, after the document for extracting descriptor in need, it is necessary to pre-processed to these documents, including:It is right
Document carries out word segmentation processing, obtains the word involved by these documents, and the word frequency information for counting these words.It is right
For Chinese document, it is possible to use Chinese word segmentation instrument carries out word segmentation processing, so as to by long text document process into word
Language set.In order to improve extract descriptor quality, reduce high-frequency noise word influence, can after participle terminates,
Filtration treatment is carried out to conventional Chinese stop words such as " ", " uh ".
By the embodiment of the present invention, without the large-scale language material model of training in advance, using flexible, to the text of different field
Shelves set or whole network data all have universality.
Embodiment 2
According to embodiments of the present invention, there is provided a kind of device embodiment of key words extraction device.
Fig. 2 is the schematic diagram of a kind of optional key words extraction device according to embodiments of the present invention, as shown in Fig. 2
The device includes:Acquiring unit 202, for obtain institute it is in need extraction descriptor document and appear in the document
In word;Construction unit 204, the frequency for being occurred in the document based on each word builds word document square
Battle array, wherein, every a line of word document matrix represents word frequency information of each word in a document, each list
Show word frequency information of the word in each piece document;Generation unit 206, for utilizing latent semantic analysis model pair
Word document matrix carries out semantic analysis, generates latent semantic space;Extracting unit 208, for according to potential applications
Spatial decimation it is in need extract descriptor document descriptor.
For example, it is assumed that having the N document for needing to extract descriptor, these documents are related to M word, the document altogether
Set expression be D={ d1, d2, d3 ..., dn }, the set expression of this M word be W=w1, w2, w3 ...,
Wm }, then a word document matrix A (i.e. word-document matrix A) of N*M can be set up by above-mentioned document and word,
Matrix A is as follows:
Every a line one document of correspondence in matrix A, wherein each word of element representation corresponding word in this document
Frequency information;One word of each row correspondence, wherein word frequency information of each element representation word in correspondence document,
Specifically, a in AijA is passed through by D and Wij=DiWjMapping is obtained, and represents word frequency informations of the word j in document i.
Further, on the basis of matrix A, normalization factor can be calculated, and each row vector is normalized.Normalizing
Change factor computational methods including various, be not limited thereto, for example, can be entered from L2-normal ization methods
Row vector is normalized.Specifically, the computational methods of L2-norm normalization factors are as follows:
Norm=(d1)2+…+…(dn)2
It is perfect by above-described embodiment, it is possible to achieve chapter level document is processed using latent semantic analysis method
Based on the deficiency of word frequency information extraction descriptor, the semanteme of word is taken into account to reduce influence of the noise word to descriptor quality,
Enable, for representing that the descriptor of theme preferably covers document information, to make the expression of theme more perfect, from
And it is effectively improved the quality of the descriptor being drawn into so that the theme for extracting has more preferably in the later stage is applied
Universality, had great significance to calculating the work such as similarity or file retrieval.
Alternatively, above-mentioned generation unit includes:Analysis module, for using latent semantic analysis model analysis word text
The corresponding relation of word and document in shelves matrix;Generation module, for according to corresponding relation by word document matrix
Word and document be mapped in the vector space for meeting predetermined dimension condition, generate latent semantic space.
The purpose of latent semantic analysis is to find out each word real meaning in a document, that is, potential applications,
So as to the relation between the semantic information and word and theme that obtain word.Particularly, latent semantic space is generated just
It is that one large-scale collection of document is modeled in space is safeguarded using a reasonable dimension, and word and document is all represented
To in the space.For example, have 2000 documents, comprising 7000 words, in latent semantic analysis, by word
Represented according to corresponding relation with document in being 100 vector space to a dimension.
By the embodiment of the present invention, based on latent semantic analysis model extraction theme, the influence of noise word can be reduced,
The descriptor for extracting is set preferably to describe the theme of document.
Based on above-described embodiment, alternatively, generation unit is additionally operable to using singular value decomposition model or Non-negative Matrix Factorization
Model or probability Vector Space Model carry out semantic analysis to word document matrix, generate latent semantic space.
It uses singular value decomposition K-SVD models to generate the process of latent semantic space with the process introduced in embodiment 1,
Will not be repeated here.
By the embodiment of the present invention, using singular value decomposition mode, can be filtered by singular value and degree of membership filters two
The mode of kind, has filtered out the semantic classes and degree of membership word not high of the less descriptor of weighted value, eliminates high frequency and makes an uproar
The influence of sound word so that the descriptor for extracting can preferably describe the theme of document.
Alternatively, above-mentioned extracting unit includes:Determining module, for determining descriptor word according to latent semantic space
Matrix, wherein, every a line of descriptor word matrix represents the semantic classes of descriptor, and each row are represented in all need
Extract the word occurred in the document of descriptor;Order module, for every a line word in descriptor word matrix
Sorted by its weighted value;Abstraction module, default threshold is more than for weighted value in the descriptor word matrix after extraction sequence
The word of value as it is in need extract descriptor document descriptor.
Based on previous embodiment, after singular value decomposition is carried out to word document matrix A, obtain right in three matrixes
Angular moment battle array Σ and VTTwo matrixes, according to T1=Σ VTProduct mode obtain intermediary matrix T1, filter out T1In matrix
Full 0 row and full 0 row, obtain final descriptor word matrix T2, wherein T2In row represent extract descriptor
Semantic classes, row represent document in word, T2In the word that represents of row where each element representation element
Membership (i.e. degree of membership) between the descriptor represented with the row where the element.Then to matrix T2In it is each
OK, sorted according to weighted value size, and the corresponding word of row and weight using weighted value more than weight threshold is used as theme
Word and subject information are added in theme set, constitute descriptor set of words, are used to represent the theme of each document.
It should be noted that according to the difference of mission requirements, weight threshold can be divided into two kinds:One is integer type m,
Representing needs to extract the theme that the preceding m descriptor related to the theme is used for representing document;Two is decimal type f, table
Showing needs to extract weighted value all words bigger than f as descriptor for representing the theme of document.
Alternatively, above-mentioned acquiring unit includes:Acquisition module, for obtain institute it is in need extraction descriptor document;
Word-dividing mode, for the document for extracting descriptor in need carry out word segmentation processing, in obtaining appearing in the document
Word.
That is, after the document for extracting descriptor in need, it is necessary to pre-processed to these documents, including:It is right
Document carries out word segmentation processing, obtains the word involved by these documents, and the word frequency information for counting these words.It is right
For Chinese document, it is possible to use Chinese word segmentation instrument carries out word segmentation processing, so as to by long text document process into word
Language set.In order to improve extract descriptor quality, reduce high-frequency noise word influence, can after participle terminates,
Filtration treatment is carried out to conventional Chinese stop words such as " ", " uh ".
By the embodiment of the present invention, without the large-scale language material model of training in advance, using flexible, to the text of different field
Shelves set or whole network data all have universality.
Above-mentioned key words extraction device include processor and memory, above-mentioned acquiring unit, construction unit, generation unit,
Extracting unit etc. is stored in memory as program unit, by computing device storage above-mentioned journey in memory
Sequence unit.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can set one
Or more, parse content of text by adjusting kernel parameter.
Memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/
Or the form, such as read-only storage (ROM) or flash memory (f l ash RAM) such as Nonvolatile memory, memory includes at least one
Individual storage chip.
Present invention also provides a kind of embodiment of computer program product, when being performed on data processing equipment, fit
In the program code for performing initialization there are as below methods step:Obtain institute it is in need extraction descriptor document and appearance
Word in the document;The frequency occurred in the document based on each word builds word document matrix, wherein,
Every a line of word document matrix represents word frequency information of each word in a document, and each row represent a word
Word frequency information in each piece document;Semantic analysis is carried out to word document matrix using latent semantic analysis model, it is raw
Into latent semantic space;According to latent semantic space extract institute it is in need extraction descriptor document descriptor.
The embodiments of the present invention are for illustration only, and the quality of embodiment is not represented.
In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment
The part of detailed description, may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents, can be by other
Mode realize.Wherein, device embodiment described above is only schematical, such as division of described unit,
Can be a kind of division of logic function, there can be other dividing mode when actually realizing, for example multiple units or component
Can combine or be desirably integrated into another system, or some features can be ignored, or do not perform.It is another, institute
Display or the coupling each other for discussing or direct-coupling or communication connection can be by some interfaces, unit or mould
The INDIRECT COUPLING of block or communication connection, can be electrical or other forms.
The unit that is illustrated as separating component can be or may not be it is physically separate, it is aobvious as unit
The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to
On multiple units.Some or all of unit therein can be according to the actual needs selected to realize this embodiment scheme
Purpose.
In addition, during each functional unit in each embodiment of the invention can be integrated in a processing unit, it is also possible to
It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.It is above-mentioned integrated
Unit can both be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
If the integrated unit is to realize in the form of SFU software functional unit and as independent production marketing or when using,
Can store in a computer read/write memory medium.Based on such understanding, technical scheme essence
On all or part of the part that is contributed to prior art in other words or the technical scheme can be with software product
Form is embodied, and the computer software product is stored in a storage medium, including some instructions are used to so that one
Platform computer equipment (can be personal computer, server or network equipment etc.) performs each embodiment institute of the invention
State all or part of step of method.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only
Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD
Etc. it is various can be with the medium of store program codes.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improve and moisten
Decorations also should be regarded as protection scope of the present invention.
Claims (10)
1. a kind of key words extraction method, it is characterised in that including:
Obtain the document of institute's extraction descriptor in need and appear in word in the document;
The frequency occurred in the document based on each word builds word document matrix, wherein, the word text
Every a line of shelves matrix represents word frequency information of each word in a document, and each row represent that a word exists
Word frequency information in each piece document;
Semantic analysis is carried out to the word document matrix using latent semantic analysis model, generation potential applications are empty
Between;
According to the latent semantic space extract it is described institute it is in need extraction descriptor document descriptor.
2. method according to claim 1, it is characterised in that using latent semantic analysis model to word text
Shelves matrix carries out semantic analysis, and generation latent semantic space includes:
Closed using the word in word document matrix described in the latent semantic analysis model analysis is corresponding with document
System;
The word in the word document matrix and document are mapped to according to the corresponding relation meet predetermined dimension
In the vector space of condition, the latent semantic space is generated.
3. method according to claim 1 and 2, it is characterised in that using latent semantic analysis model to institute's predicate
Language document matrix carries out semantic analysis, and generation latent semantic space includes:
Using singular value decomposition model or Non-negative Matrix Factorization model or probability Vector Space Model to institute's predicate
Language document matrix carries out semantic analysis, generates latent semantic space.
4. method according to claim 1, it is characterised in that extracted according to the latent semantic space described all
Needing the descriptor of the document for extracting descriptor includes:
Descriptor word matrix is determined according to the latent semantic space, wherein, the descriptor word matrix
The semantic classes of descriptor is represented per a line, each row are represented in the document of institute extraction descriptor in need
The word of appearance;
To being sorted by its weighted value per a line word in the descriptor word matrix;
Weighted value is more than the word of predetermined threshold value as all need in extracting the descriptor word matrix after sequence
Extract the descriptor of the document of descriptor.
5. method according to claim 1, it is characterised in that obtain institute's extraction descriptor in need document and
The word appeared in the document includes:
Obtain it is described institute it is in need extraction descriptor document;
Word segmentation processing is carried out to described the document for extracting descriptor in need, obtains described appearing in the document
Word.
6. a kind of key words extraction device, it is characterised in that including:
Acquiring unit, for obtaining the document of institute's extraction descriptor in need and appearing in word in the document;
Construction unit, the frequency for being occurred in the document based on each word builds word document matrix, its
In, every a line of the word document matrix represents word frequency information of each word in a document, Mei Yilie
Represent word frequency information of the word in each piece document;
Generation unit, for carrying out semantic analysis to the word document matrix using latent semantic analysis model,
Generation latent semantic space;
Extracting unit, for according to the latent semantic space extract it is described institute it is in need extraction descriptor document
Descriptor.
7. device according to claim 6, it is characterised in that the generation unit includes:
Analysis module, for using the word in word document matrix described in the latent semantic analysis model analysis
With the corresponding relation of document;
Generation module, for mapping the word in the word document matrix and document according to the corresponding relation
To in the vector space for meeting predetermined dimension condition, the latent semantic space is generated.
8. the device according to claim 6 or 7, it is characterised in that the generation unit is additionally operable to utilize singular value
Decomposition model or Non-negative Matrix Factorization model or probability Vector Space Model are carried out to the word document matrix
Semantic analysis, generates latent semantic space.
9. device according to claim 6, it is characterised in that the extracting unit includes:
Determining module, for determining descriptor word matrix according to the latent semantic space, wherein, the master
The every a line for writing inscription word matrix represents the semantic classes of descriptor, and each row are represented in institute extraction in need
The word occurred in the document of descriptor;
Order module, for being sorted by its weighted value per a line word in the descriptor word matrix;
Abstraction module, the word for weighted value in the descriptor word matrix after extraction sequence more than predetermined threshold value
As the descriptor of described the document for extracting descriptor in need.
10. device according to claim 6, it is characterised in that the acquiring unit includes:
Acquisition module, for obtaining described the document for extracting descriptor in need;
Word-dividing mode, for carrying out word segmentation processing to described the document for extracting descriptor in need, obtains described
Appear in the word in the document.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510819148.2A CN106776530B (en) | 2015-11-23 | 2015-11-23 | Method and device for extracting subject term |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510819148.2A CN106776530B (en) | 2015-11-23 | 2015-11-23 | Method and device for extracting subject term |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106776530A true CN106776530A (en) | 2017-05-31 |
CN106776530B CN106776530B (en) | 2020-07-03 |
Family
ID=58963111
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510819148.2A Active CN106776530B (en) | 2015-11-23 | 2015-11-23 | Method and device for extracting subject term |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106776530B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117494726A (en) * | 2023-12-29 | 2024-02-02 | 成都航空职业技术学院 | Information keyword extraction method |
-
2015
- 2015-11-23 CN CN201510819148.2A patent/CN106776530B/en active Active
Non-Patent Citations (1)
Title |
---|
周进华 等: "基于衰减词共现图的多文档摘要研究", 《小型微型计算机系统》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117494726A (en) * | 2023-12-29 | 2024-02-02 | 成都航空职业技术学院 | Information keyword extraction method |
CN117494726B (en) * | 2023-12-29 | 2024-04-12 | 成都航空职业技术学院 | Information keyword extraction method |
Also Published As
Publication number | Publication date |
---|---|
CN106776530B (en) | 2020-07-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107807987B (en) | Character string classification method and system and character string classification equipment | |
Rei et al. | Grasping the finer point: A supervised similarity network for metaphor detection | |
CN109492157B (en) | News recommendation method and theme characterization method based on RNN and attention mechanism | |
CN112131350B (en) | Text label determining method, device, terminal and readable storage medium | |
CN105608477B (en) | Method and system for matching portrait with job position | |
CN106776562A (en) | A kind of keyword extracting method and extraction system | |
CN112364937B (en) | User category determination method and device, recommended content determination method and electronic equipment | |
CN112905739B (en) | False comment detection model training method, detection method and electronic equipment | |
CN109684476A (en) | A kind of file classification method, document sorting apparatus and terminal device | |
CN111680131B (en) | Document clustering method and system based on semantics and computer equipment | |
WO2020007989A1 (en) | Method for co-clustering senders and receivers based on text or image data files | |
CN106570170A (en) | Text classification and naming entity recognition integrated method and system based on depth cyclic neural network | |
WO2016036345A1 (en) | External resource identification | |
CN112527958A (en) | User behavior tendency identification method, device, equipment and storage medium | |
CN113392179A (en) | Text labeling method and device, electronic equipment and storage medium | |
CN115017320A (en) | E-commerce text clustering method and system combining bag-of-words model and deep learning model | |
CN110532378A (en) | A kind of short text aspect extracting method based on topic model | |
CN113239268A (en) | Commodity recommendation method, device and system | |
CN113761192A (en) | Text processing method, text processing device and text processing equipment | |
CN116108836B (en) | Text emotion recognition method and device, computer equipment and readable storage medium | |
CN106776530A (en) | Key words extraction method and device | |
CN110413985B (en) | Related text segment searching method and device | |
CN114491076B (en) | Data enhancement method, device, equipment and medium based on domain knowledge graph | |
CN110162614B (en) | Question information extraction method and device, electronic equipment and storage medium | |
Chen | Tracking latent domain structures: An integration of Pathfinder and Latent Semantic Analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |