CN106776530A - Key words extraction method and device - Google Patents

Key words extraction method and device Download PDF

Info

Publication number
CN106776530A
CN106776530A CN201510819148.2A CN201510819148A CN106776530A CN 106776530 A CN106776530 A CN 106776530A CN 201510819148 A CN201510819148 A CN 201510819148A CN 106776530 A CN106776530 A CN 106776530A
Authority
CN
China
Prior art keywords
word
document
descriptor
matrix
latent semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510819148.2A
Other languages
Chinese (zh)
Other versions
CN106776530B (en
Inventor
祁国晟
徐文斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510819148.2A priority Critical patent/CN106776530B/en
Publication of CN106776530A publication Critical patent/CN106776530A/en
Application granted granted Critical
Publication of CN106776530B publication Critical patent/CN106776530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of key words extraction method and device.Wherein, the method includes:Obtain the document of institute's extraction descriptor in need and appear in word in the document;The frequency occurred in the document based on each word builds word document matrix, wherein, every a line of word document matrix represents word frequency information of each word in a document, and each row represent word frequency information of the word in each piece document;Semantic analysis is carried out to word document matrix using latent semantic analysis model, latent semantic space is generated;According to latent semantic space extract institute it is in need extraction descriptor document descriptor.The present invention is solved due to the technical problem of polysemy or the synonymous influence key words extraction quality for causing of many words.

Description

Key words extraction method and device
Technical field
The present invention relates to natural language processing field, in particular to a kind of key words extraction method and device.
Background technology
Theme can embody the central idea expressed by document, be one of effective means of computer expression document.Extract Subject information helps to understand the effective information of document, treatment effeciency of the raising computer to document.At present, theme is taken out The technology of taking is one hot technology of natural language processing field.
Usually, by taking Chinese subject extraction as an example, descriptor, Subject Concept and master are generally divided into subject extraction task Topic three aspects of sentence.Although single descriptor is unlike Subject Concept and theme line, with clear and definite meaning, One theme set of words can clearly describe a theme, and be more beneficial for computer disposal.
In the related art, there is provided a kind of key words extraction method, specific implementation procedure is as follows:(1) collect a large amount of Document builds large-scale collection of document, and the frequency that statistics word occurs in all documents builds the frequency mould of word-document Type (Inverse Document Frequency, referred to as IDF);(2) for the document for needing extraction theme, system Word frequency information (Term Frequency, referred to as TF) of the meter word in the document;(3) build based on word frequency letter The weighting weight calculation model of breath, determines the weighted value of each word in document, and by weighted value size to all words Sequence;(4) according to threshold value set in advance, top-n word after output previous step sequence.
Inventor has found that above-mentioned technical method has the following disadvantages:(1) the key words extraction model based on word frequency information, Extract descriptor when need rely on word frequency information, easily influenceed by high-frequency noise word, cause the descriptor for extracting and Its set is easily polluted by high-frequency noise word, it is impossible to ensure the extraction quality of descriptor;(2) based on weighted value sequence Key words extraction technology, no matter how weighted value computation model changes, cannot all consider the semanteme of each word, thus The problems such as Chinese polysemy or synonymous many words can not be solved, i.e., can not efficiently differentiate the semanteme of word, so as to influence The descriptor of extraction and its quality of set.In addition, such scheme needs to learn IDF models, and IDF models exist It is obvious regardless of effect in the whole network data in field, and when processing the document of same area, effect is decreased obviously, and typically needs Want the IDF models in the re -training field, underaction.
For above-mentioned problem, effective solution is not yet proposed at present.
The content of the invention
A kind of key words extraction method and device is the embodiment of the invention provides, at least to solve due to polysemy or many The technical problem of the synonymous influence key words extraction quality for causing of word.
A kind of one side according to embodiments of the present invention, there is provided key words extraction method, including:Obtain all need Extract the document of descriptor and appear in word in the document;Based on the frequency that each word occurs in the document Rate builds word document matrix, wherein, every a line of the word document matrix represents each word in a document Word frequency information, each row represent word frequency information of the word in each piece document;Using latent semantic analysis model Semantic analysis is carried out to the word document matrix, latent semantic space is generated;Extracted according to the latent semantic space The descriptor of described the document for extracting descriptor in need.
Further, semantic analysis is carried out to above-mentioned word document matrix using latent semantic analysis model, is generated potential Semantic space includes:Using the word in the above-mentioned word document matrix of above-mentioned latent semantic analysis model analysis and document Corresponding relation;The word in above-mentioned word document matrix and document are mapped to the predetermined dimension of satisfaction according to above-mentioned corresponding relation In the vector space of degree condition, above-mentioned latent semantic space is generated.
Further, semantic analysis is carried out to above-mentioned word document matrix using latent semantic analysis model, is generated potential Semantic space includes:Using singular value decomposition model or Non-negative Matrix Factorization model or probability Vector Space Model pair Above-mentioned word document matrix carries out semantic analysis, generates latent semantic space.
Further, according to above-mentioned latent semantic space extract it is above-mentioned institute it is in need extraction descriptor document descriptor Including:Descriptor word matrix is determined according to above-mentioned latent semantic space, wherein, above-mentioned descriptor word matrix it is every A line represents the semantic classes of descriptor, and each row represent what is occurred in above-mentioned the document for extracting descriptor in need Word;To being sorted by its weighted value per a line word in above-mentioned descriptor word matrix;Extract the descriptor word after sequence In language matrix weighted value more than predetermined threshold value word as it is above-mentioned institute it is in need extraction descriptor document descriptor.
Further, the document and the word that appears in the document for obtaining institute's extraction descriptor in need include:Obtain Take above-mentioned the document for extracting descriptor in need;Word segmentation processing is carried out to above-mentioned the document for extracting descriptor in need, Obtain the above-mentioned word appeared in the document.
Another aspect according to embodiments of the present invention, additionally provides a kind of key words extraction device, including:Acquiring unit, For obtaining the document of institute's extraction descriptor in need and appearing in word in the document;Construction unit, for base The frequency occurred in the document in each word builds word document matrix, wherein, above-mentioned word document matrix it is every A line represents word frequency information of each word in a document, and each row represent word of the word in each piece document Frequency information;Generation unit, for carrying out semantic analysis to above-mentioned word document matrix using latent semantic analysis model, Generation latent semantic space;Extracting unit, for being extracted according to above-mentioned latent semantic space, above-mentioned institute is in need to extract master The descriptor of the document of epigraph.
Further, above-mentioned generation unit includes:Analysis module, for utilizing above-mentioned latent semantic analysis model analysis The corresponding relation of word and document in above-mentioned word document matrix;Generation module, for being incited somebody to action according to above-mentioned corresponding relation Word in above-mentioned word document matrix is mapped in the vector space for meeting predetermined dimension condition with document, is generated above-mentioned Latent semantic space.
Further, above-mentioned generation unit is additionally operable to using singular value decomposition model or Non-negative Matrix Factorization model or probability Vector Space Model carries out semantic analysis to above-mentioned word document matrix, generates latent semantic space.
Further, above-mentioned extracting unit includes:Determining module, for determining theme according to above-mentioned latent semantic space Word word matrix, wherein, every a line of above-mentioned descriptor word matrix represents the semantic classes of descriptor, each list Show the word occurred in above-mentioned the document for extracting descriptor in need;Order module, for above-mentioned descriptor word Sorted by its weighted value per a line word in language matrix;Abstraction module, for extracting the descriptor word matrix after sequence Middle weighted value more than predetermined threshold value word as it is above-mentioned institute it is in need extraction descriptor document descriptor.
Further, above-mentioned acquiring unit includes:Acquisition module, for obtaining above-mentioned the descriptor that extracts in need Document;Word-dividing mode, for carrying out word segmentation processing to above-mentioned the document for extracting descriptor in need, obtain it is above-mentioned go out Word in present the document.
In embodiments of the present invention, by the way of descriptor is extracted based on semantic analysis result, by obtaining all need Extract the document of descriptor and appear in word in the document;Based on the frequency that each word occurs in the document Rate builds word document matrix, wherein, every a line of word document matrix represents word of each word in a document Frequency information, each row represent word frequency information of the word in each piece document;Using latent semantic analysis model to word Language document matrix carries out semantic analysis, generates latent semantic space;Institute's extraction in need is extracted according to latent semantic space The descriptor of the document of descriptor, has reached the purpose that descriptor is extracted based on semantic analysis result, it is achieved thereby that carrying The technique effect of key words extraction quality high, and then solve due to polysemy or the synonymous influence theme for causing of many words Word extracts the technical problem of quality.
Brief description of the drawings
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, this hair Bright schematic description and description does not constitute inappropriate limitation of the present invention for explaining the present invention.In accompanying drawing In:
Fig. 1 is the flow chart of a kind of optional key words extraction method according to embodiments of the present invention;
Fig. 2 is the schematic diagram of a kind of optional key words extraction device according to embodiments of the present invention.
Specific embodiment
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, is clearly and completely described to the technical scheme in the embodiment of the present invention, it is clear that described embodiment The only embodiment of a present invention part, rather than whole embodiments.Based on the embodiment in the present invention, ability The every other embodiment that domain those of ordinary skill is obtained under the premise of creative work is not made, should all belong to The scope of protection of the invention.
It should be noted that term " first ", " in description and claims of this specification and above-mentioned accompanying drawing Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that this The data that sample is used can be exchanged in the appropriate case, so as to embodiments of the invention described herein can with except Here the order beyond those for illustrating or describing is implemented.Additionally, term " comprising " and " having " and they Any deformation, it is intended that covering is non-exclusive to be included, for example, containing process, the side of series of steps or unit Method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include unclear List or for these processes, method, product or other intrinsic steps of equipment or unit.
Embodiment 1
According to embodiments of the present invention, there is provided a kind of embodiment of the method for key words extraction method, it is necessary to explanation, Can be performed in the such as one group computer system of computer executable instructions the step of the flow of accompanying drawing is illustrated, And, although logical order is shown in flow charts, but in some cases, can be with suitable different from herein Sequence performs shown or described step.
Fig. 1 is the flow chart of a kind of optional key words extraction method according to embodiments of the present invention, as shown in figure 1, The method comprises the following steps:
Step S102, obtains the document of institute's extraction descriptor in need and appears in word in the document;
Step S104, the frequency occurred in the document based on each word builds word document matrix, wherein, word Every a line of document matrix represents word frequency information of each word in a document, and each row represent a word each Word frequency information in piece document;
Step S106, semantic analysis is carried out to word document matrix using latent semantic analysis model, generates potential applications Space;
Step S108, according to latent semantic space extract institute it is in need extraction descriptor document descriptor.
For example, it is assumed that having the N document for needing to extract descriptor, these documents are related to M word, the document altogether Set expression be D={ d1, d2, d3 ..., dn }, the set expression of this M word be W=w1, w2, w3 ..., Wm }, then a word document matrix A (i.e. word-document matrix A) of N*M can be set up by above-mentioned document and word, Matrix A is as follows:
Every a line one document of correspondence in matrix A, wherein each word of element representation corresponding word in this document Frequency information;One word of each row correspondence, wherein word frequency information of each element representation word in correspondence document, Specifically, a in AijA is passed through by D and Wij=DiWjMapping is obtained, and represents word frequency informations of the word j in document i.
Further, on the basis of matrix A, normalization factor can be calculated, and each row vector is normalized.Normalizing Change factor computational methods including various, be not limited thereto, for example, can be entered from L2-normal ization methods Row vector is normalized.Specifically, the computational methods of L2-normal ization normalization factors are as follows:
Norm=(d1)2+…+…(dn)2
By above-mentioned steps, it is possible to achieve chapter level document is processed using latent semantic analysis method, improves base The deficiency of descriptor is extracted in word frequency information, the semanteme of word is taken into account to reduce influence of the noise word to descriptor quality, Enable, for representing that the descriptor of theme preferably covers document information, to make the expression of theme more perfect, from And it is effectively improved the quality of the descriptor being drawn into so that the theme for extracting has more preferably in the later stage is applied Universality, had great significance to calculating the work such as similarity or file retrieval.
Alternatively, semantic analysis is carried out to word document matrix using latent semantic analysis model, generation potential applications are empty Between include:
S2, using the word in latent semantic analysis model analysis word document matrix and the corresponding relation of document;
S4, the word in word document matrix and document are mapped to according to corresponding relation meet predetermined dimension condition to In quantity space, latent semantic space is generated.
The purpose of latent semantic analysis is to find out each word real meaning in a document, that is, potential applications, So as to the relation between the semantic information and word and theme that obtain word.Particularly, latent semantic space is generated just It is that one large-scale collection of document is modeled in space is safeguarded using a reasonable dimension, and word and document is all represented To in the space.For example, have 2000 documents, comprising 7000 words, in latent semantic analysis, by word Represented according to corresponding relation with document in being 100 vector space to a dimension.
By the embodiment of the present invention, based on latent semantic analysis model extraction theme, the influence of noise word can be reduced, The descriptor for extracting is set preferably to describe the theme of document.
Based on above-described embodiment, alternatively, semantic analysis is carried out to word document matrix using latent semantic analysis model, Generation latent semantic space includes:
S6, using singular value decomposition model or Non-negative Matrix Factorization model NMF or probability Vector Space Model pLSI Semantic analysis is carried out to word document matrix, latent semantic space is generated.
Below as a example by using singular value decomposition K-SVD models, the process of generation latent semantic space is discussed in detail:
Wherein, singular value decomposition (Singular Value Decomposition, referred to as SVD) is linear algebra A kind of middle important matrix decomposition, is the popularization of normal matrix unitarily diagonalizable in matrix analysis, in signal transacting, statistics There is important application in etc. field.Unitary matrice U is a complex matrix for n rows n row, meets UTU=UUT=En, wherein, UT It is the conjugate transposition of U, EnIt is n rank unit matrixs.In linear algebra, matrix column order is the linear independence of matrix The squillion of file.Similarly, jordan canonical form is the squillion of the linear independence row of matrix.
During implementation, word document matrix is processed using SVD, by matrix A according to A=U Σ VTMode decompose It is U, Σ, VTThree matrixes, wherein, Σ is diagonal matrix, and each element is the strange of matrix A on diagonal Different value (i.e. characteristic value).A=U Σ V are described belowTA kind of simple solution method:
(1) matrix A is soughtTThe unitary similar diagonal matrix and unitary similar matrix V of A:
(2) note V=(V1, V2), V1∈Cn×r, V2∈Cn×(n-r),
(3) U is made1=AV1Δ-1, U1∈Cm×r,
(4) U is expanded1It is U matrixes, U=(U1, U2),
(5) singular value decomposition is constructed
Wherein, in Σ each singular value it is corresponding be each " semanteme " dimension weighted value.Further, it is possible to will not Too important weighted value is configured to 0, and all dimension numerical value that will be less than a certain weight threshold are all configured to 0, only retain Most important dimensional information, so available latent semantic space can filter some noise words.
By the embodiment of the present invention, using singular value decomposition mode, can be filtered by singular value and degree of membership filters two The mode of kind, has filtered out the semantic classes and degree of membership word not high of the less descriptor of weighted value, eliminates high frequency and makes an uproar The influence of sound word so that the descriptor for extracting can preferably describe the theme of document.
Alternatively, the descriptor of document for extracting institute's extraction descriptor in need according to latent semantic space includes:
S8, descriptor word matrix is determined according to latent semantic space, wherein, every a line table of descriptor word matrix Show the semantic classes of descriptor, each row represent the word that occurs in the document for extracting descriptor in need;
S10, to being sorted by its weighted value per a line word in descriptor word matrix;
S12, weighted value is taken out more than the word of predetermined threshold value as institute is in need in extracting the descriptor word matrix after sequence Take the descriptor of the document of descriptor.
Based on previous embodiment, after singular value decomposition is carried out to word document matrix A, obtain right in three matrixes Angular moment battle array Σ and VTTwo matrixes, according to T1=Σ VTProduct mode obtain intermediary matrix T1, filter out T1In matrix Full 0 row and full 0 row, obtain final descriptor word matrix T2, wherein T2In row represent extract descriptor Semantic classes, row represent document in word, T2In the word that represents of row where each element representation element Membership (i.e. degree of membership) between the descriptor represented with the row where the element.Then to matrix T2In it is each OK, sorted according to weighted value size, and the corresponding word of row and weight using weighted value more than weight threshold is used as theme Word and subject information are added in theme set, constitute descriptor set of words, are used to represent the theme of each document.
It should be noted that according to the difference of mission requirements, weight threshold can be divided into two kinds:One is integer type m, Representing needs to extract the theme that the preceding m descriptor related to the theme is used for representing document;Two is decimal type f, table Showing needs to extract weighted value all words bigger than f as descriptor for representing the theme of document.
Alternatively, the document and the word that appears in the document for obtaining institute's extraction descriptor in need include:
S14, obtain institute it is in need extraction descriptor document;
S16, to it is in need extract descriptor document carry out word segmentation processing, obtain appearing in the word in the document.
That is, after the document for extracting descriptor in need, it is necessary to pre-processed to these documents, including:It is right Document carries out word segmentation processing, obtains the word involved by these documents, and the word frequency information for counting these words.It is right For Chinese document, it is possible to use Chinese word segmentation instrument carries out word segmentation processing, so as to by long text document process into word Language set.In order to improve extract descriptor quality, reduce high-frequency noise word influence, can after participle terminates, Filtration treatment is carried out to conventional Chinese stop words such as " ", " uh ".
By the embodiment of the present invention, without the large-scale language material model of training in advance, using flexible, to the text of different field Shelves set or whole network data all have universality.
Embodiment 2
According to embodiments of the present invention, there is provided a kind of device embodiment of key words extraction device.
Fig. 2 is the schematic diagram of a kind of optional key words extraction device according to embodiments of the present invention, as shown in Fig. 2 The device includes:Acquiring unit 202, for obtain institute it is in need extraction descriptor document and appear in the document In word;Construction unit 204, the frequency for being occurred in the document based on each word builds word document square Battle array, wherein, every a line of word document matrix represents word frequency information of each word in a document, each list Show word frequency information of the word in each piece document;Generation unit 206, for utilizing latent semantic analysis model pair Word document matrix carries out semantic analysis, generates latent semantic space;Extracting unit 208, for according to potential applications Spatial decimation it is in need extract descriptor document descriptor.
For example, it is assumed that having the N document for needing to extract descriptor, these documents are related to M word, the document altogether Set expression be D={ d1, d2, d3 ..., dn }, the set expression of this M word be W=w1, w2, w3 ..., Wm }, then a word document matrix A (i.e. word-document matrix A) of N*M can be set up by above-mentioned document and word, Matrix A is as follows:
Every a line one document of correspondence in matrix A, wherein each word of element representation corresponding word in this document Frequency information;One word of each row correspondence, wherein word frequency information of each element representation word in correspondence document, Specifically, a in AijA is passed through by D and Wij=DiWjMapping is obtained, and represents word frequency informations of the word j in document i.
Further, on the basis of matrix A, normalization factor can be calculated, and each row vector is normalized.Normalizing Change factor computational methods including various, be not limited thereto, for example, can be entered from L2-normal ization methods Row vector is normalized.Specifically, the computational methods of L2-norm normalization factors are as follows:
Norm=(d1)2+…+…(dn)2
It is perfect by above-described embodiment, it is possible to achieve chapter level document is processed using latent semantic analysis method Based on the deficiency of word frequency information extraction descriptor, the semanteme of word is taken into account to reduce influence of the noise word to descriptor quality, Enable, for representing that the descriptor of theme preferably covers document information, to make the expression of theme more perfect, from And it is effectively improved the quality of the descriptor being drawn into so that the theme for extracting has more preferably in the later stage is applied Universality, had great significance to calculating the work such as similarity or file retrieval.
Alternatively, above-mentioned generation unit includes:Analysis module, for using latent semantic analysis model analysis word text The corresponding relation of word and document in shelves matrix;Generation module, for according to corresponding relation by word document matrix Word and document be mapped in the vector space for meeting predetermined dimension condition, generate latent semantic space.
The purpose of latent semantic analysis is to find out each word real meaning in a document, that is, potential applications, So as to the relation between the semantic information and word and theme that obtain word.Particularly, latent semantic space is generated just It is that one large-scale collection of document is modeled in space is safeguarded using a reasonable dimension, and word and document is all represented To in the space.For example, have 2000 documents, comprising 7000 words, in latent semantic analysis, by word Represented according to corresponding relation with document in being 100 vector space to a dimension.
By the embodiment of the present invention, based on latent semantic analysis model extraction theme, the influence of noise word can be reduced, The descriptor for extracting is set preferably to describe the theme of document.
Based on above-described embodiment, alternatively, generation unit is additionally operable to using singular value decomposition model or Non-negative Matrix Factorization Model or probability Vector Space Model carry out semantic analysis to word document matrix, generate latent semantic space.
It uses singular value decomposition K-SVD models to generate the process of latent semantic space with the process introduced in embodiment 1, Will not be repeated here.
By the embodiment of the present invention, using singular value decomposition mode, can be filtered by singular value and degree of membership filters two The mode of kind, has filtered out the semantic classes and degree of membership word not high of the less descriptor of weighted value, eliminates high frequency and makes an uproar The influence of sound word so that the descriptor for extracting can preferably describe the theme of document.
Alternatively, above-mentioned extracting unit includes:Determining module, for determining descriptor word according to latent semantic space Matrix, wherein, every a line of descriptor word matrix represents the semantic classes of descriptor, and each row are represented in all need Extract the word occurred in the document of descriptor;Order module, for every a line word in descriptor word matrix Sorted by its weighted value;Abstraction module, default threshold is more than for weighted value in the descriptor word matrix after extraction sequence The word of value as it is in need extract descriptor document descriptor.
Based on previous embodiment, after singular value decomposition is carried out to word document matrix A, obtain right in three matrixes Angular moment battle array Σ and VTTwo matrixes, according to T1=Σ VTProduct mode obtain intermediary matrix T1, filter out T1In matrix Full 0 row and full 0 row, obtain final descriptor word matrix T2, wherein T2In row represent extract descriptor Semantic classes, row represent document in word, T2In the word that represents of row where each element representation element Membership (i.e. degree of membership) between the descriptor represented with the row where the element.Then to matrix T2In it is each OK, sorted according to weighted value size, and the corresponding word of row and weight using weighted value more than weight threshold is used as theme Word and subject information are added in theme set, constitute descriptor set of words, are used to represent the theme of each document.
It should be noted that according to the difference of mission requirements, weight threshold can be divided into two kinds:One is integer type m, Representing needs to extract the theme that the preceding m descriptor related to the theme is used for representing document;Two is decimal type f, table Showing needs to extract weighted value all words bigger than f as descriptor for representing the theme of document.
Alternatively, above-mentioned acquiring unit includes:Acquisition module, for obtain institute it is in need extraction descriptor document; Word-dividing mode, for the document for extracting descriptor in need carry out word segmentation processing, in obtaining appearing in the document Word.
That is, after the document for extracting descriptor in need, it is necessary to pre-processed to these documents, including:It is right Document carries out word segmentation processing, obtains the word involved by these documents, and the word frequency information for counting these words.It is right For Chinese document, it is possible to use Chinese word segmentation instrument carries out word segmentation processing, so as to by long text document process into word Language set.In order to improve extract descriptor quality, reduce high-frequency noise word influence, can after participle terminates, Filtration treatment is carried out to conventional Chinese stop words such as " ", " uh ".
By the embodiment of the present invention, without the large-scale language material model of training in advance, using flexible, to the text of different field Shelves set or whole network data all have universality.
Above-mentioned key words extraction device include processor and memory, above-mentioned acquiring unit, construction unit, generation unit, Extracting unit etc. is stored in memory as program unit, by computing device storage above-mentioned journey in memory Sequence unit.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can set one Or more, parse content of text by adjusting kernel parameter.
Memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/ Or the form, such as read-only storage (ROM) or flash memory (f l ash RAM) such as Nonvolatile memory, memory includes at least one Individual storage chip.
Present invention also provides a kind of embodiment of computer program product, when being performed on data processing equipment, fit In the program code for performing initialization there are as below methods step:Obtain institute it is in need extraction descriptor document and appearance Word in the document;The frequency occurred in the document based on each word builds word document matrix, wherein, Every a line of word document matrix represents word frequency information of each word in a document, and each row represent a word Word frequency information in each piece document;Semantic analysis is carried out to word document matrix using latent semantic analysis model, it is raw Into latent semantic space;According to latent semantic space extract institute it is in need extraction descriptor document descriptor.
The embodiments of the present invention are for illustration only, and the quality of embodiment is not represented.
In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment The part of detailed description, may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents, can be by other Mode realize.Wherein, device embodiment described above is only schematical, such as division of described unit, Can be a kind of division of logic function, there can be other dividing mode when actually realizing, for example multiple units or component Can combine or be desirably integrated into another system, or some features can be ignored, or do not perform.It is another, institute Display or the coupling each other for discussing or direct-coupling or communication connection can be by some interfaces, unit or mould The INDIRECT COUPLING of block or communication connection, can be electrical or other forms.
The unit that is illustrated as separating component can be or may not be it is physically separate, it is aobvious as unit The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to On multiple units.Some or all of unit therein can be according to the actual needs selected to realize this embodiment scheme Purpose.
In addition, during each functional unit in each embodiment of the invention can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.It is above-mentioned integrated Unit can both be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.
If the integrated unit is to realize in the form of SFU software functional unit and as independent production marketing or when using, Can store in a computer read/write memory medium.Based on such understanding, technical scheme essence On all or part of the part that is contributed to prior art in other words or the technical scheme can be with software product Form is embodied, and the computer software product is stored in a storage medium, including some instructions are used to so that one Platform computer equipment (can be personal computer, server or network equipment etc.) performs each embodiment institute of the invention State all or part of step of method.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD Etc. it is various can be with the medium of store program codes.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improve and moisten Decorations also should be regarded as protection scope of the present invention.

Claims (10)

1. a kind of key words extraction method, it is characterised in that including:
Obtain the document of institute's extraction descriptor in need and appear in word in the document;
The frequency occurred in the document based on each word builds word document matrix, wherein, the word text Every a line of shelves matrix represents word frequency information of each word in a document, and each row represent that a word exists Word frequency information in each piece document;
Semantic analysis is carried out to the word document matrix using latent semantic analysis model, generation potential applications are empty Between;
According to the latent semantic space extract it is described institute it is in need extraction descriptor document descriptor.
2. method according to claim 1, it is characterised in that using latent semantic analysis model to word text Shelves matrix carries out semantic analysis, and generation latent semantic space includes:
Closed using the word in word document matrix described in the latent semantic analysis model analysis is corresponding with document System;
The word in the word document matrix and document are mapped to according to the corresponding relation meet predetermined dimension In the vector space of condition, the latent semantic space is generated.
3. method according to claim 1 and 2, it is characterised in that using latent semantic analysis model to institute's predicate Language document matrix carries out semantic analysis, and generation latent semantic space includes:
Using singular value decomposition model or Non-negative Matrix Factorization model or probability Vector Space Model to institute's predicate Language document matrix carries out semantic analysis, generates latent semantic space.
4. method according to claim 1, it is characterised in that extracted according to the latent semantic space described all Needing the descriptor of the document for extracting descriptor includes:
Descriptor word matrix is determined according to the latent semantic space, wherein, the descriptor word matrix The semantic classes of descriptor is represented per a line, each row are represented in the document of institute extraction descriptor in need The word of appearance;
To being sorted by its weighted value per a line word in the descriptor word matrix;
Weighted value is more than the word of predetermined threshold value as all need in extracting the descriptor word matrix after sequence Extract the descriptor of the document of descriptor.
5. method according to claim 1, it is characterised in that obtain institute's extraction descriptor in need document and The word appeared in the document includes:
Obtain it is described institute it is in need extraction descriptor document;
Word segmentation processing is carried out to described the document for extracting descriptor in need, obtains described appearing in the document Word.
6. a kind of key words extraction device, it is characterised in that including:
Acquiring unit, for obtaining the document of institute's extraction descriptor in need and appearing in word in the document;
Construction unit, the frequency for being occurred in the document based on each word builds word document matrix, its In, every a line of the word document matrix represents word frequency information of each word in a document, Mei Yilie Represent word frequency information of the word in each piece document;
Generation unit, for carrying out semantic analysis to the word document matrix using latent semantic analysis model, Generation latent semantic space;
Extracting unit, for according to the latent semantic space extract it is described institute it is in need extraction descriptor document Descriptor.
7. device according to claim 6, it is characterised in that the generation unit includes:
Analysis module, for using the word in word document matrix described in the latent semantic analysis model analysis With the corresponding relation of document;
Generation module, for mapping the word in the word document matrix and document according to the corresponding relation To in the vector space for meeting predetermined dimension condition, the latent semantic space is generated.
8. the device according to claim 6 or 7, it is characterised in that the generation unit is additionally operable to utilize singular value Decomposition model or Non-negative Matrix Factorization model or probability Vector Space Model are carried out to the word document matrix Semantic analysis, generates latent semantic space.
9. device according to claim 6, it is characterised in that the extracting unit includes:
Determining module, for determining descriptor word matrix according to the latent semantic space, wherein, the master The every a line for writing inscription word matrix represents the semantic classes of descriptor, and each row are represented in institute extraction in need The word occurred in the document of descriptor;
Order module, for being sorted by its weighted value per a line word in the descriptor word matrix;
Abstraction module, the word for weighted value in the descriptor word matrix after extraction sequence more than predetermined threshold value As the descriptor of described the document for extracting descriptor in need.
10. device according to claim 6, it is characterised in that the acquiring unit includes:
Acquisition module, for obtaining described the document for extracting descriptor in need;
Word-dividing mode, for carrying out word segmentation processing to described the document for extracting descriptor in need, obtains described Appear in the word in the document.
CN201510819148.2A 2015-11-23 2015-11-23 Method and device for extracting subject term Active CN106776530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510819148.2A CN106776530B (en) 2015-11-23 2015-11-23 Method and device for extracting subject term

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510819148.2A CN106776530B (en) 2015-11-23 2015-11-23 Method and device for extracting subject term

Publications (2)

Publication Number Publication Date
CN106776530A true CN106776530A (en) 2017-05-31
CN106776530B CN106776530B (en) 2020-07-03

Family

ID=58963111

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510819148.2A Active CN106776530B (en) 2015-11-23 2015-11-23 Method and device for extracting subject term

Country Status (1)

Country Link
CN (1) CN106776530B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117494726A (en) * 2023-12-29 2024-02-02 成都航空职业技术学院 Information keyword extraction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周进华 等: "基于衰减词共现图的多文档摘要研究", 《小型微型计算机系统》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117494726A (en) * 2023-12-29 2024-02-02 成都航空职业技术学院 Information keyword extraction method
CN117494726B (en) * 2023-12-29 2024-04-12 成都航空职业技术学院 Information keyword extraction method

Also Published As

Publication number Publication date
CN106776530B (en) 2020-07-03

Similar Documents

Publication Publication Date Title
CN107807987B (en) Character string classification method and system and character string classification equipment
Rei et al. Grasping the finer point: A supervised similarity network for metaphor detection
CN109492157B (en) News recommendation method and theme characterization method based on RNN and attention mechanism
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN105608477B (en) Method and system for matching portrait with job position
CN106776562A (en) A kind of keyword extracting method and extraction system
CN112364937B (en) User category determination method and device, recommended content determination method and electronic equipment
CN112905739B (en) False comment detection model training method, detection method and electronic equipment
CN109684476A (en) A kind of file classification method, document sorting apparatus and terminal device
CN111680131B (en) Document clustering method and system based on semantics and computer equipment
WO2020007989A1 (en) Method for co-clustering senders and receivers based on text or image data files
CN106570170A (en) Text classification and naming entity recognition integrated method and system based on depth cyclic neural network
WO2016036345A1 (en) External resource identification
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
CN113392179A (en) Text labeling method and device, electronic equipment and storage medium
CN115017320A (en) E-commerce text clustering method and system combining bag-of-words model and deep learning model
CN110532378A (en) A kind of short text aspect extracting method based on topic model
CN113239268A (en) Commodity recommendation method, device and system
CN113761192A (en) Text processing method, text processing device and text processing equipment
CN116108836B (en) Text emotion recognition method and device, computer equipment and readable storage medium
CN106776530A (en) Key words extraction method and device
CN110413985B (en) Related text segment searching method and device
CN114491076B (en) Data enhancement method, device, equipment and medium based on domain knowledge graph
CN110162614B (en) Question information extraction method and device, electronic equipment and storage medium
Chen Tracking latent domain structures: An integration of Pathfinder and Latent Semantic Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant