CN106598997A - Method and device for computing membership degree of text subject - Google Patents
Method and device for computing membership degree of text subject Download PDFInfo
- Publication number
- CN106598997A CN106598997A CN201510680277.8A CN201510680277A CN106598997A CN 106598997 A CN106598997 A CN 106598997A CN 201510680277 A CN201510680277 A CN 201510680277A CN 106598997 A CN106598997 A CN 106598997A
- Authority
- CN
- China
- Prior art keywords
- sentence
- text
- measured
- keyword
- theme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
Abstract
The invention discloses a method and a device for computing a membership degree of a text subject, relates to the technical field of computers, and solves the problem that computation of the membership degree has much error due to the fact that subject keywords appearing in a text are uncorrelated with the text subject. The main technical scheme of the method provided by the invention comprises the steps of segmenting the text to be tested into sentences, thus obtaining a sentence list; finding the sentences including keywords in a subject keyword set from the sentence list according to the preset subject keyword set; determining position weighted values of the sentences according to the positions of the sentences in the text to be tested; and computing the subject membership degree of the text to be tested according to the position weighted values of the sentences and the quantity of the sentences including the keywords. The method and the device provided by the invention are mainly used for computing the membership degree of the text.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of side for calculating text subject degree of membership
Method and device.
Background technology
Under big data background, it is an important topic that relevant information is extracted.Information extraction technique is not
Attempt comprehensive understanding entire chapter document, simply the part comprising relevant information in document is analyzed.It is logical
Cross the characteristic key words that extract in article to determine the subject content expressed by the article.
Existing relevant information extraction algorithm mostly with article whether there is the feature related to a certain theme
Keyword, so as to judge whether the content expressed by this article belongs to the theme.It is this to be with keyword
It is no to occur in article as feature, although can than more comprehensively obtaining the relevant information in article,
But the information extracted there may be a large amount of noises because in article not all word all with theme
Close association.Therefore, may draw when finally the theme expressed by this article is judged and contrary sentence
Disconnected result, causes the bigger error of subsequent analysis.
The content of the invention
In view of this, the present invention provides a kind of method and device for calculating text subject degree of membership, mainly
Purpose is that the theme that the position occurred in the text by subject key words calculates the text with the frequency is returned
Category degree, so as to improve the accuracy of degree of membership judgement.
To reach above-mentioned purpose, present invention generally provides following technical scheme:
On the one hand, the invention provides a kind of method for calculating text subject degree of membership, the method includes:
Subordinate sentence process is carried out to text to be measured, sentence list is obtained;
According to preset subject key words collection, search and contain in the sentence list subject key words
Concentrate the sentence of keyword;
According to position of the sentence in the text to be measured, the position weight value of the sentence is determined;
The quantity of the sentence according to the position weight value of the sentence and containing keyword, calculates described
The theme degree of membership of text to be measured.
On the other hand, present invention also offers a kind of device for calculating text subject degree of membership, the device
Including:
Subordinate sentence processing unit, for carrying out subordinate sentence process to text to be measured, obtains sentence list;
Searching unit, for according to preset subject key words collection, searching in the subordinate sentence processing unit
The sentence of keyword is concentrated in the sentence list for obtaining containing the subject key words;
Determining unit, for the sentence that found according to the searching unit in the text to be measured
Position, determines the position weight value of the sentence;
Computing unit, for the position weight value of sentence that determined according to the determining unit and lookup
The quantity of the sentence containing keyword that unit is searched, calculates the theme degree of membership of the text to be measured.
The method and device of the calculating text subject degree of membership proposed according to the invention described above, is to pass through
Text to be measured is carried out after subordinate sentence, whether determined in each sentence containing subject key words and record contains
The quantity of subject key words sentence, further according to the position containing the crucial sentence of theme in text to be measured,
Determine the position weight value of the sentence, through calculating the position weight value sum containing keyword sentence with
The ratio of the position weight value sum of all sentences is worth to the theme degree of membership of text to be measured.With existing skill
Art is compared, and the present invention determines the degree of correlation of text to be measured and theme by the size of ratio, it is to avoid
The excessively absolute problem of the conclusion of dichotomy.Additionally, the position power by sentence in text to be measured
The theme degree of membership that weight values are calculated is to be quantified the position that keyword occurs in text to be measured
Analyze and be added in the calculating of degree of membership, caused by the noise because mentioning in background technology can be reduced
Analytical error, so as to improve the accuracy of degree of membership judgement.
Description of the drawings
By the detailed description for reading hereafter preferred embodiment, various other advantage and benefit for
Those of ordinary skill in the art will be clear from understanding.Accompanying drawing is only used for illustrating the mesh of preferred embodiment
, and it is not considered as limitation of the present invention.And in whole accompanying drawing, with identical with reference to symbol
Number represent identical part.In the accompanying drawings:
Fig. 1 shows a kind of stream of the method for calculating text subject degree of membership that the embodiment of the present invention is proposed
Cheng Tu;
Fig. 2 shows another kind of method for calculating text subject degree of membership that the embodiment of the present invention is proposed
Flow chart;
Fig. 3 shows a kind of group of the device of calculating text subject degree of membership that the embodiment of the present invention is proposed
Into block diagram;
Fig. 4 shows another kind of device for calculating text subject degree of membership that the embodiment of the present invention is proposed
Composition frame chart.
Specific embodiment
The exemplary embodiment of the present invention is more fully described below with reference to accompanying drawings.Although showing in accompanying drawing
The exemplary embodiment of the present invention is shown, it being understood, however, that may be realized in various forms the present invention
And should not be limited by embodiments set forth here.On the contrary, there is provided these embodiments are able to more
Thoroughly understand the present invention, and can be by the complete technology for conveying to this area of the scope of the present invention
Personnel.
A kind of method of calculating text subject degree of membership is embodiments provided, as shown in figure 1,
Concrete steps include:
101st, subordinate sentence process is carried out to text to be measured, obtains sentence list.
Referring to Chinese text more text to be measured used in the embodiment of the present invention, and general can express
The text of certain topic mostly is the article of big length or multistage.And for existing article subject analysis side
Whether method, mainly carries out participle to article, then check in participle containing the keyword related to theme.
And existing participle mode is relative complex, and the accuracy of participle has much room for improvement.Therefore, in order to avoid
The error analysis caused by participle, the embodiment of the present invention, will by carrying out subordinate sentence process to text to be measured
Text to be measured decomposes some sentences, and these sentences are constituted into a sentence list.
102nd, according to preset subject key words collection, search and contain in sentence list the subject key words collection
The sentence of middle keyword.
While text to be measured is decomposed, in addition it is also necessary to create a subject key words collection, wrap in the word set
Include some keywords related to theme.After the sentence list for obtaining text to be measured, by the sentence
The keyword that sentence in list is concentrated with subject key words is matched, by the sentence containing keyword
Screen.
Wherein, when sentence is matched with keyword, can be by sentences decomposition into some words, will
These words and keyword carry out matching one by one, it is also possible to by keyword being brought in sentence one by one,
Matched with word or word in sentence, for specific matching way the present embodiment is not specifically limited.
Main purpose is to filter out the sentence containing keyword in sentence list.
103rd, the position according to sentence in text to be measured, determines the position weight value of the sentence.
The position weight value of sentence refers to the significance level of sentence position in the text.And in a text
Zhang Zhong, the keyword related to theme often appears at some relatively substantially and in important position,
Such as the title of article, first section or latter end of article etc..Therefore, in an article, the position of sentence
There is correlation with the theme of article.In embodiments of the present invention, can be obtained according to above-mentioned correlation
The position weight value for going out sentence can be used to indicate that the degree of correlation of the sentence and theme.
It should be noted that for the value of position weight value, the embodiment of the present invention is not limited by solid
Fixed algorithm takes the position weight value of sentence, or by manually according to the position of experience setting sentence
Weighted value.
104th, the quantity of the sentence according to the position weight value of sentence and containing keyword, calculates to be measured
The theme degree of membership of text.
After the quantity of the sentence in the position weight value for determining sentence and containing keyword, by tired
Plus can obtain all position weight value sums comprising keyword sentence.The institute in text to be measured is calculated
There is the position weight value of sentence, it becomes possible to draw the position weight value sum comprising keyword sentence in institute
There is the accounting in the position weight value of sentence.The accounting value is exactly text to be measured relative to the theme
Theme degree of membership.Accounting is higher, illustrates that the theme of text to be measured is higher with the degree of correlation of test theme.
The calculating text subject that the embodiment of the present invention is adopted is can be seen that with reference to above-mentioned implementation
Whether degree of membership method, be by carrying out to text to be measured after subordinate sentence, determining in each sentence containing master
Topic keyword simultaneously records the quantity containing subject key words sentence, further according to the sentence containing subject key words
Position of the son in text to be measured, determines the position weight value of the sentence, and through calculating keyword is contained
The position weight value sum of sentence is worth to text to be measured with the ratio of the position weight value sum of all sentences
Theme degree of membership.Compared with prior art, the present invention determines text to be measured by the size of ratio
With the degree of correlation of theme, it is to avoid the excessively absolute problem of the conclusion of dichotomy.Additionally, passing through sentence
The theme degree of membership that position weight value of the son in text to be measured is calculated is in text to be measured by keyword
The position of middle appearance has carried out quantitative analysis and has been added in the calculating of degree of membership, can reduce because of background
Analytical error caused by the noise mentioned in technology, so as to improve the accuracy of degree of membership judgement.
In order to a kind of method for calculating text subject degree of membership proposed by the present invention is explained in more detail,
The embodiment of the present invention will be illustrated by concrete implementation mode, as shown in Fig. 2 the method is right
Included step is when calculating text subject degree of membership:
201st, subordinate sentence process is carried out to text to be measured, obtains sentence list.
When subordinate sentence process is carried out to text to be measured, its processing mode is simple, easy relative to participle.
Subordinate sentence is obtained by only needing to be made pauses in reading unpunctuated ancient writings according to fixed punctuation mark.In the punctuation mark of Chinese,
At the end of typically to represent in short, it is identified with fullstop, question mark, exclamation mark etc. more.Therefore,
In the present embodiment, these punctuation marks can in advance be chosen, then text to be measured is carried out by
The comparison of individual byte, when certain byte in judging text to be measured is preset punctuation mark, just
The content intercepted between the point and last short sentence point is stored in sentence list as a subordinate sentence.
202nd, according to preset subject key words collection, search and contain in sentence list the subject key words collection
The sentence of middle keyword.
Before this step is performed, some subject key words should be also first obtained.These subject key words
Selection should be selected according to the sequence of the degree of correlation with theme, quantity fix premise
Under, the subject key words high with theme correlation degree should be selected.Select the high theme of degree of correlation crucial
Word can also improve the judgment accuracy of theme degree of membership.
After subject key words are determined, it is necessary to which the sentence in sentence list in 201 is screened,
Select the sentence containing subject key words.Specific way can be that one is selected in sentence list
Sentence, by the sentence word segmentation processing is carried out, and obtains constituting some participles of the sentence.Then by these
Participle is matched with all subject key words, if identical, just the sentence is recorded as containing theme
The sentence of keyword.Because the quantity of subject key words can be multiple, and the main purpose of this step
The sentence containing subject key words is to look for out, and pays no attention to the number containing subject key words in the sentence
Amount.Therefore, when the matching of subject key words is carried out, it is not necessary to by all participles in sentence and institute
Some subject key words carrying out one by one matches, but once there is participle and master during matching
When topic keyword is identical, just terminate the subsequent match process to the sentence, be directly recorded as the sentence
Sentence containing subject key words.Wherein, can adopt for sentence is carried out into word segmentation processing concrete mode
Existing processing mode is used, here is omitted for detailed process.
203rd, the position according to sentence in text to be measured, determines the position weight value of the sentence.
The position of sentence can substantially be divided into caption position, first section position, head and the tail sentence position in one article
Put, according to subject key words the probability of diverse location is occurred in, the position weight of diverse location is set
Value, generally, position weight value according to order from high to low be followed successively by taglines weighted value,
First section sentence weighted value, head and the tail sentence weighted value, common sentence weighted value.
The mode for determining sentence position in text to be measured can be marked when subordinate sentence is carried out, root
The position of sentence is determined according to fixed mark character, for example, can be distinguished not according to the pattern of word
The taglines of same level, one after the symbol is then can determine for the first sentence of section according to carriage-return character, should
One before symbol is section tail office, and first section sentence be then after taglines and the sentence first carriage return character it
Between sentence.Mark to sentence position can just be completed by above-mentioned determination strategy, certainly this
Bright embodiment is not specifically limited for the mark mode of sentence position.Main purpose is will be according to sentence
The different sentences that are distinguished as of position configure different position weight values.
204th, the position weight value of the cumulative sentence containing keyword, obtains the theme line power of text to be measured
Weight values.
Theme line weighted value is the position weight value sum of the sentence containing keyword, when calculating,
Can count the sentence quantity with same position weighted value, then by diverse location weighted value and quantity
Product carries out adding up and obtains theme line weighted value.By calculating theme line weighted value, master can be shown
The height of the frequency of occurrences in several key positions of the topic keyword in text to be measured.Such that it is able to phase
To intuitively finding out the theme of text to be measured and the degree of approximation of the theme expressed by subject key words.
Specific computational methods may be referred to computing formula given below:
B=Ntitle*Weighttitle+Nfirst-last*Weightfirst-last+Npara-frist*Weightpara-frist+Ncommon*Weightcommon
Wherein, B is the theme a weighted value, NtitleFor the sentence quantity of taglines, WeighttitleFor title
Sentence weighted value, Nfirst-lastFor the sentence quantity of head and the tail sentence, Weightfirst-lastFor head and the tail sentence weighted value, Npara-frist
For the sentence quantity of first section sentence, Weightpara-fristFor first section sentence weighted value, NcommonFor the sentence of common sentence
Quantity, WeightcommonFor common sentence weighted value.
205th, the weight total value of text to be measured is calculated.
Weight total value is exactly the position weight value sum of all sentences in text to be measured.
Specific computing formula is as follows:
Ball=Ntitle-all*Weighttitle+Nfirst-last-all*Weightfirst-last+Npara-frist-all*Weightpara-frist+
Ncommon-all*Weightcommon
Wherein, BallFor weight total value, Ntitle-allFor the sentence quantity of all taglines, Nfirst-last-allFor institute
There are the sentence quantity of head and the tail sentence, Npara-frist-allFor the sentence quantity of all first section sentences, Ncommon-allIt is all
The sentence quantity of common sentence.
206th, the quotient of theme line weighted value and weight total value is calculated, the theme ownership of text to be measured is obtained
Degree.
The theme degree of membership of text to be measured is the theme or central idea and test theme for judging the text
Similarity factor, wherein, from theme line weighted value can be seen that in text to be measured to test theme it is related
Content parameters, by the quotient for calculating theme line weighted value and weight total value, it is possible to obtain to be measured
The content related to test theme accounts for the ratio of total content, that is, the theme of the text to be measured in text
Degree of membership, i.e. B/BallValue.
Further, as the realization to said method, the embodiment of the present invention additionally provides a kind of calculating
The device of text subject degree of membership, as shown in figure 3, the device embodiment and preceding method embodiment phase
Correspondence, for ease of reading, this device embodiment is no longer entered to the detail content in preceding method embodiment
Row is repeated one by one, it should be understood that the device in the present embodiment can be corresponded to realizes that preceding method is implemented
Full content in example.The device includes:
Subordinate sentence processing unit 31, for carrying out subordinate sentence process to text to be measured, obtains sentence list;
Searching unit 32, it is single in subordinate sentence process for according to preset subject key words collection, searching
The sentence of keyword is concentrated in the sentence list that unit 31 obtains containing the subject key words;
Determining unit 33, for the sentence that found according to the searching unit 32 in the text to be measured
In position, determine the position weight value of the sentence;
Computing unit 34, for the position weight value of sentence that determined according to the determining unit 33 and
The quantity of the sentence containing keyword that searching unit 32 is searched, the theme for calculating the text to be measured is returned
Category degree.
Further, as shown in figure 4, the determining unit 33 of described device includes:
First determining module 331, it is described for determining position of the sentence in the text to be measured
Position includes title, first section, head and the tail sentence and other general positions;
Second determining module 332, for the sentence that determined according to first determining module 331 described
Position in text to be measured determines the corresponding position weight value of the sentence, the position weight value according to
It is arranged with theme correlation degree, including taglines weighted value, first section sentence weighted value, head and the tail sentence weight
Value and common sentence weighted value.
Further, as shown in figure 4, the searching unit 32 of described device includes:
Word-dividing mode 321, for the sentence in the sentence list to be carried out into word segmentation processing;
Matching module 322, the participle and the keyword for the word-dividing mode 321 to be obtained is carried out
Matching;
Logging modle 323, be for when the match is successful for the matching module 322, recording the sentence
Sentence containing keyword.
Further, the matching module 322 is used to concentrate the participle with the subject key words
Keyword matched one by one, when the success of the participle in the sentence and the Keywords matching,
No longer other participles in the sentence are matched.
Further, as shown in figure 4, the computing unit 34 includes:
First computing module 341, for the position weight value of the sentence containing the keyword that adds up, obtains
To the theme line weighted value of the text to be measured;
Second computing module 342, calculates the weight total value of the text to be measured, and the weight total value is institute
There is the position weight value sum of sentence;
3rd computing module 343, for calculating the theme line weight that first computing module 341 is obtained
The quotient of the weight total value that value is obtained with second computing module 342, obtains the text to be measured
Theme degree of membership.
Further, the subordinate sentence processing unit 31 of described device is additionally operable to, according to predetermined punctuation mark
Subordinate sentence process is carried out to the text to be measured.
In sum, the method and device of the calculating text subject degree of membership that the embodiment of the present invention is adopted,
It is by carrying out to text to be measured after subordinate sentence, determining and whether contain in each sentence subject key words and remember
Quantity of the record containing subject key words sentence, further according to the sentence containing theme key in text to be measured
Position, determine the position weight value of the sentence, through calculating the position weight containing keyword sentence
Value sum is worth to the theme degree of membership of text to be measured with the ratio of the position weight value sum of all sentences.
Compared with prior art, the present invention determines the related journey of text to be measured and theme by the size of ratio
Degree, it is to avoid the excessively absolute problem of the conclusion of dichotomy.Additionally, by sentence in text to be measured
The theme degree of membership that calculated of position weight value be to enter the position that keyword occurs in text to be measured
Go and quantitative analysis and be added in the calculating of degree of membership, can reduce because of making an uproar for mentioning in background technology
Analytical error caused by sound, so as to improve the accuracy of degree of membership judgement.
The device for calculating text subject degree of membership includes processor and memory, and above-mentioned subordinate sentence is processed
Unit, searching unit, determining unit and computing unit etc. are stored in memory as program unit,
Corresponding function is realized by computing device storage said procedure unit in memory.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can
To arrange one or more, test text is calculated relative to preset theme by adjusting kernel parameter
Theme degree of membership, so as to improve the accuracy of theme degree of membership judgement.
Memory potentially includes the volatile memory in computer-readable medium, random access memory
The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash
RAM), memory includes at least one storage chip.
Present invention also provides a kind of computer program, when performing on data processing equipment,
It is adapted for carrying out initializing the program code of there are as below methods step:Subordinate sentence process is carried out to text to be measured,
Obtain sentence list;According to preset subject key words collection, search in the sentence list containing described
Subject key words concentrate the sentence of keyword;According to position of the sentence in the text to be measured,
Determine the position weight value of the sentence;According to the position weight value of the sentence and contain keyword
Sentence quantity, calculate the theme degree of membership of the text to be measured.
Those skilled in the art it should be appreciated that embodiments herein can be provided as method, system,
Or computer program.Therefore, the application can be implemented using complete hardware embodiment, complete software
Example or with reference to the form of the embodiment in terms of software and hardware.And, the application can be adopted at one
Or it is multiple wherein include computer usable program code computer-usable storage medium (including but not
Be limited to magnetic disc store, CD-ROM, optical memory etc.) on the computer program implemented
Form.
The application is with reference to the method according to the embodiment of the present application, equipment (system) and computer program
The flow chart and/or block diagram of product is describing.It should be understood that can be realized flowing by computer program instructions
In each flow process and/or square frame and flow chart and/or block diagram in journey figure and/or block diagram
Flow process and/or square frame combination.Can provide these computer program instructions to all-purpose computer, specially
With the processor of computer, Embedded Processor or other programmable data processing devices producing one
Machine so that produced by the instruction of computer or the computing device of other programmable data processing devices
It is raw to be used to realize in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple sides
The device of the function of specifying in frame.
These computer program instructions may be alternatively stored in can guide computer or other programmable datas to process
In the computer-readable memory that equipment works in a specific way so that be stored in the computer-readable and deposit
Instruction in reservoir is produced and includes the manufacture of command device, and command device realization is in flow chart one
The function of specifying in flow process or one square frame of multiple flow processs and/or block diagram or multiple square frames.
These computer program instructions can also be loaded into computer or other programmable data processing devices
On so that series of operation steps is performed on computer or other programmable devices to produce computer
The process of realization, so as to the instruction performed on computer or other programmable devices is provided for realizing
Specify in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple square frames
The step of function.
In a typical configuration, computing device include one or more processors (CPU), input/
Output interface, network interface and internal memory.
Memory potentially includes the volatile memory in computer-readable medium, random access memory
The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash
RAM).Memory is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be with
Information Store is realized by any method or technique.Information can be computer-readable instruction, data knot
Structure, the module of program or other data.The example of the storage medium of computer includes, but are not limited to phase
Become internal memory (PRAM), static RAM (SRAM), dynamic random access memory
(DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electricity can
Erasable programmable read-only memory (EPROM) (EEPROM), fast flash memory bank or other memory techniques, read-only light
Disk read-only storage (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic
Cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus or any other non-transmission medium,
Can be used to store the information that can be accessed by a computing device.Define according to herein, computer-readable
Medium does not include temporary computer readable media (transitory media), the such as data-signal and load of modulation
Ripple.
Embodiments herein is these are only, the application is not limited to.For this area skill
For art personnel, the application can have various modifications and variations.It is all spirit herein and principle it
Interior made any modification, equivalent substitution and improvements etc., should be included in claims hereof model
Within enclosing.
Claims (10)
1. it is a kind of calculate text subject degree of membership method, it is characterised in that methods described includes:
Subordinate sentence process is carried out to text to be measured, sentence list is obtained;
According to preset subject key words collection, search and contain in the sentence list subject key words
Concentrate the sentence of keyword;
According to position of the sentence in the text to be measured, the position weight value of the sentence is determined;
The quantity of the sentence according to the position weight value of the sentence and containing keyword, calculates described
The theme degree of membership of text to be measured.
2. method according to claim 1, it is characterised in that it is described according to the sentence in institute
The position in text to be measured is stated, determining the position weight value of the sentence includes:
Determine position of the sentence in the text to be measured, the position include taglines position,
First section sentence position, head and the tail sentence position and common sentence position;
The corresponding position weight value of the sentence is determined according to position of the sentence in the text to be measured,
The position weight value is arranged according to it with theme correlation degree, including taglines weighted value, first section sentence
Weighted value, head and the tail sentence weighted value and common sentence weighted value.
3. method according to claim 2, it is characterised in that described to be closed according to preset theme
Keyword collection, searches in the sentence list and concentrates the sentence of keyword to include containing the subject key words:
Sentence in the sentence list is carried out into word segmentation processing;
The participle is matched with the keyword;
If the match is successful, it is the sentence containing keyword to record the sentence.
4. method according to claim 3, it is characterised in that it is described by the participle with it is described
Keyword carries out matching to be included:
The keyword that the participle is concentrated with the subject key words is matched one by one;
When the participle in the sentence is with Keywords matching success, no longer in the sentence
Other participles are matched.
5. the method according to any one of claim 1-4, it is characterised in that described according to institute
The position weight value of sentence and the quantity of the sentence containing keyword are stated, the text to be measured is calculated
Theme degree of membership includes:
The position weight value of the cumulative sentence containing the keyword, obtains the theme of the text to be measured
Sentence weighted value;
The weight total value of the text to be measured is calculated, the weight total value is the position weight of all sentences
Value sum;
The quotient of the theme line weighted value and the weight total value is calculated, the text to be measured is obtained
Theme degree of membership.
6. method according to claim 1, it is characterised in that described text to be measured to be carried out point
Sentence is processed, and obtaining sentence list includes:
Subordinate sentence process is carried out to the text to be measured according to predetermined punctuation mark.
7. it is a kind of calculate text subject degree of membership device, it is characterised in that described device includes:
Subordinate sentence processing unit, for carrying out subordinate sentence process to text to be measured, obtains sentence list;
Searching unit, for according to preset subject key words collection, searching in the subordinate sentence processing unit
The sentence of keyword is concentrated in the sentence list for obtaining containing the subject key words;
Determining unit, for the sentence that found according to the searching unit in the text to be measured
Position, determines the position weight value of the sentence;
Computing unit, for the position weight value of sentence that determined according to the determining unit and lookup
The quantity of the sentence containing keyword that unit is searched, calculates the theme degree of membership of the text to be measured.
8. device according to claim 7, it is characterised in that the determining unit includes:
First determining module, for determining position of the sentence in the text to be measured, institute's rheme
Put including title, first section, head and the tail sentence and other general positions;
Second determining module, for the sentence that determined according to first determining module in the text to be measured
Position in this determines the corresponding position weight value of the sentence, and the position weight value is according to itself and master
Topic degree of correlation is arranged, including taglines weighted value, first section sentence weighted value, head and the tail sentence weighted value and general
Logical sentence weighted value.
9. the device according to claim 7 or 8, it is characterised in that the searching unit includes:
Word-dividing mode, for the sentence in the sentence list to be carried out into word segmentation processing;
Matching module, the participle for the word-dividing mode to be obtained is matched with the keyword;
Logging modle, for when the match is successful for the matching module, it to be containing relevant to record the sentence
The sentence of keyword.
10. the device according to any one of claim 7-9, it is characterised in that the calculating list
Unit includes:
First computing module, for the position weight value of the sentence containing the keyword that adds up, obtains
The theme line weighted value of the text to be measured;
Second computing module, calculates the weight total value of the text to be measured, and the weight total value is all
The position weight value sum of sentence;
3rd computing module, for calculating theme line weighted value and the institute that first computing module is obtained
The quotient of the weight total value that the second computing module is obtained is stated, the theme degree of membership of the text to be measured is obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510680277.8A CN106598997B (en) | 2015-10-19 | 2015-10-19 | Method and device for calculating text theme attribution degree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510680277.8A CN106598997B (en) | 2015-10-19 | 2015-10-19 | Method and device for calculating text theme attribution degree |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106598997A true CN106598997A (en) | 2017-04-26 |
CN106598997B CN106598997B (en) | 2021-05-18 |
Family
ID=58555102
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510680277.8A Active CN106598997B (en) | 2015-10-19 | 2015-10-19 | Method and device for calculating text theme attribution degree |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106598997B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107704763A (en) * | 2017-09-04 | 2018-02-16 | 中国移动通信集团广东有限公司 | Multi-source heterogeneous leak information De-weight method, stage division and device |
CN109657202A (en) * | 2017-10-10 | 2019-04-19 | 北京国双科技有限公司 | The method and device of text-processing |
CN111369294A (en) * | 2020-03-06 | 2020-07-03 | 中国铁塔股份有限公司 | Software cost estimation method and device |
CN111581358A (en) * | 2020-04-08 | 2020-08-25 | 北京百度网讯科技有限公司 | Information extraction method and device and electronic equipment |
CN111950037A (en) * | 2020-08-25 | 2020-11-17 | 北京天融信网络安全技术有限公司 | Detection method, detection device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101231634A (en) * | 2007-12-29 | 2008-07-30 | 中国科学院计算技术研究所 | Autoabstract method for multi-document |
CN101315624A (en) * | 2007-05-29 | 2008-12-03 | 阿里巴巴集团控股有限公司 | Text subject recommending method and device |
CN103136300A (en) * | 2011-12-05 | 2013-06-05 | 北京百度网讯科技有限公司 | Recommendation method and device of text related subject |
CN103744953A (en) * | 2014-01-02 | 2014-04-23 | 中国科学院计算机网络信息中心 | Network hotspot mining method based on Chinese text emotion recognition |
US20150341751A1 (en) * | 2013-11-07 | 2015-11-26 | a la mode technologies, inc. | System and Method for Gathering Information about a Subject in Close Proximity to a User |
-
2015
- 2015-10-19 CN CN201510680277.8A patent/CN106598997B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101315624A (en) * | 2007-05-29 | 2008-12-03 | 阿里巴巴集团控股有限公司 | Text subject recommending method and device |
CN101231634A (en) * | 2007-12-29 | 2008-07-30 | 中国科学院计算技术研究所 | Autoabstract method for multi-document |
CN103136300A (en) * | 2011-12-05 | 2013-06-05 | 北京百度网讯科技有限公司 | Recommendation method and device of text related subject |
US20150341751A1 (en) * | 2013-11-07 | 2015-11-26 | a la mode technologies, inc. | System and Method for Gathering Information about a Subject in Close Proximity to a User |
CN103744953A (en) * | 2014-01-02 | 2014-04-23 | 中国科学院计算机网络信息中心 | Network hotspot mining method based on Chinese text emotion recognition |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107704763A (en) * | 2017-09-04 | 2018-02-16 | 中国移动通信集团广东有限公司 | Multi-source heterogeneous leak information De-weight method, stage division and device |
CN109657202A (en) * | 2017-10-10 | 2019-04-19 | 北京国双科技有限公司 | The method and device of text-processing |
CN109657202B (en) * | 2017-10-10 | 2022-10-28 | 北京国双科技有限公司 | Text processing method and device |
CN111369294A (en) * | 2020-03-06 | 2020-07-03 | 中国铁塔股份有限公司 | Software cost estimation method and device |
CN111369294B (en) * | 2020-03-06 | 2023-06-23 | 中国铁塔股份有限公司 | Software cost estimation method and device |
CN111581358A (en) * | 2020-04-08 | 2020-08-25 | 北京百度网讯科技有限公司 | Information extraction method and device and electronic equipment |
CN111581358B (en) * | 2020-04-08 | 2023-08-18 | 北京百度网讯科技有限公司 | Information extraction method and device and electronic equipment |
CN111950037A (en) * | 2020-08-25 | 2020-11-17 | 北京天融信网络安全技术有限公司 | Detection method, detection device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106598997B (en) | 2021-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105389349B (en) | Dictionary update method and device | |
CN107229668B (en) | Text extraction method based on keyword matching | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
CN106598997A (en) | Method and device for computing membership degree of text subject | |
CN108920456B (en) | Automatic keyword extraction method | |
CN103336766B (en) | Short text garbage identification and modeling method and device | |
CN104199965B (en) | Semantic information retrieval method | |
CN107463658B (en) | Text classification method and device | |
CN104077407B (en) | A kind of intelligent data search system and method | |
CN106570128A (en) | Mining algorithm based on association rule analysis | |
CN106598999A (en) | Method and device for calculating text theme membership degree | |
CN103309852A (en) | Method for discovering compound words in specific field based on statistics and rules | |
CN107229627B (en) | Text processing method and device and computing equipment | |
CN106446124A (en) | Website classification method based on network relation graph | |
CN103324641B (en) | Information record recommendation method and device | |
CN111860981B (en) | Enterprise national industry category prediction method and system based on LSTM deep learning | |
CN114997288A (en) | Design resource association method | |
CN114385775A (en) | Sensitive word recognition method based on big data | |
CN107357794A (en) | Optimize the method and apparatus of the data store organisation of key value database | |
CN106919576A (en) | Using the method and device of two grades of classes keywords database search for application now | |
CN107992402A (en) | Blog management method and log management apparatus | |
CN106815209B (en) | Uygur agricultural technical term identification method | |
CN107861950A (en) | The detection method and device of abnormal text | |
CN106934024A (en) | A kind of data processing method and device | |
CN102789466A (en) | Question title quality judgment method and device and question guiding method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |