CN108268602A - Analyze method, apparatus, equipment and the computer storage media of text topic point - Google Patents
Analyze method, apparatus, equipment and the computer storage media of text topic point Download PDFInfo
- Publication number
- CN108268602A CN108268602A CN201711390850.7A CN201711390850A CN108268602A CN 108268602 A CN108268602 A CN 108268602A CN 201711390850 A CN201711390850 A CN 201711390850A CN 108268602 A CN108268602 A CN 108268602A
- Authority
- CN
- China
- Prior art keywords
- word
- text data
- data
- text
- primary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000003860 storage Methods 0.000 title claims abstract description 24
- 238000004458 analytical method Methods 0.000 claims abstract description 17
- 238000000605 extraction Methods 0.000 claims description 41
- 238000012549 training Methods 0.000 claims description 40
- 238000012545 processing Methods 0.000 claims description 24
- 238000013136 deep learning model Methods 0.000 claims description 7
- 230000001427 coherent effect Effects 0.000 abstract description 3
- 239000000284 extract Substances 0.000 description 13
- 241000522620 Scorpio Species 0.000 description 6
- 238000004891 communication Methods 0.000 description 6
- 238000010276 construction Methods 0.000 description 6
- 239000009490 scorpio Substances 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 241000239226 Scorpiones Species 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000686 essence Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of method, apparatus, equipment and computer storage media for analyzing text topic point, the method includes:Obtain text data;Primary word is extracted from the text data;Syntactic analysis is carried out to the text data, according to, with the relevant syntactic structure content of the primary word, obtaining the topic point of the text data in the text data.By technical solution provided by the present invention, it can realize that acquired topic point has the characteristics that important, clear and coherent and not escape, the core that can accurately express original text notebook data is semantic, so as to improve the accuracy of text topic point analysis.
Description
【Technical field】
The present invention relates to natural language processing more particularly to a kind of method, apparatus, equipment and meters for analyzing text topic point
Calculation machine storage medium.
【Background technology】
The prior art is normally based on the prediction that topic model carries out text subject when analyzing text topic point.But
The prior art has the following disadvantages:Since topic model is actually the disaggregated model in specific subject classification, it is only capable of
Enough analyses obtain specific subject categories, and categorical measure is limited;The theme high abstraction analyzed by topic model, it is difficult to accurate
Really the core of description text is semantic.Therefore, it is urgent to provide a kind of methods that can accurately analyze text topic point.
【Invention content】
In view of this, the present invention provides a kind of method, apparatus, equipment and computer storages for analyzing text topic point to be situated between
Matter, for improving the accuracy of text topic point analysis.
The present invention is to provide a kind of method for analyzing text topic point, institute for technical scheme applied to solve the technical problem
The method of stating includes:Obtain text data;Primary word is extracted from the text data;Grammer point is carried out to the text data
Analysis, according to, with the relevant syntactic structure content of the primary word, obtaining the topic point of the text data in the text data.
According to one preferred embodiment of the present invention, primary word is extracted from the text data to include:From the text data
Middle extraction meets the word of preset part of speech requirement as primary word;And/or determine the important of each word in the text data
Property score, extraction meets the word of preset score requirement as primary word.
According to one preferred embodiment of the present invention, the importance score for determining each word in the text data includes:
Statistical indicator based on word in large-scale data determines the importance score of each word in the text data;It or will
Each word input in text data word order models trained in advance, according to the output of word order models as a result,
Determine the importance score of each word in the text data.
According to one preferred embodiment of the present invention, the word order models train to obtain in the following ways in advance:
Training data is obtained, the training data includes the text data for being labeled with each word importance score;By training data Chinese
Each word of notebook data is as input, using the importance score of word each in text data as output, training deep learning mould
Type obtains word order models.
According to one preferred embodiment of the present invention, according in the text data in the relevant syntactic structure of the primary word
Hold, the topic point for obtaining the text data includes:Obtain the syntax tree of the text data;According to acquired syntax tree,
It determines and the relevant syntactic structure content of the primary word;The syntactic structure content determined is combined, obtains the text
The topic point of notebook data.
According to one preferred embodiment of the present invention, it is described by the syntactic structure content determined be combined including:From determine
Selection meets the content that default syntactic structure requires and is combined in the syntactic structure content gone out.
The present invention is to provide a kind of device for analyzing text topic point, institute for technical scheme applied to solve the technical problem
Device is stated to include:Acquiring unit, for obtaining text data;Extraction unit, for extracting primary word from the text data;
Processing unit, for carrying out syntactic analysis to the text data, according to relevant with the primary word in the text data
Syntactic structure content obtains the topic point of the text data.
According to one preferred embodiment of the present invention, the extraction unit from the text data for extracting primary word
When, it is specific to perform:Extraction meets the word of preset part of speech requirement as primary word from the text data;And/or it determines
The importance score of each word in the text data, extraction meet the word of preset score requirement as primary word.
According to one preferred embodiment of the present invention, the extraction unit is for determining the weight of each word in the text data
It is specific to perform during the property wanted score:Statistical indicator based on word in large-scale data determines each word in the text data
Importance score;Or the word order models for training each word input in the text data in advance, according to word
Order models output as a result, determine the text data in each word importance score.
According to one preferred embodiment of the present invention, described device further includes training unit, for instructing in advance in the following ways
Get word order models:Training data is obtained, the training data includes the text for being labeled with each word importance score
Data;Using each word of training data text data as input, using the importance score of word each in text data as
Output, training deep learning model, obtains word order models.
According to one preferred embodiment of the present invention, the processing unit for according in the text data with it is described important
Word relevant syntactic structure content is specific to perform when obtaining the topic point of the text data:Obtain the language of the text data
Method tree;According to acquired syntax tree, determine and the relevant syntactic structure content of the primary word;The syntactic structure that will be determined
Content is combined, and obtains the topic point of the text data.
According to one preferred embodiment of the present invention, the processing unit is combined by the syntactic structure determined content
When, it is specific to perform:Selection meets the content that default syntactic structure requires and is combined from the syntactic structure content determined.
As can be seen from the above technical solutions, the present invention is then based on by the primary word of the corresponding original text notebook data of extraction
The syntactic structure and primary word of original text notebook data are inscribed a little if obtaining original text notebook data, thus can realize it is acquired if
Topic point has the characteristics that important, clear and coherent and not escape, and the core that can accurately express original text notebook data is semantic, so as to improve text
The accuracy of topic point analysis.
【Description of the drawings】
Fig. 1 is the method flow diagram of analysis text topic point that one embodiment of the invention provides;
Fig. 2 is the schematic diagram of the syntactic structure of text data that one embodiment of the invention provides;
Fig. 3 is the structure drawing of device of analysis text topic point that one embodiment of the invention provides;
Fig. 4 is the block diagram of computer system/server that one embodiment of the invention provides.
【Specific embodiment】
To make the objectives, technical solutions, and advantages of the present invention clearer, it is right in the following with reference to the drawings and specific embodiments
The present invention is described in detail.
The term used in embodiments of the present invention is only merely for the purpose of description specific embodiment, and is not intended to be limiting
The present invention.In the embodiment of the present invention and " one kind " of singulative used in the attached claims, " described " and "the"
It is also intended to including most forms, unless context clearly shows that other meanings.
It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, represent
There may be three kinds of relationships, for example, A and/or B, can represent:Individualism A, exists simultaneously A and B, individualism B these three
Situation.In addition, character "/" herein, it is a kind of relationship of "or" to typically represent forward-backward correlation object.
Depending on linguistic context, word as used in this " if " can be construed to " ... when " or " when ...
When " or " in response to determining " or " in response to detection ".Similarly, depending on linguistic context, phrase " if it is determined that " or " if detection
(condition or event of statement) " can be construed to " when determining " or " in response to determining " or " when the detection (condition of statement
Or event) when " or " in response to detecting (condition or event of statement) ".
Fig. 1 is the method flow diagram of analysis text topic point that one embodiment of the invention provides, as shown in fig. 1, described
Method includes:
In 101, text data is obtained.
In this step, acquired text data can be the text of single character string, or by multiple characters
The text that string is formed.This article notebook data can be sentence, phrase etc. in Chinese field.Wherein, acquired text data can
Think the text data of text formatting, or the text being converted to after the non-textual formats such as voice, image are obtained
Notebook data.
In 102, primary word is extracted from the text data.
In this step, it is required according to preset extraction, corresponding this article is extracted from the text data acquired in step 101
The primary word of notebook data.
Specifically, in the primary word for extracting text data, in the following manner may be used:Text data is carried out at cutting word
Reason obtains the cutting word result of text data;According to the cutting word of text data as a result, the word that will wherein meet preset extraction requirement
Language is extracted as the primary word of this article notebook data.Wherein, preset extraction requires to include in this step:Preset part of speech will
Ask or preset score requirement at least one of.
Specifically, when extracting primary word from text data, following several ways may be used:
(1) word for meeting preset part of speech requirement in text data is extracted as primary word.
Wherein, preset part of speech requirement can be notional word, such as common noun, proper noun, the verb for having actual demand
Deng.In the primary word during this kind of mode is used to extract text data, can be determined in text data by part of speech analytical technology
Then the part of speech of each word requires according to preset part of speech, extracts primary word of the word met the requirements as text data.Example
Such as, if the requirement of preset part of speech is noun, acquired text data is " I likes A ", the corresponding cutting word result of this article notebook data
For " I ", " love " and " A ", if wherein " A " represents city name, the part of speech of " A " is noun, then extracts " A " as the text
The primary word of data.
(2) word for meeting preset score requirement in text data is extracted as primary word.
Wherein, it is more than predetermined threshold value that preset score requirement, which can be the importance score of each word in text data,;Also
Can according in text data each word importance score, choose and come the word of top N, wherein N is positive integer.Citing
For, if text data is " I likes AB ", the importance score of each word is respectively " I 0.168497 ", " love in cutting word result
0.221857 ", " A 0.203215 " and " B 0.406431 ", if wherein " A " represents city name, " B " represents sight spot name, if in advance
If score requirement to choose the word that makes number one as primary word, then choose the primary word of " B " as text data.
It specifically, can be based on word in large-scale data in the importance of each word in obtaining text data
Statistical indicator obtains the importance score of each word in text data.For example, the TF-IDF of text data can be passed through
The calculating knot of the information such as (termfrequency-inversedocumentfrequency, term frequency-inverse document frequency), mutual information
Fruit, to obtain the importance score of each word in text data.The word order models that training obtains in advance can also be used, it will
After the cutting word result of text data inputs the model, according to the output of the model as a result, obtaining the weight of each word in text data
The property wanted score.
Wherein, word order models may be used the advance training of in the following manner and obtain:Obtain training data, acquired instruction
Practice data to include being labeled with the text data of each word importance score;Using each word of training data text data as
Input, using the importance score of word each in text data as output, training deep learning model obtains word sequence mould
Type.Wherein, such as multiple perceptron model, convolutional neural networks model, Recognition with Recurrent Neural Network may be used in deep learning model
Model etc..Using the word order models, can the importance of each word be obtained according to each word in the text data of input
Score.
(3) it extracts and meets the word of the requirement of preset part of speech and the requirement of preset score in text data simultaneously as should
The primary word of text data.
In this kind of mode, the part of speech of each word and importance score in text data need to be obtained simultaneously, it is pre- by meeting
If part of speech requirement and score requirement primary word of the word as this article notebook data.For example, if being wrapped in text data
During the word for meeting the requirement of preset part of speech containing multiple, then required according to preset score, importance score is sorted in top N
Primary word of the word as this article notebook data, wherein N can be preset more than 1 integer;It is if alternatively, each in text data
The importance score of word sorts when the word of top N has various parts of speech, then makees the word for meeting preset part of speech requirement
For the primary word of this article notebook data, wherein N can be preset more than 1 integer.It is understood that the present invention is to from text
The number of the primary word extracted in data can be one or multiple without limiting.
In 103, to the text data carry out syntactic analysis, according in text data with the relevant language of the primary word
Method structure content obtains the topic point of the text data.
In this step, it based on the primary word acquired in step 102, is determined from text data relevant with the primary word
Syntactic structure content, syntactic structure content determined by combination, so as to obtain the topic of text data point.
Specifically, in the following manner may be used in the topic point for obtaining text data:The syntax tree of text data is obtained,
The syntax tree of text data can be obtained by the interdependent algorithm of grammer, i.e., each word in text data can be obtained by the syntax tree
Syntactic structure relationship in dependence between language, i.e. text data between each word;According to acquired syntax tree, determine
With the relevant syntactic structure content of primary word extracted, i.e., found out from syntax tree around extracted primary word important with this
In the relevant syntactic structure content of word, such as subject-predicate phrase content relevant with primary word, V-O construction content, modification structure
Appearance, Negative Structure content etc.;Identified syntactic structure content is combined, obtains the topic point of text data.Wherein, exist
When identified syntactic structure content is combined, therefrom a part can be selected to be combined, such as selection satisfaction is default
The syntactic structure content of syntactic structure requirement is combined, and it can be to choose subject-predicate phrase, dynamic guest's knot to preset syntactic structure requirement
The syntactic structures such as structure, modification structure, other syntactic structures are then without selection;Or whole grammers determined by selection
Structure content is combined.
Wherein, when being combined to syntactic structure content, it can extract and be removed in selected syntactic structure content respectively
After word outside primary word, it is combined together with primary word according to the appearance sequence of word each in text data, combination is tied
Fruit is inscribed a little as this article notebook data.Group can also be carried out according to the appearance sequence of syntactic structure content each in text data
It closes, the result after repeating part therein is rejected is inscribed a little as this article notebook data.
For example, if text data is " shooter in our bedrooms has pretended the scorpio of 3 years ", pass through the interdependent calculation of grammer
The syntax tree for correspondence this article notebook data that method obtains is as shown in Figure 2.If primary word is " pseudo- according to determined by step 102
Dress " then according to the syntax tree, determines with the relevant syntactic structure content of primary word to be respectively that " shooter pretends (SBV, subject-predicate knot
Structure) ", " camouflage (MT, voice structure) " and " camouflage scorpio (VOB, V-O construction) ".If based on default syntactic structure requirement
Structure and V-O construction are called, then selection and subject-predicate phrase and V-O construction phase from primary word relevant syntactic structure content
Corresponding structure content selects " shooter's camouflage " and " camouflage scorpio ", makees after selected structure content is combined
It is inscribed a little for this article notebook data.When being combined, " shooter " in " shooter's camouflage " and " camouflage day can be extracted respectively
The scorpio of scorpion " then carries out " shooter " " scorpio " and primary word " camouflage " according to the sequence of appearance accordingly in text data
Combination inscribes " shooter pretends scorpio " that combination obtains a little as this article notebook data.
It is understood that topic point acquired in this step can be one or multiple.If step 102
When there are one middle extracted primary words, then unique topic point can be obtained based on the primary word;If step 102 is extracted
Primary word when having multiple, then the topic point obtained based on multiple primary words may be one, it is also possible to multiple.This is because
In the syntactic structure of text data, it is understood that there may be the identical situation of syntactic structure corresponding to different primary words works as presence
During this kind of situation, then it is only capable of obtaining a topic point according to multiple primary words;When from the relevant syntactic structure of different primary words
When different, then multiple topic points can be obtained according to multiple primary words.The present invention to the quantity of acquired topic point without
It limits.
Fig. 3 is the structure drawing of device of analysis text topic point that one embodiment of the invention provides, as shown in Figure 3, described
Device includes:Acquiring unit 31, training unit 32, extraction unit 33 and processing unit 34.
Acquiring unit 31, for obtaining text data.
Text data acquired in acquiring unit 31 can be the text of single character string, or by multiple character strings
The text of composition.This article notebook data can be sentence, phrase etc. in Chinese field.Wherein, the text acquired in acquiring unit 31
Notebook data can be the text data of text formatting, or be converted after the non-textual formats such as voice, image are obtained
Obtained text data.
Training unit 32 obtains word order models for training.
The importance that the word order models that training unit 32 is trained are used to obtain each word in text data obtains
Point, for the primary word of the corresponding text data of extraction.In the following manner may be used in training unit 32, and training obtains word in advance
Order models:
Training data is obtained, the training data acquired in training unit 32 includes being labeled with each word importance score
Text data;Training unit 32 is using each word of training data text data as input, by word each in text data
Importance score obtains word order models as output, training deep learning model.Wherein, deep learning model can be adopted
With multiple perceptron model, convolutional neural networks model, Recognition with Recurrent Neural Network model etc..
Obtained word order models are trained using training unit 32, it can be according to each word in the text data of input
Language obtains the importance score of each word.
Extraction unit 33, for extracting primary word from the text data.
Extraction unit 33 is required according to preset extraction, and corresponding be somebody's turn to do is extracted from the text data acquired in acquiring unit 31
The primary word of text data.
Specifically, in the following manner may be used in the primary word for extracting text data in extraction unit 33:Extraction unit 33
Cutting word processing is carried out to text data, obtains the cutting word result of text data;According to the cutting word of text data as a result, extraction unit
33 extract the word for wherein meeting preset extraction requirement as the primary word of this article notebook data.Wherein, it is preset to carry
It takes and requires to include:At least one of preset part of speech requirement or the requirement of preset score.
Specifically, following several ways may be used when extracting primary word from text data in extraction unit 33:
(1) extraction unit 33 extracts the word for meeting preset part of speech requirement in text data as primary word.
Wherein, preset part of speech requirement can be notional word, such as common noun, proper noun, the verb for having actual demand
Deng.It, can be by using part of speech analytical technology in the primary word during extraction unit 33 extracts text data using this kind of mode
It determines the part of speech of each word in text data, is then required according to preset part of speech, extract the word met the requirements as text
The primary word of data.For example, if the requirement of preset part of speech is noun, acquired text data is " I likes A ", this article notebook data
Corresponding cutting word result is " I ", " love " and " A ", if wherein " A " represents city name, the part of speech of " A " is noun, then extracts
Unit 32 extracts the primary word of " A " as this article notebook data.
(2) extraction unit 33 extracts the word for meeting preset score requirement in text data as primary word.
Wherein, it is more than predetermined threshold value that preset score requirement, which can be the importance score of each word in text data,;Also
Can according in text data each word importance score, choose and come the word of top N, wherein N is positive integer.Citing
For, if text data is " I likes AB ", the importance score of each word is respectively " I 0.168497 ", " love in cutting word result
0.221857 ", " A 0.203215 " and " B 0.406431 ", wherein " A " represents city name, " B " represents sight spot name, if default
Score requirement to choose the word that makes number one as primary word, then extraction unit 33 extracts " B " as text data
Primary word.
Specifically, it when extraction unit 33 obtains the importance of each word in text data, can greatly advised based on word
Statistical indicator of the modulus in obtains the importance score of each word in text data.For example, text data can be passed through
The information such as TF-IDF (termfrequency-inversedocumentfrequency, term frequency-inverse document frequency), mutual information
Result of calculation, to obtain the importance score of each word in text data.Training unit 32 can also be used, and training obtains in advance
Word order models, after the cutting word result of text data is inputted the model, according to the output of the model as a result, obtain text
The importance score of each word in data.
(3) extraction unit 33 extracts meets the requirement of preset part of speech and the requirement of preset score simultaneously in text data
Primary word of the word as this article notebook data.
In this kind of mode, extraction unit 33 need to obtain the part of speech of each word and importance in text data and obtain simultaneously
Point, primary word of the word as this article notebook data of preset part of speech requirement and score requirement will be met.For example, it is if literary
During the word for meeting the requirement of preset part of speech comprising multiple in notebook data, then required according to preset score, extraction unit 33 can
Using by importance score sequence top N word be used as this article notebook data primary word, wherein N can be preset more than 1
Integer;Alternatively, if the importance score of each word sorts when the word of top N there are various parts of speech in text data, carry
Take unit 33 that can will meet the word of preset part of speech requirement as the primary word of this article notebook data, wherein N can be default
More than 1 integer.It is understood that the present invention to the number of primary word extracted from text data without limit
It is fixed, can be one or multiple.
Processing unit 34, for the text data carry out syntactic analysis, according in the text data with it is described heavy
The relevant syntactic structure content of word is wanted, obtains the topic point of the text data.
Primary word of the processing unit 34 acquired in based on extraction unit 33 determines related to the primary word from text data
Syntactic structure content, identified syntactic structure content is combined, so as to obtain the topic of text data point.
Specifically, in the following manner may be used in the topic point for obtaining text data in processing unit 34:Obtain textual data
According to syntax tree, processing unit 34 can be obtained the syntax tree of text data by the interdependent algorithm of grammer, that is, pass through the syntax tree
The dependence between each word in text data can be obtained, i.e., the syntactic structure relationship between each word;According to acquired
Syntax tree, processing unit 34 determines to surround institute with the relevant syntactic structure content of primary word extracted, i.e. processing unit 34
The primary word of extraction found out from syntax tree with the relevant syntactic structure content of the primary word, such as with the relevant subject-predicate of primary word
Structure content, V-O construction content, modification structure content, Negative Structure content etc.;Identified syntactic structure content is carried out
Combination, obtains the topic point of text data.Wherein, when identified syntactic structure content is combined by processing unit 34,
Therefrom a part can be selected to be combined, such as selection meets the syntactic structure content that default syntactic structure requires and carries out group
It closes, it can be subject-predicate phrase, V-O construction, modification structure etc. to preset syntactic structure requirement;Or language determined by selection
The whole of method structure content is combined.
Wherein, when processing unit 34 is combined syntactic structure content, selected grammer knot can be extracted respectively
After word in structure content in addition to primary word, group is carried out together with primary word according to the appearance sequence of word each in text data
It closes, is inscribed a little using combined result as this article notebook data.Processing unit 34 can also be according to syntactic structure each in text data
After the appearance sequence of content is combined, the result obtained after repeating part therein is rejected is inscribed as this article notebook data
Point.
It is understood that the topic point acquired in processing unit 34 can be one or multiple.If extraction is single
When there are one the primary words that member 33 is extracted, then unique topic point can be obtained based on the primary word;If extraction unit 33
When the primary word extracted has multiple, then the topic point obtained based on multiple primary words may be one, it is also possible to multiple.This
It is due in the syntactic structure of text data, it is understood that there may be the situation identical from the relevant syntactic structure of different primary words,
When there are during this kind of situation, be then only capable of obtaining a topic point according to multiple primary words;When from the relevant language of different primary words
During method structure difference, then multiple topic points can be obtained according to multiple primary words.The present invention is to the quantity of acquired topic point
Without limiting.
Using the topic point acquired in the present invention, can be applied under several scenes:
Such as apply in conversational system, after the topic point for obtaining current chat language, as conversational system acquired in
The corresponding chat language of topic point generation reply language so that the replys language generated has spy that is reasonable, having logic
Point;It can also apply in search system, after obtaining user and inputting the topic point of query, acquired in search engine utilization
Topic point searched for, can realize expand search range so that search result more meets the purpose of the search need of user;
It can be also used for judging that the consumption of user is intended to, in intention etc. of going on a journey, search text, dialog text in the corresponding user of acquisition etc.
It, should so as to draw a portrait judgement according to constructed user by acquired topic point for building user's portrait after the topic point of appearance
The consumption of user is intended to, trip is intended to etc..
Fig. 4 shows the frame suitable for being used for the exemplary computer system/server 012 for realizing embodiment of the present invention
Figure.The computer system/server 012 that Fig. 4 is shown is only an example, function that should not be to the embodiment of the present invention and use
Range band carrys out any restrictions.
As shown in figure 4, computer system/server 012 is showed in the form of universal computing device.Computer system/clothes
The component of business device 012 can include but is not limited to:One or more processor or processing unit 016, system storage
028, the bus 018 of connection different system component (including system storage 028 and processing unit 016).
Bus 018 represents one or more in a few class bus structures, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using the arbitrary bus structures in a variety of bus structures.It lifts
For example, these architectures include but not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC)
Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Computer system/server 012 typically comprises a variety of computer system readable media.These media can be appointed
What usable medium that can be accessed by computer system/server 012, including volatile and non-volatile medium, movably
With immovable medium.
System storage 028 can include the computer system readable media of form of volatile memory, such as deposit at random
Access to memory (RAM) 030 and/or cache memory 032.Computer system/server 012 may further include other
Removable/nonremovable, volatile/non-volatile computer system storage medium.Only as an example, storage system 034 can
For reading and writing immovable, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").Although in Fig. 4
Be not shown, can provide for move non-volatile magnetic disk (such as " floppy disk ") read-write disc driver and pair can
The CD drive that mobile anonvolatile optical disk (such as CD-ROM, DVD-ROM or other optical mediums) is read and write.In these feelings
Under condition, each driver can be connected by one or more data media interfaces with bus 018.Memory 028 can wrap
Include at least one program product, the program product have one group of (for example, at least one) program module, these program modules by with
Put the function to perform various embodiments of the present invention.
Program/utility 040 with one group of (at least one) program module 042, can be stored in such as memory
In 028, such program module 042 includes --- but being not limited to --- operating system, one or more application program, other
Program module and program data may include the realization of network environment in each or certain combination in these examples.Journey
Sequence module 042 usually performs function and/or method in embodiment described in the invention.
Computer system/server 012 can also with one or more external equipments 014 (such as keyboard, sensing equipment,
Display 024 etc.) communication, in the present invention, computer system/server 012 communicates with outside radar equipment, can also be with
One or more enables a user to the equipment interacted with the computer system/server 012 communication and/or with causing the meter
Any equipment that calculation machine systems/servers 012 can communicate with one or more of the other computing device (such as network interface card, modulation
Demodulator etc.) communication.This communication can be carried out by input/output (I/O) interface 022.Also, computer system/clothes
Being engaged in device 012 can also be by network adapter 020 and one or more network (such as LAN (LAN), wide area network (WAN)
And/or public network, such as internet) communication.As shown in the figure, network adapter 020 by bus 018 and computer system/
Other modules communication of server 012.It should be understood that although not shown in the drawings, computer system/server 012 can be combined
Using other hardware and/or software module, including but not limited to:Microcode, device driver, redundant processing unit, external magnetic
Dish driving array, RAID system, tape drive and data backup storage system etc..
Processing unit 016 is stored in program in system storage 028 by operation, so as to perform various functions using with
And data processing, such as realize a kind of method for analyzing text topic point, it can include:
Obtain text data;
Primary word is extracted from the text data;
To the text data carry out syntactic analysis, according in the text data with the relevant grammer knot of the primary word
Structure content obtains the topic point of the text data.
Above-mentioned computer program can be set in computer storage media, i.e., the computer storage media is encoded with
Computer program, the program by one or more computers when being performed so that one or more computers are performed in the present invention
State the method flow shown in embodiment and/or device operation.For example, the method stream performed by said one or multiple processors
Journey can include:
Obtain text data;
Primary word is extracted from the text data;
To the text data carry out syntactic analysis, according in the text data with the relevant grammer knot of the primary word
Structure content obtains the topic point of the text data.
With time, the development of technology, medium meaning is more and more extensive, and the route of transmission of computer program is no longer limited by
Tangible medium, can also directly be downloaded from network etc..The arbitrary combination of one or more computer-readable media may be used.
Computer-readable medium can be computer-readable signal media or computer readable storage medium.Computer-readable storage medium
Matter for example may be-but not limited to-electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device or
The arbitrary above combination of person.The more specific example (non exhaustive list) of computer readable storage medium includes:There are one tools
Or the electrical connections of multiple conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM),
Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light
Memory device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer readable storage medium can
To be any tangible medium for including or storing program, the program can be commanded execution system, device or device use or
Person is in connection.
Computer-readable signal media can include in a base band or as a carrier wave part propagation data-signal,
Wherein carry computer-readable program code.Diversified forms may be used in the data-signal of this propagation, including --- but
It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be
Any computer-readable medium other than computer readable storage medium, which can send, propagate or
Transmission for by instruction execution system, device either device use or program in connection.
The program code included on computer-readable medium can be transmitted with any appropriate medium, including --- but it is unlimited
In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
It can write to perform the computer that operates of the present invention with one or more programming language or combinations
Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++,
Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
It fully performs, partly perform on the user computer on the user computer, the software package independent as one performs, portion
Divide and partly perform or perform on a remote computer or server completely on the remote computer on the user computer.
Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or
Wide area network (WAN) be connected to subscriber computer or, it may be connected to outer computer (such as is provided using Internet service
Quotient passes through Internet connection).
Using technical solution provided by the present invention, by the primary word of the corresponding original text notebook data of extraction, it is then based on original
The syntactic structure and primary word of text data are inscribed a little if obtaining original text notebook data, thus the present invention can realize it is acquired
Topic point there is important, clear and coherent and not escape, the core that can accurately express original text notebook data is semantic, so as to improve
The accuracy of text topic point analysis.
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of division of logic function can have other dividing mode in actual implementation.
The unit illustrated as separating component may or may not be physically separate, be shown as unit
The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple
In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
That each unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list
The form that hardware had both may be used in member is realized, can also be realized in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, is used including some instructions so that a computer
It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) perform the present invention
The part steps of a embodiment the method.And aforementioned storage medium includes:USB flash disk, mobile hard disk, read-only memory (Read-
Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. it is various
The medium of program code can be stored.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
God and any modification, equivalent substitution, improvement and etc. within principle, done, should be included within the scope of protection of the invention.
Claims (14)
- A kind of 1. method for analyzing text topic point, which is characterized in that the method includes:Obtain text data;Primary word is extracted from the text data;To the text data carry out syntactic analysis, according in the text data in the relevant syntactic structure of the primary word Hold, obtain the topic point of the text data.
- 2. according to the method described in claim 1, include it is characterized in that, extracting primary word from the text data:Extraction meets the word of preset part of speech requirement as primary word from the text data;And/orDetermine the importance score of each word in the text data, extraction meets the word of preset score requirement as important Word.
- 3. according to the method described in claim 2, it is characterized in that, determine the importance score of each word in the text data Including:Statistical indicator based on word in large-scale data determines the importance score of each word in the text data;Or PersonBy the word order models trained in advance of each word input in the text data, exported according to word order models As a result, determine the importance score of each word in the text data.
- 4. according to the method described in claim 3, it is characterized in that, the word order models are to instruct in advance in the following ways It gets:Training data is obtained, the training data includes the text data for being labeled with each word importance score;Using each word of training data text data as input, using the importance score of word each in text data as defeated Go out, training deep learning model obtains word order models.
- 5. according to the method described in claim 1, it is characterized in that, according to relevant with the primary word in the text data Syntactic structure content, the topic point for obtaining the text data include:Obtain the syntax tree of the text data;According to acquired syntax tree, determine and the relevant syntactic structure content of the primary word;The syntactic structure content determined is combined, obtains the topic point of the text data.
- 6. according to the method described in claim 5, it is characterized in that, described be combined packet by the syntactic structure content determined It includes:Selection meets the content that default syntactic structure requires and is combined from the syntactic structure content determined.
- 7. a kind of device for analyzing text topic point, which is characterized in that described device includes:Acquiring unit, for obtaining text data;Extraction unit, for extracting primary word from the text data;Processing unit, for the text data carry out syntactic analysis, according in the text data with the primary word phase The syntactic structure content of pass obtains the topic point of the text data.
- 8. device according to claim 7, which is characterized in that the extraction unit from the text data for carrying It is specific to perform when taking primary word:Extraction meets the word of preset part of speech requirement as primary word from the text data;And/orDetermine the importance score of each word in the text data, extraction meets the word of preset score requirement as important Word.
- 9. device according to claim 8, which is characterized in that the extraction unit is for determining in the text data It is specific to perform during the importance score of each word:Statistical indicator based on word in large-scale data determines the importance score of each word in the text data;Or PersonBy the word order models trained in advance of each word input in the text data, exported according to word order models As a result, determine the importance score of each word in the text data.
- 10. device according to claim 9, which is characterized in that described device further includes training unit, for using following Training obtains word order models to mode in advance:Training data is obtained, the training data includes the text data for being labeled with each word importance score;Using each word of training data text data as input, using the importance score of word each in text data as defeated Go out, training deep learning model obtains word order models.
- 11. device according to claim 7, which is characterized in that the processing unit is for according to the text data In with the relevant syntactic structure content of the primary word, it is specific to perform when obtaining the topic point of the text data:Obtain the syntax tree of the text data;According to acquired syntax tree, determine and the relevant syntactic structure content of the primary word;The syntactic structure content determined is combined, obtains the topic point of the text data.
- 12. according to the devices described in claim 11, which is characterized in that the processing unit is in by the syntactic structure determined It is specific to perform when appearance is combined:Selection meets the content that default syntactic structure requires and is combined from the syntactic structure content determined.
- 13. a kind of equipment, which is characterized in that the equipment includes:One or more processors;Storage device, for storing one or more programs,When one or more of programs are performed by one or more of processors so that one or more of processors are real The now method as described in any in claim 1-6.
- 14. a kind of storage medium for including computer executable instructions, the computer executable instructions are by computer disposal Method when device performs for execution as described in any in claim 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711390850.7A CN108268602A (en) | 2017-12-21 | 2017-12-21 | Analyze method, apparatus, equipment and the computer storage media of text topic point |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711390850.7A CN108268602A (en) | 2017-12-21 | 2017-12-21 | Analyze method, apparatus, equipment and the computer storage media of text topic point |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108268602A true CN108268602A (en) | 2018-07-10 |
Family
ID=62772458
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711390850.7A Pending CN108268602A (en) | 2017-12-21 | 2017-12-21 | Analyze method, apparatus, equipment and the computer storage media of text topic point |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108268602A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783733A (en) * | 2019-01-15 | 2019-05-21 | 三角兽(北京)科技有限公司 | User's portrait generating means and method, information processing unit and storage medium |
CN110096709A (en) * | 2019-05-07 | 2019-08-06 | 百度在线网络技术(北京)有限公司 | Command processing method and device, server and computer-readable medium |
CN110851560A (en) * | 2018-07-27 | 2020-02-28 | 杭州海康威视数字技术股份有限公司 | Information retrieval method, device and equipment |
CN114491013A (en) * | 2021-12-09 | 2022-05-13 | 重庆邮电大学 | Topic mining method, storage medium and system for merging syntactic structure information |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070233465A1 (en) * | 2006-03-20 | 2007-10-04 | Nahoko Sato | Information extracting apparatus, and information extracting method |
CN104536950A (en) * | 2014-12-11 | 2015-04-22 | 北京百度网讯科技有限公司 | Text summarization generating method and device |
CN105260359A (en) * | 2015-10-16 | 2016-01-20 | 晶赞广告(上海)有限公司 | Semantic keyword extraction method and apparatus |
CN106502994A (en) * | 2016-11-29 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | A kind of method and apparatus of the keyword extraction of text |
CN106815213A (en) * | 2016-12-30 | 2017-06-09 | 全民互联科技(天津)有限公司 | A kind of contract performance clause extraction method and system |
CN106959944A (en) * | 2017-02-14 | 2017-07-18 | 中国电子科技集团公司第二十八研究所 | A kind of Event Distillation method and system based on Chinese syntax rule |
-
2017
- 2017-12-21 CN CN201711390850.7A patent/CN108268602A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070233465A1 (en) * | 2006-03-20 | 2007-10-04 | Nahoko Sato | Information extracting apparatus, and information extracting method |
CN104536950A (en) * | 2014-12-11 | 2015-04-22 | 北京百度网讯科技有限公司 | Text summarization generating method and device |
CN105260359A (en) * | 2015-10-16 | 2016-01-20 | 晶赞广告(上海)有限公司 | Semantic keyword extraction method and apparatus |
CN106502994A (en) * | 2016-11-29 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | A kind of method and apparatus of the keyword extraction of text |
CN106815213A (en) * | 2016-12-30 | 2017-06-09 | 全民互联科技(天津)有限公司 | A kind of contract performance clause extraction method and system |
CN106959944A (en) * | 2017-02-14 | 2017-07-18 | 中国电子科技集团公司第二十八研究所 | A kind of Event Distillation method and system based on Chinese syntax rule |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851560A (en) * | 2018-07-27 | 2020-02-28 | 杭州海康威视数字技术股份有限公司 | Information retrieval method, device and equipment |
CN110851560B (en) * | 2018-07-27 | 2023-03-10 | 杭州海康威视数字技术股份有限公司 | Information retrieval method, device and equipment |
CN109783733A (en) * | 2019-01-15 | 2019-05-21 | 三角兽(北京)科技有限公司 | User's portrait generating means and method, information processing unit and storage medium |
CN110096709A (en) * | 2019-05-07 | 2019-08-06 | 百度在线网络技术(北京)有限公司 | Command processing method and device, server and computer-readable medium |
CN114491013A (en) * | 2021-12-09 | 2022-05-13 | 重庆邮电大学 | Topic mining method, storage medium and system for merging syntactic structure information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11249774B2 (en) | Realtime bandwidth-based communication for assistant systems | |
CN109657054B (en) | Abstract generation method, device, server and storage medium | |
CN106919661B (en) | Emotion type identification method and related device | |
US10657543B2 (en) | Targeted e-commerce business strategies based on affiliation networks derived from predictive cognitive traits | |
CN107193973A (en) | The field recognition methods of semanteme parsing information and device, equipment and computer-readable recording medium | |
WO2016085409A1 (en) | A method and system for sentiment classification and emotion classification | |
CN104679769B (en) | The method and device classified to the usage scenario of product | |
CN109599095A (en) | A kind of mask method of voice data, device, equipment and computer storage medium | |
CN109271493A (en) | A kind of language text processing method, device and storage medium | |
CN108268602A (en) | Analyze method, apparatus, equipment and the computer storage media of text topic point | |
CN108108468A (en) | A kind of short text sentiment analysis method and apparatus based on concept and text emotion | |
CN108932066A (en) | Method, apparatus, equipment and the computer storage medium of input method acquisition expression packet | |
CN109446907A (en) | A kind of method, apparatus of Video chat, equipment and computer storage medium | |
CN108550054A (en) | A kind of content quality appraisal procedure, device, equipment and medium | |
CN110377694A (en) | Text is marked to the method, apparatus, equipment and computer storage medium of logical relation | |
CN110362663A (en) | Adaptive multi-sensing similarity detection and resolution | |
CN107590130A (en) | Scene determines method and device, storage medium and electronic equipment | |
CN110362825A (en) | A kind of text based finance data abstracting method, device and electronic equipment | |
CN116955591A (en) | Recommendation language generation method, related device and medium for content recommendation | |
CN108268443A (en) | It determines the transfer of topic point and obtains the method, apparatus for replying text | |
US11222143B2 (en) | Certified information verification services | |
CN114118062A (en) | Customer feature extraction method and device, electronic equipment and storage medium | |
US20220027612A1 (en) | Detecting and processing sections spanning processed document partitions | |
CN117668758A (en) | Dialog intention recognition method and device, electronic equipment and storage medium | |
CN109582846A (en) | Method, apparatus, electronic equipment and the storage medium scanned for by article |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180710 |