CN108268602A - Analyze method, apparatus, equipment and the computer storage media of text topic point - Google Patents

Analyze method, apparatus, equipment and the computer storage media of text topic point Download PDF

Info

Publication number
CN108268602A
CN108268602A CN201711390850.7A CN201711390850A CN108268602A CN 108268602 A CN108268602 A CN 108268602A CN 201711390850 A CN201711390850 A CN 201711390850A CN 108268602 A CN108268602 A CN 108268602A
Authority
CN
China
Prior art keywords
word
text data
data
text
primary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711390850.7A
Other languages
Chinese (zh)
Inventor
郭振
吴文权
刘占
刘占一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201711390850.7A priority Critical patent/CN108268602A/en
Publication of CN108268602A publication Critical patent/CN108268602A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of method, apparatus, equipment and computer storage media for analyzing text topic point, the method includes:Obtain text data;Primary word is extracted from the text data;Syntactic analysis is carried out to the text data, according to, with the relevant syntactic structure content of the primary word, obtaining the topic point of the text data in the text data.By technical solution provided by the present invention, it can realize that acquired topic point has the characteristics that important, clear and coherent and not escape, the core that can accurately express original text notebook data is semantic, so as to improve the accuracy of text topic point analysis.

Description

Analyze method, apparatus, equipment and the computer storage media of text topic point
【Technical field】
The present invention relates to natural language processing more particularly to a kind of method, apparatus, equipment and meters for analyzing text topic point Calculation machine storage medium.
【Background technology】
The prior art is normally based on the prediction that topic model carries out text subject when analyzing text topic point.But The prior art has the following disadvantages:Since topic model is actually the disaggregated model in specific subject classification, it is only capable of Enough analyses obtain specific subject categories, and categorical measure is limited;The theme high abstraction analyzed by topic model, it is difficult to accurate Really the core of description text is semantic.Therefore, it is urgent to provide a kind of methods that can accurately analyze text topic point.
【Invention content】
In view of this, the present invention provides a kind of method, apparatus, equipment and computer storages for analyzing text topic point to be situated between Matter, for improving the accuracy of text topic point analysis.
The present invention is to provide a kind of method for analyzing text topic point, institute for technical scheme applied to solve the technical problem The method of stating includes:Obtain text data;Primary word is extracted from the text data;Grammer point is carried out to the text data Analysis, according to, with the relevant syntactic structure content of the primary word, obtaining the topic point of the text data in the text data.
According to one preferred embodiment of the present invention, primary word is extracted from the text data to include:From the text data Middle extraction meets the word of preset part of speech requirement as primary word;And/or determine the important of each word in the text data Property score, extraction meets the word of preset score requirement as primary word.
According to one preferred embodiment of the present invention, the importance score for determining each word in the text data includes: Statistical indicator based on word in large-scale data determines the importance score of each word in the text data;It or will Each word input in text data word order models trained in advance, according to the output of word order models as a result, Determine the importance score of each word in the text data.
According to one preferred embodiment of the present invention, the word order models train to obtain in the following ways in advance: Training data is obtained, the training data includes the text data for being labeled with each word importance score;By training data Chinese Each word of notebook data is as input, using the importance score of word each in text data as output, training deep learning mould Type obtains word order models.
According to one preferred embodiment of the present invention, according in the text data in the relevant syntactic structure of the primary word Hold, the topic point for obtaining the text data includes:Obtain the syntax tree of the text data;According to acquired syntax tree, It determines and the relevant syntactic structure content of the primary word;The syntactic structure content determined is combined, obtains the text The topic point of notebook data.
According to one preferred embodiment of the present invention, it is described by the syntactic structure content determined be combined including:From determine Selection meets the content that default syntactic structure requires and is combined in the syntactic structure content gone out.
The present invention is to provide a kind of device for analyzing text topic point, institute for technical scheme applied to solve the technical problem Device is stated to include:Acquiring unit, for obtaining text data;Extraction unit, for extracting primary word from the text data; Processing unit, for carrying out syntactic analysis to the text data, according to relevant with the primary word in the text data Syntactic structure content obtains the topic point of the text data.
According to one preferred embodiment of the present invention, the extraction unit from the text data for extracting primary word When, it is specific to perform:Extraction meets the word of preset part of speech requirement as primary word from the text data;And/or it determines The importance score of each word in the text data, extraction meet the word of preset score requirement as primary word.
According to one preferred embodiment of the present invention, the extraction unit is for determining the weight of each word in the text data It is specific to perform during the property wanted score:Statistical indicator based on word in large-scale data determines each word in the text data Importance score;Or the word order models for training each word input in the text data in advance, according to word Order models output as a result, determine the text data in each word importance score.
According to one preferred embodiment of the present invention, described device further includes training unit, for instructing in advance in the following ways Get word order models:Training data is obtained, the training data includes the text for being labeled with each word importance score Data;Using each word of training data text data as input, using the importance score of word each in text data as Output, training deep learning model, obtains word order models.
According to one preferred embodiment of the present invention, the processing unit for according in the text data with it is described important Word relevant syntactic structure content is specific to perform when obtaining the topic point of the text data:Obtain the language of the text data Method tree;According to acquired syntax tree, determine and the relevant syntactic structure content of the primary word;The syntactic structure that will be determined Content is combined, and obtains the topic point of the text data.
According to one preferred embodiment of the present invention, the processing unit is combined by the syntactic structure determined content When, it is specific to perform:Selection meets the content that default syntactic structure requires and is combined from the syntactic structure content determined.
As can be seen from the above technical solutions, the present invention is then based on by the primary word of the corresponding original text notebook data of extraction The syntactic structure and primary word of original text notebook data are inscribed a little if obtaining original text notebook data, thus can realize it is acquired if Topic point has the characteristics that important, clear and coherent and not escape, and the core that can accurately express original text notebook data is semantic, so as to improve text The accuracy of topic point analysis.
【Description of the drawings】
Fig. 1 is the method flow diagram of analysis text topic point that one embodiment of the invention provides;
Fig. 2 is the schematic diagram of the syntactic structure of text data that one embodiment of the invention provides;
Fig. 3 is the structure drawing of device of analysis text topic point that one embodiment of the invention provides;
Fig. 4 is the block diagram of computer system/server that one embodiment of the invention provides.
【Specific embodiment】
To make the objectives, technical solutions, and advantages of the present invention clearer, it is right in the following with reference to the drawings and specific embodiments The present invention is described in detail.
The term used in embodiments of the present invention is only merely for the purpose of description specific embodiment, and is not intended to be limiting The present invention.In the embodiment of the present invention and " one kind " of singulative used in the attached claims, " described " and "the" It is also intended to including most forms, unless context clearly shows that other meanings.
It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, represent There may be three kinds of relationships, for example, A and/or B, can represent:Individualism A, exists simultaneously A and B, individualism B these three Situation.In addition, character "/" herein, it is a kind of relationship of "or" to typically represent forward-backward correlation object.
Depending on linguistic context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determining " or " in response to detection ".Similarly, depending on linguistic context, phrase " if it is determined that " or " if detection (condition or event of statement) " can be construed to " when determining " or " in response to determining " or " when the detection (condition of statement Or event) when " or " in response to detecting (condition or event of statement) ".
Fig. 1 is the method flow diagram of analysis text topic point that one embodiment of the invention provides, as shown in fig. 1, described Method includes:
In 101, text data is obtained.
In this step, acquired text data can be the text of single character string, or by multiple characters The text that string is formed.This article notebook data can be sentence, phrase etc. in Chinese field.Wherein, acquired text data can Think the text data of text formatting, or the text being converted to after the non-textual formats such as voice, image are obtained Notebook data.
In 102, primary word is extracted from the text data.
In this step, it is required according to preset extraction, corresponding this article is extracted from the text data acquired in step 101 The primary word of notebook data.
Specifically, in the primary word for extracting text data, in the following manner may be used:Text data is carried out at cutting word Reason obtains the cutting word result of text data;According to the cutting word of text data as a result, the word that will wherein meet preset extraction requirement Language is extracted as the primary word of this article notebook data.Wherein, preset extraction requires to include in this step:Preset part of speech will Ask or preset score requirement at least one of.
Specifically, when extracting primary word from text data, following several ways may be used:
(1) word for meeting preset part of speech requirement in text data is extracted as primary word.
Wherein, preset part of speech requirement can be notional word, such as common noun, proper noun, the verb for having actual demand Deng.In the primary word during this kind of mode is used to extract text data, can be determined in text data by part of speech analytical technology Then the part of speech of each word requires according to preset part of speech, extracts primary word of the word met the requirements as text data.Example Such as, if the requirement of preset part of speech is noun, acquired text data is " I likes A ", the corresponding cutting word result of this article notebook data For " I ", " love " and " A ", if wherein " A " represents city name, the part of speech of " A " is noun, then extracts " A " as the text The primary word of data.
(2) word for meeting preset score requirement in text data is extracted as primary word.
Wherein, it is more than predetermined threshold value that preset score requirement, which can be the importance score of each word in text data,;Also Can according in text data each word importance score, choose and come the word of top N, wherein N is positive integer.Citing For, if text data is " I likes AB ", the importance score of each word is respectively " I 0.168497 ", " love in cutting word result 0.221857 ", " A 0.203215 " and " B 0.406431 ", if wherein " A " represents city name, " B " represents sight spot name, if in advance If score requirement to choose the word that makes number one as primary word, then choose the primary word of " B " as text data.
It specifically, can be based on word in large-scale data in the importance of each word in obtaining text data Statistical indicator obtains the importance score of each word in text data.For example, the TF-IDF of text data can be passed through The calculating knot of the information such as (termfrequency-inversedocumentfrequency, term frequency-inverse document frequency), mutual information Fruit, to obtain the importance score of each word in text data.The word order models that training obtains in advance can also be used, it will After the cutting word result of text data inputs the model, according to the output of the model as a result, obtaining the weight of each word in text data The property wanted score.
Wherein, word order models may be used the advance training of in the following manner and obtain:Obtain training data, acquired instruction Practice data to include being labeled with the text data of each word importance score;Using each word of training data text data as Input, using the importance score of word each in text data as output, training deep learning model obtains word sequence mould Type.Wherein, such as multiple perceptron model, convolutional neural networks model, Recognition with Recurrent Neural Network may be used in deep learning model Model etc..Using the word order models, can the importance of each word be obtained according to each word in the text data of input Score.
(3) it extracts and meets the word of the requirement of preset part of speech and the requirement of preset score in text data simultaneously as should The primary word of text data.
In this kind of mode, the part of speech of each word and importance score in text data need to be obtained simultaneously, it is pre- by meeting If part of speech requirement and score requirement primary word of the word as this article notebook data.For example, if being wrapped in text data During the word for meeting the requirement of preset part of speech containing multiple, then required according to preset score, importance score is sorted in top N Primary word of the word as this article notebook data, wherein N can be preset more than 1 integer;It is if alternatively, each in text data The importance score of word sorts when the word of top N has various parts of speech, then makees the word for meeting preset part of speech requirement For the primary word of this article notebook data, wherein N can be preset more than 1 integer.It is understood that the present invention is to from text The number of the primary word extracted in data can be one or multiple without limiting.
In 103, to the text data carry out syntactic analysis, according in text data with the relevant language of the primary word Method structure content obtains the topic point of the text data.
In this step, it based on the primary word acquired in step 102, is determined from text data relevant with the primary word Syntactic structure content, syntactic structure content determined by combination, so as to obtain the topic of text data point.
Specifically, in the following manner may be used in the topic point for obtaining text data:The syntax tree of text data is obtained, The syntax tree of text data can be obtained by the interdependent algorithm of grammer, i.e., each word in text data can be obtained by the syntax tree Syntactic structure relationship in dependence between language, i.e. text data between each word;According to acquired syntax tree, determine With the relevant syntactic structure content of primary word extracted, i.e., found out from syntax tree around extracted primary word important with this In the relevant syntactic structure content of word, such as subject-predicate phrase content relevant with primary word, V-O construction content, modification structure Appearance, Negative Structure content etc.;Identified syntactic structure content is combined, obtains the topic point of text data.Wherein, exist When identified syntactic structure content is combined, therefrom a part can be selected to be combined, such as selection satisfaction is default The syntactic structure content of syntactic structure requirement is combined, and it can be to choose subject-predicate phrase, dynamic guest's knot to preset syntactic structure requirement The syntactic structures such as structure, modification structure, other syntactic structures are then without selection;Or whole grammers determined by selection Structure content is combined.
Wherein, when being combined to syntactic structure content, it can extract and be removed in selected syntactic structure content respectively After word outside primary word, it is combined together with primary word according to the appearance sequence of word each in text data, combination is tied Fruit is inscribed a little as this article notebook data.Group can also be carried out according to the appearance sequence of syntactic structure content each in text data It closes, the result after repeating part therein is rejected is inscribed a little as this article notebook data.
For example, if text data is " shooter in our bedrooms has pretended the scorpio of 3 years ", pass through the interdependent calculation of grammer The syntax tree for correspondence this article notebook data that method obtains is as shown in Figure 2.If primary word is " pseudo- according to determined by step 102 Dress " then according to the syntax tree, determines with the relevant syntactic structure content of primary word to be respectively that " shooter pretends (SBV, subject-predicate knot Structure) ", " camouflage (MT, voice structure) " and " camouflage scorpio (VOB, V-O construction) ".If based on default syntactic structure requirement Structure and V-O construction are called, then selection and subject-predicate phrase and V-O construction phase from primary word relevant syntactic structure content Corresponding structure content selects " shooter's camouflage " and " camouflage scorpio ", makees after selected structure content is combined It is inscribed a little for this article notebook data.When being combined, " shooter " in " shooter's camouflage " and " camouflage day can be extracted respectively The scorpio of scorpion " then carries out " shooter " " scorpio " and primary word " camouflage " according to the sequence of appearance accordingly in text data Combination inscribes " shooter pretends scorpio " that combination obtains a little as this article notebook data.
It is understood that topic point acquired in this step can be one or multiple.If step 102 When there are one middle extracted primary words, then unique topic point can be obtained based on the primary word;If step 102 is extracted Primary word when having multiple, then the topic point obtained based on multiple primary words may be one, it is also possible to multiple.This is because In the syntactic structure of text data, it is understood that there may be the identical situation of syntactic structure corresponding to different primary words works as presence During this kind of situation, then it is only capable of obtaining a topic point according to multiple primary words;When from the relevant syntactic structure of different primary words When different, then multiple topic points can be obtained according to multiple primary words.The present invention to the quantity of acquired topic point without It limits.
Fig. 3 is the structure drawing of device of analysis text topic point that one embodiment of the invention provides, as shown in Figure 3, described Device includes:Acquiring unit 31, training unit 32, extraction unit 33 and processing unit 34.
Acquiring unit 31, for obtaining text data.
Text data acquired in acquiring unit 31 can be the text of single character string, or by multiple character strings The text of composition.This article notebook data can be sentence, phrase etc. in Chinese field.Wherein, the text acquired in acquiring unit 31 Notebook data can be the text data of text formatting, or be converted after the non-textual formats such as voice, image are obtained Obtained text data.
Training unit 32 obtains word order models for training.
The importance that the word order models that training unit 32 is trained are used to obtain each word in text data obtains Point, for the primary word of the corresponding text data of extraction.In the following manner may be used in training unit 32, and training obtains word in advance Order models:
Training data is obtained, the training data acquired in training unit 32 includes being labeled with each word importance score Text data;Training unit 32 is using each word of training data text data as input, by word each in text data Importance score obtains word order models as output, training deep learning model.Wherein, deep learning model can be adopted With multiple perceptron model, convolutional neural networks model, Recognition with Recurrent Neural Network model etc..
Obtained word order models are trained using training unit 32, it can be according to each word in the text data of input Language obtains the importance score of each word.
Extraction unit 33, for extracting primary word from the text data.
Extraction unit 33 is required according to preset extraction, and corresponding be somebody's turn to do is extracted from the text data acquired in acquiring unit 31 The primary word of text data.
Specifically, in the following manner may be used in the primary word for extracting text data in extraction unit 33:Extraction unit 33 Cutting word processing is carried out to text data, obtains the cutting word result of text data;According to the cutting word of text data as a result, extraction unit 33 extract the word for wherein meeting preset extraction requirement as the primary word of this article notebook data.Wherein, it is preset to carry It takes and requires to include:At least one of preset part of speech requirement or the requirement of preset score.
Specifically, following several ways may be used when extracting primary word from text data in extraction unit 33:
(1) extraction unit 33 extracts the word for meeting preset part of speech requirement in text data as primary word.
Wherein, preset part of speech requirement can be notional word, such as common noun, proper noun, the verb for having actual demand Deng.It, can be by using part of speech analytical technology in the primary word during extraction unit 33 extracts text data using this kind of mode It determines the part of speech of each word in text data, is then required according to preset part of speech, extract the word met the requirements as text The primary word of data.For example, if the requirement of preset part of speech is noun, acquired text data is " I likes A ", this article notebook data Corresponding cutting word result is " I ", " love " and " A ", if wherein " A " represents city name, the part of speech of " A " is noun, then extracts Unit 32 extracts the primary word of " A " as this article notebook data.
(2) extraction unit 33 extracts the word for meeting preset score requirement in text data as primary word.
Wherein, it is more than predetermined threshold value that preset score requirement, which can be the importance score of each word in text data,;Also Can according in text data each word importance score, choose and come the word of top N, wherein N is positive integer.Citing For, if text data is " I likes AB ", the importance score of each word is respectively " I 0.168497 ", " love in cutting word result 0.221857 ", " A 0.203215 " and " B 0.406431 ", wherein " A " represents city name, " B " represents sight spot name, if default Score requirement to choose the word that makes number one as primary word, then extraction unit 33 extracts " B " as text data Primary word.
Specifically, it when extraction unit 33 obtains the importance of each word in text data, can greatly advised based on word Statistical indicator of the modulus in obtains the importance score of each word in text data.For example, text data can be passed through The information such as TF-IDF (termfrequency-inversedocumentfrequency, term frequency-inverse document frequency), mutual information Result of calculation, to obtain the importance score of each word in text data.Training unit 32 can also be used, and training obtains in advance Word order models, after the cutting word result of text data is inputted the model, according to the output of the model as a result, obtain text The importance score of each word in data.
(3) extraction unit 33 extracts meets the requirement of preset part of speech and the requirement of preset score simultaneously in text data Primary word of the word as this article notebook data.
In this kind of mode, extraction unit 33 need to obtain the part of speech of each word and importance in text data and obtain simultaneously Point, primary word of the word as this article notebook data of preset part of speech requirement and score requirement will be met.For example, it is if literary During the word for meeting the requirement of preset part of speech comprising multiple in notebook data, then required according to preset score, extraction unit 33 can Using by importance score sequence top N word be used as this article notebook data primary word, wherein N can be preset more than 1 Integer;Alternatively, if the importance score of each word sorts when the word of top N there are various parts of speech in text data, carry Take unit 33 that can will meet the word of preset part of speech requirement as the primary word of this article notebook data, wherein N can be default More than 1 integer.It is understood that the present invention to the number of primary word extracted from text data without limit It is fixed, can be one or multiple.
Processing unit 34, for the text data carry out syntactic analysis, according in the text data with it is described heavy The relevant syntactic structure content of word is wanted, obtains the topic point of the text data.
Primary word of the processing unit 34 acquired in based on extraction unit 33 determines related to the primary word from text data Syntactic structure content, identified syntactic structure content is combined, so as to obtain the topic of text data point.
Specifically, in the following manner may be used in the topic point for obtaining text data in processing unit 34:Obtain textual data According to syntax tree, processing unit 34 can be obtained the syntax tree of text data by the interdependent algorithm of grammer, that is, pass through the syntax tree The dependence between each word in text data can be obtained, i.e., the syntactic structure relationship between each word;According to acquired Syntax tree, processing unit 34 determines to surround institute with the relevant syntactic structure content of primary word extracted, i.e. processing unit 34 The primary word of extraction found out from syntax tree with the relevant syntactic structure content of the primary word, such as with the relevant subject-predicate of primary word Structure content, V-O construction content, modification structure content, Negative Structure content etc.;Identified syntactic structure content is carried out Combination, obtains the topic point of text data.Wherein, when identified syntactic structure content is combined by processing unit 34, Therefrom a part can be selected to be combined, such as selection meets the syntactic structure content that default syntactic structure requires and carries out group It closes, it can be subject-predicate phrase, V-O construction, modification structure etc. to preset syntactic structure requirement;Or language determined by selection The whole of method structure content is combined.
Wherein, when processing unit 34 is combined syntactic structure content, selected grammer knot can be extracted respectively After word in structure content in addition to primary word, group is carried out together with primary word according to the appearance sequence of word each in text data It closes, is inscribed a little using combined result as this article notebook data.Processing unit 34 can also be according to syntactic structure each in text data After the appearance sequence of content is combined, the result obtained after repeating part therein is rejected is inscribed as this article notebook data Point.
It is understood that the topic point acquired in processing unit 34 can be one or multiple.If extraction is single When there are one the primary words that member 33 is extracted, then unique topic point can be obtained based on the primary word;If extraction unit 33 When the primary word extracted has multiple, then the topic point obtained based on multiple primary words may be one, it is also possible to multiple.This It is due in the syntactic structure of text data, it is understood that there may be the situation identical from the relevant syntactic structure of different primary words, When there are during this kind of situation, be then only capable of obtaining a topic point according to multiple primary words;When from the relevant language of different primary words During method structure difference, then multiple topic points can be obtained according to multiple primary words.The present invention is to the quantity of acquired topic point Without limiting.
Using the topic point acquired in the present invention, can be applied under several scenes:
Such as apply in conversational system, after the topic point for obtaining current chat language, as conversational system acquired in The corresponding chat language of topic point generation reply language so that the replys language generated has spy that is reasonable, having logic Point;It can also apply in search system, after obtaining user and inputting the topic point of query, acquired in search engine utilization Topic point searched for, can realize expand search range so that search result more meets the purpose of the search need of user; It can be also used for judging that the consumption of user is intended to, in intention etc. of going on a journey, search text, dialog text in the corresponding user of acquisition etc. It, should so as to draw a portrait judgement according to constructed user by acquired topic point for building user's portrait after the topic point of appearance The consumption of user is intended to, trip is intended to etc..
Fig. 4 shows the frame suitable for being used for the exemplary computer system/server 012 for realizing embodiment of the present invention Figure.The computer system/server 012 that Fig. 4 is shown is only an example, function that should not be to the embodiment of the present invention and use Range band carrys out any restrictions.
As shown in figure 4, computer system/server 012 is showed in the form of universal computing device.Computer system/clothes The component of business device 012 can include but is not limited to:One or more processor or processing unit 016, system storage 028, the bus 018 of connection different system component (including system storage 028 and processing unit 016).
Bus 018 represents one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using the arbitrary bus structures in a variety of bus structures.It lifts For example, these architectures include but not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Computer system/server 012 typically comprises a variety of computer system readable media.These media can be appointed What usable medium that can be accessed by computer system/server 012, including volatile and non-volatile medium, movably With immovable medium.
System storage 028 can include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (RAM) 030 and/or cache memory 032.Computer system/server 012 may further include other Removable/nonremovable, volatile/non-volatile computer system storage medium.Only as an example, storage system 034 can For reading and writing immovable, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").Although in Fig. 4 Be not shown, can provide for move non-volatile magnetic disk (such as " floppy disk ") read-write disc driver and pair can The CD drive that mobile anonvolatile optical disk (such as CD-ROM, DVD-ROM or other optical mediums) is read and write.In these feelings Under condition, each driver can be connected by one or more data media interfaces with bus 018.Memory 028 can wrap Include at least one program product, the program product have one group of (for example, at least one) program module, these program modules by with Put the function to perform various embodiments of the present invention.
Program/utility 040 with one group of (at least one) program module 042, can be stored in such as memory In 028, such program module 042 includes --- but being not limited to --- operating system, one or more application program, other Program module and program data may include the realization of network environment in each or certain combination in these examples.Journey Sequence module 042 usually performs function and/or method in embodiment described in the invention.
Computer system/server 012 can also with one or more external equipments 014 (such as keyboard, sensing equipment, Display 024 etc.) communication, in the present invention, computer system/server 012 communicates with outside radar equipment, can also be with One or more enables a user to the equipment interacted with the computer system/server 012 communication and/or with causing the meter Any equipment that calculation machine systems/servers 012 can communicate with one or more of the other computing device (such as network interface card, modulation Demodulator etc.) communication.This communication can be carried out by input/output (I/O) interface 022.Also, computer system/clothes Being engaged in device 012 can also be by network adapter 020 and one or more network (such as LAN (LAN), wide area network (WAN) And/or public network, such as internet) communication.As shown in the figure, network adapter 020 by bus 018 and computer system/ Other modules communication of server 012.It should be understood that although not shown in the drawings, computer system/server 012 can be combined Using other hardware and/or software module, including but not limited to:Microcode, device driver, redundant processing unit, external magnetic Dish driving array, RAID system, tape drive and data backup storage system etc..
Processing unit 016 is stored in program in system storage 028 by operation, so as to perform various functions using with And data processing, such as realize a kind of method for analyzing text topic point, it can include:
Obtain text data;
Primary word is extracted from the text data;
To the text data carry out syntactic analysis, according in the text data with the relevant grammer knot of the primary word Structure content obtains the topic point of the text data.
Above-mentioned computer program can be set in computer storage media, i.e., the computer storage media is encoded with Computer program, the program by one or more computers when being performed so that one or more computers are performed in the present invention State the method flow shown in embodiment and/or device operation.For example, the method stream performed by said one or multiple processors Journey can include:
Obtain text data;
Primary word is extracted from the text data;
To the text data carry out syntactic analysis, according in the text data with the relevant grammer knot of the primary word Structure content obtains the topic point of the text data.
With time, the development of technology, medium meaning is more and more extensive, and the route of transmission of computer program is no longer limited by Tangible medium, can also directly be downloaded from network etc..The arbitrary combination of one or more computer-readable media may be used. Computer-readable medium can be computer-readable signal media or computer readable storage medium.Computer-readable storage medium Matter for example may be-but not limited to-electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device or The arbitrary above combination of person.The more specific example (non exhaustive list) of computer readable storage medium includes:There are one tools Or the electrical connections of multiple conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM), Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light Memory device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer readable storage medium can To be any tangible medium for including or storing program, the program can be commanded execution system, device or device use or Person is in connection.
Computer-readable signal media can include in a base band or as a carrier wave part propagation data-signal, Wherein carry computer-readable program code.Diversified forms may be used in the data-signal of this propagation, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium other than computer readable storage medium, which can send, propagate or Transmission for by instruction execution system, device either device use or program in connection.
The program code included on computer-readable medium can be transmitted with any appropriate medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
It can write to perform the computer that operates of the present invention with one or more programming language or combinations Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully performs, partly perform on the user computer on the user computer, the software package independent as one performs, portion Divide and partly perform or perform on a remote computer or server completely on the remote computer on the user computer. Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or Wide area network (WAN) be connected to subscriber computer or, it may be connected to outer computer (such as is provided using Internet service Quotient passes through Internet connection).
Using technical solution provided by the present invention, by the primary word of the corresponding original text notebook data of extraction, it is then based on original The syntactic structure and primary word of text data are inscribed a little if obtaining original text notebook data, thus the present invention can realize it is acquired Topic point there is important, clear and coherent and not escape, the core that can accurately express original text notebook data is semantic, so as to improve The accuracy of text topic point analysis.
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of division of logic function can have other dividing mode in actual implementation.
The unit illustrated as separating component may or may not be physically separate, be shown as unit The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also That each unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, is used including some instructions so that a computer It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) perform the present invention The part steps of a embodiment the method.And aforementioned storage medium includes:USB flash disk, mobile hard disk, read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. it is various The medium of program code can be stored.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God and any modification, equivalent substitution, improvement and etc. within principle, done, should be included within the scope of protection of the invention.

Claims (14)

  1. A kind of 1. method for analyzing text topic point, which is characterized in that the method includes:
    Obtain text data;
    Primary word is extracted from the text data;
    To the text data carry out syntactic analysis, according in the text data in the relevant syntactic structure of the primary word Hold, obtain the topic point of the text data.
  2. 2. according to the method described in claim 1, include it is characterized in that, extracting primary word from the text data:
    Extraction meets the word of preset part of speech requirement as primary word from the text data;And/or
    Determine the importance score of each word in the text data, extraction meets the word of preset score requirement as important Word.
  3. 3. according to the method described in claim 2, it is characterized in that, determine the importance score of each word in the text data Including:
    Statistical indicator based on word in large-scale data determines the importance score of each word in the text data;Or Person
    By the word order models trained in advance of each word input in the text data, exported according to word order models As a result, determine the importance score of each word in the text data.
  4. 4. according to the method described in claim 3, it is characterized in that, the word order models are to instruct in advance in the following ways It gets:
    Training data is obtained, the training data includes the text data for being labeled with each word importance score;
    Using each word of training data text data as input, using the importance score of word each in text data as defeated Go out, training deep learning model obtains word order models.
  5. 5. according to the method described in claim 1, it is characterized in that, according to relevant with the primary word in the text data Syntactic structure content, the topic point for obtaining the text data include:
    Obtain the syntax tree of the text data;
    According to acquired syntax tree, determine and the relevant syntactic structure content of the primary word;
    The syntactic structure content determined is combined, obtains the topic point of the text data.
  6. 6. according to the method described in claim 5, it is characterized in that, described be combined packet by the syntactic structure content determined It includes:
    Selection meets the content that default syntactic structure requires and is combined from the syntactic structure content determined.
  7. 7. a kind of device for analyzing text topic point, which is characterized in that described device includes:
    Acquiring unit, for obtaining text data;
    Extraction unit, for extracting primary word from the text data;
    Processing unit, for the text data carry out syntactic analysis, according in the text data with the primary word phase The syntactic structure content of pass obtains the topic point of the text data.
  8. 8. device according to claim 7, which is characterized in that the extraction unit from the text data for carrying It is specific to perform when taking primary word:
    Extraction meets the word of preset part of speech requirement as primary word from the text data;And/or
    Determine the importance score of each word in the text data, extraction meets the word of preset score requirement as important Word.
  9. 9. device according to claim 8, which is characterized in that the extraction unit is for determining in the text data It is specific to perform during the importance score of each word:
    Statistical indicator based on word in large-scale data determines the importance score of each word in the text data;Or Person
    By the word order models trained in advance of each word input in the text data, exported according to word order models As a result, determine the importance score of each word in the text data.
  10. 10. device according to claim 9, which is characterized in that described device further includes training unit, for using following Training obtains word order models to mode in advance:
    Training data is obtained, the training data includes the text data for being labeled with each word importance score;
    Using each word of training data text data as input, using the importance score of word each in text data as defeated Go out, training deep learning model obtains word order models.
  11. 11. device according to claim 7, which is characterized in that the processing unit is for according to the text data In with the relevant syntactic structure content of the primary word, it is specific to perform when obtaining the topic point of the text data:
    Obtain the syntax tree of the text data;
    According to acquired syntax tree, determine and the relevant syntactic structure content of the primary word;
    The syntactic structure content determined is combined, obtains the topic point of the text data.
  12. 12. according to the devices described in claim 11, which is characterized in that the processing unit is in by the syntactic structure determined It is specific to perform when appearance is combined:
    Selection meets the content that default syntactic structure requires and is combined from the syntactic structure content determined.
  13. 13. a kind of equipment, which is characterized in that the equipment includes:
    One or more processors;
    Storage device, for storing one or more programs,
    When one or more of programs are performed by one or more of processors so that one or more of processors are real The now method as described in any in claim 1-6.
  14. 14. a kind of storage medium for including computer executable instructions, the computer executable instructions are by computer disposal Method when device performs for execution as described in any in claim 1-6.
CN201711390850.7A 2017-12-21 2017-12-21 Analyze method, apparatus, equipment and the computer storage media of text topic point Pending CN108268602A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711390850.7A CN108268602A (en) 2017-12-21 2017-12-21 Analyze method, apparatus, equipment and the computer storage media of text topic point

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711390850.7A CN108268602A (en) 2017-12-21 2017-12-21 Analyze method, apparatus, equipment and the computer storage media of text topic point

Publications (1)

Publication Number Publication Date
CN108268602A true CN108268602A (en) 2018-07-10

Family

ID=62772458

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711390850.7A Pending CN108268602A (en) 2017-12-21 2017-12-21 Analyze method, apparatus, equipment and the computer storage media of text topic point

Country Status (1)

Country Link
CN (1) CN108268602A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783733A (en) * 2019-01-15 2019-05-21 三角兽(北京)科技有限公司 User's portrait generating means and method, information processing unit and storage medium
CN110096709A (en) * 2019-05-07 2019-08-06 百度在线网络技术(北京)有限公司 Command processing method and device, server and computer-readable medium
CN110851560A (en) * 2018-07-27 2020-02-28 杭州海康威视数字技术股份有限公司 Information retrieval method, device and equipment
CN114491013A (en) * 2021-12-09 2022-05-13 重庆邮电大学 Topic mining method, storage medium and system for merging syntactic structure information

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070233465A1 (en) * 2006-03-20 2007-10-04 Nahoko Sato Information extracting apparatus, and information extracting method
CN104536950A (en) * 2014-12-11 2015-04-22 北京百度网讯科技有限公司 Text summarization generating method and device
CN105260359A (en) * 2015-10-16 2016-01-20 晶赞广告(上海)有限公司 Semantic keyword extraction method and apparatus
CN106502994A (en) * 2016-11-29 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and apparatus of the keyword extraction of text
CN106815213A (en) * 2016-12-30 2017-06-09 全民互联科技(天津)有限公司 A kind of contract performance clause extraction method and system
CN106959944A (en) * 2017-02-14 2017-07-18 中国电子科技集团公司第二十八研究所 A kind of Event Distillation method and system based on Chinese syntax rule

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070233465A1 (en) * 2006-03-20 2007-10-04 Nahoko Sato Information extracting apparatus, and information extracting method
CN104536950A (en) * 2014-12-11 2015-04-22 北京百度网讯科技有限公司 Text summarization generating method and device
CN105260359A (en) * 2015-10-16 2016-01-20 晶赞广告(上海)有限公司 Semantic keyword extraction method and apparatus
CN106502994A (en) * 2016-11-29 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and apparatus of the keyword extraction of text
CN106815213A (en) * 2016-12-30 2017-06-09 全民互联科技(天津)有限公司 A kind of contract performance clause extraction method and system
CN106959944A (en) * 2017-02-14 2017-07-18 中国电子科技集团公司第二十八研究所 A kind of Event Distillation method and system based on Chinese syntax rule

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851560A (en) * 2018-07-27 2020-02-28 杭州海康威视数字技术股份有限公司 Information retrieval method, device and equipment
CN110851560B (en) * 2018-07-27 2023-03-10 杭州海康威视数字技术股份有限公司 Information retrieval method, device and equipment
CN109783733A (en) * 2019-01-15 2019-05-21 三角兽(北京)科技有限公司 User's portrait generating means and method, information processing unit and storage medium
CN110096709A (en) * 2019-05-07 2019-08-06 百度在线网络技术(北京)有限公司 Command processing method and device, server and computer-readable medium
CN114491013A (en) * 2021-12-09 2022-05-13 重庆邮电大学 Topic mining method, storage medium and system for merging syntactic structure information

Similar Documents

Publication Publication Date Title
US11249774B2 (en) Realtime bandwidth-based communication for assistant systems
CN109657054B (en) Abstract generation method, device, server and storage medium
CN106919661B (en) Emotion type identification method and related device
US10657543B2 (en) Targeted e-commerce business strategies based on affiliation networks derived from predictive cognitive traits
CN107193973A (en) The field recognition methods of semanteme parsing information and device, equipment and computer-readable recording medium
WO2016085409A1 (en) A method and system for sentiment classification and emotion classification
CN104679769B (en) The method and device classified to the usage scenario of product
CN109599095A (en) A kind of mask method of voice data, device, equipment and computer storage medium
CN109271493A (en) A kind of language text processing method, device and storage medium
CN108268602A (en) Analyze method, apparatus, equipment and the computer storage media of text topic point
CN108108468A (en) A kind of short text sentiment analysis method and apparatus based on concept and text emotion
CN108932066A (en) Method, apparatus, equipment and the computer storage medium of input method acquisition expression packet
CN109446907A (en) A kind of method, apparatus of Video chat, equipment and computer storage medium
CN108550054A (en) A kind of content quality appraisal procedure, device, equipment and medium
CN110377694A (en) Text is marked to the method, apparatus, equipment and computer storage medium of logical relation
CN110362663A (en) Adaptive multi-sensing similarity detection and resolution
CN107590130A (en) Scene determines method and device, storage medium and electronic equipment
CN110362825A (en) A kind of text based finance data abstracting method, device and electronic equipment
CN116955591A (en) Recommendation language generation method, related device and medium for content recommendation
CN108268443A (en) It determines the transfer of topic point and obtains the method, apparatus for replying text
US11222143B2 (en) Certified information verification services
CN114118062A (en) Customer feature extraction method and device, electronic equipment and storage medium
US20220027612A1 (en) Detecting and processing sections spanning processed document partitions
CN117668758A (en) Dialog intention recognition method and device, electronic equipment and storage medium
CN109582846A (en) Method, apparatus, electronic equipment and the storage medium scanned for by article

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180710