CN108052520A - Conjunctive word analysis method, electronic device and storage medium based on topic model - Google Patents

Conjunctive word analysis method, electronic device and storage medium based on topic model Download PDF

Info

Publication number
CN108052520A
CN108052520A CN201711059225.4A CN201711059225A CN108052520A CN 108052520 A CN108052520 A CN 108052520A CN 201711059225 A CN201711059225 A CN 201711059225A CN 108052520 A CN108052520 A CN 108052520A
Authority
CN
China
Prior art keywords
word
theme
topic model
text
checked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711059225.4A
Other languages
Chinese (zh)
Inventor
赵清源
吕梓燊
韦邕
徐亮
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201711059225.4A priority Critical patent/CN108052520A/en
Priority to PCT/CN2017/113720 priority patent/WO2019085118A1/en
Publication of CN108052520A publication Critical patent/CN108052520A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation

Abstract

A kind of conjunctive word analysis method based on topic model, includes the following steps:A, when needing to carry out theme modeling to a technical field, the text to be checked of the technical field is obtained from the corresponding predetermined database of the technical field, theme modeling is carried out to the text to be checked of acquisition, to obtain the corresponding topic model of each text to be checked;B, the text to be checked is trained based on the topic model with train theme that the text to be checked includes and comprising theme in the probability distribution matrix of probability that occurs in each theme of word;C, the corresponding theme vector of each word is obtained from the probability distribution matrix, according to the relation between the corresponding theme vector of the default each word of conjunctive word weight analysis rule analysis, to analyze the corresponding conjunctive word of word to be retrieved.So as to realize relatively more accurate to conjunctive word when information retrieval is carried out to specific professional domain and comprehensively calculate.

Description

Conjunctive word analysis method, electronic device and storage medium based on topic model
Technical field
The present invention relates to information retrieval field more particularly to a kind of conjunctive word analysis method based on topic model, electronics Device and storage medium.
Background technology
In information retrieval system, conjunctive word calculating is a very crucial step.It is calculated by conjunctive word, we are a side Face can calculate the possible idea of user when user's key entry content is less, play the purpose of broadening search content;It is and another On the one hand the near synonym of user's input content can also be searched out, other the similar words that look like in database is found, is joined Want to match.
Different from the use of near synonym and synonym common in daily life, in some specific areas, for example, medical treatment is strong Health field, scientific and technical innovation field etc., the use of conjunctive word is with certain professional degree in information retrieval system, if will open on the net The dictionary storehouse of the near synonym synonym in source is directly used in these specific professional domains, it will usually retrieval result be caused to be not allowed The problems such as really and retrieval result is not comprehensive.
The content of the invention
In view of this, the present invention proposes a kind of conjunctive word analysis method based on topic model, device and computer-readable Medium.Conjunctive word analysis method, device and computer-readable medium based on topic model proposed by the invention is suitable for appointing In the information retrieval system of what professional domain, the corresponding conjunctive word of Feature Words to be retrieved can be rapidly and accurately calculated.
First, to achieve the above object, the present invention proposes a kind of conjunctive word analysis method based on topic model, this method Include the following steps:
A, when needing to carry out theme modeling to a technical field, from the corresponding predetermined data of the technical field Storehouse obtains the text to be checked of the technical field, and theme modeling is carried out to the text to be checked of acquisition, each to be checked to obtain The corresponding topic model of text;
B, the text to be checked is trained based on the topic model with train theme that the text to be checked includes, And comprising theme in the probability distribution matrix of probability that occurs in each theme of word;
C, the corresponding theme vector of each word is obtained from the probability distribution matrix, according to default conjunctive word weight Analysis rule analyzes the relation between the corresponding theme vector of each word, to analyze the corresponding conjunctive word of word to be retrieved.
Preferably, the corresponding theme vector of each word is obtained from the probability distribution matrix to be included:
The corresponding parameter of each row in the probability distribution matrix is normalized, so as to obtain using word as dimension The corresponding theme vector of each word.
Preferably, the default conjunctive word weight analysis rule includes:
Calculate respectively the corresponding theme vector of the word to be retrieved it is corresponding with the probability distribution matrix it is each other Euclidean distance between the corresponding theme vector of word;
The magnitude relationship between each Euclidean distance calculated is analyzed, finds out minimum Euclidean distance;
Using other the corresponding words of Euclidean distance for the minimum found out as the conjunctive word of the word to be retrieved.
Preferably, other described words are referred in the corresponding word of the probability distribution matrix, except the word to be retrieved Word outside language.
Preferably, following steps are further included before the step A:
According to predetermined technical field and the mapping relations of corpus, the skill belonging to the text to be checked obtained is determined The corresponding corpus in art field, and using definite corpus as the corpus of the topic model of the technical field.
In addition, to achieve the above object, the present invention also provides a kind of electronic device based on topic model, the device bags It includes:Memory, processor are stored with the conjunctive word analysis system based on topic model on the memory, described to be based on theme Following operation is realized when the conjunctive word analysis system of model is executed by processor:
S1, when needing to carry out theme modeling to technical field, from the corresponding predetermined number of the technical field The text to be checked of the technical field is obtained according to storehouse, theme modeling is carried out to the text to be checked of acquisition, it is each to be checked to obtain Ask the corresponding topic model of text;
S2, based on the topic model text to be checked is trained to train the master that the text to be checked includes Topic and comprising theme in the probability distribution matrix of probability that occurs in each theme of word;
S3, the corresponding theme vector of each word is obtained from the probability distribution matrix, is weighed according to default conjunctive word Relation between the corresponding theme vector of each word of weight analysis rule analysis, to analyze the corresponding association of word to be retrieved Word.
Preferably, it is general from the theme feature to perform the conjunctive word analysis system realization based on topic model for the processor The operation of the corresponding theme vector of each word is obtained in rate distribution matrix to be included:
The corresponding parameter of each row in the probability distribution matrix is normalized, so as to obtain using word as dimension The corresponding theme vector of each word.
Preferably, the processor performs the conjunctive word analysis system based on topic model and realizes the default conjunctive word The operation of weight analysis rule includes:
Calculate respectively the corresponding theme vector of the word to be retrieved it is corresponding with the probability distribution matrix it is each other Euclidean distance between the corresponding theme vector of word;
The magnitude relationship between each Euclidean distance calculated is analyzed, finds out minimum Euclidean distance;
Using other the corresponding words of Euclidean distance for the minimum found out as the conjunctive word of the word to be retrieved.
Preferably, it is also real before the processor performs the conjunctive word analysis system realization step S1 based on topic model Now following operation:
According to predetermined technical field and the mapping relations of corpus, the skill belonging to the text to be checked obtained is determined The corresponding corpus in art field, and using definite corpus as the corpus of the topic model of the technical field.
Further, to achieve the above object, the present invention also provides a kind of computer readable storage medium, which can It reads to be stored with the conjunctive word analysis program based on topic model on medium, the conjunctive word analysis program based on topic model is processed Device realizes the step of above-mentioned conjunctive word analysis method based on topic model when performing.
Compared to the prior art, the conjunctive word analysis method proposed by the invention based on topic model, electronic device and Computer readable storage medium, it is first, corresponding from the technical field when needing to carry out theme modeling to a technical field Predetermined database obtains the text to be checked of the technical field, and theme modeling is carried out to the text to be checked of acquisition, with Obtain the corresponding topic model of each text to be checked;Then, train text to be checked to be checked to train based on topic model Ask the theme that includes of text and comprising theme in the probability distribution matrix of probability that occurs in each theme of word;It connects It, the corresponding theme vector of each word is obtained from probability distribution matrix, according to default conjunctive word weight analysis rule point The relation between the corresponding theme vector of each word is analysed, to analyze the corresponding conjunctive word of word to be retrieved.In this way, not only may be used The dictionary of the near synonym increased income on the net synonym is directly used in specific professional domain, and compared with existing association The retrieval result of this conjunctive word analysis method based on topic model of word analysis mode is relatively more accurate and retrieval result is more complete Face.
Description of the drawings
Fig. 1 is the schematic diagram of the hardware structure of the electronic device based on topic model of the present invention;
Fig. 2 is the program module schematic diagram of the conjunctive word analysis system based on topic model of the present invention;
Fig. 3 is the hardware structure schematic diagram of the analysis module in Fig. 2;
Fig. 4 is the implementation process diagram of the conjunctive word analysis method the present invention is based on topic model;
Fig. 5 is the implementation process diagram of step S403 in Fig. 4.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, it is right The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before creative work is made All other embodiments obtained are put, belong to the scope of protection of the invention.
As shown in fig.1, it is the schematic diagram of the hardware structure of electronic device of the present invention.
In the present embodiment, electronic device 1 is include but not limited to, and connection memory can be in communication with each other by system bus 11st, processor 12, network interface 13.It is pointed out that Fig. 1 illustrates only the electronic device with component 11-13, but should What is understood is, it is not required that implements all components shown, the more or less component of the implementation that can be substituted.
Wherein, memory 11 includes at least a type of readable storage medium storing program for executing, and readable storage medium storing program for executing includes flash memory, hard Disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), static random-access Memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only storage Device (PROM), magnetic storage, disk, CD etc..In some embodiments, memory 11 can be the inside of electronic device 1 Storage unit, such as the hard disk or memory of electronic device 1.In further embodiments, memory 11 can also be electronic device 1 External memory equipment, such as the plug-in type hard disk being equipped on electronic device 1, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) blocks, flash card (Flash Card) etc..Certainly, memory 11 can also be both Internal storage unit including electronic device 1 also includes its External memory equipment.In the present embodiment, memory 11 is commonly used in depositing Storage is installed on the operating system of electronic device 1 and types of applications software, such as the conjunctive word analysis system 200 based on topic model Program code etc..In addition, memory 11 can be also used for temporarily storing the Various types of data that has exported or will export.
Processor 12 can be in some embodiments central processing unit (Central Processing Unit, CPU), Controller, microcontroller, microprocessor or other data processing chips.Processor 12 is total commonly used in control electronic device 1 Gymnastics is made, such as performs and the progress data interaction of electronic device 1 or communicate relevant control and processing etc..In the present embodiment, Processor 12 is used for the program code stored in run memory 11 or processing data, such as runs the pass based on topic model Join word analysis system 200 etc..
Network interface 13 may include radio network interface or wired network interface, which is commonly used in electronics Communication connection is established between device 1 and other electronic equipments.
In other embodiments of the present invention, electronic device 1 further includes display (not shown in figure 1), and display can To be light-emitting diode display, liquid crystal display, touch-control liquid crystal display and OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) touch device etc..Display is for showing the information that handles in the electronic apparatus 1 and for showing Show visual user interface etc..
So far, oneself is through describing the application environment of each embodiment of the present invention and the hardware configuration and work(of relevant device in detail Energy.In the following, above application environment and relevant device will be based on, each embodiment of the present invention is proposed.
First, the present invention proposes a kind of conjunctive word analysis system 200 based on topic model.
As shown in fig.2, it is the program mould of 200 first embodiment of conjunctive word analysis system the present invention is based on topic model Block diagram.In the present embodiment, the conjunctive word analysis system 200 based on topic model can be divided into one or more modules, In, one or more module is stored in memory 11, and (is described in the present embodiment by one or more processors Manage device 12) it is performed, to complete the present invention.For example, in fig. 2, the conjunctive word analysis system 200 based on topic model can be by It is divided into modeling module 201, training module 202 and analysis module 203.The so-called program module of the present invention is to have referred to Into the series of computation machine program instruction section of specific function, analyzed than program more suitable for describing the conjunctive word based on topic model The implementation procedure of system 200 in the electronic apparatus 1.The function of putting up with each program module 201-203 below is described in detail.
Modeling module 201, it is corresponding from the technical field for when needing to carry out theme modeling to technical field Predetermined database (for example, the corresponding paper storehouse of the technical field, blog articles storehouse etc.) obtains treating for the technical field Query text carries out theme modeling, to obtain the corresponding topic model of each text to be checked to the text to be checked of acquisition.
In one embodiment of the invention, query text progress can be treated according to the application scenarios of the technical field Editor, to obtain good text to be checked.
If for example, the technical field is medical field, made with the text in the predetermined database of medical field For text to be checked, first, according to the core key word of medical field, the text of no practical significance is deleted (for example, analysis is each The species of core key word containing predetermined medical field and corresponding quantity in a text to be checked, if one is treated The species of the core key word contained in query text be less than first threshold (for example, 2), the core key word contained it is total Quantity is less than second threshold (for example, 2), it is determined that the text is insignificant text), exclusive PCR.During cutting word only Retain noun and verb, delete some adjectives, auxiliary word etc., for example, delete " ", " obtaining ", " " etc. exclusive PCRs.
In general, in topic model, theme represents concept, an one side, shows as a series of relevant words, is this The conditional probability of a little words.For image, theme is exactly a bucket, and the inside has filled the higher word of probability of occurrence, these words There is very strong correlation with this theme.
Carry out theme modeling before need a corpus, in the present embodiment, according to predetermined technical field with The mapping relations of corpus determine the corresponding corpus of technical field belonging to the text to be checked obtained, and the language that will be determined Expect the corpus that storehouse is modeled as the theme of the technical field.
Training module 202, for based on topic model text to be checked being trained to train the master that text to be checked includes Topic and comprising theme in the probability distribution matrix of probability that occurs in each theme of word.
In general, in LDA topic models, the generating process of each word all relies on the theme belonging to the word, All there are one conditional value at risk between an i.e. usual word and a theme, which is typically expressed as:P (word | theme).Such a relation is represented with matrix, then the line number of matrix is equal to the number of theme, and columns is equal to all words The number of language, then every a line of matrix is that the probability distribution of different terms is generated under some theme.Also it is to be based on When topic model trains some text to be checked, usually train between the corresponding different terms of theme that text to be checked includes Probability distribution matrix.
In the present embodiment, each row expression of each row of selective analysis probability distribution matrix, then probability distribution matrix Be probability that some word occurs in some theme, after the corresponding parameter of each row is normalized, obtain with Word is the corresponding theme vector of each word of dimension.
Analysis module 203, for according to the corresponding theme of the default each word of conjunctive word weight analysis rule analysis to Relation between amount, to analyze the corresponding conjunctive word of word to be retrieved.
Wherein, default conjunctive word weight analysis rule includes:
The corresponding theme vector of word to be retrieved other each words corresponding with the probability distribution matrix are calculated respectively (other described words refer to:In the corresponding word of the probability distribution matrix, the word in addition to the word to be retrieved) Euclidean distance between corresponding theme vector;
The magnitude relationship between each Euclidean distance calculated is analyzed, finds out minimum Euclidean distance;
Using other corresponding words of the Euclidean distance found out as the conjunctive word of the word to be retrieved.
Wherein, other words are referred in the corresponding word of probability distribution matrix, the word in addition to word to be retrieved.
The present embodiment is using topic model LDA (Latent Dirichlet Allocation) to corpus to be checked Training result, the potential probability distribution of word in each theme is abstracted as word to the probability distribution of theme, and using should Distribution calculates the Euclidean distance between the corresponding theme vector of each word, so according to the corresponding theme vector of each word it Between Euclidean distance calculate incidence relation in entire corpus between each word.It should be noted that in the present embodiment Feature is word.
In a preferred embodiment, as shown in figure 3, being the schematic diagram of the hardware structure of analysis module 203 in Fig. 2.By Fig. 3 understands that analysis module 203 includes:Computing unit 301, comparing unit 302, conjunctive word determination unit 303.
Computing unit 301, it is corresponding with probability distribution matrix for calculating the corresponding theme vector of word to be retrieved respectively Euclidean distance between the corresponding theme vector of each other words.
Resolution unit 302, for analyzing the magnitude relationship calculated between each Euclidean distance, find out it is minimum it is European away from From.
Conjunctive word determination unit 303, other corresponding words of Euclidean distance of the minimum for that will find out are to be checked as this The conjunctive word of rope word.
Implement the above-mentioned conjunctive word analysis system based on topic model, first, needing to carry out a technical field When theme models, the text to be checked of the technical field is obtained from the corresponding predetermined database of the technical field, to obtaining The text to be checked taken carries out theme modeling, to obtain the corresponding topic model of each text to be checked;Then, based on theme mould Type training text to be checked with train theme that text to be checked includes and comprising theme in word in each theme The probability distribution matrix of the probability of appearance;Then, the corresponding theme vector of each word is obtained from probability distribution matrix, according to Relation between the corresponding theme vector of the default each word of conjunctive word weight analysis rule analysis, to analyze word to be retrieved The corresponding conjunctive word of language.In this way, the dictionary for the near synonym synonym increased income on the net can be not only directly used in specific special Industry field, and compared with the retrieval knot of this conjunctive word analysis method based on topic model of existing conjunctive word analysis mode Fruit is relatively more accurate and retrieval result is more comprehensive.
In addition, the present invention also proposes a kind of conjunctive word analysis method based on topic model.
As shown in figure 4, for the present invention is based on the conjunctive word analysis method implementation process diagrams of topic model.It can by Fig. 4 Know, the conjunctive word analysis method the present invention is based on topic model includes the following steps S401 to step S403.
Step S401, it is corresponding true in advance from the technical field when needing to carry out theme modeling to a technical field Fixed database (for example, the corresponding paper storehouse of the technical field, blog articles storehouse etc.) obtains the text to be checked of the technical field This, carries out theme modeling, to obtain the corresponding topic model of each text to be checked to the text to be checked of acquisition.
In one embodiment of the invention, query text progress can be treated according to the application scenarios of the technical field Editor, to obtain good text to be checked.
If for example, the technical field is medical field, made with the text in the predetermined database of medical field For text to be checked, first, according to the core key word of medical field, the text of no practical significance is deleted (for example, analysis is each The species of core key word containing predetermined medical field and corresponding quantity in a text to be checked, if one is treated The species of the core key word contained in query text be less than first threshold (for example, 2), the core key word contained it is total Quantity is less than second threshold (for example, 2), it is determined that the text is insignificant text), exclusive PCR.During cutting word only Retain noun and verb, delete some adjectives, auxiliary word etc., for example, delete " ", " obtaining ", " " etc. exclusive PCRs.It is in general, main It inscribes in model, theme represents concept, an one side, shows as a series of relevant words, is that the condition of these words is general Rate.For image, theme is exactly a bucket, and the inside has filled the higher word of probability of occurrence, these words have very with this theme Strong correlation.
Carry out theme modeling before need a corpus, in the present embodiment, according to predetermined technical field with The mapping relations of corpus determine the corresponding corpus of technical field belonging to the text to be checked obtained, and the language that will be determined Expect the corpus that storehouse is modeled as the theme of the technical field.
Step S402 trains text to be checked to train the theme and bag that text to be checked includes based on topic model The probability distribution matrix for the probability that word in the theme contained occurs in each theme.
In general, in LDA topic models, the generating process of each word all relies on the theme belonging to the word, All there are one conditional value at risk between an i.e. usual word and a theme, which is typically expressed as:P (word | theme).Such a relation is represented with matrix, then the line number of matrix is equal to the number of theme, and columns is equal to all words The number of language, then every a line of matrix is that the probability distribution of different terms is generated under some theme.Also it is to be based on When topic model trains some text to be checked, it is corresponding not usually to train theme fish elder brother's theme that text to be checked includes With the probability distribution matrix between word.
In the present embodiment, each row expression of each row of selective analysis probability distribution matrix, then probability distribution matrix Be probability that some word occurs in some theme, further, after the corresponding parameter of each row is normalized, It obtains using word as the corresponding theme vector of each word of dimension.
Step S403, according between the corresponding theme vector of the default each word of conjunctive word weight analysis rule analysis Relation, to analyze the corresponding conjunctive word of word to be retrieved.
Wherein, default conjunctive word weight analysis rule includes:
The corresponding theme vector of word to be retrieved other each words corresponding with the probability distribution matrix are calculated respectively (other described words refer to:In the corresponding word of the probability distribution matrix, the word in addition to the word to be retrieved) Euclidean distance between corresponding theme vector;
The magnitude relationship between each Euclidean distance calculated is analyzed, finds out minimum Euclidean distance;
Using other corresponding words of the Euclidean distance found out as the conjunctive word of the word to be retrieved.
Wherein, other words are referred in the corresponding word of probability distribution matrix, the word in addition to word to be retrieved.
The present embodiment is using topic model LDA (Latent Dirichlet Allocation) to corpus to be checked Training result, the potential probability distribution of word in each theme is abstracted as word to the probability distribution of theme, and using should Distribution calculates the Euclidean distance between the corresponding theme vector of each word, so according to the corresponding theme vector of each word it Between Euclidean distance calculate incidence relation in entire corpus between each word.It should be noted that in the present embodiment Feature is word.
In a preferred embodiment, as shown in figure 5, being the implementation process diagram of step S403 in Fig. 4.It can by Fig. 5 Know, step S403 is to specifically comprise the following steps S501 to step S503 in embodiment one.
Step S501, calculate respectively the corresponding theme vector of word to be retrieved it is corresponding with probability distribution matrix it is each other Euclidean distance between the corresponding theme vector of word.
Step S502, analysis calculate the magnitude relationship between each Euclidean distance, find out minimum Euclidean distance.
Step S503, using minimum other the corresponding words of Euclidean distance found out as the association of the word to be retrieved Word.
Implement the above-mentioned conjunctive word analysis method based on topic model,
First, it is corresponding predetermined from the technical field when needing to carry out theme modeling to a technical field Database obtains the text to be checked of the technical field, theme modeling is carried out to the text to be checked of acquisition, to obtain each treat The corresponding topic model of query text;Then, based on topic model text to be checked is trained to be included to train text to be checked Theme and comprising theme in the probability distribution matrix of probability that occurs in each theme of word;Then, from probability point The corresponding theme vector of each word is obtained in cloth matrix, according to the default each word pair of conjunctive word weight analysis rule analysis Relation between the theme vector answered, to analyze the corresponding conjunctive word of word to be retrieved.In this way, it not only will can on the net increase income The dictionary of near synonym synonym be directly used in specific professional domain, and compared with existing conjunctive word analysis mode this The retrieval result of conjunctive word analysis method of the kind based on topic model is relatively more accurate and retrieval result is more comprehensive.
The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on such understanding, technical scheme substantially in other words does the prior art Going out the part of contribution can be embodied in the form of software product, which is stored in a storage medium In (such as ROM/RAM, magnetic disc, CD), used including some instructions so that a station terminal equipment (can be mobile phone, computer takes Be engaged in device, air conditioner or the network equipment etc.) perform method described in each embodiment of the present invention.
It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair The equivalent structure or equivalent flow shift that bright specification and accompanying drawing content are made directly or indirectly is used in other relevant skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of conjunctive word analysis method based on topic model, which is characterized in that described method includes following steps:
A, when needing to carry out theme modeling to a technical field, obtained from the corresponding predetermined database of the technical field The text to be checked of the technical field is taken, theme modeling is carried out to the text to be checked of acquisition, to obtain each text to be checked Corresponding topic model;
B, the text to be checked is trained based on the topic model to train the theme and bag that the text to be checked includes The probability distribution matrix for the probability that word in the theme contained occurs in each theme;
C, the corresponding theme vector of each word is obtained from the probability distribution matrix, according to default conjunctive word weight analysis Relation between the corresponding theme vector of each word of rule analysis, to analyze the corresponding conjunctive word of word to be retrieved.
2. the conjunctive word analysis method of topic model according to claim 1, which is characterized in that from the moment of probability distribution Obtaining the corresponding theme vector of each word in battle array includes:
The corresponding parameter of each row in the probability distribution matrix is normalized, so as to obtain using word as each of dimension The corresponding theme vector of a word.
3. the conjunctive word analysis method of topic model according to claim 1, which is characterized in that the default conjunctive word Weight analysis rule includes:
The corresponding theme vector of the word to be retrieved other each words corresponding with the probability distribution matrix are calculated respectively Euclidean distance between corresponding theme vector;
The magnitude relationship between each Euclidean distance calculated is analyzed, finds out minimum Euclidean distance;
Using other the corresponding words of Euclidean distance for the minimum found out as the conjunctive word of the word to be retrieved.
4. the conjunctive word analysis method of topic model according to claim 3, which is characterized in that other described words referred to It is, in the corresponding word of the probability distribution matrix, the word in addition to the word to be retrieved.
5. the conjunctive word analysis method of topic model according to claim 1, which is characterized in that before the step A Further include following steps:
According to predetermined technical field and the mapping relations of corpus, the technology neck belonging to the text to be checked obtained is determined The corresponding corpus in domain, and using definite corpus as the corpus of the topic model of the technical field.
6. a kind of electronic device, which is characterized in that the electronic device includes:Memory, processor are stored on the memory There are the conjunctive word analysis system based on topic model that can be run on the processor, the conjunctive word based on topic model Following operation is realized when analysis system is performed by the processor:
S1, when needing to carry out theme modeling to technical field, from the corresponding predetermined database of the technical field The text to be checked of the technical field is obtained, theme modeling is carried out to the text to be checked of acquisition, to obtain each text to be checked This corresponding topic model;
S2, the text to be checked is trained based on the topic model with train theme that the text to be checked includes and Comprising theme in the probability distribution matrix of probability that occurs in each theme of word;
S3, the corresponding theme vector of each word is obtained from the probability distribution matrix, according to default conjunctive word weight point The relation between the corresponding theme vector of each word of rule analysis is analysed, to analyze the corresponding conjunctive word of word to be retrieved.
7. electronic device according to claim 6, which is characterized in that the processor performs the association based on topic model Word analysis system realizes the operation bag that the corresponding theme vector of each word is obtained from the theme feature probability distribution matrix It includes:
The corresponding parameter of each row in the probability distribution matrix is normalized, so as to obtain using word as each of dimension The corresponding theme vector of a word.
8. electronic device according to claim 6, which is characterized in that the processor performs the association based on topic model Word analysis system realizes that the operation of the default conjunctive word weight analysis rule includes:
The corresponding theme vector of the word to be retrieved other each words corresponding with the probability distribution matrix are calculated respectively Euclidean distance between corresponding theme vector;
The magnitude relationship between each Euclidean distance calculated is analyzed, finds out minimum Euclidean distance;
Using other the corresponding words of Euclidean distance for the minimum found out as the conjunctive word of the word to be retrieved.
9. electronic device according to claim 6, which is characterized in that the processor performs the association based on topic model Before word analysis system realizes step S1, following operation is also realized:
According to predetermined technical field and the mapping relations of corpus, the technology neck belonging to the text to be checked obtained is determined The corresponding corpus in domain, and using definite corpus as the corpus of the topic model of the technical field.
10. a kind of computer readable storage medium, which is characterized in that be stored on the computer readable storage medium based on master The conjunctive word analysis system of model is inscribed, is realized when the conjunctive word analysis system based on topic model is executed by processor as weighed Profit requires the step of conjunctive word analysis method of 1 to 5 any one of them based on topic model.
CN201711059225.4A 2017-11-01 2017-11-01 Conjunctive word analysis method, electronic device and storage medium based on topic model Pending CN108052520A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201711059225.4A CN108052520A (en) 2017-11-01 2017-11-01 Conjunctive word analysis method, electronic device and storage medium based on topic model
PCT/CN2017/113720 WO2019085118A1 (en) 2017-11-01 2017-11-30 Topic model-based associated word analysis method, and electronic apparatus and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711059225.4A CN108052520A (en) 2017-11-01 2017-11-01 Conjunctive word analysis method, electronic device and storage medium based on topic model

Publications (1)

Publication Number Publication Date
CN108052520A true CN108052520A (en) 2018-05-18

Family

ID=62118863

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711059225.4A Pending CN108052520A (en) 2017-11-01 2017-11-01 Conjunctive word analysis method, electronic device and storage medium based on topic model

Country Status (2)

Country Link
CN (1) CN108052520A (en)
WO (1) WO2019085118A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413737A (en) * 2019-07-29 2019-11-05 腾讯科技(深圳)有限公司 A kind of determination method, apparatus, server and the readable storage medium storing program for executing of synonym
CN111382566A (en) * 2018-12-28 2020-07-07 北京搜狗科技发展有限公司 Site theme determination method and device and electronic equipment
CN115730592A (en) * 2022-11-30 2023-03-03 贵州电网有限责任公司信息中心 Power grid redundant data elimination method, device, equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8326820B2 (en) * 2009-09-30 2012-12-04 Microsoft Corporation Long-query retrieval
CN103425799A (en) * 2013-09-04 2013-12-04 北京邮电大学 Personalized research direction recommending system and method based on themes
CN103699625A (en) * 2013-12-20 2014-04-02 北京百度网讯科技有限公司 Method and device for retrieving based on keyword
CN105224521A (en) * 2015-09-28 2016-01-06 北大方正集团有限公司 Key phrases extraction method and use its method obtaining correlated digital resource and device
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity
CN106202294A (en) * 2016-07-01 2016-12-07 北京奇虎科技有限公司 The related news computational methods merged based on key word and topic model and device
CN106202177A (en) * 2016-06-27 2016-12-07 腾讯科技(深圳)有限公司 A kind of file classification method and device
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN106897276A (en) * 2015-12-17 2017-06-27 中国科学院深圳先进技术研究院 A kind of internet data clustering method and system
CN107220232A (en) * 2017-04-06 2017-09-29 北京百度网讯科技有限公司 Keyword extracting method and device, equipment and computer-readable recording medium based on artificial intelligence

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831234B (en) * 2012-08-31 2015-04-22 北京邮电大学 Personalized news recommendation device and method based on news content and theme feature
CN103440329B (en) * 2013-09-04 2016-05-18 北京邮电大学 Authority author and high-quality paper commending system and recommend method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8326820B2 (en) * 2009-09-30 2012-12-04 Microsoft Corporation Long-query retrieval
CN103425799A (en) * 2013-09-04 2013-12-04 北京邮电大学 Personalized research direction recommending system and method based on themes
CN103699625A (en) * 2013-12-20 2014-04-02 北京百度网讯科技有限公司 Method and device for retrieving based on keyword
CN105224521A (en) * 2015-09-28 2016-01-06 北大方正集团有限公司 Key phrases extraction method and use its method obtaining correlated digital resource and device
CN106897276A (en) * 2015-12-17 2017-06-27 中国科学院深圳先进技术研究院 A kind of internet data clustering method and system
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity
CN106202177A (en) * 2016-06-27 2016-12-07 腾讯科技(深圳)有限公司 A kind of file classification method and device
CN106202294A (en) * 2016-07-01 2016-12-07 北京奇虎科技有限公司 The related news computational methods merged based on key word and topic model and device
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN107220232A (en) * 2017-04-06 2017-09-29 北京百度网讯科技有限公司 Keyword extracting method and device, equipment and computer-readable recording medium based on artificial intelligence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周亦鹏 等: "基于关联词的主题模型语义标注", 《智能系统学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382566A (en) * 2018-12-28 2020-07-07 北京搜狗科技发展有限公司 Site theme determination method and device and electronic equipment
CN110413737A (en) * 2019-07-29 2019-11-05 腾讯科技(深圳)有限公司 A kind of determination method, apparatus, server and the readable storage medium storing program for executing of synonym
CN110413737B (en) * 2019-07-29 2022-10-14 腾讯科技(深圳)有限公司 Synonym determination method, synonym determination device, server and readable storage medium
CN115730592A (en) * 2022-11-30 2023-03-03 贵州电网有限责任公司信息中心 Power grid redundant data elimination method, device, equipment and storage medium

Also Published As

Publication number Publication date
WO2019085118A1 (en) 2019-05-09

Similar Documents

Publication Publication Date Title
CN107943847B (en) Business connection extracting method, device and storage medium
CN109871532B (en) Text theme extraction method and device and storage medium
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN110163476A (en) Project intelligent recommendation method, electronic device and storage medium
CN107704503A (en) User's keyword extracting device, method and computer-readable recording medium
CN109492222B (en) Intention identification method and device based on concept tree and computer equipment
CN110532397B (en) Question-answering method and device based on artificial intelligence, computer equipment and storage medium
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN107818491A (en) Electronic installation, Products Show method and storage medium based on user's Internet data
CN110866181A (en) Resource recommendation method, device and storage medium
CN109062972A (en) Web page classification method, device and computer readable storage medium
CN108052394A (en) The method and computer equipment of resource allocation based on SQL statement run time
CN109885828A (en) Word error correction method, device, computer equipment and medium based on language model
CN107358247A (en) A kind of method and device for determining to be lost in user
CN109522395A (en) Automatic question-answering method and device
CN107797989A (en) Enterprise name recognition methods, electronic equipment and computer-readable recording medium
JP7040535B2 (en) Security information processing equipment, information processing methods and programs
CN108052520A (en) Conjunctive word analysis method, electronic device and storage medium based on topic model
CN108090042A (en) For identifying the method and apparatus of text subject
CN103577547B (en) Webpage type identification method and device
CN112686053A (en) Data enhancement method and device, computer equipment and storage medium
CN111753302A (en) Method and device for detecting code bugs, computer readable medium and electronic equipment
CN107357782A (en) One kind identification user's property method for distinguishing and terminal
CN110502623A (en) Intelligent answer method, electronic device, computer equipment and readable storage medium storing program for executing
CN113705792A (en) Personalized recommendation method, device, equipment and medium based on deep learning model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180518