CN107577656B - Text implicit semantic activation method and system - Google Patents

Text implicit semantic activation method and system Download PDF

Info

Publication number
CN107577656B
CN107577656B CN201710565733.3A CN201710565733A CN107577656B CN 107577656 B CN107577656 B CN 107577656B CN 201710565733 A CN201710565733 A CN 201710565733A CN 107577656 B CN107577656 B CN 107577656B
Authority
CN
China
Prior art keywords
text
detected
word
determining
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710565733.3A
Other languages
Chinese (zh)
Other versions
CN107577656A (en
Inventor
曾大军
白洁
李林静
王磊
李秋丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201710565733.3A priority Critical patent/CN107577656B/en
Publication of CN107577656A publication Critical patent/CN107577656A/en
Application granted granted Critical
Publication of CN107577656B publication Critical patent/CN107577656B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a text implied semantic activation method and a system, wherein the activation method comprises the following steps: acquiring information of a to-be-detected term of a to-be-detected text; determining an activation coefficient of each reference lexical item in a word list of a text set knowledge base according to the text set knowledge base and information of the lexical item to be detected of the text to be detected; the text set knowledge base comprises a word list formed by a plurality of reference terms, reference word vectors corresponding to the reference terms and reference word frequencies; selecting corresponding reference terms to form a hidden semantic set of the text to be tested according to each activation coefficient; and adding the implicit semantic set into the text to be tested for semantic expansion. Therefore, the hidden information of the text to be detected can be accurately determined, and the accuracy is high.

Description

Text implicit semantic activation method and system
Technical Field
The invention relates to the technical field of computer science, in particular to a text implied semantic activation method and a text implied semantic activation system.
Background
In the face of mass dynamic data on the current internet, the problem of information overload becomes a big obstacle for network users to acquire effective information. By taking a social media platform as a representative, a text generated by a user is noisy, has fuzzy semantics, and has free and high-speed content style, so that a large amount of hidden information exists. Therefore, the hidden information is mined from large-scale data, efficient and accurate hidden semantic analysis is realized, and the method has very important application value.
The previous implicit semantic analysis work can be mainly divided into two types, one type utilizes an external knowledge base (WordNet, Wikipedia and the like) or an external technology (a search engine, machine translation and the like) to expand the semantic information of a text to be detected. Representative work includes Kim et al (Kim, h. -j., k. -j.hong and j.y. chang (2015). semantic enching text reproduction Model for Document clustering. Proceedings of the 30th annual ACM Symposium on Applied Computing.) text information in wikipedia is used to expand text to be tested and Applied in text clustering work. The method has the defects that external knowledge is needed, the flexibility is poor, and the real-time updating of the knowledge is difficult to adapt to the internet information dynamics which change rapidly. The other method is to excavate the text implicit semantic information by constructing language models and other methods, and to perform vectorized abstract representation on the text. Representative methods include LDA (Blei, D.M., A.Y.Ng and M.Jordan (2003), "Laten Dirichlet Allocation," the Journal of Machine Learning Research 3: 993- "and para graph2vec (Le, Q.V.and T.Mikolov (2014.). Distributed retrieval of sequences and documents, Proceedings of the 31st International Conference Machine Learning.) proposed by Blei et al. The methods obtain high-level abstract semantic information of the text through modeling, but original text information is lost, so that the problems that high-level abstract contents are difficult to interpret visually and are difficult to combine with other semantic analysis methods are caused.
Disclosure of Invention
In order to solve the problems in the prior art, namely solving the problems that original text information is lost due to the fact that high-level abstract semantic information of a text is obtained through modeling, so that the original text information cannot be intuitively interpreted and other semantic analysis methods are difficult to combine, the invention provides a text implicit semantic activation method and a system.
In order to achieve the purpose, the invention provides the following scheme:
a text implied semantic activation method, the activation method comprising:
acquiring information of a to-be-detected term of a to-be-detected text;
determining an activation coefficient of each reference lexical item in a word list of a text set knowledge base according to the text set knowledge base and information of the lexical item to be detected of the text to be detected; the text set knowledge base comprises a word list formed by a plurality of reference terms, reference word vectors corresponding to the reference terms and reference word frequencies;
selecting corresponding reference terms to form a hidden semantic set of the text to be tested according to each activation coefficient;
and adding the implicit semantic set into the text to be tested for semantic expansion.
Optionally, the activation method further includes:
and training the pre-collected original text set to construct a text set knowledge base.
Optionally, the training of the pre-collected original text set to construct a text set knowledge base specifically includes:
training and preprocessing an original text set, and determining an original lexical item set containing all reference lexical items;
filtering stop words in the original lexical item set to obtain a reference lexical item set;
constructing a word list according to each reference lexical item in the reference lexical item set;
counting the occurrence frequency of each reference word item, and determining the corresponding reference word frequency;
and training the text set through a word vector training tool, and determining the reference word vector corresponding to each reference word item.
Optionally, the acquiring information of the to-be-detected term of the to-be-detected text specifically includes:
carrying out sentence segment division and word segmentation on the text to be detected to obtain a word item to be detected of the text to be detected;
and determining the word vector to be detected and the word frequency to be detected of each term to be detected based on the text set knowledge base.
Optionally, the determining the word vector to be detected and the word frequency to be detected of each term to be detected specifically includes:
searching each reference word vector in the text set knowledge base according to the lexical item to be detected, and determining a word vector to be detected corresponding to each lexical item to be detected;
and determining the word frequency to be detected corresponding to each word to be detected according to each reference word frequency in the text set knowledge base.
Optionally, the determining the word vector to be detected and the word frequency to be detected of each term to be detected further includes:
training the term to be detected through a word vector training tool to obtain an incremental word vector to be detected corresponding to the term to be detected;
and counting the occurrence frequency of each term to be detected in the text to be detected, and determining the term frequency to be detected of each term to be detected by combining each reference term frequency in the text set knowledge base.
Optionally, the activation method further includes:
and adding the dynamically obtained updated word item to be detected, word vector to be detected and word frequency to be detected of the text to be detected into the text set knowledge base for expansion, and updating the word list, reference word vector and reference word frequency of the text set knowledge base.
Optionally, the determining an activation coefficient of each reference term in a vocabulary of the text collection knowledge base specifically includes:
calculating the joint association strength of each reference lexical item in the word list of the text set knowledge base and the text to be tested;
and determining the activation coefficient of each reference word item according to the word frequency of each reference word item and the joint association strength of the reference word item and the text to be tested.
Optionally, the calculating the joint association strength between each reference term in the vocabulary of the text collection knowledge base and the text to be tested specifically includes any one of the following:
respectively calculating the similarity of the word vector corresponding to each reference term and the word vector corresponding to each term to be detected in the text to be detected, calculating the weighted average value of each similarity, and determining the joint association strength;
calculating a weighted average vector of word vectors corresponding to each term to be detected in the text to be detected, calculating the similarity of the weighted average vector and the word vectors corresponding to each reference term, and determining the joint association strength;
carrying out sentence segment division on the text to be detected, calculating division weighted average vectors of word vectors corresponding to division lexical items in each divided set, respectively calculating the similarity of each division weighted average vector and the word vector corresponding to each reference lexical item, calculating the average value of each similarity, and determining the joint association strength;
randomly selecting a plurality of sub-segments in the text to be detected, calculating segment weighted average vectors of word vectors corresponding to segment terms in each sub-segment, respectively calculating the similarity of each segment weighted average vector and the word vectors corresponding to reference terms, calculating the average value of each similarity, and determining the joint association strength.
Optionally, the determining the activation coefficient of each reference term according to the joint association strength of each reference term and the text to be tested specifically includes any one of the following:
the joint association strength of each reference lexical item and the text to be tested is used as an activation coefficient corresponding to each reference lexical item;
and calculating the reference word frequency of each reference word item and the combined association strength weighted sum of the corresponding reference word item and the text to be tested, and determining the activation coefficient of each reference word item.
Optionally, the selecting, according to each activation coefficient, a corresponding reference term to form a hidden semantic set of the text to be tested specifically includes:
calculating the length N of the implicit semantic set based on the text to be detectedY
Sequencing all the activation coefficients according to the sequencing result and the length N of the implicit semantic setYAn implicit semantic set is determined.
Optionally, the length N of the implied semantic set is calculated based on the text to be testedYThe method specifically comprises the following steps:
determining the length N of the implied semantic set according to the following formulaY
NY=αNX
Wherein α denotes the activation ratio, NXAnd representing the term to be tested in the text to be tested.
Optionally, the length N according to the sorting result and the implicit semantic setYDetermining a set of implied semantics, in particular comprisingAny one of the following:
selecting the first N with the activation coefficients ordered from large to smallYEach reference term constitutes the implicit semantic collection;
sequentially selecting the lexical items with the largest activation coefficients and adding the lexical items to the implied semantic set; until the sum of the activation coefficients corresponding to all terms in the implicit semantic set is greater than or equal to NY
In order to achieve the above purpose, the invention also provides the following scheme:
a text implication semantic activation system, the activation system comprising:
the acquisition unit is used for acquiring the information of the terms of the text to be detected;
the determining unit is used for determining the activation coefficient of each reference lexical item in the word list of the text set knowledge base according to the text set knowledge base and the information of the lexical item to be detected of the text to be detected; the text set knowledge base comprises a word list formed by a plurality of reference terms, reference word vectors corresponding to the reference terms and reference word frequencies;
the selection unit is used for selecting the corresponding reference terms to form a hidden semantic set of the text to be tested according to each activation coefficient;
and the expansion unit is used for adding the implicit semantic set into the text to be tested for semantic expansion.
Optionally, the activation system further includes:
and the knowledge base construction unit is used for training the pre-collected original text set to construct a text set knowledge base.
Optionally, the knowledge base building unit includes:
the preprocessing module is used for training and preprocessing the original text set and determining an original lexical item set containing all reference lexical items;
the filtering module is used for filtering stop words in the original lexical item set to obtain a reference lexical item set;
the construction module is used for constructing a word list according to each reference lexical item in the reference lexical item set;
the statistic module is used for counting the occurrence frequency of each reference word and determining the corresponding reference word frequency;
and the training module is used for training the text set through a word vector training tool and determining the reference word vector corresponding to each reference word item.
According to the embodiment of the invention, the invention discloses the following technical effects:
according to the text implicit semantic activation method and system, the activation coefficient of each reference lexical item in the word list of the text set knowledge base can be determined through the text set knowledge base and the information of the to-be-detected lexical item of the to-be-detected text, and the implicit semantic set of the to-be-detected text is determined according to the activation coefficients, so that semantic expansion is performed on the to-be-detected text, the implicit information of the to-be-detected text can be accurately determined, and accuracy is high.
Drawings
FIG. 1 is a flow chart of a text implied semantic activation method of the present invention;
FIG. 2 is a flow chart of determining activation coefficients in the text implicit semantic activation method according to the present invention;
FIG. 3 is a schematic structural diagram of the text implication semantic activation system of the present invention.
Detailed Description
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
The invention aims to provide a text implicit semantic activation method and a text implicit semantic activation system.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in FIG. 1, the text implicit semantic activation method of the present invention includes:
step 110: and acquiring the information of the vocabulary item to be detected of the text to be detected.
Step 120: and determining the activation coefficient of each reference lexical item in the word list of the text set knowledge base according to the text set knowledge base and the information of the lexical item to be detected of the text to be detected.
The text set knowledge base comprises a word list formed by a plurality of reference terms, reference word vectors corresponding to the reference terms and reference word frequency.
Step 130: and selecting corresponding reference terms to form a hidden semantic set of the text to be tested according to each activation coefficient.
Step 140: and adding the implicit semantic set into the text to be tested for semantic expansion.
In addition, the text implicit semantic activation method further comprises the following steps:
step 100: and training the pre-collected original text set to construct a text set knowledge base.
When the text implicit semantic activation method is used for implicit semantic activation, the step 100 is only needed to be executed when the text implicit semantic activation method is used for the first time, and a knowledge base is constructed.
Wherein the pre-collected original text set at least comprises 1000 texts of the same type. The text type may be long text or short text. The number of words in the long text single document is more than 50, and the number of words in the short text single document is between 2 and 50.
In step 100, the training of the pre-collected original text set to construct a text set knowledge base specifically includes:
step 1001: training and preprocessing the original text set, and determining an original lexical item set containing all the reference lexical items.
The preprocessing comprises the steps of cleaning an original text set, segmenting words and dividing a semantic set to obtain all terms.
The "word" represents the smallest semantic unit in the text. For English and other texts taking a space as a separator, a word is a word between two spaces; for the text without space as separator, such as Chinese, the word is obtained after the word segmentation is carried out by the word segmentation tool. Terms represent words that are not repeated.
Step 1002: and filtering stop words in the original lexical item set to obtain a reference lexical item set.
Counting all terms appearing in the text set, or removing common stop words to obtain a text set word list and storing the text set word list in the following set form:
{ "rain"; "always"; "Down"; "atmosphere"; "not counting"; "harmonious"; … … }.
Step 1003: and constructing a word list according to each reference lexical item in the reference lexical item set.
Step 1004: and counting the occurrence frequency of each reference word item, and determining the corresponding reference word frequency.
For each term in the vocabulary, counting the frequency of occurrence in the text set and storing the word in the form of a dictionary as follows:
{ "rain": 20;
"atmosphere": 8;
"harmonious": 32, a first step of removing the first layer;
"people": 20;
a "pyramid": 12;
……}。
step 1005: and training the text set through a word vector training tool, and determining the reference word vector corresponding to each reference word item.
The word vector is vectorized representation of 1 × L words, and each term has its uniquely determined word vector as abstract semantic information. Word vectors are obtained by Word vector training tools such as Word2 Vec. Taking Word2Vec as an example, for a text set containing M documents and V terms, the input of Word2Vec is M lists, and each list is a Word sequence of the corresponding document; the output is V1 × L word vectors. The word vectors are stored in the form of a dictionary as follows (for convenience of presentation in this embodiment, L ═ 5 is taken as an example):
{ "rain": [5.6,9.0,4.1,9.5,4.5 ];
"atmosphere": [3.3,7.7,1.7,7.1,8.9 ];
"harmonious": [6.8,10.0,9.3,8.8,4.0 ];
"people": [5.4,9.4,1.8,6.2,4.2 ];
a "pyramid": [1.8,9.3,4.0,1.5,0.8 ];
……}
to speed up the next search, the dictionaries mentioned here and below are preferably stored using hash structures.
In step 110, the acquiring information of the term to be detected of the text to be detected specifically includes:
step 1101: and carrying out sentence segment division and word segmentation on the text to be detected to obtain a word item to be detected of the text to be detected. The dividing standard can be divided by sentences or clauses, can also be divided by natural paragraphs, and can also be divided in other forms according to actual use requirements. The words are English words or obtained by segmenting Chinese texts by a segmentation tool.
Step 1102: and determining the word vector to be detected and the word frequency to be detected of each term to be detected based on the text set knowledge base.
In step 1102, the determining the word vector to be detected and the word frequency to be detected of each term to be detected specifically includes:
step 11021: searching each reference word vector in the text set knowledge base according to the lexical item to be detected, and determining a word vector to be detected corresponding to each lexical item to be detected;
step 11022: and determining the word frequency to be detected corresponding to each word to be detected according to each reference word frequency in the text set knowledge base.
In addition, in step 1102, the determining the word vector to be detected and the word frequency to be detected of each term to be detected may further include:
step 11021 a: and training the term to be detected through a word vector training tool to obtain an incremental word vector to be detected corresponding to the term to be detected. By the method, the word vector of the new term can be acquired while the existing word vector training result is utilized, and the utilization efficiency of the knowledge base is greatly improved.
Step 11022 a: and counting the occurrence frequency of each term to be detected in the text to be detected, and determining the updated term frequency to be detected of each term to be detected by combining each reference term frequency in the text set knowledge base.
Based on steps 11021a and 11022a, the text implied semantic activation method further comprises the following steps:
in addition, in step 110, the obtaining of the information of the to-be-detected terms of the to-be-detected text may further include adding the to-be-detected terms, the to-be-detected word vectors, and the to-be-detected word frequencies of the to-be-detected text, which are obtained dynamically, to the text collection knowledge base for expansion, and updating the word list, the reference word vectors, and the reference word frequencies of the text collection knowledge base.
As shown in fig. 2, in step 120, the determining an activation coefficient of each reference term in a vocabulary of the knowledge base of text sets specifically includes:
step 1201: calculating the joint association strength of each reference lexical item in the word list of the text set knowledge base and the text to be tested;
step 1202: and determining the activation coefficient of each reference word item according to the word frequency of each reference word item and the joint association strength of the reference word item and the text to be tested.
In step 1201, the calculating the joint association strength between each reference term in the vocabulary of the text collection knowledge base and the text to be tested specifically includes any one of the following:
the first correlation calculation method comprises the following steps: respectively calculating the similarity of the word vector corresponding to each reference term and the word vector corresponding to each term to be detected in the text to be detected, calculating the weighted average value of each similarity, and determining the joint association strength;
and a second correlation calculation method: calculating a weighted average vector of word vectors corresponding to each term to be detected in the text to be detected, calculating the similarity of the weighted average vector and the word vectors corresponding to each reference term, and determining the joint association strength;
and a third correlation calculation method: carrying out sentence segment division on the text to be detected, calculating division weighted average vectors of word vectors corresponding to division lexical items in each divided set, respectively calculating the similarity of each division weighted average vector and the word vector corresponding to each reference lexical item, calculating the average value of each similarity, and determining the joint association strength;
and a correlation calculation method four: randomly selecting a plurality of sub-segments in the text to be detected, calculating segment weighted average vectors of word vectors corresponding to segment terms in each sub-segment, respectively calculating the similarity of each segment weighted average vector and the word vectors corresponding to reference terms, calculating the average value of each similarity, and determining the joint association strength.
The weight of the weighted average may refer to an appearance ratio of a word in the text to be tested, or may refer to a TF (Term Frequency) -IDF (Inverse Document Frequency) value of the Term to be tested. The similarity calculation can be realized by cosine similarity or vector inner product, or by calculating Euclidean distance or Manhattan distance and then calculating the reciprocal of the Euclidean distance or Manhattan distance.
The calculation process of the joint association strength is explained by taking the text to be detected 'raining straight down', the reference terms 'people' and 'pyramid' in the text set knowledge base as examples. Suppose that the word sequence and its corresponding word vector extracted from "rain always down" are:
[ "rain": [5.6,9.0,4.1,9.5,4.5 ];
"atmosphere": [7.1,9.9,3.8,2.6,2.1 ];
"harmonious": [9.0,5.9,7.8,5.3,4.8 ];
"eave": [8.5,2.5,2.0,3.1,6.6 ];
"feel": [1.1,9.5,5.8,1.3,1.1]]
The word vectors of "people" and "pyramid" are respectively:
"people": [5.4,9.4,1.8,6.2,4.2]
A "pyramid": [1.8,9.3,4.0,1.5,0.8]
According to the first method, when the similarity adopts a cosine similarity calculation formula, the joint association strength s of the people and the rainXyComprises the following steps:
Figure RE-GDA0001487540040000131
(1);
the strength of the joint association of "pyramid" and "rain-straight-down" is:
Figure RE-GDA0001487540040000132
in step 1202, the determining the activation coefficient of each reference term according to the joint association strength of each reference term and the text to be tested specifically includes any one of the following:
the activation coefficient calculation method comprises the following steps: the joint association strength of each reference lexical item and the text to be tested is used as an activation coefficient corresponding to each reference lexical item;
and an activation coefficient calculation method II: and calculating the reference word frequency of each reference word item and the combined association strength weighted sum of the corresponding reference word item and the text to be tested, and determining the activation coefficient of each reference word item.
Determining the activation coefficient of each reference term in the activation coefficient calculation method two according to the formula (3):
aXy=βpy+(1-β)sXy(3)
wherein p isyReference word frequency, s, representing reference term yXyβ is a scale coefficient, the value is between 0 and 1, and the value can be adjusted according to the actual situation in the using process.
The calculation process of the activation coefficient is explained by taking the current text of 'rain-direct descent', the existing terms 'people' and 'pyramid' in the knowledge base as examples.
The word frequencies of "people" and "pyramid" are as follows:
"people": 20
A "pyramid": 12
Assuming that the number of terms in the text to be tested is 2000, the frequency of the words to be tested of 'people' is 0.01, and 'pyramid' is 0.006, the joint association strengths of 'people' and 'pyramid' and 'rain straight down' are 0.86 and 0.77 respectively according to the formula (1) and the formula (2), and the activation coefficient is calculated by using the second method, and when β is equal to 0.2, the activation coefficient of 'people' is:
aXy=0.2×0.01+0.8×0.86=0.69;
the activation coefficients for the "pyramid" are:
aXy=0.2×0.006+0.8×0.77=0.62。
step 130, selecting corresponding reference terms to form a hidden semantic set of the text to be tested according to each activation coefficient, specifically comprising:
step 1301: calculating the length N of the implicit semantic set based on the text to be detectedY
Step 1302: sequencing all the activation coefficients according to the sequencing result and the length N of the implicit semantic setYAn implicit semantic set is determined.
In step 1301, based on the text to be tested, the length N of the implied semantic set is calculatedYThe method specifically comprises the following steps:
determining the length N of the set of implied semantics according to the following formula (4)Y
NY=αNX(4);
Wherein α denotes the activation ratio, NXAnd representing the term to be tested in the text to be tested.
In step 1302, the length N according to the sorting result and the implicit semantic setYDetermining an implicit semantic set, specifically including any one of the following:
the method for extracting the implicit semantic set comprises the following steps: selecting the first N with the activation coefficients ordered from large to smallYAnd the reference terms corresponding to the activation coefficients form the implicit semantic set. And in the implicit semantic set, the weight of each term is equal to 1.
And a second implicit semantic set extraction method: sequentially selecting the lexical items with the largest activation coefficients and adding the lexical items to the implied semantic set; until the sum of the activation coefficients corresponding to all terms in the implicit semantic set is greater than or equal to NY. And in the implicit semantic set, the weight of each term is equal to the activation coefficient of the term.
The extraction process of the implied semantic set Y is illustrated by taking the current text 'rain-down', reference terms 'people' and 'pyramid' in a text set knowledge base as examples.
The sorting results of the people and the pyramid are as follows according to the sorting of the activation coefficients:
"people": 0.69;
a "pyramid": 0.62;
when the number of words in the 'rain-down' is 5 and the activation ratio α is 0.2, extracting a set of the extracted implicit semantics which is { 'people': 1} according to a set extraction method I of the implicit semantics;
according to the second method for extracting the set of the implied semantics, the set of the implied semantics obtained by extraction is as follows: { "people": 0.69; a "pyramid": 0.62}.
The method trains an original text set to obtain a corresponding knowledge base, wherein the knowledge base comprises a word list formed by all reference terms, reference word vectors obtained by training, reference word frequency and the like. And searching corresponding information in a knowledge base aiming at the text to be detected, wherein the corresponding information comprises word vectors of all words in the document, all terms in a word list, the word vectors, the word frequency and the like. And calculating the joint association strength of each reference word in the word list and the text to be tested through the reference word vector by utilizing the joint activation thought in cognitive psychology, and meanwhile, calculating the activation coefficient by combining the word frequency information. This approach takes into account both the information of the terms themselves and the association information of the documents and terms. And then, sequencing all terms in the word list from high to low according to the activation coefficients, and extracting a part of terms with higher activation coefficients to form an implicit semantic set of the text to be tested. And finally, directly taking the activated implicit semantic set as an implicit semantic analysis result, or adding the original text for semantic expansion, and performing subsequent text analysis work, such as text classification, text retrieval, theme analysis and the like. Therefore, the problem that the implicit semantic information of the document is activated by using the current text corpus and has interpretability is solved.
In addition, the present invention also provides a text implication semantic activation system, as shown in fig. 3, the text implication semantic activation system of the present invention includes: the knowledge base building unit 200, the obtaining unit 210, the determining unit 220, the selecting unit 230 and the expanding unit 240.
The knowledge base construction unit 200 is configured to train a pre-collected original text set, and construct a text set knowledge base; the obtaining unit 210 is configured to obtain term information of a text to be detected; the determining unit 220 is configured to determine, according to a text set knowledge base and information of a to-be-detected term of the to-be-detected text, an activation coefficient of each reference term in a vocabulary of the text set knowledge base; the text set knowledge base comprises a word list formed by a plurality of reference terms, reference word vectors corresponding to the reference terms and reference word frequencies; the selecting unit 230 is configured to select, according to each activation coefficient, a corresponding reference term to form a hidden semantic collection of the text to be tested; the expansion unit 240 is configured to add the implicit semantic set to the text to be tested for semantic expansion.
The knowledge base construction unit 200 includes a preprocessing module, a filtering module, a construction module, a statistical module, and a training module.
The preprocessing module is used for training and preprocessing an original text set and determining an original lexical item set containing all reference lexical items; the filtering module is used for filtering stop words in the original lexical item set to obtain a reference lexical item set; the construction module is used for constructing a word list according to each reference lexical item in the reference lexical item set; the statistic module is used for counting the occurrence frequency of each reference word and determining the corresponding reference word frequency; the training module is used for training the text set through a word vector training tool and determining the reference word vector corresponding to each reference word item.
The acquiring unit 310 acquires information of a term to be detected of a text to be detected, and specifically includes: carrying out sentence segment division and word segmentation on the text to be detected to obtain a word item to be detected of the text to be detected; and determining the word vector to be detected and the word frequency to be detected of each term to be detected based on the text set knowledge base.
The determining of the word vector to be detected and the word frequency to be detected of each term to be detected specifically includes any one of the following:
searching each reference word vector in the text set knowledge base according to the lexical item to be detected, and determining a word vector to be detected corresponding to each lexical item to be detected; determining the word frequency to be detected corresponding to each word to be detected according to each reference word frequency in the text set knowledge base;
training the term to be detected through a word vector training tool to obtain an incremental word vector to be detected corresponding to the term to be detected; and counting the occurrence frequency of each term to be detected in the text to be detected, and determining the term frequency to be detected of each term to be detected by combining each reference term frequency in the text set knowledge base.
Based on the incremental word vector to be detected and the word frequency, the text implicit semantic activation system further comprises an updating unit, wherein the updating unit is used for adding the updated word item to be detected, the updated word vector to be detected and the updated word frequency of the text to be detected into the text set knowledge base for expansion, updating the word list of the text set knowledge base, and referring to the word vector and the updated reference word frequency.
The determining unit 230 includes a calculating subunit, configured to calculate joint association strength between each reference term in the vocabulary of the text collection knowledge base and the text to be tested; and the determining subunit is used for determining the activation coefficient of each reference term according to the joint association strength of each reference term and the text to be tested.
The calculation subunit includes any one of a first calculation module, a second calculation module, a third calculation module and a fourth calculation module:
the first calculation module is used for calculating the similarity between the word vector corresponding to each reference term and the word vector corresponding to each term to be detected in the text to be detected, calculating the weighted average value of the similarities and determining the joint association strength;
the second calculation module is used for calculating weighted average vectors of word vectors corresponding to all terms to be detected in the text to be detected, calculating the similarity between the weighted average vectors and the word vectors corresponding to all reference terms, and determining the joint association strength;
the third calculation module is used for dividing the text to be detected into sentence segments, calculating the division weighted average vectors of the word vectors corresponding to the division lexical items in each divided set, calculating the similarity of each division weighted average vector and the word vector corresponding to each reference lexical item, calculating the average value of each degree of identity, and determining the joint association strength;
the fourth calculation module is used for randomly selecting a plurality of sub-segments in the text to be detected, calculating segment weighted average vectors of word vectors corresponding to segment terms in each sub-segment, calculating similarity of each segment weighted average vector and the word vectors corresponding to reference terms respectively, calculating an average value of each similarity, and determining the joint association strength.
The determining subunit includes a first determining module or a second determining module:
the first determining module is used for determining the joint association strength of each reference word and the text to be tested as the activation coefficient corresponding to each reference word;
the second determining module is used for calculating the reference word frequency of each reference word and the weighted sum of the joint association strength of the corresponding reference word and the text to be tested, and determining the activation coefficient of each reference word.
The selecting unit 230 includes: a length calculation module for calculating the length N of the implied semantic set based on the text to be detectedY(ii) a A selecting module for sorting the activation coefficients according to the sorting result and the length N of the implied semantic setYAn implicit semantic set is determined.
The length calculation module calculates the length N of the implied semantic set according to the following formula (4)YThe method specifically comprises the following steps:
NY=αNX(4);
wherein α denotes the activation ratio, NXAnd representing the term to be tested in the text to be tested.
The selection module comprises a first extraction submodule or a second extraction submodule, and the first extraction submoduleThe blocks being arranged to select the top N ordered from large to smallYAnd the reference terms corresponding to the activation coefficients form the implicit semantic set. The second extraction submodule is used for sequentially selecting the maximum activation coefficient and the corresponding terms to be added into the implied semantic set; until the sum of the activation coefficients corresponding to all terms in the implicit semantic set is greater than or equal to NY
Compared with the prior art, the text implied semantic activation system has the same beneficial effects as the text implied semantic activation method, and is not repeated herein.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (15)

1. A text implied semantic activation method, characterized in that the activation method comprises:
acquiring information of a to-be-detected term of a to-be-detected text;
determining an activation coefficient of each reference lexical item in a word list of a text set knowledge base according to the text set knowledge base and information of the lexical item to be detected of the text to be detected; the text set knowledge base comprises a word list formed by a plurality of reference terms, reference word vectors corresponding to the reference terms and reference word frequencies;
selecting corresponding reference terms to form a hidden semantic set of the text to be tested according to each activation coefficient;
adding the implicit semantic set into the text to be tested for semantic expansion;
the acquiring of the information of the to-be-detected term of the to-be-detected text specifically includes:
carrying out sentence segment division and word segmentation on the text to be detected to obtain a word item to be detected of the text to be detected;
and determining the word vector to be detected and the word frequency to be detected of each term to be detected based on the text set knowledge base.
2. The text implication semantic activation method of claim 1, wherein the activation method further comprises:
and training the pre-collected original text set to construct a text set knowledge base.
3. The text implication semantic activation method according to claim 2, wherein the training of the pre-collected original text set to construct a text set knowledge base specifically comprises:
training and preprocessing an original text set, and determining an original lexical item set containing all reference lexical items;
filtering stop words in the original lexical item set to obtain a reference lexical item set;
constructing a word list according to each reference lexical item in the reference lexical item set;
counting the occurrence frequency of each reference word item, and determining the corresponding reference word frequency;
and training the text set through a word vector training tool, and determining the reference word vector corresponding to each reference word item.
4. The method for activating implicit semantics of text according to claim 1, wherein the determining a word vector to be detected and a word frequency to be detected of each term to be detected specifically comprises:
searching each reference word vector in the text set knowledge base according to the lexical item to be detected, and determining a word vector to be detected corresponding to each lexical item to be detected;
and determining the word frequency to be detected corresponding to each word to be detected according to each reference word frequency in the text set knowledge base.
5. The method for activating implicit semantics of text according to claim 1, wherein the determining a word vector to be tested and a word frequency to be tested for each term to be tested further comprises:
training the term to be detected through a word vector training tool to obtain an incremental word vector to be detected corresponding to the term to be detected;
and counting the occurrence frequency of each term to be detected in the text to be detected, and determining the term frequency to be detected of each term to be detected by combining each reference term frequency in the text set knowledge base.
6. The text implication semantic activation method of claim 5, wherein the activation method further comprises:
and adding the dynamically obtained updated word item to be detected, word vector to be detected and word frequency to be detected of the text to be detected into the text set knowledge base for expansion, and updating the word list, reference word vector and reference word frequency of the text set knowledge base.
7. The method according to claim 1, wherein the determining the activation coefficient of each reference term in the vocabulary of the knowledge base of text sets specifically comprises:
calculating the joint association strength of each reference lexical item in the word list of the text set knowledge base and the text to be tested;
and determining the activation coefficient of each reference word item according to the word frequency of each reference word item and the joint association strength of the reference word item and the text to be tested.
8. The method for activating text implication semantics according to claim 7, wherein the calculating of the joint association strength of each reference term in the vocabulary of the text set knowledge base and the text to be tested specifically includes any one of:
respectively calculating the similarity of the word vector corresponding to each reference term and the word vector corresponding to each term to be detected in the text to be detected, calculating the weighted average value of each similarity, and determining the joint association strength;
calculating a weighted average vector of word vectors corresponding to each term to be detected in the text to be detected, calculating the similarity of the weighted average vector and the word vectors corresponding to each reference term, and determining the joint association strength;
carrying out sentence segment division on the text to be detected, calculating division weighted average vectors of word vectors corresponding to division lexical items in each divided set, respectively calculating the similarity of each division weighted average vector and the word vector corresponding to each reference lexical item, calculating the average value of each similarity, and determining the joint association strength;
randomly selecting a plurality of sub-segments in the text to be detected, calculating segment weighted average vectors of word vectors corresponding to segment terms in each sub-segment, respectively calculating the similarity of each segment weighted average vector and the word vectors corresponding to reference terms, calculating the average value of each similarity, and determining the joint association strength.
9. The text implication semantic activation method according to claim 7, wherein the determining of the activation coefficient of each reference term according to the joint association strength of each reference term and the text to be tested specifically includes any one of:
the joint association strength of each reference lexical item and the text to be tested is used as an activation coefficient corresponding to each reference lexical item;
and calculating the reference word frequency of each reference word item and the combined association strength weighted sum of the corresponding reference word item and the text to be tested, and determining the activation coefficient of each reference word item.
10. The text implied semantic activation method according to claim 1, characterized in that selecting corresponding reference terms to form an implied semantic set of the text to be tested according to each activation coefficient specifically comprises:
calculating the length N of the implicit semantic set based on the text to be detectedY
Sequencing all the activation coefficients according to the sequencing result and the length N of the implicit semantic setYAn implicit semantic set is determined.
11. According to claim 10The text implied semantic activation method is characterized in that the length N of the implied semantic set is calculated based on the text to be testedYThe method specifically comprises the following steps:
determining the length N of the implied semantic set according to the following formulaY
NY=αNX
Wherein α denotes the activation ratio, NXAnd representing the term to be tested in the text to be tested.
12. The text implication semantic activation method according to claim 10, wherein the length N according to the sorting result and the implication semantic collection isYDetermining an implicit semantic set, specifically including any one of the following:
selecting the first N with the activation coefficients ordered from large to smallYReference terms corresponding to the activation coefficients form the implicit semantic set;
sequentially selecting the maximum activation coefficient and the corresponding terms to be added into the implicit semantic set; until the sum of the activation coefficients corresponding to all terms in the implicit semantic set is greater than or equal to NY
13. A text implication semantic activation system, the activation system comprising:
the acquisition unit is used for acquiring the information of the terms of the text to be detected;
the determining unit is used for determining the activation coefficient of each reference lexical item in the word list of the text set knowledge base according to the text set knowledge base and the information of the lexical item to be detected of the text to be detected; the text set knowledge base comprises a word list formed by a plurality of reference terms, reference word vectors corresponding to the reference terms and reference word frequencies;
the selection unit is used for selecting the corresponding reference terms to form a hidden semantic set of the text to be tested according to each activation coefficient;
the extension unit is used for adding the implicit semantic set into the text to be tested for semantic extension;
the acquiring unit acquires the information of the to-be-detected term of the to-be-detected text, and specifically includes: carrying out sentence segment division and word segmentation on the text to be detected to obtain a word item to be detected of the text to be detected;
and determining the word vector to be detected and the word frequency to be detected of each term to be detected based on the text set knowledge base.
14. The text implication semantic activation system of claim 13, wherein the activation system further comprises:
and the knowledge base construction unit is used for training the pre-collected original text set to construct a text set knowledge base.
15. The text implication semantic activation system of claim 14, wherein the knowledge base construction unit comprises:
the preprocessing module is used for training and preprocessing the original text set and determining an original lexical item set containing all reference lexical items;
the filtering module is used for filtering stop words in the original lexical item set to obtain a reference lexical item set;
the construction module is used for constructing a word list according to each reference lexical item in the reference lexical item set;
the statistic module is used for counting the occurrence frequency of each reference word and determining the corresponding reference word frequency;
and the training module is used for training the text set through a word vector training tool and determining the reference word vector corresponding to each reference word item.
CN201710565733.3A 2017-07-12 2017-07-12 Text implicit semantic activation method and system Active CN107577656B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710565733.3A CN107577656B (en) 2017-07-12 2017-07-12 Text implicit semantic activation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710565733.3A CN107577656B (en) 2017-07-12 2017-07-12 Text implicit semantic activation method and system

Publications (2)

Publication Number Publication Date
CN107577656A CN107577656A (en) 2018-01-12
CN107577656B true CN107577656B (en) 2020-02-14

Family

ID=61049103

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710565733.3A Active CN107577656B (en) 2017-07-12 2017-07-12 Text implicit semantic activation method and system

Country Status (1)

Country Link
CN (1) CN107577656B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11250072A (en) * 1998-02-26 1999-09-17 Nippon Telegr & Teleph Corp <Ntt> Information sorting method, device therefor and storage medium stored with information sorting program
CN1706173A (en) * 2002-10-16 2005-12-07 皇家飞利浦电子股份有限公司 Directory assistant method and apparatus
CN101833561A (en) * 2010-02-12 2010-09-15 西安电子科技大学 Natural language processing oriented Web service intelligent agent
CN104408033A (en) * 2014-11-25 2015-03-11 中国人民解放军国防科学技术大学 Text message extracting method and system
CN104965819A (en) * 2015-07-12 2015-10-07 大连理工大学 Biomedical event trigger word identification method based on syntactic word vector
CN106557476A (en) * 2015-09-24 2017-04-05 北京奇虎科技有限公司 The acquisition methods and device of relevant information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11250072A (en) * 1998-02-26 1999-09-17 Nippon Telegr & Teleph Corp <Ntt> Information sorting method, device therefor and storage medium stored with information sorting program
CN1706173A (en) * 2002-10-16 2005-12-07 皇家飞利浦电子股份有限公司 Directory assistant method and apparatus
CN101833561A (en) * 2010-02-12 2010-09-15 西安电子科技大学 Natural language processing oriented Web service intelligent agent
CN104408033A (en) * 2014-11-25 2015-03-11 中国人民解放军国防科学技术大学 Text message extracting method and system
CN104965819A (en) * 2015-07-12 2015-10-07 大连理工大学 Biomedical event trigger word identification method based on syntactic word vector
CN106557476A (en) * 2015-09-24 2017-04-05 北京奇虎科技有限公司 The acquisition methods and device of relevant information

Also Published As

Publication number Publication date
CN107577656A (en) 2018-01-12

Similar Documents

Publication Publication Date Title
CN106055538B (en) The automatic abstracting method of the text label that topic model and semantic analysis combine
CN106997382B (en) Innovative creative tag automatic labeling method and system based on big data
CN107451126B (en) Method and system for screening similar meaning words
CN1701323B (en) Digital ink database searching using handwriting feature synthesis
CN108681574B (en) Text abstract-based non-fact question-answer selection method and system
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN107423282A (en) Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN102081642A (en) Chinese label extraction method for clustering search results of search engine
CN103038764A (en) Method for keyword extraction
CN104899188A (en) Problem similarity calculation method based on subjects and focuses of problems
CN108710611A (en) A kind of short text topic model generation method of word-based network and term vector
CN105551485A (en) Audio file retrieval method and system
JP4534666B2 (en) Text sentence search device and text sentence search program
CN110674378A (en) Chinese semantic recognition method based on cosine similarity and minimum editing distance
CN112148886A (en) Method and system for constructing content knowledge graph
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
Indhuja et al. Text based language identification system for indian languages following devanagiri script
CN110929022A (en) Text abstract generation method and system
CN107577656B (en) Text implicit semantic activation method and system
CN110413985B (en) Related text segment searching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant