CN109101485A

CN109101485A - A kind of information processing method, device, electronic equipment and computer storage medium

Info

Publication number: CN109101485A
Application number: CN201810745000.2A
Authority: CN
Inventors: 杜若; 覃勋辉; 向海; 侯聪; 刘科
Original assignee: Chongqing Yuzhi Technology Co Ltd
Current assignee: Hubei Central China Technology Development Of Electric Power Co ltd
Priority date: 2018-07-09
Filing date: 2018-07-09
Publication date: 2018-12-28
Anticipated expiration: 2038-07-09
Also published as: CN109101485B

Abstract

The embodiment of the invention discloses a kind of information processing method, device, electronic equipment and computer storage mediums, word frequency-inverse file frequency (TF-IDF) can be used for assessing words to the significance level of Mr. Yu's file, in current way, only using each words as independent element, so that lower using the accuracy that the TF-IDF value that present practice obtains carries out text classification and information retrieval.The embodiment of the present invention is by obtaining synonym of each text word in text information in text information, and the first synonym collection of text word is obtained according to the synonym of text word and text word, and then it is based on first synonym collection, the second synonym collection of text information is obtained, the TF-IDF value of second synonym collection is finally calculated.Due to considering the synonym relationship in text information between text word, further, it is based on the TF-IDF value, the accuracy of text classification or information retrieval can be improved.

Description

A kind of information processing method, device, electronic equipment and computer storage medium

Technical field

The present invention relates to Internet technical field more particularly to a kind of information processing method, device, electronic equipment and meters Calculation machine storage medium.

Background technique

Word frequency-inverse file frequency (Term Frequency-Inverse Document Frequency, TF-IDF) is A kind of weighting technique for text classification and information retrieval, TF-IDF can be used for assessing words to Mr. Yu's file set or certain The significance level of a copy of it file in corpus.The number that the importance of words occurs in this document with the words Directly proportional increase, but the frequency that can occur in corpus with the words simultaneously is inversely proportional decline.

In current way, each words as independent element and is only calculated into its TF-IDF value, so that using working as The TF-IDF value that preceding way obtains carries out text classification and the accuracy of information retrieval is lower.

Summary of the invention

The embodiment of the invention discloses a kind of information processing method, device, electronic equipment and computer storage mediums, can To obtain the TF-IDF value for the synonym collection that text information includes, further, be conducive to improve text classification and information inspection The accuracy of rope.

In a first aspect, this method may include: reception information processing the embodiment of the invention discloses information processing method Request, the information process request includes multiple text informations, and each text information includes at least one text word；Root The text word for including according to the multiple text information obtains the first synonym collection about the text word, and described first is same Adopted set of words includes at least one synonym of the text word and the text word；For each text information, really First coefficient of the fixed text information, first coefficient are synonymous with second comprising the text word in the text information Set of words is corresponding, and first synonym collection includes second synonym collection, and first coefficient is for establishing institute State the linear expression relationship between the second synonym collection and the text information；According to the first coefficient of the text information, Obtain word frequency-inverse file frequency of second synonym collection.

In one implementation, the information process request further includes the destination number of target synonym collection, described It is described after obtaining word frequency-inverse file frequency of second synonym collection according to the first coefficient of the text information Method can also include: from second synonym collection determine meet the destination number and word frequency-inverse file frequency compared with Big target synonym collection.

In one implementation, the specific embodiment of the first coefficient of the determination text information can be with are as follows: Obtain the word frequency of each text word in the text information, the word frequency of the text word is for establishing the text word and described Linear expression relationship between text information；Obtain the second synonym collection comprising each text word；For each institute The second synonym collection is stated, the second coefficient and the text of each text word are directed to according to second synonym collection The word frequency of this word obtains the first coefficient of second synonym collection.

In one implementation, described that the second of each text word is directed to according to second synonym collection The word frequency of coefficient and the text word, before obtaining the first coefficient of second synonym collection, the method can also be wrapped It includes: for each text word in the text information, determining the primary vector of the text word；According to the text word Primary vector, obtains the secondary vector of the third synonym collection comprising the text word, and second synonym collection includes The third synonym collection；According to the secondary vector of the primary vector of the text word and the third synonym collection, obtain Cosine similarity between the text word and the third synonym collection；According to the cosine similarity, obtain described Third synonym collection is directed to the second coefficient of the text word.

In one implementation, the information process request further includes the quantity of all text informations, described according to institute The first coefficient for stating text information, the word frequency-inverse file frequency specific embodiment for obtaining second synonym collection can With are as follows: it sums to all first coefficients of the text information, obtains the first numerical value；Second synonym collection is corresponding The first coefficient divided by first numerical value, obtain second value；Each text is directed to second synonym collection First coefficient of information is summed, and third value is obtained；The quantity for all text informations that the information process request includes is removed Logarithm operation is carried out with the result of the third value, obtains the 4th numerical value；By the second value and the 4th numerical value phase Multiply, obtains word frequency-inverse file frequency of second synonym collection.

In one implementation, the text word for including according to the multiple text information, obtains about the text The specific embodiment of first synonym collection of this word can be with are as follows: carries out word segmentation processing to the multiple text information, obtains Text set of words, the text set of words include at least one text word；It is searched in default database of synonyms each described The synonym of text word, obtains the 4th synonym collection about each text word, and the 4th synonym collection includes The synonym of the text word and the text word found；According to the 4th synonym collection, described first is obtained Synonym collection.

In one implementation, described according to the 4th synonym collection, obtain first synonym collection Specific embodiment can be with are as follows: determines that all synonyms of text word and the text word are present in other the 4th synonyms The 4th synonym collection of target in set, other described the 4th synsets are combined into described about each text word The 4th synonym collection in 4th synonym collection in addition to the 4th synonym collection of target；Will it is described other the 4th Synonym collection is determined as first synonym collection.

Second aspect, the embodiment of the invention discloses a kind of information processing unit, which includes for executing above-mentioned The unit of method described in one side.

The third aspect, the embodiment of the invention discloses a kind of electronic equipment, should be described to deposit including memory and processor For reservoir for storing computer program, the computer program includes program instruction, and the processor is configured for calling institute Program instruction is stated, method described in above-mentioned first aspect is executed.

Fourth aspect, the embodiment of the invention discloses a kind of computer storage medium, the computer storage medium storage There is computer program, the computer program includes program instruction, and described program instruction makes the place when being executed by a processor Reason device executes method described in above-mentioned first aspect.

By implementing the embodiment of the present invention, corresponding second synonym collection of available each text information and the One coefficient, is based on first coefficient, the TF-IDF value of corresponding second synonym collection of available text information, due to The TF-IDF value considers in text information existing synonym relationship between text word, further, is based on the TF-IDF Value, can be improved the accuracy of text classification or information retrieval.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only Some embodiments of the present invention, for those of ordinary skill in the art, without creative efforts, also Other drawings may be obtained according to these drawings without any creative labor.

Fig. 1 is a kind of flow diagram of information processing method provided in an embodiment of the present invention；

Fig. 2 is the flow diagram of another information processing method provided in an embodiment of the present invention；

Fig. 3 is a kind of structural schematic diagram of information processing unit provided in an embodiment of the present invention；

Fig. 4 is the structural schematic diagram of a kind of electronic equipment provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.

The cardinal principle of the technical solution of the application may include: to be existed by obtaining each text word in text information Synonym in text information, and the first synonymous of text word is obtained according to the synonym of text word and text word Set of words, and then it is based on first synonym collection, corresponding second synonym collection of text information is obtained, is finally calculated The TF-IDF value of second synonym collection is obtained, is existed between text word since the TF-IDF value considers in text information Synonym relationship, further, be based on the TF-IDF value, the accuracy of text classification or information retrieval can be improved.

Referring to Fig. 1, Fig. 1 is a kind of flow diagram of information processing method provided in an embodiment of the present invention.Specifically , as shown in Figure 1, the information processing method of the embodiment of the present invention can include but is not limited to following steps:

S101, electronic equipment receive information process request, which includes multiple text informations.

Specifically, electronic equipment can extract the information process request packet in the case where receiving information process request The multiple text informations included.In one implementation, signal processing request can be terminal device transmission, can also be with It is that electronic equipment is automatically generated in the case where detecting information processing event.The information processing event can be user's point Hit the confirmation treatment button triggering in the information processing interface that electronic equipment is shown.Wherein, electronic equipment can be terminal and set It is standby, it is also possible to server.The terminal device can be smart phone, tablet computer, personal computer (Personal Computer, PC), smart television, smartwatch, mobile unit, wearable device, the following 5th third-generation mobile communication technology Terminal device etc. in (the 5th Generation, 5G) network.

Text information can be the combination of a sentence or multiple sentences, be also possible to a paragraph or a chapter, The embodiment of the present invention is not construed as limiting this.Each text information includes at least one text word, and text word can be calling participle Algorithm carries out an individual word in the word segmentation result obtained after word segmentation processing to text information.For example, when text information is When " glad and happy be synonym ", the text word that text information includes can be " happiness ", "and", " happy ", Any one in "Yes", " synonym ".In one implementation, text word, which can also be, calls segmentation methods to text Information carries out word segmentation processing, and after removing the stop words in word segmentation result, and one obtained in target word segmentation result is independent Word.For example, when text information is " glad and happy be synonym ", a text word that text information includes can be with It is " happiness ", " happy ", any one in " synonym ".Wherein, segmentation methods can include but is not limited to based on character string Matched segmentation methods (such as Forward Maximum Method method, reverse maximum matching method, minimum cutting, two-way maximum matching method), base Segmentation methods in understanding and the segmentation methods based on statistics, the embodiment of the present invention are not construed as limiting this.Stop words, which refers to, to be believed In breath retrieval, to save memory space and improving search efficiency, meeting before or after handling natural language data (or text) The certain words or word that automatic fitration is fallen.Generally, stop words can be divided into following two class: the first kind is using very extensive, very To being excessively frequent some words, as in Chinese " I ", " if " etc. words；Second class is that the frequency of occurrences is very high in text, But practical significance little word, including auxiliary words of mood, adverbial word, preposition, conjunction etc., as " ", " ", "and" word.

The text word that S102, electronic equipment include according to multiple text informations, obtains synonymous about the first of text word Set of words.

Wherein, which includes at least one synonym of text word and text word.Namely It says, all synonyms of text word and text word in affiliated text information are present in first synset In conjunction.For example, when text information is " glad and happy be synonym ", when the text word that text information includes is " happiness ", Synonym of the text word " happiness " in text information is " happy ", the first synonym about text word " happiness " Set may include { " happiness ", " happy " }.

In one implementation, multiple text informations can be corresponding with multiple first synonym collections, and text information In each text word be corresponding with first synonym collection.

In one implementation, corresponding first synonym collection of different text words in same text information can be with It is identical, it can also be different.In one implementation, corresponding first synonym of same text word in different text informations Gather identical, corresponding first synonym collection of different text words in different text informations may be the same or different. For example, when the quantity of text information is 2, and text information 1 and text information 2 be respectively as follows: " it is glad and it is happy be synonymous When word ", " glad and be happily synonym ", which can correspond to 2 the first synonym collections, and this 2 first Synonym collection is respectively as follows: { " happiness ", " happy ", " happy " }, { " synonym " }, at this point, the text word in text information 1 Corresponding first synonym collection of text word " happiness ", " happy " in " happiness ", " happy " and text information 2 is { " high It is emerging ", " happy ", " happy ".

It in one implementation, may include text word and this article about the first synonym collection of text word All synonyms of this word in affiliated text information can also include the text word in other text informations.For example, aforementioned First synonym collection { " happiness ", " happy ", " happy " } includes the text word in text information 1 and text information 2.? In a kind of implementation, the text can be present in about all words for including in the first synonym collection of text word In text information belonging to word.For example, when the quantity of text information is 2, and text information 1 and text information 2 are respectively as follows: When " glad and happy be synonym ", " today is Monday ", which can correspond to 4 the first synsets It closes, which is respectively as follows: { " happiness ", " happy " }, { " synonym " }, { " today " }, { " week One " }, at this point, the first synonym collection of the text word " happiness " in text information 1 are as follows: { " happiness ", " happy " }, and this All words in one synonym collection are present in text information 1 belonging to text word " happiness ".

In one implementation, multiple text informations can correspond to first synonym collection.For example, above-mentioned text This information 1 and corresponding first synonym collection of text information 2 can be with are as follows: " happiness ", " happy ", " synonym ", " today ", " Monday " }.

S103, electronic equipment are directed to each text information, determine the first coefficient of text information, wherein first system Number is corresponding with comprising the second synonym collection of text word in text information.

Wherein, the first coefficient can be used for establishing the linear expression between second synonym collection and text information Relationship, the first synonym collection may include second synonym collection.

In one implementation, each text word in each text information can be corresponding with one it is second synonymous Set of words, and corresponding second synonym collection of different text words in each text information may be the same or different. For example, when the quantity of text information is 2, and text information 1 and text information 2 be respectively as follows: " it is glad and it is happy be synonymous When word ", " today is Monday ", which can correspond to 1 the first synonym collection, first synset It is combined into: { " happiness ", " happy ", " synonym ", " today ", " Monday " }.Text word " happiness " and text in text information 1 Corresponding second synonym collection of this word " happy " can be identical, and is { " happiness ", " happy " }, the text in text information 1 Corresponding second synset of word " synonym " is combined into { " synonym " }.

In one implementation, text information can be indicated with the second synonym collection.For example, text information 1 can To be indicated with the second synonym collection { " happiness ", " happy " }, { " synonym " }.Specifically, for indicating text information Second synonym collection is corresponding with first coefficient, can establish second synonym collection by first coefficient and is somebody's turn to do Linear expression relationship between text information.For example, the second synonym collection { " happiness ", " happy " }, { " synonym " } correspond to The first coefficient when being respectively s1 and s2, linear expression relationship between the second synonym collection and text information can be with are as follows: Text information 1=s1* { " happiness ", " happy " }+s2* { " synonym " }.In one implementation, text information is corresponding The quantity of the quantity of second synonym collection the first coefficient corresponding with text information is identical, is built with will pass through the first coefficient Linear expression relationship between vertical second synonym collection and text information.In one implementation, each text information pair First coefficient can be corresponding with by each of answering the second synonym collection.

According to the first coefficient of text information, the word frequency-for obtaining second synonym collection is inverse for S104, electronic equipment Document-frequency.

Specifically, electronic equipment can obtain the word of second synonym collection according to the first coefficient of text information Frequency and inverse document frequency, and then it is based on the word frequency and inverse document frequency, obtain the word of second synonym collection Frequently-inverse file frequency.

Wherein, the word frequency of second synonym collection can be according to text information (text information and second synonym Gather corresponding) corresponding all first coefficients obtain.For example, the linear list between the second synonym collection and text information 1 Show relationship are as follows: text information 1=s1* { " happiness ", " happy " }+s2* { " synonym " }, the then word of second synonym collection Frequency can be obtained according to s1 and s2.

In one implementation, the second synonym collection can correspond to one or more text informations, that is to say, that Second synonym collection can be used for linear expression one or more text informations corresponding with second synonym collection. If second synonym collection is corresponding with multiple text informations, the inverse document frequency of second synonym collection can be with According to second synonym collection, corresponding first coefficient is obtained in corresponding each text information.For example, if second is synonymous The corresponding text information of set of words { " happiness ", " happy ", " happy " } 1 (" glad and happy be synonym ") and text information 2 (" glad and be happily synonym "), and the linear expression relationship between the second synonym collection and text information 1 are as follows: text Information 1=s1* { " happiness ", " happy ", " happy " }+s2* { " synonym " }, between the second synonym collection and text information 2 Linear expression relationship are as follows: text information 2=s1 ' * { " happiness ", " happy ", " happy " }+s2 ' * { " synonym " }, then this The inverse document frequency of two synonym collections can be according to s1 and s1 ' it obtains.

Referring to Fig. 2, Fig. 2 is the flow diagram of another information processing method provided in an embodiment of the present invention.Specifically , as shown in Fig. 2, another information processing method of the embodiment of the present invention can include but is not limited to following steps:

S201, electronic equipment receive information process request, which includes multiple text informations.

It should be noted that the implementation procedure of step S201 may refer to the specific descriptions in Fig. 1 in step S101, This is not repeated.

S202, electronic equipment carry out word segmentation processing to multiple text informations, obtain text set of words, text set of words Including at least one text word.

Specifically, electronic equipment can carry out word segmentation processing to multiple text informations, a text set of words is obtained.Its In, the text word in each text information is present in text set of words.

In one implementation, electronic equipment can carry out word segmentation processing parallel to each text information, and will obtain Multiple word segmentation results merge, obtain text set of words.

S203, electronic equipment search the synonym of each text word in default database of synonyms, obtain about each 4th synonym collection of text word.Wherein, the 4th synonym collection includes text word and the text word found Synonym.

In one implementation, electronic equipment can search the same of each text word in default database of synonyms Adopted word, obtains the 5th synonym collection about each text word, and the 5th synonym collection does not include text word.Into one Step, the intersection of the 5th synonym collection and aforementioned texts set of words can be determined as the 6th synset by electronic equipment It closes, the 6th synonym collection then is added in text word, obtains the 4th synonym collection.Wherein, the 4th synset All words in conjunction are present in the corresponding text information of text word.For example, when the quantity of text information is 2, and text When this information 1 and text information 2 are respectively as follows: " glad and happy be synonym ", " today is Monday ", two text envelopes Ceasing corresponding text set of words can be with are as follows: and { " happiness ", " happy ", " synonym ", " today ", " Monday " }, if default The synonym of text word " happiness " in the text information 1 found in database of synonyms are as follows: " happy " and " happy ", then 5th synonym collection of text word " happiness " are as follows: { " happy ", " happy " }, to the 5th synonym collection and text set of words Take the 6th synonym collection that intersection obtains are as follows: text word " happiness " is added the 6th synonym collection, obtained by { " happy " } The 4th synonym collection are as follows: { " happiness ", " happy " }.

In one implementation, which can store in the electronic equipment, or be stored in Yun Shang.In one implementation, which can also be stored in another electronic equipment, another electricity Sub- equipment can be with the electronic equipment by wired or wirelessly establish connection, so that the electronic equipment can be inquired The default database of synonyms being stored in another electronic equipment, the embodiment of the present invention deposit default database of synonyms Storage space, which is set, to be not construed as limiting.

In one implementation, electronic equipment can also be by the synonym of each text word of web search, to obtain The 4th synonym collection about each text word.

S204, electronic equipment obtain the first synonym collection according to the 4th synonym collection.

In one implementation, electronic equipment obtains the tool of the first synonym collection according to the 4th synonym collection Body embodiment can be with are as follows: it is the 4th same that electronic equipment determines that all synonyms of text word and text word are present in other The 4th synonym collection of target in adopted set of words, and other the 4th synonym collections are determined as the first synonym collection. Wherein, which is combined into the 4th synonym collection about each text word except the target the 4th is synonymous The 4th synonym collection other than set of words.For example, when the quantity of text information is 2, and text information 1 and text information 2 When being respectively as follows: " glad and happy be synonym ", " I is very glad ", corresponding 4th synonym collection of text information 1 be can wrap Include: { " happiness ", " happy " }, { " synonym " }, corresponding 4th synonym collection of text information 2 may include: { " I " }, { " very " }, { " happiness " }, the institute in corresponding 4th synonym collection { " happiness " } of text word " happiness " in text information 2 There is word to be present in 1 corresponding 4th synonym collection { " happiness ", " happy " } of text information, at this point, in text information 2 Corresponding 4th synonym collection { " happiness " } of text word " happiness " be the 4th synonym collection of target, text information 1 is corresponding The 4th synonym collection { " happiness ", " happy " } be other the 4th synonym collections.

In one implementation, electronic equipment obtains the tool of the first synonym collection according to the 4th synonym collection Body embodiment can be with are as follows: electronic equipment according to the quantity of the word in the 4th synonym collection, to the 4th synonym collection into Row sequence；Compression processing is carried out to the 4th synonym collection circulation after sequence, obtains the first synonym collection；Wherein, described If compression processing includes: first set, all words for including are present in second set, the first set are deleted, by Two set are determined as the first synonym collection, and the first set and second set are two the 4th different synonym collections.

In one implementation, the specific embodiment that electronic equipment is ranked up the 4th synonym collection can be with Are as follows: electronic equipment carries out descending sort to the 4th synonym collection, alternatively, electronic equipment rises the 4th synonym collection Sequence sequence, the embodiment of the present invention are not construed as limiting this.

S205, electronic equipment obtain the word frequency of each text word in text information.Wherein, the word frequency of text word is used In the linear expression relationship established between text word and text information.

In one implementation, the word frequency of text word can be time that text word occurs in affiliated text information Number.In one implementation, the word frequency of text word can be number that text word occurs in affiliated text information divided by The value that the number summation that all text words in text information occur in text information obtains.In a kind of implementation In, the word frequency of text word can be number that text word occurs in affiliated text information divided by the first text word in this article The value that the number occurred in this information obtains, wherein it is most that the first text word can be in text information frequency of occurrence Text word.

In one implementation, text word and the text can be established with the word frequency of text word and text word Linear expression relationship between information.For example, when the word frequency of text word is time that text word occurs in affiliated text information Number, and text information 1 be " it is glad and it is happy be synonym, I is very glad " when, the word frequency of text word " happiness " is 2, text word The word frequency of " happy " is 1, and the word frequency of text word " synonym " is 1, and the word frequency of text word " very " is 1.At this point, text word and text Linear expression relationship between this information 1 can be with are as follows: text information 1=" happiness " * 2+ " happy "+" synonym "+" very ".

S206, electronic equipment obtain the second synonym collection comprising each text word.Wherein, aforementioned first synonym Set may include second synonym collection.

In one implementation, electronic equipment can store corresponding second synonym collection of each text information, often The union of corresponding all second synonym collections of a text information includes all text words in text information.For example, literary This information 1 (including m text word) be corresponding with 3 the second synonym collections (as set 1, set 2, set 3), then gather 1, The union of set 2 and set 3 includes the m text word.

In one implementation, after electronic equipment obtains the second synonym collection comprising each text word, may be used also To execute step s2061-s2064:

S2061: for each text word in text information, the primary vector of text word is determined.

Wherein, which can be used for unique identification text word.In one implementation, the primary vector It can be term vector, term vector is used to the words in natural language switching to the dense vector that computer is understood that.In one kind In implementation, which can be one-hot vector.For example, it is assumed that the quantity of different text words is N, each text word It can be corresponded with the continuous integral number from 0 to N-1, if the respective integer of a text word is expressed as i, this article in order to obtain The one-hot vector of this word, can create a full 0 and a length of N's vector, and its i-th bit is set as 1.For example, when N is When 3, the term vector of text word can be with are as follows: [1,0,0].In one implementation, which, which can also be, passes through Word2vec model or other models obtain, and the embodiment of the present invention is not construed as limiting this.For example, the term vector of text word can be with Are as follows: [0.5,0.3,0.2].

S2062: according to the primary vector of text word, second of the third synonym collection comprising text word is obtained Vector.Wherein, aforementioned second synonym collection includes the third synonym collection.

In one implementation, text word can correspond to one or more third synonym collections.If text word , then there is text word in each third synonym collection in corresponding multiple third synonym collections.

Wherein, the secondary vector of third synonym collection can be used for the unique identification third synonym collection.In one kind In implementation, the secondary vector of third synonym collection can be the of all text words in the third synonym collection The arithmetic mean of instantaneous value of one vector.For example, third synset is combined into { " happiness ", " happy " }, and the first of text word " happiness " Vector is [0.5,0.3,0.2], when the primary vector of text word " happy " is [0.4,0.1,0.2], third synonym collection Secondary vector are as follows: ([0.5,0.3,0.2]+[0.4,0.1,0.2])/2.

S2063: according to the secondary vector of the primary vector of text word and the third synonym collection, the text is obtained Cosine similarity between word and the third synonym collection.

In one implementation, electronic equipment can be by the primary vector of text word and the third synonym collection The dot product of secondary vector is determined as the cosine similarity between text word and the third synonym collection.

S2064: according to the cosine similarity, the second coefficient that the third synonym collection is directed to text word is obtained.

In one implementation, electronic equipment can determine the corresponding all third synonym collections of text word with The summation of cosine similarity between text word, and the cosine between text word and the third synonym collection is similar Degree obtains the second coefficient that the third synonym collection is directed to text word divided by the summation.It should be noted that working as this article When this word is corresponding with multiple third synonym collections, all third synonym collections are directed to the total of the second coefficient of text word Be 1.

In one implementation, third synonym collection can be established by the second coefficient and third synonym collection Linear expression relationship between text word.For example, text word " happiness " is corresponding, there are two third synonym collections, wherein Third synonym collection 1 is { " happiness ", " happy " }, and third synonym collection 2 is { " happiness ", " happy " }, and third is synonymous Set of words 1 is 0.4 for the second coefficient of text word, and third synonym collection 2 is for the second coefficient of text word 0.6, then the linear expression relationship between third synonym collection and text word can be with are as follows: text word " happiness "=0.4* { " happiness ", " happy " }+0.6* { " happiness ", " happy " }.

S207, electronic equipment are directed to each second synonym collection, are directed to each text according to second synonym collection Second coefficient of word and the word frequency of text word, obtain the first coefficient of second synonym collection.

Text word and text information can be established with the word frequency of text word and text word by mentioning in step S205 Between linear expression relationship, mention that can to establish third by the second coefficient and third synonym collection same in step s2064 Linear expression relationship between adopted set of words and text word, it is to be understood that can be built based on step S205 and s2064 Linear expression relationship between vertical third synonym collection and text information further can establish the second synset Close the linear expression relationship between text information.

For example, when text information is " glad and happy be synonym ", and the third synonym collection of text word " happiness " For { " happiness ", " happy " }, the third synset of text word " happy " is combined into { " happiness ", " happy " }, and text word is " synonymous When the third synset of word " is combined into { " synonym " }, it can be indicated with third synonym collection { " happiness ", " happy " } Text word " happiness " can indicate text word " happy " with third synonym collection { " happiness ", " happy " }, can use Three synonym collections { " synonym " } indicate text word " synonym ", but due to can use text word " happiness ", " happy " and " synonym " indicates text information, it is possible to use third synonym collection { " happiness ", " happy " } and { " synonym " } Text information is indicated, at this point, it is corresponding second same that { " happiness ", " happy " } and { " synonym " } is determined as text information Adopted set of words.

In one implementation, when the text word in text information 1 is text word 1 and text word 2, text word 1 is corresponding Third synset be combined into set a (for text word 1 the second coefficient be 0.4) and set b (for the second of text word 1 Coefficient is that 0.6), the corresponding third synset of text word 2 is combined into set b (the second coefficient for text word 2 is 1), and text The word frequency of this word 1 is 2, when the word frequency of text word 2 is 1, the linear expression relationship between text word and text information 1 are as follows: text Information 1=2* text word 1+ text word 2.Linear expression relationship between third synonym collection and each text word is substituted into Above formula, the linear expression relationship between available third synonym collection and text information 1 are as follows: text information 1=2* (0.4* set a+0.6* set b)+set b=0.8* set a+2.2* set b.Wherein, set a is corresponding with text word 1 Third synonym collection, meanwhile, set a is also the second synonym collection for text information 1, and set b is similarly.Therefore, For text information 1 the second synonym collection (the first coefficient of set a) be 0.8, for the second of text information 1 (the first coefficient of set b) is 2.2 to synonym collection.

According to the first coefficient of text information, the word frequency-for obtaining second synonym collection is inverse for S208, electronic equipment Document-frequency.

In one implementation, electronic equipment is executed according to the first coefficient of text information, and it is second synonymous to obtain this It is specific to execute step s2081-s2085 when the word frequency of set of words-inverse file frequency step:

S2081: it sums to all first coefficients of text information, obtains the first numerical value.

Specifically, electronic equipment can obtain the first coefficient of corresponding all second synonym collections of text information Afterwards, it sums to all first coefficients, obtains the first numerical value.For example, if the second synonym collection (set a and set b) and text Linear expression relationship between information 1 are as follows: text information 1=0.8* set a+2.2* set b, then the first numerical value is 0.8+2.2 =3.

S2082: by corresponding first coefficient of second synonym collection divided by first numerical value, second value is obtained.Tool Body, the quantity of the quantity of the corresponding second value of text information the second synonym collection corresponding with text information is identical, electricity Sub- equipment can be with each second value of parallel computation.For example, if the second synonym collection (set a and set b) and text information Linear expression relationship between 1 are as follows: text information 1=0.8* set a+2.2* set b, then text information 1 corresponding one Two numerical value are 0.8/3, another the corresponding second value of text information 1 is 2.2/3.

S2083: the first coefficient to second synonym collection for each text information is summed, and third value is obtained. In one implementation, same second synonym collection can correspond to multiple text informations, and each text information can be used The second synonym collection linear expression, second synonym collection correspond to each text information and there is first coefficient, Electronic equipment can sum to second synonym collection for the first coefficient of each text information, obtain third value.Example Such as, the second synonym collection (the corresponding text information 1 of set a) and text information 2, text information 1 and corresponding second synonym (the linear expression relationship between set a and set b) are as follows: text information 1=0.8* set a+2.2* set b, text envelope of set Breath 2 and corresponding second synonym collection (the linear expression relationship between set a and set c) are as follows: text information 2=0.6* Set a+1.3* set c, then (set a) is 0.8 for the first coefficient of text information 1 to the second synonym collection, and second is synonymous (set a) is 0.6 for the first coefficient of text information 2 to set of words, i.e., third value is 0.8+0.6=1.4.

S2084: the quantity for all text informations that information process request includes is carried out divided by the result of the third value Logarithm operation obtains the 4th numerical value.

In one implementation, aforementioned information processing request can also include the quantity of all text informations.For example, If the quantity of all text informations is 2, third value 1.4, then the 4th numerical value is lg (2/1.4).

S2085: aforementioned second value is multiplied with the 4th numerical value, obtains the inverse text of word frequency-of second synonym collection Part frequency.

In one implementation, each text information can be corresponding with one or more second synonym collections, when this When text information corresponds to multiple second synonym collections, electronic equipment can be with the word of each second synonym collection of parallel computation Frequently-inverse file frequency.For example, if corresponding second synonym collection of text information 1 are as follows: set a and set b, and a pairs of set The second value answered be 0.8/3 and the 4th numerical value be (2/1.4) lg, the corresponding second value of set b is 2.2/3 and the 4th number Value is lg (2/1.3), then the second synonym collection (word frequency of set a)-inverse file frequency is (0.8/3) * lg (2/1.4), the (word frequency of set b)-inverse file frequency is (2.2/3) * lg (2/1.3) to two synonym collections.

S209, electronic equipment from second synonym collection determine meet destination number and word frequency-inverse file frequency compared with Big target synonym collection.

In one implementation, aforementioned information processing request can also include the destination number of target synonym collection, Wherein, if second synonym collection corresponds to text information 1, which is combined into is associated with text information 1 Highest second synonym collection of degree.

In one implementation, electronic equipment can be first by corresponding multiple second synonym collections of text information Maximum second synonym collection of middle word frequency-inverse file frequency is determined as target synonym collection, then again will be same except the target Maximum second synonym collection of word frequency-inverse file frequency is determined as in multiple second synonym collections except adopted set of words Target synonym collection, until obtaining the target synonym collection that quantity is destination number.

In one implementation, electronic equipment can be according to word frequency-inverse file frequency of the second synonym collection from big To small sequence, the second synonym collection of selection target quantity is as target synonym from multiple second synonym collections Set.For example, if destination number is 1, corresponding second synonym collection of text information 1 are as follows: set a and set b, and set a Word frequency-inverse file frequency be 0.4, word frequency-inverse file frequency of set b is 0.8, then the target synset of text information 1 It is combined into set b.It should be noted that numerical value involved in above-mentioned example is only used for illustrating, the embodiment of the present invention is not constituted It limits.

In one implementation, electronic equipment may also receive from the information retrieval requests of terminal device, the information Retrieval request includes term, and the synonym collection of the available term of electronic equipment obtains and the synonym collection packet Identical each second synonym collection of the word contained, and each second synonym collection is obtained in corresponding text information Then word frequency-inverse file frequency sends the corresponding text information of maximum second synonym collection of word frequency-inverse file frequency To terminal device.Alternatively, electronic equipment can also be by the second synset of word frequency-biggish preset quantity of inverse file frequency It closes corresponding text information and is sent to terminal device.Wherein, which can be electronic equipment default setting, can also To be that terminal device is sent to electronic equipment, the embodiment of the present invention is not construed as limiting this.

Fig. 3 is referred to, Fig. 3 is a kind of structural schematic diagram of information processing unit provided in an embodiment of the present invention, specifically , as shown in figure 3, the information processing unit 30, may include:

Receiving unit 301, for receiving information process request, which includes multiple text informations, each Text information includes at least one text word.

Processing unit 302, the text word for including according to multiple text informations obtain same about the first of text word Adopted set of words, first synonym collection include at least one synonym of text word and text word.

The processing unit 302 is also used to determine the first coefficient of text information, the first system for each text information Number is corresponding with comprising the second synonym collection of text word in text information, and the first synonym collection includes second same Adopted set of words, the first coefficient are used for the linear expression relationship established between the second synonym collection and text information.

The processing unit 302 is also used to the first coefficient according to text information, obtains second synonym collection Word frequency-inverse file frequency.

In one implementation, information process request can also include the destination number of target synonym collection, at this Manage unit 302, can be also used for from the second synonym collection determine meet the destination number and word frequency-inverse file frequency compared with Big target synonym collection.

In one implementation, which is specifically used for: obtaining each text word in text information Word frequency, the linear expression relationship that the word frequency of text word is used to establish between text word and text information；Obtaining includes each text Second synonym collection of this word；For each second synonym collection, each text word is directed to according to the second synonym collection The second coefficient and text word word frequency, obtain the first coefficient of the second synonym collection.

In one implementation, the processing unit 302 can be also used for for each text in text information Word determines the primary vector of text word；According to the primary vector of text word, the third synonym collection comprising text word is obtained Secondary vector, the second synonym collection includes third synonym collection；According to the primary vector of text word and third synonym The secondary vector of set obtains the cosine similarity between text word and third synonym collection；According to the cosine similarity, Obtain the second coefficient that third synonym collection is directed to text word.

In one implementation, aforementioned information processing request can also include the quantity of all text informations, the processing Unit 302 is specifically used for: summing to all first coefficients of text information, obtains the first numerical value；By the second synonym collection Corresponding first coefficient obtains second value divided by first numerical value；To the second synonym collection for each text information The summation of first coefficient, obtains third value；The quantity for all text informations for including to information process request is divided by the third number The result of value carries out logarithm operation, obtains the 4th numerical value；Aforementioned second value is multiplied with the 4th numerical value, it is same to obtain second The word frequency of adopted set of words-inverse file frequency.

In one implementation, which is specifically used for: carrying out word segmentation processing to multiple text informations, obtains To text set of words, text set of words includes at least one text word；Each text word is searched in default database of synonyms Synonym, obtain the 4th synonym collection about each text word, the 4th synonym collection includes text word and lookup The synonym of the text word arrived；According to the 4th synonym collection, the first synonym collection is obtained.

In one implementation, which is specifically used for: determining all same of text word and text word Adopted word is present in the 4th synonym collection of target in other the 4th synonym collections, which is combined into About the 4th synonym collection in the 4th synonym collection of each text word in addition to the 4th synonym collection of target； Other the 4th synonym collections are determined as the first synonym collection.

Embodiment of the method shown in the embodiment of the present invention and Fig. 1, Fig. 2 is based on same design, bring technical effect also phase Together, concrete principle please refers to the description of Fig. 1, embodiment illustrated in fig. 2, and this will not be repeated here.

Referring to Fig. 4, Fig. 4 is the structural schematic diagram of a kind of electronic equipment provided in an embodiment of the present invention.The electronic equipment 40 may include receiver 401, memory 402 and processor 403, and receiver 401, memory 402 and processor 403 pass through The connection of one or more communication bus.

Receiver 401 can be used for receiving data, for example, receiver 401 can be used for receiving information process request.

Memory 402 may include read-only memory and random access memory, and to processor 403 provide instruction and Data.The a part of of memory 402 can also include nonvolatile RAM.

Processor 403 can be central processing unit (Central Processing Unit, CPU), the processor 403 It can also be other general processors, digital signal processor (Digital Signal Processor, DSP), dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic device Part, discrete hardware components etc..General processor can be microprocessor, and optionally, which is also possible to any normal The processor etc. of rule.Wherein:

Memory 402, for storing program instruction.

Processor 403, for calling the program instruction stored in memory 402, to be used for:

Information process request is received, which includes multiple text informations, and each text information includes at least One text word；

The text word for including according to multiple text informations obtains the first synonym collection about text word, this is first same Adopted set of words includes at least one synonym of text word and text word；

For each text information, determine the first coefficient of text information, the first coefficient with comprising in text information Text word the second synonym collection it is corresponding, the first synonym collection includes the second synonym collection, and the first coefficient is used for Establish the linear expression relationship between the second synonym collection and text information；

According to the first coefficient of text information, word frequency-inverse file frequency of second synonym collection is obtained.

In one implementation, information process request can also include the destination number of target synonym collection, processing Device 403 can be also used for the determination from the second synonym collection and meet the destination number and word frequency-biggish mesh of inverse file frequency Mark synonym collection.

In one implementation, when which is used to determine the first coefficient of text information, it is specifically used for The word frequency of each text word in text information is obtained, the word frequency of text word is for establishing between text word and text information Linear expression relationship；Obtain the second synonym collection comprising each text word；For each second synonym collection, root According to the second synonym collection for the second coefficient of each text word and the word frequency of text word, the second synonym collection is obtained The first coefficient.

In one implementation, which can be also used for for each text word in text information, Determine the primary vector of text word；According to the primary vector of text word, the of the third synonym collection comprising text word is obtained Two vectors, the second synonym collection include third synonym collection；According to the primary vector of text word and third synonym collection Secondary vector, obtain the cosine similarity between text word and third synonym collection；According to the cosine similarity, obtain Third synonym collection is directed to the second coefficient of text word.

In one implementation, aforementioned information processing request can also include the quantity of all text informations, the processing Device 403 is used for the first coefficient according to text information, when obtaining word frequency-inverse file frequency of second synonym collection, tool Body is used to sum to all first coefficients of text information, obtains the first numerical value；By the second synonym collection corresponding first Coefficient obtains second value divided by first numerical value；The first coefficient to the second synonym collection for each text information is asked With obtain third value；The quantity for all text informations for including to information process request divided by the third value result into Row logarithm operation obtains the 4th numerical value；Aforementioned second value is multiplied with the 4th numerical value, obtains the second synonym collection Word frequency-inverse file frequency.

In one implementation, which is used for the text word for including according to multiple text informations, is closed When the first synonym collection of text word, it is specifically used for carrying out word segmentation processing to multiple text informations, obtains text word set It closes, text set of words includes at least one text word；The synonym of each text word is searched in default database of synonyms, Obtain the 4th synonym collection about each text word, the 4th synonym collection includes text word and this article for finding The synonym of this word；According to the 4th synonym collection, the first synonym collection is obtained.

In one implementation, which is used to obtain the first synset according to the 4th synonym collection When conjunction, specifically for determining that all synonyms of text word and text word are present in other the 4th synonym collections The 4th synonym collection of target, which, which is combined into the 4th synonym collection about each text word, removes The 4th synonym collection other than the 4th synonym collection of target；It is same that other the 4th synonym collections are determined as first Adopted set of words.

It should be noted that the specific implementation of unmentioned content and each step in the corresponding embodiment of Fig. 4 It can be found in Fig. 1-embodiment illustrated in fig. 3 and foregoing teachings, which is not described herein again.

The embodiment of the present invention also provides a kind of computer readable storage medium, and computer-readable recording medium storage has meter Calculation machine program, computer program includes program instruction, when program instruction is executed by processor, processor is made to execute such as Fig. 1-Fig. 2 Performed step in shown embodiment of the method.

Above disclosed is only section Example of the invention, cannot limit the power of the present invention with this certainly Sharp range, those skilled in the art can understand all or part of the processes for realizing the above embodiment, and weighs according to the present invention Benefit requires made equivalent variations, still belongs to the scope covered by the invention.

Claims

1. a kind of information processing method characterized by comprising

Information process request is received, the information process request includes multiple text informations, and each text information includes extremely A few text word；

The text word for including according to the multiple text information obtains the first synonym collection about the text word, described First synonym collection includes at least one synonym of the text word and the text word；

For each text information, the first coefficient of the text information is determined, first coefficient and include the text Second synonym collection of the text word in this information is corresponding, and first synonym collection includes second synset It closes, first coefficient is used for the linear expression relationship established between second synonym collection and the text information；

According to the first coefficient of the text information, word frequency-inverse file frequency of second synonym collection is obtained.

2. the method according to claim 1, wherein the information process request further includes target synonym collection Destination number, first coefficient according to the text information obtains word frequency-inverse file of second synonym collection After frequency, the method also includes:

Determination meets the destination number from second synonym collection and word frequency-biggish target of inverse file frequency is synonymous Set of words.

3. the method according to claim 1, wherein the first coefficient of the determination text information, comprising:

The word frequency of each text word in the text information is obtained, the word frequency of the text word is for establishing the text word and institute State the linear expression relationship between text information；

Obtain the second synonym collection comprising each text word；

For each second synonym collection, the second of each text word is directed to according to second synonym collection The word frequency of coefficient and the text word obtains the first coefficient of second synonym collection.

4. according to the method described in claim 3, it is characterized in that, described be directed to each institute according to second synonym collection State the second coefficient of text word and the word frequency of the text word, before obtaining the first coefficient of second synonym collection, institute State method further include:

For each text word in the text information, the primary vector of the text word is determined；

According to the primary vector of the text word, the secondary vector of the third synonym collection comprising the text word, institute are obtained Stating the second synonym collection includes the third synonym collection；

According to the secondary vector of the primary vector of the text word and the third synonym collection, the text word and institute are obtained State the cosine similarity between third synonym collection；

According to the cosine similarity, the second coefficient that the third synonym collection is directed to the text word is obtained.

5. method according to any one of claims 1 to 4, which is characterized in that the information process request further includes owning The quantity of text information, first coefficient according to the text information, the word frequency-for obtaining second synonym collection are inverse Document-frequency, comprising:

All first coefficients summation to the text information, obtains the first numerical value；

By corresponding first coefficient of second synonym collection divided by first numerical value, second value is obtained；

The first coefficient to second synonym collection for each text information is summed, and third value is obtained；

Logarithm is carried out divided by the result of the third value to the quantity for all text informations that the information process request includes Operation obtains the 4th numerical value；

The second value is multiplied with the 4th numerical value, obtains word frequency-inverse file frequency of second synonym collection.

6. method according to any one of claims 1 to 4, which is characterized in that described according to the multiple text information packet The text word included obtains the first synonym collection about the text word, comprising:

Word segmentation processing is carried out to the multiple text information, obtains text set of words, the text set of words includes at least one Text word；

The synonym that each text word is searched in default database of synonyms obtains the about each text word Four synonym collections, the 4th synonym collection include the synonym of the text word and the text word found；

According to the 4th synonym collection, first synonym collection is obtained.

7. according to the method described in claim 6, obtaining described it is characterized in that, described according to the 4th synonym collection First synonym collection, comprising:

Determine that all synonyms of text word and the text word are present in the target the 4th in other the 4th synonym collections Synonym collection, other described the 4th synsets are combined into the 4th synonym collection about each text word and remove The 4th synonym collection other than the 4th synonym collection of target；

Other described the 4th synonym collections are determined as first synonym collection.

8. a kind of information processing unit, which is characterized in that described device includes for executing such as any one of claim 1~7 institute The unit for the method stated.

9. a kind of electronic equipment, which is characterized in that including memory and processor, the memory is for storing computer journey Sequence, the computer program include program instruction, and the processor is configured for calling described program instruction, execute such as right It is required that 1~7 described in any item methods.

10. a kind of computer storage medium, which is characterized in that the computer storage medium is stored with computer program, described Computer program includes program instruction, and described program instruction makes the processor execute such as claim when being executed by a processor 1~7 described in any item methods.