CN109858012A - Barrage Text similarity computing method, storage medium, equipment and system - Google Patents
Barrage Text similarity computing method, storage medium, equipment and system Download PDFInfo
- Publication number
- CN109858012A CN109858012A CN201811459848.5A CN201811459848A CN109858012A CN 109858012 A CN109858012 A CN 109858012A CN 201811459848 A CN201811459848 A CN 201811459848A CN 109858012 A CN109858012 A CN 109858012A
- Authority
- CN
- China
- Prior art keywords
- barrage
- text
- similarity
- lexical item
- identical lexical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 56
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 20
- 238000004590 computer program Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 2
- 238000000034 method Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000005291 magnetic effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of barrage Text similarity computing method, storage medium, equipment and systems, it is related to big data processing field, include: to be segmented to the text of barrage A and barrage B, obtains the identical lexical item of barrage A and barrage B and the minimum word frequency of identical lexical item;Calculate proportion of the identical lexical item in barrage A and barrage B text;Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency;The text of barrage A and barrage B are mapped as space vector by word2vec model, cosine similarity algorithm is then based on and calculates barrage A and barrage B in the text similarity of text space;Text similarity to barrage A and barrage B based on identical lexical item word frequency, and weight calculation is carried out in the text similarity of text space, obtain the final similarity of barrage A and barrage B.The present invention can effectively ensure that barrage text is calculated between similarity accuracy.
Description
Technical field
The present invention relates to big data processing fields, and in particular to a kind of barrage Text similarity computing method, storage medium,
Equipment and system.
Background technique
With the rapid development of mobile Internet, industry is broadcast live and also shows booming gesture, it is more and more young
People likes dismissing free time by way of viewing live streaming.
User can be carried out by way of sending barrage text with main broadcaster or other users mutual during watching live streaming
It is dynamic, but the direct broadcasting room in certain popular main broadcasters causes the barrage amount of the direct broadcasting room non-since the direct broadcasting room number of users is more
Chang Ju great will cause barrage to be paved with entire live streaming picture, if each barrage sent for user is shown to protect
The viewing experience of user is demonstrate,proved, live streaming platform can calculate the similarity barrage text, if 2 barrage phase knowledge and magnanimity are higher,
Then only show 2 barrages in 1 barrage, in the prior art for the calculating of barrage similarity have cosine similarity, it is European away from
From algorithm etc., but these algorithms usually only considered the distance of 2 barrages in space, lead to the calculating of similarity between barrage not
It is enough accurate.
Summary of the invention
In view of the deficiencies in the prior art, the purpose of the present invention is to provide a kind of barrage Text similarity computing sides
Method, storage medium, equipment and system can effectively ensure that the accuracy of similarity between barrage text is calculated.
First aspect present invention provides a kind of barrage Text similarity computing method, comprising the following steps:
The text of barrage A and barrage B are segmented, obtain the identical lexical item and identical lexical item of barrage A and barrage B
Minimum word frequency;
Calculate proportion of the identical lexical item in barrage A and barrage B text;
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency;
The text of barrage A and barrage B are mapped as space vector by word2vec model, are then based on cosine similarity
Algorithm calculates barrage A and barrage B in the text similarity of text space;
Text similarity to barrage A and barrage B based on identical lexical item word frequency, and the text similarity in text space
Weight calculation is carried out, the final similarity of barrage A and barrage B are obtained.
With reference to first aspect, in the first possible implementation, described to calculate identical lexical item in barrage A and barrage B
Proportion in text, calculation formula are as follows:
Wherein, P (A, B) indicates proportion of the identical lexical item in barrage A and barrage B text, wordiIndicate same words
, niIndicate the minimum word frequency of identical lexical item, m indicates the number of identical lexical item, LAIndicate the text size of barrage A, LBIndicate bullet
The text size of curtain B.
The possible implementation of with reference to first aspect the first, in the second possible implementation, the calculating
The text similarity of barrage A and barrage B based on identical lexical item word frequency, calculation formula are as follows:
Wherein, SimtfThe text similarity of barrage A and barrage B of (A, the B) expression based on identical lexical item word frequency.
The possible implementation of second with reference to first aspect, it is in the third possible implementation, described to be based on
Cosine similarity algorithm calculates the text similarity of barrage A and barrage B in text space, calculation formula are as follows:
Wherein, simword2vec(A, B) indicate barrage A and barrage B text space similarity,Indicate that barrage A passes through
Space vector after the mapping of word2vec model,Indicate the space vector after barrage B is mapped by word2vec model, p table
ShowDimension.
The third possible implementation with reference to first aspect, in the fourth possible implementation, described pair of bullet
Curtain text similarity of the A and barrage B based on identical lexical item word frequency, and weight meter is carried out in the text similarity of text space
It calculates, obtains the final similarity of barrage A and barrage B, calculation formula are as follows:
Sim (A, B)=λ * Simtf(A,B)+(1-λ)Simword2vec(A,B)
Wherein, Sim (A, B) indicates barrage A and the final similarity of barrage B, and λ is regulation coefficient, value range be [0.6,
0.8]。
The third possible implementation with reference to first aspect, in a fifth possible implementation, as barrage A and
When the final similarity of barrage B is greater than given threshold, any bar barrage chosen in barrage A or barrage B is opened up on live streaming picture
Show, another unselected barrage is not shown on live streaming picture.
Second aspect of the present invention provides a kind of storage medium, and computer program, the calculating are stored on the storage medium
Machine program performs the steps of when being executed by processor
The text of barrage A and barrage B are segmented, obtain the identical lexical item and identical lexical item of barrage A and barrage B
Minimum word frequency;
Calculate proportion of the identical lexical item in barrage A and barrage B text;
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency;
The text of barrage A and barrage B are mapped as space vector by word2vec model, are then based on cosine similarity
Algorithm calculates barrage A and barrage B in the text similarity of text space;
Text similarity to barrage A and barrage B based on identical lexical item word frequency, and the text similarity in text space
Weight calculation is carried out, the final similarity of barrage A and barrage B are obtained.
Third aspect present invention provides a kind of electronic equipment, and the electronic equipment includes:
Participle unit is used to choose barrage barrage A and barrage B to be presented, divides the text of barrage A and barrage B
Word obtains the identical lexical item of barrage A and barrage B and the minimum word frequency of identical lexical item;
Ratio computing unit is used to calculate proportion of the identical lexical item in barrage A and barrage B text;
First Text similarity computing unit is used to calculate the text of barrage A and barrage B based on identical lexical item word frequency
Similarity;
Second Text similarity computing unit is used to map the text of barrage A and barrage B by word2vec model
For space vector, it is then based on cosine similarity algorithm and calculates barrage A and barrage B in the text similarity of text space;
Final similarity calculated is used for the text similarity to barrage A and barrage B based on identical lexical item word frequency,
And weight calculation is carried out in the text similarity of text space, obtain the final similarity of barrage A and barrage B.
Fourth aspect present invention provides a kind of barrage Text similarity computing system, comprising:
Word segmentation module is used to choose barrage barrage A and barrage B to be presented, divides the text of barrage A and barrage B
Word obtains the identical lexical item of barrage A and barrage B and the minimum word frequency of identical lexical item;
Ratio computing module is used to calculate proportion of the identical lexical item in barrage A and barrage B text;
First Text similarity computing module is used to calculate the text of barrage A and barrage B based on identical lexical item word frequency
Similarity;
Second Text similarity computing module is used to map the text of barrage A and barrage B by word2vec model
For space vector, it is then based on cosine similarity algorithm and calculates barrage A and barrage B in the text similarity of text space;
Final similarity calculation module is used for the text similarity to barrage A and barrage B based on identical lexical item word frequency,
And weight calculation is carried out in the text similarity of text space, obtain the final similarity of barrage A and barrage B.
In conjunction with fourth aspect, in the first possible implementation, the ratio computing module calculates identical lexical item and exists
Proportion in barrage A and barrage B text, calculation formula are as follows:
Wherein, P (A, B) indicates proportion of the identical lexical item in barrage A and barrage B text, wordiIndicate same words
, niIndicate the minimum word frequency of identical lexical item, m indicates the number of identical lexical item, LAIndicate the text size of barrage A, LBIndicate bullet
The text size of curtain B.
Compared with the prior art, the advantages of the present invention are as follows: in similarity calculation between carrying out barrage text, first to bullet
Curtain text is segmented, and the minimum word frequency of identical lexical item and identical lexical item that participle obtains between barrage is then based on, according to phase
With lexical item and minimum word frequency, the text similarity between the barrage based on identical lexical item word frequency, and the text in text space are calculated
This similarity finally assigns power to the text similarity based on identical lexical item word frequency and in the text similarity of text space respectively
Calculated again, the value obtained after calculating is as the final similarity between barrage text, when calculating text phase knowledge and magnanimity, carry out into
The considerations of row identical lexical item, the accuracy of similarity between barrage text is calculated is effectively ensured.
Detailed description of the invention
Fig. 1 is a kind of flow chart of barrage Text similarity computing method in the embodiment of the present invention;
Fig. 2 is the structural schematic diagram of a kind of electronic equipment in the embodiment of the present invention.
Specific embodiment
The embodiment of the invention provides a kind of barrage Text similarity computing method, based on the phrase word frequency of barrage text into
The accuracy of similarity calculation between barrage is effectively ensured in row similarity calculation.The embodiment of the present invention has also correspondingly provided storage
Medium, electronic equipment and barrage Text similarity computing system.
Below in conjunction with the attached drawing in additional copy inventive embodiments, technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this
Embodiment in invention, those skilled in the art's every other implementation obtained without creative efforts
Example, shall fall within the protection scope of the present invention.
It is shown in Figure 1, an a kind of embodiment packet of barrage Text similarity computing method provided in an embodiment of the present invention
It includes:
S1: segmenting the text of barrage A and barrage B, obtains the identical lexical item and same words of barrage A and barrage B
The minimum word frequency of item.
Common participle software in the prior art can be used for the participle of barrage text in the embodiment of the present invention, such as
Jieba etc..
In the embodiment of the present invention, the minimum word frequency of identical lexical item and identical lexical item for barrage A and barrage B
Out, it is illustrated below, it is assumed that the text of barrage A is " main broadcaster operates severe ", and the text of barrage B is that " this main broadcaster is very strict
Evil ", by barrage A participle after, be " main broadcaster ", " operation " and " severity ", barrage B segment after, for " this ", " main broadcaster ", " very " and
" severity ", then the identical lexical item of barrage A and barrage B is " main broadcaster ", " severity "." main broadcaster " occurs 1 time in barrage A, in barrage B
It is middle to occur 1 time, therefore the minimum word frequency of identical lexical item " main broadcaster " is 1;" severity " occurs 1 time in barrage A, occurs 1 in barrage B
It is secondary, therefore the minimum word frequency of " severity " is 1.Word frequency refers to the number that some identical lexical item occurs in certain barrage, and identical lexical item exists
Word frequency in different barrages is minimized the minimum word frequency as the identical lexical item, such as word of the identical lexical item A in barrage A
Frequency is 2, and word frequency of the identical lexical item A in barrage A is 1, then the word frequency of boarding at school of identical lexical item A is 1.
S2: proportion of the identical lexical item in barrage A and barrage B text is calculated.Calculate identical lexical item in barrage A and
Then shared ratio in the entire text of barrage B altogether calculates same words that is, by barrage A and barrage B and be an entirety
Item ratio shared in this entirety can reflect text by the calculating to identical lexical item proportion in a kind of degree
This degree of correlation, identical lexical item is more, and identical lexical item proportion in entirety is higher, then the semanteme between barrage text gets over phase
It closes.
S3: the text similarity of barrage A and barrage B based on identical lexical item word frequency are calculated.Based on identical lexical item word frequency
Text similarity can show representativeness of the identical lexical item in barrage A and barrage B, and the word frequency of identical lexical item is higher, represent
Property is then poorer, is just more difficult to judge the similarity between 2 barrage texts.
S4: the text of barrage A and barrage B are mapped as space vector by word2vec model, are then based on cosine phase
Barrage A and barrage B is calculated in the text similarity of text space like degree algorithm.
In the embodiment of the present invention, Word2vec is that a group is used to generate the correlation model of term vector, these models be it is shallow and
Double-deck neural network is used to training with the word text of construction linguistics again, and network is existing with vocabulary, and need to guess adjacent bit
The input word set, in word2vec bag of words assume under, the sequence of word be it is unessential, training complete after,
Word2vec model can be used to map each word to a vector, can be used to indicate word to the relationship between word, vector is nerve
The hidden layer of network.
S5: the text similarity to barrage A and barrage B based on identical lexical item word frequency, and the text phase in text space
Weight calculation is carried out like degree, obtains the final similarity of barrage A and barrage B.
In the embodiment of the present invention, two text similarities can be calculated to barrage A and barrage B, respectively based on identical
The text similarity of lexical item word frequency and text similarity in text space, to based on identical lexical item word frequency text similarity and
Weight is assigned respectively in the text similarity of text space to be calculated, the value obtained after calculating as barrage A and barrage B most
Whole similarity.
In the embodiment of the present invention.When the final similarity of barrage A and barrage B is greater than given threshold, barrage A or bullet are chosen
Any bar barrage in curtain B shows that another unselected barrage is not shown on live streaming picture on live streaming picture.
Barrage Text similarity computing method of the embodiment of the present invention, in similarity calculation between carrying out barrage text, first
Barrage text is segmented, the minimum word frequency of identical lexical item and identical lexical item that participle obtains between barrage, root are then based on
According to identical lexical item and minimum word frequency, the text similarity between the barrage based on identical lexical item word frequency is calculated, and in text space
Text similarity, finally assigned respectively to the text similarity based on identical lexical item word frequency and in the text similarity of text space
Weight to be given to be calculated, the value obtained after calculating is as the final similarity between barrage text, when calculating text phase knowledge and magnanimity, into
Row carries out the considerations of identical lexical item, and the accuracy of similarity between barrage text is calculated is effectively ensured.
Optionally, on the basis of above-mentioned Fig. 1 corresponding embodiment, a kind of barrage text phase provided in an embodiment of the present invention
In first alternative embodiment like degree calculation method, proportion of the identical lexical item in barrage A and barrage B text is calculated,
Calculation formula are as follows:
Wherein, P (A, B) indicates proportion of the identical lexical item in barrage A and barrage B text, wordiIndicate same words
, niIndicate the minimum word frequency of identical lexical item, m indicates the number of identical lexical item, LAIndicate the text size of barrage A, LBIndicate bullet
The text size of curtain B.
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency, calculation formula are as follows:
Wherein, SimtfThe text similarity of barrage A and barrage B of (A, the B) expression based on identical lexical item word frequency.
Optionally, on the basis of above-mentioned Fig. 1 corresponding first alternative embodiment, one kind provided in an embodiment of the present invention
In second alternative embodiment of barrage Text similarity computing method, barrage A and barrage B is calculated based on cosine similarity algorithm
In the text similarity of text space, calculation formula are as follows:
Wherein, simword2vec(A, B) indicate barrage A and barrage B text space similarity,Indicate that barrage A passes through
Space vector after the mapping of word2vec model,Indicate the space vector after barrage B is mapped by word2vec model, p table
ShowDimension.
Text similarity to barrage A and barrage B based on identical lexical item word frequency, and the text similarity in text space
Weight calculation is carried out, the final similarity of barrage A and barrage B, calculation formula are obtained are as follows:
Sim (A, B)=λ * Simtf(A,B)+(1-λ)Simword2vec(A,B)
Wherein, Sim (A, B) indicates barrage A and the final similarity of barrage B, and λ is regulation coefficient, value range be [0.6,
0.8], preferred value is 0.7, because the text similarity based on identical lexical item word frequency is more important more to reflect barrage text
Between similarity, therefore the text similarity weight based on identical lexical item word frequency is larger.
Below in conjunction with an example to the entire calculating process of the final similarity of barrage A in the embodiment of the present invention and barrage B
It is illustrated.
Assuming that the text of barrage A is " main broadcaster operates severe ", the text of barrage B is " this main broadcaster is very serious ", then barrage A
Identical lexical item is " main broadcaster " and " severity " with barrage B's, and the minimum word frequency of " main broadcaster " is 1, and the minimum word frequency of " severity " is 1.
Calculate proportion of the identical lexical item in barrage A and barrage B text:
P (A, B)=(1+1)/4+5=0.22;
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency:
Simtf (A, B)=log (1+ (2-0.22)/2+1)=0.201;
Barrage A and barrage B is calculated in the text similarity of text space based on cosine similarity algorithm:
Simword2vec (A, B)=0.68
The final similarity of barrage A and barrage B is calculated:
Sim (A, B)=λ * Simtf(A,B)+(1-λ)Simword2vec(A, B)=0.6*0.201+0.4*0.68=0.392
Therefore the final similarity that barrage A and barrage B is calculated is 0.392.
Assuming that the text of barrage C is " China is proud " there are also a barrage C, calculate between barrage A and barrage C at this time
Similarity, the identical lexical item between barrage A and barrage C do not have, and minimum word frequency is also 0.
Calculate proportion of the identical lexical item in barrage A and barrage C text:
P (A, C)=0/5+3=0;
Calculate the text similarity of barrage A and barrage C based on identical lexical item word frequency:
Simtf (A, C)=log (1+0)=0;
Barrage A and barrage C is calculated in the text similarity of text space based on cosine similarity algorithm:
Simword2vec (A, C)=0.35
The final similarity of barrage A and barrage C is calculated:
Sim (A, C)=λ * Simtf(A,C)+(1-λ)Simword2vec(A, C)=0.6*0+0.4*0.35=0.14
Therefore the final similarity that barrage A and barrage C is calculated is 0.14, it will be apparent that similar between barrage A and barrage B
Degree meets the fact, illustrates the barrage text similarity meter of the embodiment of the present invention much larger than the phase knowledge and magnanimity between barrage A and barrage C
Calculation method calculated result accuracy is high.
A kind of one embodiment of storage medium provided in an embodiment of the present invention includes: to be stored with computer on the storage medium
Program, the computer program perform the steps of when being executed by processor
The text of barrage A and barrage B are segmented, obtain the identical lexical item and identical lexical item of barrage A and barrage B
Minimum word frequency;
Calculate proportion of the identical lexical item in barrage A and barrage B text;
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency;
The text of barrage A and barrage B are mapped as space vector by word2vec model, are then based on cosine similarity
Algorithm calculates barrage A and barrage B in the text similarity of text space;
Text similarity to barrage A and barrage B based on identical lexical item word frequency, and the text similarity in text space
Weight calculation is carried out, the final similarity of barrage A and barrage B are obtained.
Optionally, on the basis of above-mentioned storage medium embodiment, a kind of storage medium provided in an embodiment of the present invention
In first alternative embodiment, storage medium can be using any combination of one or more computer-readable media.It calculates
Machine readable medium can be computer-readable signal media or computer readable storage medium.Computer readable storage medium example
Can such as be but not limited to: electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or it is any more than
Combination.The more specific example (non exhaustive list) of computer readable storage medium includes: to have one or more conducting wires
Electrical connection, portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM), erasable type can compile
Journey read-only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic
Memory device or above-mentioned any appropriate combination.In this document, computer readable storage medium, which can be, any includes
Or the tangible medium of storage program, which can be commanded execution system, device or device use or in connection make
With.
Optionally, on the basis of the embodiment of above-mentioned storage medium and first alternative embodiment, the embodiment of the present invention
There is provided a kind of storage medium second alternative embodiment in, computer-readable signal media may include in a base band or
The data-signal that person propagates as carrier wave a part, wherein carrying computer-readable program code.The data of this propagation
Signal can take various forms, including but not limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer
Readable signal media can also be any computer-readable medium other than computer readable storage medium, this is computer-readable
Medium can send, propagate or transmit for by instruction execution system, device or device use or it is in connection
Program.The program code for including on computer-readable medium can transmit with any suitable medium, including but not limited to: nothing
Line, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
Optionally, on the basis of the embodiment of above-mentioned storage medium and first, second alternative embodiment, the present invention is real
It, can be with one or more programming languages or its group in the third alternative embodiment that a kind of storage medium of example offer is provided
It closes to write the computer program code for executing operation of the present invention, described program design language includes the program of object-oriented
Design language, such as Java, Smalltalk, C++, further include conventional procedural programming language-such as " C " language or
Similar programming language.Program code can execute fully on the user computer, partly on the user computer
It executes, executed as an independent software package, part partially executes on the remote computer on the user computer or complete
It executes on a remote computer or server entirely.In situations involving remote computers, remote computer can be by any
The network of type, including local area network (LAN) or wide area network (WAN), are connected to subscriber computer, or, it may be connected to it is external
Computer (such as being connected using ISP by internet).
Shown in Figure 2, an embodiment of a kind of electronic equipment provided in an embodiment of the present invention includes:
Participle unit is used to choose barrage barrage A and barrage B to be presented, divides the text of barrage A and barrage B
Word obtains the identical lexical item of barrage A and barrage B and the minimum word frequency of identical lexical item;
Ratio computing unit is used to calculate proportion of the identical lexical item in barrage A and barrage B text;
First Text similarity computing unit is used to calculate the text of barrage A and barrage B based on identical lexical item word frequency
Similarity;
Second Text similarity computing unit is used to map the text of barrage A and barrage B by word2vec model
For space vector, it is then based on cosine similarity algorithm and calculates barrage A and barrage B in the text similarity of text space;
Final similarity calculated is used for the text similarity to barrage A and barrage B based on identical lexical item word frequency,
And weight calculation is carried out in the text similarity of text space, obtain the final similarity of barrage A and barrage B.
A kind of one embodiment of barrage Text similarity computing system provided in an embodiment of the present invention includes:
Word segmentation module is used to choose barrage barrage A and barrage B to be presented, divides the text of barrage A and barrage B
Word obtains the identical lexical item of barrage A and barrage B and the minimum word frequency of identical lexical item;
Ratio computing module is used to calculate proportion of the identical lexical item in barrage A and barrage B text;
First Text similarity computing module is used to calculate the text of barrage A and barrage B based on identical lexical item word frequency
Similarity;
Second Text similarity computing module is used to map the text of barrage A and barrage B by word2vec model
For space vector, it is then based on cosine similarity algorithm and calculates barrage A and barrage B in the text similarity of text space;
Final similarity calculation module is used for the text similarity to barrage A and barrage B based on identical lexical item word frequency,
And weight calculation is carried out in the text similarity of text space, obtain the final similarity of barrage A and barrage B.
Optionally, on the basis of above-mentioned barrage Text similarity computing system corresponding embodiment, the embodiment of the present invention
In a kind of first alternative embodiment of the barrage Text similarity computing system provided, ratio computing module calculates identical lexical item
Proportion in barrage A and barrage B text, calculation formula are as follows:
Wherein, P (A, B) indicates proportion of the identical lexical item in barrage A and barrage B text, wordiIndicate same words
, niIndicate the minimum word frequency of identical lexical item, m indicates the number of identical lexical item, LAIndicate the text size of barrage A, LBIndicate bullet
The text size of curtain B.
The barrage Text similarity computing system of the embodiment of the present invention, it is first in similarity calculation between carrying out barrage text
First barrage text is segmented, is then based on the minimum word frequency of identical lexical item and identical lexical item that participle obtains between barrage,
According to identical lexical item and minimum word frequency, the text similarity between the barrage based on identical lexical item word frequency is calculated, and in text sky
Between text similarity, finally to based on identical lexical item word frequency text similarity and text space text similarity distinguish
Weight is assigned to be calculated, the value obtained after calculating is as the final similarity between barrage text, when calculating text phase knowledge and magnanimity,
The accuracy of similarity between barrage text is calculated is effectively ensured in the considerations of carrying out identical lexical item.
The present invention is not limited to the above-described embodiments, for those skilled in the art, is not departing from
Under the premise of the principle of the invention, several improvements and modifications can also be made, these improvements and modifications are also considered as protection of the invention
Within the scope of.The content being not described in detail in this specification belongs to the prior art well known to professional and technical personnel in the field.
Claims (10)
1. a kind of barrage Text similarity computing method, which comprises the following steps:
The text of barrage A and barrage B are segmented, the identical lexical item and identical lexical item for obtaining barrage A and barrage B are most
Small word frequency;
Calculate proportion of the identical lexical item in barrage A and barrage B text;
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency;
The text of barrage A and barrage B are mapped as space vector by word2vec model, are then based on cosine similarity algorithm
Barrage A and barrage B is calculated in the text similarity of text space;
Text similarity to barrage A and barrage B based on identical lexical item word frequency, and carried out in the text similarity of text space
Weight calculation obtains the final similarity of barrage A and barrage B.
2. a kind of barrage Text similarity computing method as described in claim 1, it is characterised in that: described to calculate identical lexical item
Proportion in barrage A and barrage B text, calculation formula are as follows:
Wherein, P (A, B) indicates proportion of the identical lexical item in barrage A and barrage B text, wordiIndicate identical lexical item, ni
Indicate the minimum word frequency of identical lexical item, m indicates the number of identical lexical item, LAIndicate the text size of barrage A, LBIndicate barrage B's
Text size.
3. a kind of barrage Text similarity computing method as claimed in claim 2, it is characterised in that: described to calculate based on identical
The text similarity of the barrage A and barrage B of lexical item word frequency, calculation formula are as follows:
Wherein, SimtfThe text similarity of barrage A and barrage B of (A, the B) expression based on identical lexical item word frequency.
4. a kind of barrage Text similarity computing method as claimed in claim 3, it is characterised in that: described similar based on cosine
It spends algorithm and calculates the text similarity of barrage A and barrage B in text space, calculation formula are as follows:
Wherein, simword2vec(A, B) indicate barrage A and barrage B text space similarity,Indicate that barrage A passes through
Space vector after the mapping of word2vec model,Indicate the space vector after barrage B is mapped by word2vec model, p table
ShowDimension.
5. a kind of barrage Text similarity computing method as claimed in claim 4, it is characterised in that: described to barrage A and bullet
Curtain text similarity of the B based on identical lexical item word frequency, and weight calculation is carried out in the text similarity of text space, obtain bullet
The final similarity of curtain A and barrage B, calculation formula are as follows:
Sim (A, B)=λ * Simtf(A,B)+(1-λ)Simword2vec(A,B)
Wherein, Sim (A, B) indicates barrage A and the final similarity of barrage B, and λ is regulation coefficient, and value range is [0.6,0.8].
6. a kind of barrage Text similarity computing method as described in claim 1, it is characterised in that: when barrage A's and barrage B
When final similarity is greater than given threshold, any bar barrage chosen in barrage A or barrage B is shown on live streaming picture, another
The unselected barrage of item is not shown on live streaming picture.
7. a kind of storage medium, computer program is stored on the storage medium, it is characterised in that: the computer program is located
Reason device performs the steps of when executing
The text of barrage A and barrage B are segmented, the identical lexical item and identical lexical item for obtaining barrage A and barrage B are most
Small word frequency;
Calculate proportion of the identical lexical item in barrage A and barrage B text;
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency;
The text of barrage A and barrage B are mapped as space vector by word2vec model, are then based on cosine similarity algorithm
Barrage A and barrage B is calculated in the text similarity of text space;
Text similarity to barrage A and barrage B based on identical lexical item word frequency, and carried out in the text similarity of text space
Weight calculation obtains the final similarity of barrage A and barrage B.
8. a kind of electronic equipment, which is characterized in that the electronic equipment includes:
Participle unit is used to choose barrage barrage A and barrage B to be presented, segments, obtain to the text of barrage A and barrage B
The minimum word frequency of the identical lexical item and identical lexical item of barrage A and barrage B out;
Ratio computing unit is used to calculate proportion of the identical lexical item in barrage A and barrage B text;
It is similar with the text of barrage B to be used to calculate the barrage A based on identical lexical item word frequency for first Text similarity computing unit
Degree;
Second Text similarity computing unit is used to the text of barrage A and barrage B being mapped as sky by word2vec model
Between vector, be then based on cosine similarity algorithm calculate barrage A and barrage B text space text similarity;
Final similarity calculated is used for the text similarity to barrage A and barrage B based on identical lexical item word frequency, and
Weight calculation is carried out in the text similarity of text space, obtains the final similarity of barrage A and barrage B.
9. a kind of barrage Text similarity computing system characterized by comprising
Word segmentation module is used to choose barrage barrage A and barrage B to be presented, segments, obtain to the text of barrage A and barrage B
The minimum word frequency of the identical lexical item and identical lexical item of barrage A and barrage B out;
Ratio computing module is used to calculate proportion of the identical lexical item in barrage A and barrage B text;
It is similar with the text of barrage B to be used to calculate the barrage A based on identical lexical item word frequency for first Text similarity computing module
Degree;
Second Text similarity computing module is used to the text of barrage A and barrage B being mapped as sky by word2vec model
Between vector, be then based on cosine similarity algorithm calculate barrage A and barrage B text space text similarity;
Final similarity calculation module is used for the text similarity to barrage A and barrage B based on identical lexical item word frequency, and
Weight calculation is carried out in the text similarity of text space, obtains the final similarity of barrage A and barrage B.
10. a kind of barrage Text similarity computing system as claimed in claim 9, it is characterised in that: the ratio calculates mould
Block calculates proportion of the identical lexical item in barrage A and barrage B text, calculation formula are as follows:
Wherein, P (A, B) indicates proportion of the identical lexical item in barrage A and barrage B text, wordiIndicate identical lexical item, ni
Indicate the minimum word frequency of identical lexical item, m indicates the number of identical lexical item, LAIndicate the text size of barrage A, LBIndicate barrage B's
Text size.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811459848.5A CN109858012B (en) | 2018-11-30 | 2018-11-30 | Barrage text similarity calculation method, storage medium, equipment and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811459848.5A CN109858012B (en) | 2018-11-30 | 2018-11-30 | Barrage text similarity calculation method, storage medium, equipment and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109858012A true CN109858012A (en) | 2019-06-07 |
CN109858012B CN109858012B (en) | 2023-11-28 |
Family
ID=66890555
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811459848.5A Active CN109858012B (en) | 2018-11-30 | 2018-11-30 | Barrage text similarity calculation method, storage medium, equipment and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109858012B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011118526A (en) * | 2009-12-01 | 2011-06-16 | Hitachi Ltd | Device for extraction of word semantic relation |
CN107992470A (en) * | 2017-11-08 | 2018-05-04 | 中国科学院计算机网络信息中心 | A kind of text duplicate checking method and system based on similarity |
-
2018
- 2018-11-30 CN CN201811459848.5A patent/CN109858012B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011118526A (en) * | 2009-12-01 | 2011-06-16 | Hitachi Ltd | Device for extraction of word semantic relation |
CN107992470A (en) * | 2017-11-08 | 2018-05-04 | 中国科学院计算机网络信息中心 | A kind of text duplicate checking method and system based on similarity |
Non-Patent Citations (3)
Title |
---|
王嘉?等: "基于向量空间模型的文本相似度计算方法", 《科技广场》 * |
石雁等: "结合统计和词间关系的文本关键词计算方法", 《计算机技术与发展》 * |
詹志建等: "基于语言网络和语义信息的文本相似度计算", 《计算机工程与应用》 * |
Also Published As
Publication number | Publication date |
---|---|
CN109858012B (en) | 2023-11-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110399848A (en) | Video cover generation method, device and electronic equipment | |
JP6901816B2 (en) | Entity-related data generation methods, devices, devices, and storage media | |
US20170357720A1 (en) | Joint heterogeneous language-vision embeddings for video tagging and search | |
CN109408826A (en) | A kind of text information extracting method, device, server and storage medium | |
Han et al. | Fine-grained cross-modal alignment network for text-video retrieval | |
CN110012302A (en) | A kind of network direct broadcasting monitoring method and device, data processing method | |
CN110096614B (en) | Information recommendation method and device and electronic equipment | |
CN110377778A (en) | Figure sort method, device and electronic equipment based on title figure correlation | |
CN109509010B (en) | Multimedia information processing method, terminal and storage medium | |
CN106407280A (en) | Query target matching method and device | |
CN110347428A (en) | A kind of detection method and device of code similarity | |
CN106663123A (en) | Comment-centered news reader | |
CN114706973A (en) | Extraction type text abstract generation method and device, computer equipment and storage medium | |
Li et al. | Hypergraph association weakly supervised crowd counting | |
CN114782722A (en) | Image-text similarity determining method and device and electronic equipment | |
CN107729944B (en) | Identification method and device of popular pictures, server and storage medium | |
CN111460224B (en) | Comment data quality labeling method, comment data quality labeling device, comment data quality labeling equipment and storage medium | |
CN109858012A (en) | Barrage Text similarity computing method, storage medium, equipment and system | |
CN111651660A (en) | Method for cross-media retrieval of difficult samples | |
Wei et al. | Learning embeddings of spatial, textual and temporal entities in geotagged tweets | |
Zhou et al. | Virtual data augmentation: A robust and general framework for fine-tuning pre-trained models | |
CN111191242A (en) | Vulnerability information determination method and device, computer readable storage medium and equipment | |
CN112949777B (en) | Similar image determining method and device, electronic equipment and storage medium | |
CN116935261A (en) | Data processing method and related device | |
CN109857995A (en) | Barrage similarity calculating method, storage medium, equipment and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20231027 Address after: Room 7-801, Aokai City Plaza, No. 1777 Zhonghui Avenue, Huishan District, Wuxi City, Jiangsu Province, 214000 (Huishan Station Area of Urban Railway) Applicant after: Kasima Huizhi (Wuxi) Technology Co.,Ltd. Address before: 430000 East Lake Development Zone, Wuhan City, Hubei Province, No. 1 Software Park East Road 4.1 Phase B1 Building 11 Building Applicant before: WUHAN DOUYU NETWORK TECHNOLOGY Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |