CN109858012A - Barrage Text similarity computing method, storage medium, equipment and system - Google Patents

Barrage Text similarity computing method, storage medium, equipment and system Download PDF

Info

Publication number
CN109858012A
CN109858012A CN201811459848.5A CN201811459848A CN109858012A CN 109858012 A CN109858012 A CN 109858012A CN 201811459848 A CN201811459848 A CN 201811459848A CN 109858012 A CN109858012 A CN 109858012A
Authority
CN
China
Prior art keywords
barrage
text
similarity
lexical item
identical lexical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811459848.5A
Other languages
Chinese (zh)
Other versions
CN109858012B (en
Inventor
徐乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kasima Huizhi Wuxi Technology Co ltd
Original Assignee
Wuhan Douyu Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Douyu Network Technology Co Ltd filed Critical Wuhan Douyu Network Technology Co Ltd
Priority to CN201811459848.5A priority Critical patent/CN109858012B/en
Publication of CN109858012A publication Critical patent/CN109858012A/en
Application granted granted Critical
Publication of CN109858012B publication Critical patent/CN109858012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of barrage Text similarity computing method, storage medium, equipment and systems, it is related to big data processing field, include: to be segmented to the text of barrage A and barrage B, obtains the identical lexical item of barrage A and barrage B and the minimum word frequency of identical lexical item;Calculate proportion of the identical lexical item in barrage A and barrage B text;Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency;The text of barrage A and barrage B are mapped as space vector by word2vec model, cosine similarity algorithm is then based on and calculates barrage A and barrage B in the text similarity of text space;Text similarity to barrage A and barrage B based on identical lexical item word frequency, and weight calculation is carried out in the text similarity of text space, obtain the final similarity of barrage A and barrage B.The present invention can effectively ensure that barrage text is calculated between similarity accuracy.

Description

Barrage Text similarity computing method, storage medium, equipment and system
Technical field
The present invention relates to big data processing fields, and in particular to a kind of barrage Text similarity computing method, storage medium, Equipment and system.
Background technique
With the rapid development of mobile Internet, industry is broadcast live and also shows booming gesture, it is more and more young People likes dismissing free time by way of viewing live streaming.
User can be carried out by way of sending barrage text with main broadcaster or other users mutual during watching live streaming It is dynamic, but the direct broadcasting room in certain popular main broadcasters causes the barrage amount of the direct broadcasting room non-since the direct broadcasting room number of users is more Chang Ju great will cause barrage to be paved with entire live streaming picture, if each barrage sent for user is shown to protect The viewing experience of user is demonstrate,proved, live streaming platform can calculate the similarity barrage text, if 2 barrage phase knowledge and magnanimity are higher, Then only show 2 barrages in 1 barrage, in the prior art for the calculating of barrage similarity have cosine similarity, it is European away from From algorithm etc., but these algorithms usually only considered the distance of 2 barrages in space, lead to the calculating of similarity between barrage not It is enough accurate.
Summary of the invention
In view of the deficiencies in the prior art, the purpose of the present invention is to provide a kind of barrage Text similarity computing sides Method, storage medium, equipment and system can effectively ensure that the accuracy of similarity between barrage text is calculated.
First aspect present invention provides a kind of barrage Text similarity computing method, comprising the following steps:
The text of barrage A and barrage B are segmented, obtain the identical lexical item and identical lexical item of barrage A and barrage B Minimum word frequency;
Calculate proportion of the identical lexical item in barrage A and barrage B text;
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency;
The text of barrage A and barrage B are mapped as space vector by word2vec model, are then based on cosine similarity Algorithm calculates barrage A and barrage B in the text similarity of text space;
Text similarity to barrage A and barrage B based on identical lexical item word frequency, and the text similarity in text space Weight calculation is carried out, the final similarity of barrage A and barrage B are obtained.
With reference to first aspect, in the first possible implementation, described to calculate identical lexical item in barrage A and barrage B Proportion in text, calculation formula are as follows:
Wherein, P (A, B) indicates proportion of the identical lexical item in barrage A and barrage B text, wordiIndicate same words , niIndicate the minimum word frequency of identical lexical item, m indicates the number of identical lexical item, LAIndicate the text size of barrage A, LBIndicate bullet The text size of curtain B.
The possible implementation of with reference to first aspect the first, in the second possible implementation, the calculating The text similarity of barrage A and barrage B based on identical lexical item word frequency, calculation formula are as follows:
Wherein, SimtfThe text similarity of barrage A and barrage B of (A, the B) expression based on identical lexical item word frequency.
The possible implementation of second with reference to first aspect, it is in the third possible implementation, described to be based on Cosine similarity algorithm calculates the text similarity of barrage A and barrage B in text space, calculation formula are as follows:
Wherein, simword2vec(A, B) indicate barrage A and barrage B text space similarity,Indicate that barrage A passes through Space vector after the mapping of word2vec model,Indicate the space vector after barrage B is mapped by word2vec model, p table ShowDimension.
The third possible implementation with reference to first aspect, in the fourth possible implementation, described pair of bullet Curtain text similarity of the A and barrage B based on identical lexical item word frequency, and weight meter is carried out in the text similarity of text space It calculates, obtains the final similarity of barrage A and barrage B, calculation formula are as follows:
Sim (A, B)=λ * Simtf(A,B)+(1-λ)Simword2vec(A,B)
Wherein, Sim (A, B) indicates barrage A and the final similarity of barrage B, and λ is regulation coefficient, value range be [0.6, 0.8]。
The third possible implementation with reference to first aspect, in a fifth possible implementation, as barrage A and When the final similarity of barrage B is greater than given threshold, any bar barrage chosen in barrage A or barrage B is opened up on live streaming picture Show, another unselected barrage is not shown on live streaming picture.
Second aspect of the present invention provides a kind of storage medium, and computer program, the calculating are stored on the storage medium Machine program performs the steps of when being executed by processor
The text of barrage A and barrage B are segmented, obtain the identical lexical item and identical lexical item of barrage A and barrage B Minimum word frequency;
Calculate proportion of the identical lexical item in barrage A and barrage B text;
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency;
The text of barrage A and barrage B are mapped as space vector by word2vec model, are then based on cosine similarity Algorithm calculates barrage A and barrage B in the text similarity of text space;
Text similarity to barrage A and barrage B based on identical lexical item word frequency, and the text similarity in text space Weight calculation is carried out, the final similarity of barrage A and barrage B are obtained.
Third aspect present invention provides a kind of electronic equipment, and the electronic equipment includes:
Participle unit is used to choose barrage barrage A and barrage B to be presented, divides the text of barrage A and barrage B Word obtains the identical lexical item of barrage A and barrage B and the minimum word frequency of identical lexical item;
Ratio computing unit is used to calculate proportion of the identical lexical item in barrage A and barrage B text;
First Text similarity computing unit is used to calculate the text of barrage A and barrage B based on identical lexical item word frequency Similarity;
Second Text similarity computing unit is used to map the text of barrage A and barrage B by word2vec model For space vector, it is then based on cosine similarity algorithm and calculates barrage A and barrage B in the text similarity of text space;
Final similarity calculated is used for the text similarity to barrage A and barrage B based on identical lexical item word frequency, And weight calculation is carried out in the text similarity of text space, obtain the final similarity of barrage A and barrage B.
Fourth aspect present invention provides a kind of barrage Text similarity computing system, comprising:
Word segmentation module is used to choose barrage barrage A and barrage B to be presented, divides the text of barrage A and barrage B Word obtains the identical lexical item of barrage A and barrage B and the minimum word frequency of identical lexical item;
Ratio computing module is used to calculate proportion of the identical lexical item in barrage A and barrage B text;
First Text similarity computing module is used to calculate the text of barrage A and barrage B based on identical lexical item word frequency Similarity;
Second Text similarity computing module is used to map the text of barrage A and barrage B by word2vec model For space vector, it is then based on cosine similarity algorithm and calculates barrage A and barrage B in the text similarity of text space;
Final similarity calculation module is used for the text similarity to barrage A and barrage B based on identical lexical item word frequency, And weight calculation is carried out in the text similarity of text space, obtain the final similarity of barrage A and barrage B.
In conjunction with fourth aspect, in the first possible implementation, the ratio computing module calculates identical lexical item and exists Proportion in barrage A and barrage B text, calculation formula are as follows:
Wherein, P (A, B) indicates proportion of the identical lexical item in barrage A and barrage B text, wordiIndicate same words , niIndicate the minimum word frequency of identical lexical item, m indicates the number of identical lexical item, LAIndicate the text size of barrage A, LBIndicate bullet The text size of curtain B.
Compared with the prior art, the advantages of the present invention are as follows: in similarity calculation between carrying out barrage text, first to bullet Curtain text is segmented, and the minimum word frequency of identical lexical item and identical lexical item that participle obtains between barrage is then based on, according to phase With lexical item and minimum word frequency, the text similarity between the barrage based on identical lexical item word frequency, and the text in text space are calculated This similarity finally assigns power to the text similarity based on identical lexical item word frequency and in the text similarity of text space respectively Calculated again, the value obtained after calculating is as the final similarity between barrage text, when calculating text phase knowledge and magnanimity, carry out into The considerations of row identical lexical item, the accuracy of similarity between barrage text is calculated is effectively ensured.
Detailed description of the invention
Fig. 1 is a kind of flow chart of barrage Text similarity computing method in the embodiment of the present invention;
Fig. 2 is the structural schematic diagram of a kind of electronic equipment in the embodiment of the present invention.
Specific embodiment
The embodiment of the invention provides a kind of barrage Text similarity computing method, based on the phrase word frequency of barrage text into The accuracy of similarity calculation between barrage is effectively ensured in row similarity calculation.The embodiment of the present invention has also correspondingly provided storage Medium, electronic equipment and barrage Text similarity computing system.
Below in conjunction with the attached drawing in additional copy inventive embodiments, technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this Embodiment in invention, those skilled in the art's every other implementation obtained without creative efforts Example, shall fall within the protection scope of the present invention.
It is shown in Figure 1, an a kind of embodiment packet of barrage Text similarity computing method provided in an embodiment of the present invention It includes:
S1: segmenting the text of barrage A and barrage B, obtains the identical lexical item and same words of barrage A and barrage B The minimum word frequency of item.
Common participle software in the prior art can be used for the participle of barrage text in the embodiment of the present invention, such as Jieba etc..
In the embodiment of the present invention, the minimum word frequency of identical lexical item and identical lexical item for barrage A and barrage B Out, it is illustrated below, it is assumed that the text of barrage A is " main broadcaster operates severe ", and the text of barrage B is that " this main broadcaster is very strict Evil ", by barrage A participle after, be " main broadcaster ", " operation " and " severity ", barrage B segment after, for " this ", " main broadcaster ", " very " and " severity ", then the identical lexical item of barrage A and barrage B is " main broadcaster ", " severity "." main broadcaster " occurs 1 time in barrage A, in barrage B It is middle to occur 1 time, therefore the minimum word frequency of identical lexical item " main broadcaster " is 1;" severity " occurs 1 time in barrage A, occurs 1 in barrage B It is secondary, therefore the minimum word frequency of " severity " is 1.Word frequency refers to the number that some identical lexical item occurs in certain barrage, and identical lexical item exists Word frequency in different barrages is minimized the minimum word frequency as the identical lexical item, such as word of the identical lexical item A in barrage A Frequency is 2, and word frequency of the identical lexical item A in barrage A is 1, then the word frequency of boarding at school of identical lexical item A is 1.
S2: proportion of the identical lexical item in barrage A and barrage B text is calculated.Calculate identical lexical item in barrage A and Then shared ratio in the entire text of barrage B altogether calculates same words that is, by barrage A and barrage B and be an entirety Item ratio shared in this entirety can reflect text by the calculating to identical lexical item proportion in a kind of degree This degree of correlation, identical lexical item is more, and identical lexical item proportion in entirety is higher, then the semanteme between barrage text gets over phase It closes.
S3: the text similarity of barrage A and barrage B based on identical lexical item word frequency are calculated.Based on identical lexical item word frequency Text similarity can show representativeness of the identical lexical item in barrage A and barrage B, and the word frequency of identical lexical item is higher, represent Property is then poorer, is just more difficult to judge the similarity between 2 barrage texts.
S4: the text of barrage A and barrage B are mapped as space vector by word2vec model, are then based on cosine phase Barrage A and barrage B is calculated in the text similarity of text space like degree algorithm.
In the embodiment of the present invention, Word2vec is that a group is used to generate the correlation model of term vector, these models be it is shallow and Double-deck neural network is used to training with the word text of construction linguistics again, and network is existing with vocabulary, and need to guess adjacent bit The input word set, in word2vec bag of words assume under, the sequence of word be it is unessential, training complete after, Word2vec model can be used to map each word to a vector, can be used to indicate word to the relationship between word, vector is nerve The hidden layer of network.
S5: the text similarity to barrage A and barrage B based on identical lexical item word frequency, and the text phase in text space Weight calculation is carried out like degree, obtains the final similarity of barrage A and barrage B.
In the embodiment of the present invention, two text similarities can be calculated to barrage A and barrage B, respectively based on identical The text similarity of lexical item word frequency and text similarity in text space, to based on identical lexical item word frequency text similarity and Weight is assigned respectively in the text similarity of text space to be calculated, the value obtained after calculating as barrage A and barrage B most Whole similarity.
In the embodiment of the present invention.When the final similarity of barrage A and barrage B is greater than given threshold, barrage A or bullet are chosen Any bar barrage in curtain B shows that another unselected barrage is not shown on live streaming picture on live streaming picture.
Barrage Text similarity computing method of the embodiment of the present invention, in similarity calculation between carrying out barrage text, first Barrage text is segmented, the minimum word frequency of identical lexical item and identical lexical item that participle obtains between barrage, root are then based on According to identical lexical item and minimum word frequency, the text similarity between the barrage based on identical lexical item word frequency is calculated, and in text space Text similarity, finally assigned respectively to the text similarity based on identical lexical item word frequency and in the text similarity of text space Weight to be given to be calculated, the value obtained after calculating is as the final similarity between barrage text, when calculating text phase knowledge and magnanimity, into Row carries out the considerations of identical lexical item, and the accuracy of similarity between barrage text is calculated is effectively ensured.
Optionally, on the basis of above-mentioned Fig. 1 corresponding embodiment, a kind of barrage text phase provided in an embodiment of the present invention In first alternative embodiment like degree calculation method, proportion of the identical lexical item in barrage A and barrage B text is calculated, Calculation formula are as follows:
Wherein, P (A, B) indicates proportion of the identical lexical item in barrage A and barrage B text, wordiIndicate same words , niIndicate the minimum word frequency of identical lexical item, m indicates the number of identical lexical item, LAIndicate the text size of barrage A, LBIndicate bullet The text size of curtain B.
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency, calculation formula are as follows:
Wherein, SimtfThe text similarity of barrage A and barrage B of (A, the B) expression based on identical lexical item word frequency.
Optionally, on the basis of above-mentioned Fig. 1 corresponding first alternative embodiment, one kind provided in an embodiment of the present invention In second alternative embodiment of barrage Text similarity computing method, barrage A and barrage B is calculated based on cosine similarity algorithm In the text similarity of text space, calculation formula are as follows:
Wherein, simword2vec(A, B) indicate barrage A and barrage B text space similarity,Indicate that barrage A passes through Space vector after the mapping of word2vec model,Indicate the space vector after barrage B is mapped by word2vec model, p table ShowDimension.
Text similarity to barrage A and barrage B based on identical lexical item word frequency, and the text similarity in text space Weight calculation is carried out, the final similarity of barrage A and barrage B, calculation formula are obtained are as follows:
Sim (A, B)=λ * Simtf(A,B)+(1-λ)Simword2vec(A,B)
Wherein, Sim (A, B) indicates barrage A and the final similarity of barrage B, and λ is regulation coefficient, value range be [0.6, 0.8], preferred value is 0.7, because the text similarity based on identical lexical item word frequency is more important more to reflect barrage text Between similarity, therefore the text similarity weight based on identical lexical item word frequency is larger.
Below in conjunction with an example to the entire calculating process of the final similarity of barrage A in the embodiment of the present invention and barrage B It is illustrated.
Assuming that the text of barrage A is " main broadcaster operates severe ", the text of barrage B is " this main broadcaster is very serious ", then barrage A Identical lexical item is " main broadcaster " and " severity " with barrage B's, and the minimum word frequency of " main broadcaster " is 1, and the minimum word frequency of " severity " is 1.
Calculate proportion of the identical lexical item in barrage A and barrage B text:
P (A, B)=(1+1)/4+5=0.22;
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency:
Simtf (A, B)=log (1+ (2-0.22)/2+1)=0.201;
Barrage A and barrage B is calculated in the text similarity of text space based on cosine similarity algorithm:
Simword2vec (A, B)=0.68
The final similarity of barrage A and barrage B is calculated:
Sim (A, B)=λ * Simtf(A,B)+(1-λ)Simword2vec(A, B)=0.6*0.201+0.4*0.68=0.392
Therefore the final similarity that barrage A and barrage B is calculated is 0.392.
Assuming that the text of barrage C is " China is proud " there are also a barrage C, calculate between barrage A and barrage C at this time Similarity, the identical lexical item between barrage A and barrage C do not have, and minimum word frequency is also 0.
Calculate proportion of the identical lexical item in barrage A and barrage C text:
P (A, C)=0/5+3=0;
Calculate the text similarity of barrage A and barrage C based on identical lexical item word frequency:
Simtf (A, C)=log (1+0)=0;
Barrage A and barrage C is calculated in the text similarity of text space based on cosine similarity algorithm:
Simword2vec (A, C)=0.35
The final similarity of barrage A and barrage C is calculated:
Sim (A, C)=λ * Simtf(A,C)+(1-λ)Simword2vec(A, C)=0.6*0+0.4*0.35=0.14
Therefore the final similarity that barrage A and barrage C is calculated is 0.14, it will be apparent that similar between barrage A and barrage B Degree meets the fact, illustrates the barrage text similarity meter of the embodiment of the present invention much larger than the phase knowledge and magnanimity between barrage A and barrage C Calculation method calculated result accuracy is high.
A kind of one embodiment of storage medium provided in an embodiment of the present invention includes: to be stored with computer on the storage medium Program, the computer program perform the steps of when being executed by processor
The text of barrage A and barrage B are segmented, obtain the identical lexical item and identical lexical item of barrage A and barrage B Minimum word frequency;
Calculate proportion of the identical lexical item in barrage A and barrage B text;
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency;
The text of barrage A and barrage B are mapped as space vector by word2vec model, are then based on cosine similarity Algorithm calculates barrage A and barrage B in the text similarity of text space;
Text similarity to barrage A and barrage B based on identical lexical item word frequency, and the text similarity in text space Weight calculation is carried out, the final similarity of barrage A and barrage B are obtained.
Optionally, on the basis of above-mentioned storage medium embodiment, a kind of storage medium provided in an embodiment of the present invention In first alternative embodiment, storage medium can be using any combination of one or more computer-readable media.It calculates Machine readable medium can be computer-readable signal media or computer readable storage medium.Computer readable storage medium example Can such as be but not limited to: electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or it is any more than Combination.The more specific example (non exhaustive list) of computer readable storage medium includes: to have one or more conducting wires Electrical connection, portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM), erasable type can compile Journey read-only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic Memory device or above-mentioned any appropriate combination.In this document, computer readable storage medium, which can be, any includes Or the tangible medium of storage program, which can be commanded execution system, device or device use or in connection make With.
Optionally, on the basis of the embodiment of above-mentioned storage medium and first alternative embodiment, the embodiment of the present invention There is provided a kind of storage medium second alternative embodiment in, computer-readable signal media may include in a base band or The data-signal that person propagates as carrier wave a part, wherein carrying computer-readable program code.The data of this propagation Signal can take various forms, including but not limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer Readable signal media can also be any computer-readable medium other than computer readable storage medium, this is computer-readable Medium can send, propagate or transmit for by instruction execution system, device or device use or it is in connection Program.The program code for including on computer-readable medium can transmit with any suitable medium, including but not limited to: nothing Line, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
Optionally, on the basis of the embodiment of above-mentioned storage medium and first, second alternative embodiment, the present invention is real It, can be with one or more programming languages or its group in the third alternative embodiment that a kind of storage medium of example offer is provided It closes to write the computer program code for executing operation of the present invention, described program design language includes the program of object-oriented Design language, such as Java, Smalltalk, C++, further include conventional procedural programming language-such as " C " language or Similar programming language.Program code can execute fully on the user computer, partly on the user computer It executes, executed as an independent software package, part partially executes on the remote computer on the user computer or complete It executes on a remote computer or server entirely.In situations involving remote computers, remote computer can be by any The network of type, including local area network (LAN) or wide area network (WAN), are connected to subscriber computer, or, it may be connected to it is external Computer (such as being connected using ISP by internet).
Shown in Figure 2, an embodiment of a kind of electronic equipment provided in an embodiment of the present invention includes:
Participle unit is used to choose barrage barrage A and barrage B to be presented, divides the text of barrage A and barrage B Word obtains the identical lexical item of barrage A and barrage B and the minimum word frequency of identical lexical item;
Ratio computing unit is used to calculate proportion of the identical lexical item in barrage A and barrage B text;
First Text similarity computing unit is used to calculate the text of barrage A and barrage B based on identical lexical item word frequency Similarity;
Second Text similarity computing unit is used to map the text of barrage A and barrage B by word2vec model For space vector, it is then based on cosine similarity algorithm and calculates barrage A and barrage B in the text similarity of text space;
Final similarity calculated is used for the text similarity to barrage A and barrage B based on identical lexical item word frequency, And weight calculation is carried out in the text similarity of text space, obtain the final similarity of barrage A and barrage B.
A kind of one embodiment of barrage Text similarity computing system provided in an embodiment of the present invention includes:
Word segmentation module is used to choose barrage barrage A and barrage B to be presented, divides the text of barrage A and barrage B Word obtains the identical lexical item of barrage A and barrage B and the minimum word frequency of identical lexical item;
Ratio computing module is used to calculate proportion of the identical lexical item in barrage A and barrage B text;
First Text similarity computing module is used to calculate the text of barrage A and barrage B based on identical lexical item word frequency Similarity;
Second Text similarity computing module is used to map the text of barrage A and barrage B by word2vec model For space vector, it is then based on cosine similarity algorithm and calculates barrage A and barrage B in the text similarity of text space;
Final similarity calculation module is used for the text similarity to barrage A and barrage B based on identical lexical item word frequency, And weight calculation is carried out in the text similarity of text space, obtain the final similarity of barrage A and barrage B.
Optionally, on the basis of above-mentioned barrage Text similarity computing system corresponding embodiment, the embodiment of the present invention In a kind of first alternative embodiment of the barrage Text similarity computing system provided, ratio computing module calculates identical lexical item Proportion in barrage A and barrage B text, calculation formula are as follows:
Wherein, P (A, B) indicates proportion of the identical lexical item in barrage A and barrage B text, wordiIndicate same words , niIndicate the minimum word frequency of identical lexical item, m indicates the number of identical lexical item, LAIndicate the text size of barrage A, LBIndicate bullet The text size of curtain B.
The barrage Text similarity computing system of the embodiment of the present invention, it is first in similarity calculation between carrying out barrage text First barrage text is segmented, is then based on the minimum word frequency of identical lexical item and identical lexical item that participle obtains between barrage, According to identical lexical item and minimum word frequency, the text similarity between the barrage based on identical lexical item word frequency is calculated, and in text sky Between text similarity, finally to based on identical lexical item word frequency text similarity and text space text similarity distinguish Weight is assigned to be calculated, the value obtained after calculating is as the final similarity between barrage text, when calculating text phase knowledge and magnanimity, The accuracy of similarity between barrage text is calculated is effectively ensured in the considerations of carrying out identical lexical item.
The present invention is not limited to the above-described embodiments, for those skilled in the art, is not departing from Under the premise of the principle of the invention, several improvements and modifications can also be made, these improvements and modifications are also considered as protection of the invention Within the scope of.The content being not described in detail in this specification belongs to the prior art well known to professional and technical personnel in the field.

Claims (10)

1. a kind of barrage Text similarity computing method, which comprises the following steps:
The text of barrage A and barrage B are segmented, the identical lexical item and identical lexical item for obtaining barrage A and barrage B are most Small word frequency;
Calculate proportion of the identical lexical item in barrage A and barrage B text;
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency;
The text of barrage A and barrage B are mapped as space vector by word2vec model, are then based on cosine similarity algorithm Barrage A and barrage B is calculated in the text similarity of text space;
Text similarity to barrage A and barrage B based on identical lexical item word frequency, and carried out in the text similarity of text space Weight calculation obtains the final similarity of barrage A and barrage B.
2. a kind of barrage Text similarity computing method as described in claim 1, it is characterised in that: described to calculate identical lexical item Proportion in barrage A and barrage B text, calculation formula are as follows:
Wherein, P (A, B) indicates proportion of the identical lexical item in barrage A and barrage B text, wordiIndicate identical lexical item, ni Indicate the minimum word frequency of identical lexical item, m indicates the number of identical lexical item, LAIndicate the text size of barrage A, LBIndicate barrage B's Text size.
3. a kind of barrage Text similarity computing method as claimed in claim 2, it is characterised in that: described to calculate based on identical The text similarity of the barrage A and barrage B of lexical item word frequency, calculation formula are as follows:
Wherein, SimtfThe text similarity of barrage A and barrage B of (A, the B) expression based on identical lexical item word frequency.
4. a kind of barrage Text similarity computing method as claimed in claim 3, it is characterised in that: described similar based on cosine It spends algorithm and calculates the text similarity of barrage A and barrage B in text space, calculation formula are as follows:
Wherein, simword2vec(A, B) indicate barrage A and barrage B text space similarity,Indicate that barrage A passes through Space vector after the mapping of word2vec model,Indicate the space vector after barrage B is mapped by word2vec model, p table ShowDimension.
5. a kind of barrage Text similarity computing method as claimed in claim 4, it is characterised in that: described to barrage A and bullet Curtain text similarity of the B based on identical lexical item word frequency, and weight calculation is carried out in the text similarity of text space, obtain bullet The final similarity of curtain A and barrage B, calculation formula are as follows:
Sim (A, B)=λ * Simtf(A,B)+(1-λ)Simword2vec(A,B)
Wherein, Sim (A, B) indicates barrage A and the final similarity of barrage B, and λ is regulation coefficient, and value range is [0.6,0.8].
6. a kind of barrage Text similarity computing method as described in claim 1, it is characterised in that: when barrage A's and barrage B When final similarity is greater than given threshold, any bar barrage chosen in barrage A or barrage B is shown on live streaming picture, another The unselected barrage of item is not shown on live streaming picture.
7. a kind of storage medium, computer program is stored on the storage medium, it is characterised in that: the computer program is located Reason device performs the steps of when executing
The text of barrage A and barrage B are segmented, the identical lexical item and identical lexical item for obtaining barrage A and barrage B are most Small word frequency;
Calculate proportion of the identical lexical item in barrage A and barrage B text;
Calculate the text similarity of barrage A and barrage B based on identical lexical item word frequency;
The text of barrage A and barrage B are mapped as space vector by word2vec model, are then based on cosine similarity algorithm Barrage A and barrage B is calculated in the text similarity of text space;
Text similarity to barrage A and barrage B based on identical lexical item word frequency, and carried out in the text similarity of text space Weight calculation obtains the final similarity of barrage A and barrage B.
8. a kind of electronic equipment, which is characterized in that the electronic equipment includes:
Participle unit is used to choose barrage barrage A and barrage B to be presented, segments, obtain to the text of barrage A and barrage B The minimum word frequency of the identical lexical item and identical lexical item of barrage A and barrage B out;
Ratio computing unit is used to calculate proportion of the identical lexical item in barrage A and barrage B text;
It is similar with the text of barrage B to be used to calculate the barrage A based on identical lexical item word frequency for first Text similarity computing unit Degree;
Second Text similarity computing unit is used to the text of barrage A and barrage B being mapped as sky by word2vec model Between vector, be then based on cosine similarity algorithm calculate barrage A and barrage B text space text similarity;
Final similarity calculated is used for the text similarity to barrage A and barrage B based on identical lexical item word frequency, and Weight calculation is carried out in the text similarity of text space, obtains the final similarity of barrage A and barrage B.
9. a kind of barrage Text similarity computing system characterized by comprising
Word segmentation module is used to choose barrage barrage A and barrage B to be presented, segments, obtain to the text of barrage A and barrage B The minimum word frequency of the identical lexical item and identical lexical item of barrage A and barrage B out;
Ratio computing module is used to calculate proportion of the identical lexical item in barrage A and barrage B text;
It is similar with the text of barrage B to be used to calculate the barrage A based on identical lexical item word frequency for first Text similarity computing module Degree;
Second Text similarity computing module is used to the text of barrage A and barrage B being mapped as sky by word2vec model Between vector, be then based on cosine similarity algorithm calculate barrage A and barrage B text space text similarity;
Final similarity calculation module is used for the text similarity to barrage A and barrage B based on identical lexical item word frequency, and Weight calculation is carried out in the text similarity of text space, obtains the final similarity of barrage A and barrage B.
10. a kind of barrage Text similarity computing system as claimed in claim 9, it is characterised in that: the ratio calculates mould Block calculates proportion of the identical lexical item in barrage A and barrage B text, calculation formula are as follows:
Wherein, P (A, B) indicates proportion of the identical lexical item in barrage A and barrage B text, wordiIndicate identical lexical item, ni Indicate the minimum word frequency of identical lexical item, m indicates the number of identical lexical item, LAIndicate the text size of barrage A, LBIndicate barrage B's Text size.
CN201811459848.5A 2018-11-30 2018-11-30 Barrage text similarity calculation method, storage medium, equipment and system Active CN109858012B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811459848.5A CN109858012B (en) 2018-11-30 2018-11-30 Barrage text similarity calculation method, storage medium, equipment and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811459848.5A CN109858012B (en) 2018-11-30 2018-11-30 Barrage text similarity calculation method, storage medium, equipment and system

Publications (2)

Publication Number Publication Date
CN109858012A true CN109858012A (en) 2019-06-07
CN109858012B CN109858012B (en) 2023-11-28

Family

ID=66890555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811459848.5A Active CN109858012B (en) 2018-11-30 2018-11-30 Barrage text similarity calculation method, storage medium, equipment and system

Country Status (1)

Country Link
CN (1) CN109858012B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011118526A (en) * 2009-12-01 2011-06-16 Hitachi Ltd Device for extraction of word semantic relation
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011118526A (en) * 2009-12-01 2011-06-16 Hitachi Ltd Device for extraction of word semantic relation
CN107992470A (en) * 2017-11-08 2018-05-04 中国科学院计算机网络信息中心 A kind of text duplicate checking method and system based on similarity

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
王嘉?等: "基于向量空间模型的文本相似度计算方法", 《科技广场》 *
石雁等: "结合统计和词间关系的文本关键词计算方法", 《计算机技术与发展》 *
詹志建等: "基于语言网络和语义信息的文本相似度计算", 《计算机工程与应用》 *

Also Published As

Publication number Publication date
CN109858012B (en) 2023-11-28

Similar Documents

Publication Publication Date Title
CN110399848A (en) Video cover generation method, device and electronic equipment
JP6901816B2 (en) Entity-related data generation methods, devices, devices, and storage media
US20170357720A1 (en) Joint heterogeneous language-vision embeddings for video tagging and search
CN109408826A (en) A kind of text information extracting method, device, server and storage medium
Han et al. Fine-grained cross-modal alignment network for text-video retrieval
CN110012302A (en) A kind of network direct broadcasting monitoring method and device, data processing method
CN110096614B (en) Information recommendation method and device and electronic equipment
CN110377778A (en) Figure sort method, device and electronic equipment based on title figure correlation
CN109509010B (en) Multimedia information processing method, terminal and storage medium
CN106407280A (en) Query target matching method and device
CN110347428A (en) A kind of detection method and device of code similarity
CN106663123A (en) Comment-centered news reader
CN114706973A (en) Extraction type text abstract generation method and device, computer equipment and storage medium
Li et al. Hypergraph association weakly supervised crowd counting
CN114782722A (en) Image-text similarity determining method and device and electronic equipment
CN107729944B (en) Identification method and device of popular pictures, server and storage medium
CN111460224B (en) Comment data quality labeling method, comment data quality labeling device, comment data quality labeling equipment and storage medium
CN109858012A (en) Barrage Text similarity computing method, storage medium, equipment and system
CN111651660A (en) Method for cross-media retrieval of difficult samples
Wei et al. Learning embeddings of spatial, textual and temporal entities in geotagged tweets
Zhou et al. Virtual data augmentation: A robust and general framework for fine-tuning pre-trained models
CN111191242A (en) Vulnerability information determination method and device, computer readable storage medium and equipment
CN112949777B (en) Similar image determining method and device, electronic equipment and storage medium
CN116935261A (en) Data processing method and related device
CN109857995A (en) Barrage similarity calculating method, storage medium, equipment and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20231027

Address after: Room 7-801, Aokai City Plaza, No. 1777 Zhonghui Avenue, Huishan District, Wuxi City, Jiangsu Province, 214000 (Huishan Station Area of Urban Railway)

Applicant after: Kasima Huizhi (Wuxi) Technology Co.,Ltd.

Address before: 430000 East Lake Development Zone, Wuhan City, Hubei Province, No. 1 Software Park East Road 4.1 Phase B1 Building 11 Building

Applicant before: WUHAN DOUYU NETWORK TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant