CN112733520B - Text similarity calculation method, system, corresponding equipment and storage medium - Google Patents

Text similarity calculation method, system, corresponding equipment and storage medium Download PDF

Info

Publication number
CN112733520B
CN112733520B CN202011604778.5A CN202011604778A CN112733520B CN 112733520 B CN112733520 B CN 112733520B CN 202011604778 A CN202011604778 A CN 202011604778A CN 112733520 B CN112733520 B CN 112733520B
Authority
CN
China
Prior art keywords
word
text
vector
vectors
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011604778.5A
Other languages
Chinese (zh)
Other versions
CN112733520A (en
Inventor
张俊锋
程煜华
黄俊杰
侯丹丹
翟文丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wanghai Kangxin Beijing Technology Co ltd
Original Assignee
Wanghai Kangxin Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wanghai Kangxin Beijing Technology Co ltd filed Critical Wanghai Kangxin Beijing Technology Co ltd
Priority to CN202011604778.5A priority Critical patent/CN112733520B/en
Publication of CN112733520A publication Critical patent/CN112733520A/en
Application granted granted Critical
Publication of CN112733520B publication Critical patent/CN112733520B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a text similarity calculation method, a text similarity calculation system, corresponding equipment and a storage medium, wherein the method comprises the following steps: chinese word segmentation is carried out on each text, and a word sequence is generated; generating a word vector of fixed length using a trained word2vec model for each word of each word sequence; generating a 2-gram phrase sequence based on each word sequence; generating a word vector of each 2-gram phrase sequence; word vectors of words of the same word sequence and word vectors of corresponding 2-gram phrases are respectively combined into text vectors; calculating tf-idf values of each word and 2-gram phrase of the text vector as weights to obtain a text space vector; similarity between the respective texts is calculated based on the text space vector. The invention can improve the matching accuracy of the paraphrasing, the synonyms and the related words.

Description

Text similarity calculation method, system, corresponding equipment and storage medium
Technical Field
The present disclosure relates to the field of electronic digital data processing, and in particular, to a text similarity calculation method, a text similarity calculation system, a corresponding device, and a storage medium.
Background
The text similarity calculation is widely applied in the field of computers, and has important application in the technical fields of text retrieval, content recommendation, intelligent question and answer, term labeling, advertisement delivery and the like. The current common text similarity calculation modes include the following: 1) Similarity calculation based on the amount of text word (general steps are: word segmentation; tf-idf; using the distance formula: cosine, pearson, edit distance, jaccard, etc.); 2) Word vector coding models or sentence vector coding models based on word embedding technology (generally the steps are: word segmentation; word2Vec/Gloves generates Word vectors; averaging the word vectors; using the distance formula: cosine, euclidean).
The two algorithms and variants thereof have respective advantages and disadvantages and are suitable for different application scenes. Generally, algorithms based on the amount of text vocabulary have a higher degree of recognition, but cannot understand semantics, while ignoring the order of the words' positions (for text, different orders tend to have different meanings). Word embedding coding often contains general grammar structures and semantic descriptions based on a large amount of corpus learning, and has good performance in classification and clustering problems. But the word embedding nature is a dimension reduction algorithm, the learned word vector does not have too high feature recognition degree, and the application effect is poor in searching and matching scenes.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a text similarity calculation method, a text similarity calculation system, corresponding equipment and a storage medium, which can improve the matching accuracy of hyponyms, synonyms and related words.
In a first aspect of the present invention, there is provided a text similarity calculation method, the method comprising:
chinese word segmentation is carried out on each text, and a word sequence is generated;
generating a word vector of fixed length using a trained word2vec model for each word of each word sequence;
generating a 2-gram phrase sequence based on each word sequence;
generating a word vector of each 2-gram phrase sequence;
word vectors of words of the same word sequence and word vectors of corresponding 2-gram phrases are respectively combined into text vectors;
calculating tf-idf values of each word and 2-gram phrase of the text vector as weights to obtain a text space vector;
similarity between the respective texts is calculated based on the text space vector.
In an embodiment, the word2vec model includes a Skip-gram model and a CBOW model, wherein a first word vector of a word is generated using the Skip-gram model, and a second word vector of the same word is generated using the CBOW model, wherein the first word vector and the second word vector are connected as the word vector of the word.
In an embodiment, generating a word vector for each 2-gram phrase includes: the word vectors of the two words constituting each 2-gram phrase are added as the word vector of the corresponding 2-gram phrase.
In an embodiment, calculating the similarity between the respective texts based on the text space vectors includes:
for each word vector of the first text space vector, calculating Euclidean distance between each word vector of the second text space vector, respectively determining the word vector with the smallest Euclidean distance in the second text space vector as the word vector matched with the corresponding word vector of the first text space vector, respectively calculating similarity between the two matched word vectors, and respectively calculating weighted similarity between the two matched word vectors according to the weight of the two matched words and the similarity between the two word vectors;
and calculating the text similarity between the first text and the second text according to all the calculated weighted similarity and the weights of all the matched words.
In a second aspect of the present invention, there is provided a text similarity calculation system, the system comprising:
the word sequence generating module is used for carrying out Chinese word segmentation on each text to generate a word sequence;
a word-vector generation module for generating a word vector of a fixed length using a trained word2vec model for each word of each word sequence;
the phrase sequence generating module is used for respectively generating 2-gram phrase sequences based on each word sequence;
the phrase word vector generation module is used for generating word vectors of each 2-gram phrase sequence;
the text vector generation module is used for respectively combining word vectors of all words of the same word sequence and word vectors of corresponding 2-gram phrases into text vectors;
the text space vector generation module is used for calculating tf-idf values of each word and 2-gram phrase of the text vector as weights to obtain the text space vector;
and the text similarity calculation module is used for calculating the similarity between the corresponding texts based on the text space vector.
In a third aspect of the invention there is provided a computer device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method according to the first aspect of the invention or implements the functions of the system according to the second aspect of the invention when the computer program is executed by the processor.
According to a fourth aspect of the present invention there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method according to the first aspect of the present invention or performs the functions of the system according to the second aspect of the present invention.
The invention combines the advantages of similarity calculation based on the literal quantity and encoding vector similarity calculation based on the word embedding model, optimizes the problem of poor semantic description capability of the literal quantity vector of the text by using the word embedding model, and ensures the feature resolution of the generated vector. By constructing a text space vector based on Word vectors based on a Word2Vec Word embedding model, comprehensive distance measurement and calculation are carried out on the basis of tf-idf weights among Word vectors, the problem that hyponyms, synonyms and related words cannot be identified is solved, and text matching accuracy is improved.
Other features and advantages of the present invention will become more apparent from the following detailed description of embodiments of the present invention, which is to be read in connection with the accompanying drawings.
Drawings
FIG. 1 is a flow chart of an embodiment of a method according to the present invention;
fig. 2 is a block diagram of an embodiment of a system according to the present invention.
For the sake of clarity, these figures are schematic and simplified drawings, which only give details which are necessary for an understanding of the invention, while other details are omitted.
Detailed Description
Embodiments and examples of the present invention will be described in detail below with reference to the accompanying drawings.
The scope of applicability of the present invention will become apparent from the detailed description given hereinafter. It should be understood, however, that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only.
Fig. 1 shows a flowchart of a preferred embodiment of a text similarity calculation method according to the present invention. The method of this embodiment comprises:
in step S102, chinese segmentation is performed on each text, generating a word sequence. Each word after segmentation may contain 1 word, 2 words, or more than 3 words. The chinese word segmentation program may be any chinese word segmentation program known in the art. The similarity between the two texts "mixed lymphocyte culture" and "mixed leukocyte reaction" is taken as an example. For example, the text "mixed lymphocyte culture" can be divided into four words, mixed, lymphoid, cell, culture. The text of "mixed leukocyte reaction" can be divided into four words of mixed, leukocyte, cell, and reaction.
In step S104, for each word of each word sequence, a word vector of fixed length is generated using the trained word2vec model.
word2vec models are trained and built based on the relevant corpus. The word2vec model can convert Chinese words into numerical vectors and automatically train related words, paraphrasing words and synonyms. The training corpus may be determined based on the application domain. For example, if applied to the medical field, text contents of the medical field including contents collected from the internet are employed. The chinese word segmentation procedure used in training should be the same as the chinese word segmentation procedure used in the subsequent word segmentation (i.e., the chinese word segmentation procedure used in step S102).
In an embodiment, the word2vec model is a combination of a CBOW model and a skip-gram model. In this case, the CBOW model and skip-gram model are trained using the same training corpus, respectively. Skip-gram is a model that uses a word as input to predict the context around it. CBOW is a model that predicts a word itself using its context as input. The combination of Skip-gram model and CBOW model has better spatial expression capability and resolution.
For example, word2vec is used to generate a word vector, e.g., 400-dimensional in length, for a word that has been segmented. For the same word, a 200-dimensional word vector may be generated using a trained CBOW model, and a 200-dimensional word vector may be generated using a trained skip-gram model, and then the two vectors may be concatenated to generate a 400-dimensional vector as the word vector for the word.
In step S106, for each word sequence of the segmented words, a 2-gram phrase sequence is generated by using a 2-gram grammar, so that the sequence of the constituent words of the sentence can also affect the similarity. For example, "mixed lymphocyte culture" forms word sequences after chinese word segmentation: mixing, lymphocytes, cells, cultures, and then 2-gram generation: mix three phrases of lymph, lymphocyte, cell culture. The "mixed leukocyte reaction" forms word sequences after chinese word segmentation: mixing, white blood cells, reactions, and then 2-gram generation: mixing three phrases of white blood cell, white blood cell and cell reaction.
In step S108, a word vector for each 2-gram phrase of each 2-gram phrase sequence is generated. In an embodiment, the word vectors of the two words constituting each 2-gram phrase may be added as the word vector of the corresponding 2-gram phrase. For example, the word vector of the phrase "lymphocyte" is obtained by adding the word vector of "lymphocyte" to the word vector of "cell". In another embodiment, word2vec models trained using the 2-gram generated phrase sequences as training data may also be used to generate word vectors for 2-gram phrases.
In step S110, word vectors of words of the same word sequence and word vectors of corresponding 2-gram phrases are combined into text vectors, respectively.
In step S112, tf-idf (term frequency-inverse text frequency index) values of each word and 2-gram phrase of the text vector are calculated as weights, resulting in a text space vector whose values include a word vector part and a weight part. For example, mixing: 0.12, lymph: 0.23, cells: 0.43, culture: 0.32, mixed lymph: 0.82, lymphocytes: 0.56, cell culture: 0.63. the value behind each word or phrase is tf-idf value/weight. In this example, each word of the text space vector has a corresponding word vector of length 400.
In step S114, the text similarity between the text pairs is calculated based on the text space vectors of the text pairs. For text similarity calculation between two texts a and b, the Euclidean distance between each word vector in the text space vector of a and each word vector in the text space vector of b can be calculated respectively, and for each word vector in the text space vector of a, the word vector with the minimum Euclidean distance in the text space vector of b is obtained as the word vector matched with the word vector, and the similarity between the two matched word vectors is calculated respectively as follows: 1-euclidean distance between two word vectors of a match. And then respectively calculating the weighted similarity between the two matched word vectors by taking the tf-idf value of the word/phrase in a and the tf-idf value of the word/phrase in b as weights. For example, the weighted similarity is calculated as: the similarity between the two matched word vectors is multiplied by the weight of the two matched words/phrases. The final similarity between the a and b texts is calculated as the weighted similarity accumulated value of all matched word vectors divided by the accumulated value of the product of tf-idf weights of the matched words/phrases.
For example, the a text is "mixed lymphocyte culture" and the b text is "mixed leukocyte reaction". According to the above steps, the text space vector of a includes the following word vectors: the text space vector of b includes the following word vectors: mixing, white blood cells, reactions, mixing white blood cells, white blood cell reactions.
First, the euclidean distance between all word vectors of b is calculated by using the "mixed" word vector of a, and one word vector with the smallest euclidean distance in b is selected as the word vector matched with the "mixed" word vector of a. The word vector matched here is of course a "mixed" word vector of b, since the two word vectors are identical, with a euclidean distance of 0, and thus a similarity of 1. The conversion formula of the Euclidean distance and the similarity is as follows: similarity = 1-euclidean distance. The weighted similarity is a "mixed" tf-idf value multiplied by b "mixed" tf-idf value multiplied by the calculated similarity value between the two, i.e., 0.13 x 1 = 0.0169.
And then, respectively calculating Euclidean distances between the word vectors of the a and all word vectors of the b by using the 'lymph' word vector of the a, and selecting one word vector with the smallest Euclidean distance in the b as the word vector matched with the 'lymph' word vector of the a. The final calculation here found that the euclidean distance between the "white blood cell" word vector of b and the "lymph" word vector of a was the smallest, i.e. 0.17, and thus the similarity was 0.83. The weighted similarity is the tf-idf value of "lymph" of a multiplied by the tf-idf value of "white blood cell" of b multiplied by the calculated similarity value between the two, i.e., 0.2 x 0.32 x 0.83= 0.05312.
And so on until all word vectors for a are computed, e.g., resulting in the results shown in the table below.
Then, the text similarity of a and b is calculated as: the accumulated value of all weighted similarity divided by the product of the weights of the matched words/phrases, i.e
(0.0169+0.05312+0.07182+0.0576+0.12288+0.04128+0.162196)/(0.13*0.13+0.2*0.32+0.42*0.38+0.24*0.24+0.4*0.32+0.15*0.43+0.41*0.43)=0.525796/0.6669=0.7884。
In the example of determining the similarity of the text a and the text b, the similarity calculation is performed on the basis of the text literal quantity, and the similarity obtained by using the Jaccard distance is not more than 0.5, which is far different from the actual situation (the two are actually very similar). By the similarity calculation algorithm, the word2vec model obtains higher scene co-occurrence probability of cell culture and cell reaction in the learning of a large amount of medical corpus, and the lymphocyte and the leucocyte are a pair of synonyms. The final calculated similarity is close to 0.8 and is consistent with the actual situation, so that the matching accuracy of texts, especially synonyms, hyponyms and related words, is improved.
FIG. 2 shows a block diagram of a preferred embodiment of a text similarity calculation system according to the present invention, the system of which comprises:
a word sequence generation module 202, configured to perform chinese word segmentation on each text, and generate a word sequence;
a word-vector generation module 204 for generating a word vector of fixed length using the trained word2vec model for each word of each word sequence;
a phrase sequence generating module 206, configured to generate a 2-gram phrase sequence based on each word sequence;
a phrase word vector generation module 208, configured to generate a word vector of each 2-gram phrase sequence;
a text vector generation module 210, configured to combine word vectors of words of the same word sequence and word vectors of corresponding 2-gram phrases into text vectors;
the text space vector generation module 212 is configured to calculate tf-idf values of each word and 2-gram phrase of the text vector as weights, so as to obtain the text space vector;
the text similarity calculation module 214 is configured to calculate a similarity between the corresponding texts based on the text space vector.
In another embodiment, the present invention provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method embodiment or other corresponding method embodiments shown and described in connection with fig. 1 or implements the functions of the system embodiment or other corresponding system embodiment shown and described in connection with fig. 2, which are not described here again.
In another embodiment, the present invention provides a computer device, including a processor, a memory, and a computer program stored on the memory and capable of running on the processor, where the processor implements the steps of the method embodiment or other corresponding method embodiment shown and described in connection with fig. 1 or implements the functions of the system embodiment or other corresponding system embodiment shown and described in connection with fig. 2 when the processor executes the computer program, which is not described herein.
The various embodiments described herein, or particular features, structures, or characteristics thereof, may be combined as suitable in one or more embodiments of the invention. In addition, in some cases, the order of steps described in the flowcharts and/or flow-line processes may be modified as appropriate and need not be performed in exactly the order described. Additionally, various aspects of the invention may be implemented using software, hardware, firmware, or a combination thereof and/or other computer-implemented modules or devices that perform the described functions. A software implementation of the present invention may include executable code stored in a computer readable medium and executed by one or more processors. The computer-readable medium may include a computer hard drive, ROM, RAM, flash memory, a portable computer storage medium such as CD-ROM, DVD-ROM, flash drives and/or other devices having a Universal Serial Bus (USB) interface, and/or any other suitable tangible or non-transitory computer-readable medium or computer memory on which executable code may be stored and executed by a processor. The invention may be used in connection with any suitable operating system.
As used herein, the singular forms "a", "an" and "the" include plural referents (i.e., having the meaning of "at least one") unless otherwise indicated. It will be further understood that the terms "has," "comprises," "including" and/or "comprising," when used in this specification, specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
While the foregoing is directed to some preferred embodiments of the present invention, it should be emphasized that the present invention is not limited to these embodiments, but may be embodied in other forms within the scope of the inventive subject matter. Various changes and modifications may be made by one skilled in the art without departing from the spirit of the invention, and these changes or modifications still fall within the scope of the invention.

Claims (6)

1. A text similarity calculation method, the method comprising:
chinese word segmentation is carried out on each text, and a word sequence is generated;
generating a word vector of fixed length using a trained word2vec model for each word of each word sequence; the word2vec model comprises a Skip-gram model and a CBOW model, wherein the Skip-gram model is used for generating a first word vector of a word, the CBOW model is used for generating a second word vector of the same word, and the first word vector and the second word vector are connected to be used as the word vector of the word;
generating a 2-gram phrase sequence based on each word sequence;
generating a word vector of each 2-gram phrase sequence;
word vectors of words of the same word sequence and word vectors of corresponding 2-gram phrases are respectively combined into text vectors;
calculating tf-idf values of each word and 2-gram phrase of the text vector as weights to obtain a text space vector;
for each word vector of the first text space vector, respectively calculating Euclidean distance between each word vector of the second text space vector, respectively determining the word vector with the minimum Euclidean distance in the second text space vector as the word vector matched with the corresponding word vector of the first text space vector, respectively calculating the similarity between the two matched word vectors, and respectively calculating the weighted similarity between the two matched word vectors according to the weight of the two matched words and the similarity between the two word vectors; wherein the weighted similarity is calculated as: the similarity between the two matched word vectors is multiplied by the weight of the two matched words;
and calculating the text similarity between the first text and the second text according to all the calculated weighted similarity and the weights of all the matched words, wherein the text similarity between the first text and the second text is calculated as follows: an accumulated value of all weighted similarities of the first and second text space vectors divided by an accumulated value of a product of weights of the matched words of the first and second text space vectors.
2. The method of claim 1, wherein generating a word vector for each 2-gram phrase comprises: the word vectors of the two words constituting each 2-gram phrase are added as the word vector of the corresponding 2-gram phrase.
3. The method of claim 1, wherein calculating a similarity between the matched two word vectors comprises calculating the similarity as: 1-euclidean distance between two word vectors of a match.
4. A text similarity calculation system, the system comprising:
the word sequence generating module is used for carrying out Chinese word segmentation on each text to generate a word sequence;
a word-vector generation module for generating a word vector of a fixed length using a trained word2vec model for each word of each word sequence; the word2vec model comprises a Skip-gram model and a CBOW model, wherein the Skip-gram model is used for generating a first word vector of a word, the CBOW model is used for generating a second word vector of the same word, and the first word vector and the second word vector are connected to be used as the word vector of the word;
the phrase sequence generating module is used for respectively generating 2-gram phrase sequences based on each word sequence;
the phrase word vector generation module is used for generating word vectors of each 2-gram phrase sequence;
the text vector generation module is used for respectively combining word vectors of all words of the same word sequence and word vectors of corresponding 2-gram phrases into text vectors;
the text space vector generation module is used for calculating tf-idf values of each word and 2-gram phrase of the text vector as weights to obtain the text space vector;
the text similarity calculation module is used for respectively calculating Euclidean distance between each word vector of the first text space vector and each word vector of the second text space vector, respectively determining the word vector with the minimum Euclidean distance in the second text space vector as the word vector matched with the corresponding word vector of the first text space vector, respectively calculating the similarity between the two matched word vectors, and respectively calculating the weighted similarity between the two matched word vectors according to the weight of the two matched words and the similarity between the two word vectors; calculating the text similarity between the first text and the second text according to all the calculated weighted similarity and the weights of all the matched words; wherein the weighted similarity is calculated as: the similarity between the two matched word vectors is multiplied by the weight of the two matched words; wherein the text similarity between the first and second text is calculated as: an accumulated value of all weighted similarities of the first and second text space vectors divided by an accumulated value of a product of weights of the matched words of the first and second text space vectors.
5. A computer device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method according to claim 1 when the computer program is executed by the processor.
6. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method according to claim 1.
CN202011604778.5A 2020-12-30 2020-12-30 Text similarity calculation method, system, corresponding equipment and storage medium Active CN112733520B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011604778.5A CN112733520B (en) 2020-12-30 2020-12-30 Text similarity calculation method, system, corresponding equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011604778.5A CN112733520B (en) 2020-12-30 2020-12-30 Text similarity calculation method, system, corresponding equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112733520A CN112733520A (en) 2021-04-30
CN112733520B true CN112733520B (en) 2023-07-18

Family

ID=75610645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011604778.5A Active CN112733520B (en) 2020-12-30 2020-12-30 Text similarity calculation method, system, corresponding equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112733520B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113988045B (en) * 2021-12-28 2022-04-12 浙江口碑网络技术有限公司 Text similarity determining method, text processing method, corresponding device and equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273352A (en) * 2017-06-07 2017-10-20 北京理工大学 A kind of word insertion learning model and training method based on Zolu functions
CN107423284A (en) * 2017-06-14 2017-12-01 中国科学院自动化研究所 Merge the construction method and system of the sentence expression of Chinese language words internal structural information
CN109190117A (en) * 2018-08-10 2019-01-11 中国船舶重工集团公司第七〇九研究所 A kind of short text semantic similarity calculation method based on term vector
CN109508379A (en) * 2018-12-21 2019-03-22 上海文军信息技术有限公司 A kind of short text clustering method indicating and combine similarity based on weighted words vector
CN109815493A (en) * 2019-01-09 2019-05-28 厦门大学 A kind of modeling method that the intelligence hip-hop music lyrics generate
CN110209809A (en) * 2018-08-27 2019-09-06 腾讯科技(深圳)有限公司 Text Clustering Method and device, storage medium and electronic device
CN110309505A (en) * 2019-05-27 2019-10-08 重庆高开清芯科技产业发展有限公司 A kind of data format self-analytic data method of word-based insertion semantic analysis
CN110941951A (en) * 2019-10-15 2020-03-31 平安科技(深圳)有限公司 Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN111966831A (en) * 2020-08-18 2020-11-20 创新奇智(上海)科技有限公司 Model training method, text classification device and network model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273352A (en) * 2017-06-07 2017-10-20 北京理工大学 A kind of word insertion learning model and training method based on Zolu functions
CN107423284A (en) * 2017-06-14 2017-12-01 中国科学院自动化研究所 Merge the construction method and system of the sentence expression of Chinese language words internal structural information
CN109190117A (en) * 2018-08-10 2019-01-11 中国船舶重工集团公司第七〇九研究所 A kind of short text semantic similarity calculation method based on term vector
CN110209809A (en) * 2018-08-27 2019-09-06 腾讯科技(深圳)有限公司 Text Clustering Method and device, storage medium and electronic device
CN109508379A (en) * 2018-12-21 2019-03-22 上海文军信息技术有限公司 A kind of short text clustering method indicating and combine similarity based on weighted words vector
CN109815493A (en) * 2019-01-09 2019-05-28 厦门大学 A kind of modeling method that the intelligence hip-hop music lyrics generate
CN110309505A (en) * 2019-05-27 2019-10-08 重庆高开清芯科技产业发展有限公司 A kind of data format self-analytic data method of word-based insertion semantic analysis
CN110941951A (en) * 2019-10-15 2020-03-31 平安科技(深圳)有限公司 Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN111966831A (en) * 2020-08-18 2020-11-20 创新奇智(上海)科技有限公司 Model training method, text classification device and network model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于语义相似度的中文文本分类研究;李晓军;《中国优秀硕士学位论文全文数据库-信息科技辑》(第04期);I138-3522 *
基于语义相似度的中文文本分类系统的研究与实现;张真;《中国优秀硕士学位论文全文数据库-信息科技辑》(第06期);I138-245 *

Also Published As

Publication number Publication date
CN112733520A (en) 2021-04-30

Similar Documents

Publication Publication Date Title
EP3958145A1 (en) Method and apparatus for semantic retrieval, device and storage medium
CN109325229B (en) Method for calculating text similarity by utilizing semantic information
CN112307763B (en) Term standardization method, system and corresponding equipment and storage medium
CN116795973B (en) Text processing method and device based on artificial intelligence, electronic equipment and medium
CN112347758A (en) Text abstract generation method and device, terminal equipment and storage medium
WO2021000391A1 (en) Text intelligent cleaning method and device, and computer-readable storage medium
CN110516040B (en) Method, device and computer storage medium for semantic similarity comparison between texts
CN116028722B (en) Post recommendation method and device based on word vector and computer equipment
CN114003682A (en) Text classification method, device, equipment and storage medium
CN113569011A (en) Training method, device and equipment of text matching model and storage medium
CN112818091A (en) Object query method, device, medium and equipment based on keyword extraction
CN116483979A (en) Dialog model training method, device, equipment and medium based on artificial intelligence
CN112883199A (en) Collaborative disambiguation method based on deep semantic neighbor and multi-entity association
CN112733520B (en) Text similarity calculation method, system, corresponding equipment and storage medium
CN109241272B (en) Chinese text abstract generation method, computer readable storage medium and computer equipment
Javanmardi et al. Caps captioning: a modern image captioning approach based on improved capsule network
CN117932000A (en) Long document dense retrieval method and system based on topic clustering global features
CN117828042A (en) Question and answer processing method, device, equipment and medium for financial service
CN115269768A (en) Element text processing method and device, electronic equipment and storage medium
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
CN110309278B (en) Keyword retrieval method, device, medium and electronic equipment
CN112182159A (en) Personalized retrieval type conversation method and system based on semantic representation
CN117076636A (en) Information query method, system and equipment for intelligent customer service
CN115658903B (en) Text classification method, model training method, related device and electronic equipment
Kang et al. A survey of image caption tasks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant