CN112036177A

CN112036177A - Text semantic similarity information processing method and system based on multi-model fusion

Info

Publication number: CN112036177A
Application number: CN202010735606.5A
Authority: CN
Inventors: 杨万征; 蔡超; 程国艮
Original assignee: Global Tone Communication Technology Co ltd
Current assignee: Global Tone Communication Technology Co ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-12-04

Abstract

The invention belongs to the technical field of patent retrieval, and discloses a text semantic similarity information processing method and system based on multi-model fusion, which are used for acquiring patent data, and performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by adopting different models respectively to obtain corresponding word vector characteristics and sentence vector characteristics; the word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification are fused to be used as combined characteristic vectors of the patent; and calculating the similarity between the combined feature vector of the patent and the combined feature vectors of other patents in the database. The method greatly reduces the requirement of the algorithm model on the labeled data by using the unsupervised learning model, can deeply mine the deep semantic features of the article by using the sentence vector, greatly reduces the calculated amount of real-time calculation, and accelerates the feedback speed.

Description

Text semantic similarity information processing method and system based on multi-model fusion

Technical Field

The invention belongs to the technical field of patent retrieval, and particularly relates to a text semantic similarity information processing method and system based on multi-model fusion.

Background

At present, text semantic similarity calculation is an important research direction in the field of natural language processing, research results of the text semantic similarity calculation are widely applied to a retrieval system, a duplication checking system and the like, a user can be helped to quickly find what the user wants, deep requirements of the user are mined, result differences caused by different expression modes are avoided, and the text semantic similarity calculation method has high academic research value and industrial application value.

The text semantic similarity calculation research direction is roughly divided into two types. One is the direction of scientific research, and the major components are scientific research personnel of college students or enterprises, and the common technical methods are as follows: the technical directions of Simase _ LSTM, RCNN, DSSM and the like are mostly deep neural networks, and a supervised learning mode is adopted for model training to pursue higher semantic understanding. Taking the simplest Simase _ LSTM model as an example, the model structure is as follows, firstly, the text is subjected to word segmentation, then, words are converted into corresponding feature vectors, then, the corresponding feature vectors are sent into the LSTM model to extract semantic features of the words, and similarity calculation is carried out through the text vectors.

The text semantic similarity calculation method is mainly used for improving the quality of a search engine and searching for similar texts, and the order of magnitude of industrial application is far larger than that of a sample set in the scientific research field, and the industrial application has strict speed requirements, so that the text semantic similarity calculation method used in the industry is often relatively simple. Such as: the model comprises an LDA model, a PLSA model and an LFM model, wherein probability statistics of different subjects are generated by calculating different word sets through a prior probability statistical model, and then the similarity of two texts is calculated through the probability similarity of the subjects.

Most of existing methods for calculating semantic similarity of scientific research texts use a deep neural network model, a supervised learning method is adopted for model training, the algorithm of the type needs a large number of labeled samples for support, however, in the industrial field, the data volume is often large, but labeled data are often few, especially at the beginning of project creation, labeled data are more acceptable and can not be obtained, and text labeled data are different from images, so that the requirements on labeling personnel are often higher due to the fact that the text labeled data need to be subjectively understood on the articles. Therefore, the industrial field at the beginning of the project is inconvenient for developing large-scale supervised learning algorithm.

The computation amount required by the deep neural network algorithm is large and feasible on a small amount of data, but the deep neural network algorithm is applied to industrial-grade data of a few G, a few T and even a few PB, similar articles of an article are searched for semantically, hundreds of millions of times of a single neural network are required to be repeatedly executed, and the feedback time is definitely unacceptable.

Most of the existing industrial field semantic similarity detection algorithms are character-based prior probability statistical models, but context and word order relations cannot be captured, so that the detection algorithms can be only defined as shallow semantic similarity calculation.

Through the above analysis, the problems and defects of the prior art are as follows: (1) the existing text semantic similarity calculation method adopts a supervised learning method to carry out model training and needs a large amount of labeled sample support; and the calculated amount is large;

(2) most of the existing semantic similarity detection algorithms are character-based prior probability statistical models, but context and word order relations cannot be captured.

(3) Existing deep learning based models, such as: simase _ LSTM, RCNN, DSSM and the like have large calculation amount, need to be supported by a high-configuration GPU server, and have high hardware cost.

The difficulty in solving the above problems and defects is:

the problem and the defect (1) are solved by a large amount of manual labeling, the manual hiring cost is required to be invested, meanwhile, the patents belong to the problem of strong speciality, the accurate evaluation of the similarity degree between the two patents can be determined only after the review of a very professional examiner, the personnel level requirement is high, and the labeling efficiency is low.

To solve the above problems and drawbacks (2) requires the use of chain models, such as: RNN, LSTM, etc., the use of such models will again cause the need for hardware devices and labeling data, i.e.: cause problems and defects (1) and problems and defects (3)

The problem and the defect (3) are solved by only providing fund support and purchasing a high-configuration server, but the system is mainly developed for specific crowds, so that the audience crowds are few, the use rate is low, and hardware resource waste is easily caused.

The significance of solving the problems and the defects is as follows:

solving the above problems and deficiencies (1) can alleviate the stress and demand on the standard personnel and reduce the cost for project research and development.

The problem and the defect (2) can be solved, the deep semantic features of the text can be obtained, and the overall detection quality of the system is improved.

The problems and the defects (3) can be solved, the configuration requirement on the server can be reduced, the cost investment is reduced, and the equipment utilization rate is improved.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a text semantic similarity information processing method based on multi-model fusion.

The invention is realized in this way, and a text semantic similarity information processing method based on multi-model fusion comprises the following steps:

step one, acquiring patent data in a patent library, and performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by adopting different models respectively to obtain corresponding word vector characteristics and sentence vector characteristics;

step two, the word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification are fused to be used as combined characteristic vectors of the patent; simultaneously, training a sentence vector model by using the data of the claims and the specification, and training a word vector model by using the data of the title and the abstract;

step three, storing the combination characteristics of each patent in the patent library obtained in the step two, and the trained word vector model and sentence vector model respectively;

step four, collecting relevant data of the patent to be retrieved, and performing word segmentation processing on the title, the abstract, the claim and the specification of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and specification sentence vector characteristics;

combining the obtained entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and description sentence vector characteristics of the patent to be retrieved to obtain combined characteristics of the patent to be retrieved;

and step six, performing similarity calculation on the obtained combined features of the patents to be retrieved and the combined features of all the patents stored in the patent library one by one.

Further, in the step one, the obtaining of corresponding word vector features and sentence vector features by performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by using different models respectively includes:

performing word segmentation processing on the title and the abstract, extracting keywords, and converting the extracted keywords into corresponding word vector characteristics by using a word vector model;

the method comprises the steps of performing word segmentation processing on the claims and the specification, and converting the contents of the claims and the specification into corresponding sentence vector characteristics by using a sentence vector model.

Furthermore, the word vector model predicts context words by using intermediate words and combines a gradient descent algorithm to minimize the difference between the predicted upper and lower words and the real upper and lower words, and the word vector model converts the obtained word vectors for accurately feeding back the inter-word relationship.

Furthermore, the sentence vector model jointly predicts context words by using paragraph vectors and central word vectors, the paragraph vectors slide in paragraphs through windows, the paragraph vectors are transferred along with the windows, and the paragraph vectors tend to be stable, namely, the paragraph vectors can represent text contents.

Further, in step four, the step of performing word segmentation processing on the title, the abstract, the claim and the specification of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and specification sentence vector characteristics includes:

performing word segmentation processing on titles and abstracts of patents to be retrieved, extracting keywords, and converting the extracted keywords into corresponding word vector characteristics by using a trained word vector model;

and performing word segmentation processing on the claims and the specification of the patent to be retrieved, and converting the trained sentence vector model into corresponding sentence vector characteristics.

Another object of the present invention is to provide a text semantic similarity information processing system based on multi-model fusion, which implements the text semantic similarity information processing method based on multi-model fusion, and the text semantic similarity information processing system based on multi-model fusion includes:

the data acquisition module is used for acquiring relevant data of the patent to be retrieved;

the text semantic extraction module is used for respectively extracting the entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and instruction sentence vector characteristics of the patent data based on the multiple models;

the feature fusion module is used for fusing the extracted entry word vector features, abstract word vector features, claim sentence vector features and description sentence vector features to obtain combined features of the patent to be retrieved;

the similarity calculation module is used for calculating the similarity based on the obtained combination characteristics of the patent to be retrieved and the combination characteristics of other patents stored in the database in advance;

and the database is used for storing the related patent data, the patent combination characteristic data, the trained word vector model and the sentence vector model.

It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

acquiring patent data in a patent library, and performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by adopting different models respectively to obtain corresponding word vector characteristics and sentence vector characteristics;

the word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification are fused to be used as combined characteristic vectors of the patent; simultaneously, training a sentence vector model by using the data of the claims and the specification, and training a word vector model by using the data of the title and the abstract;

respectively storing the obtained combination characteristics of each patent in the patent library, and the trained word vector model and sentence vector model;

collecting relevant data of a patent to be retrieved, and performing word segmentation processing on a title, an abstract, a claim and a specification of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and specification sentence vector characteristics;

combining the obtained entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and instruction sentence vector characteristics of the patent to be retrieved to obtain the combined characteristics of the patent to be retrieved;

and carrying out similarity calculation on the obtained combined features of the patents to be retrieved and the combined features of all the patents stored in the patent library one by one.

It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

The invention also aims to provide a retrieval and duplication checking terminal for implementing the text semantic similarity information processing method based on multi-model fusion.

By combining all the technical schemes, the invention has the advantages and positive effects that:

description of test methods:

data range: 2000+ ten thousand Chinese patent data

The retrieval mode is as follows: inputting rejected patents, checking the detection rate of the XYA file of the top100, and not performing any keyword and IPC filtering.

Description of the effects: the data comparison shows that the retrieval rate of the keyword-word vector semantics is lowest only through the shallow layer, the retrieval rate of the XY documents can be greatly improved by using the sentence vectors to carry out deep semantic retrieval, the XY documents are more emphasized on similarity of contents, the A documents are more emphasized on correlation of the contents and belong to shallow layer correlation, and the effect of the method is obviously higher than that of any characteristic representation mode by carrying out characteristic fusion on the two characteristic vectors.

The method and the device calculate the text similarity from the two aspects of the shallow semantics and the deep semantics by fusing a plurality of models, can ensure that the detection result of the similar text does not deviate from the subjective consciousness of people based on character level, and can also mine the content of the similar semantics. The model architecture of the invention adopts unsupervised learning to carry out modeling all the time without any labeled data. Through off-line calculation of patent data in the patent library, calculation required in real-time calculation can be greatly compressed, and the real-time feedback speed is increased.

The model architecture of the invention uses unsupervised learning models, namely a word vector model and a sentence vector model, so that the demand of an algorithm model on labeled data is greatly reduced, deep semantic features of articles can be deeply mined by using the sentence vectors, the left part of the model can perform offline calculation aiming at patents in a patent database, and during real-time calculation, only one-by-one comparison with cached comprehensive feature vectors is needed, so that the calculation amount of real-time calculation is greatly reduced, and the feedback speed is accelerated.

The technical effect or experimental effect of the comparison is as follows:

description of test methods:

data range: 2000+ ten thousand Chinese patent data

Through comparing test equipment and detection time, the technical scheme greatly improves the overall detection rate in the use of model fusion, but when detection is not sacrificed at all, the overall detection effect is greatly improved in comparison with a comparison system, and although the detection time is increased by 0.1s, required hardware equipment is reduced by 8 times.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained from the drawings without creative efforts.

Fig. 1 is a flowchart of a text semantic similarity information processing method based on multi-model fusion according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a text semantic similarity information processing method based on multi-model fusion according to an embodiment of the present invention.

FIG. 3 is a schematic structural diagram of a text semantic similarity information processing system based on multi-model fusion according to an embodiment of the present invention;

in the figure: 1. a data acquisition module; 2. a text semantic extraction module; 3. a feature fusion module; 4. a similarity calculation module; 5. a database.

Fig. 4 is a schematic diagram of feature extraction of a word vector model according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of sentence vector model feature extraction according to the embodiment of the present invention.

Fig. 6 is a schematic diagram of vector model construction provided in the embodiment of the present invention.

Fig. 7 is a schematic diagram of combined feature extraction provided in the embodiment of the present invention.

Fig. 8 is a schematic diagram of combination feature fusion provided by an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention provides a text semantic similarity information processing method based on multi-model fusion, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1-2, a text semantic similarity information processing method based on multi-model fusion provided by the embodiment of the present invention includes:

s101, acquiring patent data in a patent library, and performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by adopting different models respectively to obtain corresponding word vector characteristics and sentence vector characteristics;

s102, combining the word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification as combined characteristic vectors of the patent; simultaneously, training a sentence vector model by using the data of the claims and the specification, and training a word vector model by using the data of the title and the abstract;

s103, storing the combination characteristics of the patents in the patent library obtained in the step S102, and the trained word vector model and sentence vector model respectively;

s104, collecting relevant data of the patent to be retrieved, and performing word segmentation processing on the title, the abstract, the claim and the specification of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and specification sentence vector characteristics;

s105, combining the acquired entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and description sentence vector characteristics of the patent to be retrieved to acquire combined characteristics of the patent to be retrieved;

and S106, performing similarity calculation on the obtained combined features of the patents to be retrieved and the combined features of the patents stored in the patent library one by one.

In step S101, the obtaining of corresponding word vector features and sentence vector features by performing word segmentation processing on the title, abstract, claim, and specification of a patent in patent data using different models according to the embodiments of the present invention includes:

The word vector model provided by the embodiment of the invention predicts context words by using intermediate words and combines a gradient descent algorithm to minimize the difference between the predicted upper and lower words and the real upper and lower words, and the word vector model converts the obtained word vectors for accurately feeding back the inter-word relationship.

The sentence vector model provided by the embodiment of the invention uses the paragraph vector and the central vocabulary vector to jointly predict the context vocabulary, the paragraph vector slides in the paragraph through the window, the paragraph vector transfers along with the window, and the paragraph vector tends to be stable, namely representing the text content.

In step S104, the method for performing word segmentation on the title, the abstract, the claim and the specification of the patent to be retrieved according to the embodiment of the present invention to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and description sentence vector characteristics includes:

As shown in fig. 3, the text semantic similarity information processing system based on multi-model fusion provided by the embodiment of the present invention includes:

The technical solution of the present invention is further illustrated by the following specific examples.

Example 1:

for patent data in a patent library, different models are used for processing titles, abstracts, claims and specifications, and because the word frequency distribution, the text length and the syntactic structure of each part are greatly different, the four parts need to be processed respectively by using different models.

For the title and the abstract, the text length is short, most of the technical nouns and the explanation vocabularies thereof are professional, and the style is simple, so that for the title abstract, word segmentation operation is firstly carried out, then keywords are extracted, and the keywords are sent into a word vector model and converted into corresponding word vectors. The word vector model is used here because the word vector model is an unsupervised model, slides in an article through a window, intercepts article segments, and predicts context words by using intermediate words as shown in fig. 4, the model does not need any labeling data, combines with a gradient descent algorithm, minimizes the difference between the predicted upper and lower words and the real upper and lower words, and the final product, namely word vector, can accurately feed back the inter-word relationship, so that the method is used for mining the shallow semantics of the abstract part of the patent title.

Aiming at the right and the specification part, because the text of the part is longer and is mostly different from 3000-10000 characters, and more upper and lower sentence citation relations exist, the deep semantics of the part are mined by using sentence vectors after the part is participled. The sentence vector model is a variant of a word vector and is also an unsupervised learning model, a paragraph vector is introduced on the basis of the word vector, and different from the word vector, the sentence vector uses the paragraph vector and a central vocabulary vector to jointly predict context vocabularies, the paragraph vector can be transferred along with a window by sliding in the paragraph through the window, and when the paragraph vector tends to be stable, the text content can be represented.

The word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification are fused to be used as combined characteristic vectors of the article, the similarity between every two words is calculated, and only the combined characteristic similarity between every two words is required to be calculated.

In conclusion, the model architecture uses unsupervised learning models, namely a word vector model and a sentence vector model, so that the demand of an algorithm model on labeled data is greatly reduced, deep semantic features of an article can be deeply mined by using the sentence vectors, the left part of the model can perform offline calculation aiming at patents in a patent database, and during real-time calculation, only one-by-one comparison with cached comprehensive feature vectors is needed, so that the calculated amount of real-time calculation is greatly reduced, and the feedback speed is accelerated.

The text semantic similarity information processing method based on multi-model fusion specifically comprises the following steps:

1. the title, abstract, claims and specification of a patent in a patent library are respectively participled.

2. The word vector model is trained using the heading and summary data.

3. Sentence vector models are trained using claim and specification data.

4. And storing the trained model.

5. Calculated separately for all patents in the patent library.

6. The title, abstract, claims, and specification are each segmented.

7. And extracting key words aiming at the titles and the abstracts.

8. And calculating word vector characteristics corresponding to the title key words and the abstract key words.

9. Sentence vector features of claims and descriptions are calculated.

10. Combining the vector characteristics of the entry words, the vector characteristics of the abstract words, the vector characteristics of the claim sentences and the vector characteristics of the instruction sentences.

11. And storing the combined characteristics off line.

12. And calculating the combination characteristics of the patents to be detected.

13. And carrying out similarity calculation on the combined features of the patent to be detected and the pre-calculated combined features in the patent library one by one.

14. And selecting the required data according to the similarity calculation result.

The invention can use neural network to replace sentence vector model for calculation.

The present invention may use a chinese word vector model or other variant word vector calculation instead of the word vector calculation described herein.

The invention can use the same model architecture, but different data sources, such as using paper data, calculating abstract feature vectors by using word vectors, calculating text feature vectors by using sentence vectors, and calculating semantic similarity by using the combined features of the two.

Example 2

The text semantic similarity information processing method based on multi-model fusion comprises the following steps:

step 1, performing word segmentation operation on the titles and abstracts of the papers in the papers database respectively.

And 2, training a word vector model by using the titles and the abstracts.

And 3, splitting the full-text data in the paper according to large chapters such as introduction, background, experiment, effect comparison and the like, and performing word segmentation operation on each chapter.

And 4, training a sentence vector model by using the word list of each chapter obtained in the step 3.

And 5, storing the word vector model obtained in the step 2 and the sentence vector model obtained in the step 4.

And 6, performing module-by-module feature extraction on the papers in the local papers database by using the word vector model and the sentence vector model.

And 7, constructing a feature fusion method, and fusing the features obtained in the step 6.

And 8, storing the extracted features and the original text information.

And 9, performing word segmentation operation on the titles and abstracts of the papers to be retrieved.

And step 10, performing vector conversion aiming at the keyword information in the thesis and the abstract by using the word vector model.

And 11, splitting the full text information of the paper to be retrieved according to chapters and performing word segmentation operation.

And step 12, using the sentence vector model to extract features.

And step 13, performing feature fusion by using the feature fusion method adopted in the step 7.

And step 14, performing similarity calculation on the combined features of the papers to be retrieved and the pre-calculated combined features in the papers database one by one.

And step 15, selecting the required data according to the similarity calculation result.

The invention is further described below by way of comparison with the effects of the examples.

The system comprises the following steps: centos 7.

Equipment: 8 cores 16 threads 128G store 1 station.

Data set: 100 ten thousand Chinese patents.

By comparing the experimental control groups, it is easy to find that different word segmentation methods have large influence on feature extraction, and a word segmentation algorithm 2 is selected for comprehensive comparison.

The model fusion can enable the detection rate of the XYA to tend to be average, the detection rate of the X file is far higher than that of the A file, or the detection rate of the A file is far higher than that of the X file, the detection rate of the XYA file can be simply understood as the feature extraction capability of the model to the deep semantics and the shallow semantics, and the deep semantics and the shallow semantics can be fused in a balanced manner by using the model fusion.

In the aspect of the use of feature fusion, the feature dimension is not changed, and the detection time is not increased.

By using feature fusion, the detection rate is much higher than that of a single model due to the consideration of deep semantics and shallow semantics.

Comparing different characteristic dimensions and the change of the characteristic dimensions in the process of detection, along with the increase of the characteristic dimensions, the detection rate of the model is improved, and the cost is that when the detection is sacrificed, the selection of the characteristic dimensions needs to comprehensively consider the quality and the speed according to the service requirements.

In the description of the present invention, "a plurality" means two or more unless otherwise specified; the terms "upper", "lower", "left", "right", "inner", "outer", "front", "rear", "head", "tail", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are only for convenience in describing and simplifying the description, and do not indicate or imply that the device or element referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, should not be construed as limiting the invention. Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A text semantic similarity information processing method based on multi-model fusion is characterized by comprising the following steps:

2. The method for processing text semantic similarity information based on multi-model fusion according to claim 1, wherein the step of performing word segmentation processing on titles, abstracts, claims and specifications of patents in patent data by using different models respectively to obtain corresponding word vector features and sentence vector features comprises the steps of:

3. The method as claimed in claim 1, wherein the word vector model predicts context words using intermediate words, and minimizes the difference between the predicted upper and lower words and the real upper and lower words by combining gradient descent algorithm, and the word vector model transforms the obtained word vectors for accurate feedback of the relationships between words.

4. The method as claimed in claim 1, wherein the sentence vector model predicts context vocabulary jointly by using paragraph vector and central vocabulary vector, and slides in paragraph through window, paragraph vector transfers along with window, paragraph vector tends to be stable, and text content can be represented.

5. The method for processing text semantic similarity information based on multi-model fusion according to claim 1, wherein the step of performing word segmentation processing on the title, abstract, claim and description of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and description sentence vector characteristics comprises:

6. A text semantic similarity information processing system based on multi-model fusion for implementing the text semantic similarity information processing method based on multi-model fusion according to claims 1-5, wherein the text semantic similarity information processing system based on multi-model fusion comprises:

7. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of:

8. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

9. A retrieval and duplication checking terminal for implementing the text semantic similarity information processing method based on multi-model fusion according to claims 1-5.