CN112036177A - Text semantic similarity information processing method and system based on multi-model fusion - Google Patents

Text semantic similarity information processing method and system based on multi-model fusion Download PDF

Info

Publication number
CN112036177A
CN112036177A CN202010735606.5A CN202010735606A CN112036177A CN 112036177 A CN112036177 A CN 112036177A CN 202010735606 A CN202010735606 A CN 202010735606A CN 112036177 A CN112036177 A CN 112036177A
Authority
CN
China
Prior art keywords
vector characteristics
model
word
word vector
sentence vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010735606.5A
Other languages
Chinese (zh)
Inventor
杨万征
蔡超
程国艮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Global Tone Communication Technology Co ltd
Original Assignee
Global Tone Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Global Tone Communication Technology Co ltd filed Critical Global Tone Communication Technology Co ltd
Priority to CN202010735606.5A priority Critical patent/CN112036177A/en
Publication of CN112036177A publication Critical patent/CN112036177A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of patent retrieval, and discloses a text semantic similarity information processing method and system based on multi-model fusion, which are used for acquiring patent data, and performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by adopting different models respectively to obtain corresponding word vector characteristics and sentence vector characteristics; the word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification are fused to be used as combined characteristic vectors of the patent; and calculating the similarity between the combined feature vector of the patent and the combined feature vectors of other patents in the database. The method greatly reduces the requirement of the algorithm model on the labeled data by using the unsupervised learning model, can deeply mine the deep semantic features of the article by using the sentence vector, greatly reduces the calculated amount of real-time calculation, and accelerates the feedback speed.

Description

Text semantic similarity information processing method and system based on multi-model fusion
Technical Field
The invention belongs to the technical field of patent retrieval, and particularly relates to a text semantic similarity information processing method and system based on multi-model fusion.
Background
At present, text semantic similarity calculation is an important research direction in the field of natural language processing, research results of the text semantic similarity calculation are widely applied to a retrieval system, a duplication checking system and the like, a user can be helped to quickly find what the user wants, deep requirements of the user are mined, result differences caused by different expression modes are avoided, and the text semantic similarity calculation method has high academic research value and industrial application value.
The text semantic similarity calculation research direction is roughly divided into two types. One is the direction of scientific research, and the major components are scientific research personnel of college students or enterprises, and the common technical methods are as follows: the technical directions of Simase _ LSTM, RCNN, DSSM and the like are mostly deep neural networks, and a supervised learning mode is adopted for model training to pursue higher semantic understanding. Taking the simplest Simase _ LSTM model as an example, the model structure is as follows, firstly, the text is subjected to word segmentation, then, words are converted into corresponding feature vectors, then, the corresponding feature vectors are sent into the LSTM model to extract semantic features of the words, and similarity calculation is carried out through the text vectors.
The text semantic similarity calculation method is mainly used for improving the quality of a search engine and searching for similar texts, and the order of magnitude of industrial application is far larger than that of a sample set in the scientific research field, and the industrial application has strict speed requirements, so that the text semantic similarity calculation method used in the industry is often relatively simple. Such as: the model comprises an LDA model, a PLSA model and an LFM model, wherein probability statistics of different subjects are generated by calculating different word sets through a prior probability statistical model, and then the similarity of two texts is calculated through the probability similarity of the subjects.
Most of existing methods for calculating semantic similarity of scientific research texts use a deep neural network model, a supervised learning method is adopted for model training, the algorithm of the type needs a large number of labeled samples for support, however, in the industrial field, the data volume is often large, but labeled data are often few, especially at the beginning of project creation, labeled data are more acceptable and can not be obtained, and text labeled data are different from images, so that the requirements on labeling personnel are often higher due to the fact that the text labeled data need to be subjectively understood on the articles. Therefore, the industrial field at the beginning of the project is inconvenient for developing large-scale supervised learning algorithm.
The computation amount required by the deep neural network algorithm is large and feasible on a small amount of data, but the deep neural network algorithm is applied to industrial-grade data of a few G, a few T and even a few PB, similar articles of an article are searched for semantically, hundreds of millions of times of a single neural network are required to be repeatedly executed, and the feedback time is definitely unacceptable.
Most of the existing industrial field semantic similarity detection algorithms are character-based prior probability statistical models, but context and word order relations cannot be captured, so that the detection algorithms can be only defined as shallow semantic similarity calculation.
Through the above analysis, the problems and defects of the prior art are as follows: (1) the existing text semantic similarity calculation method adopts a supervised learning method to carry out model training and needs a large amount of labeled sample support; and the calculated amount is large;
(2) most of the existing semantic similarity detection algorithms are character-based prior probability statistical models, but context and word order relations cannot be captured.
(3) Existing deep learning based models, such as: simase _ LSTM, RCNN, DSSM and the like have large calculation amount, need to be supported by a high-configuration GPU server, and have high hardware cost.
The difficulty in solving the above problems and defects is:
the problem and the defect (1) are solved by a large amount of manual labeling, the manual hiring cost is required to be invested, meanwhile, the patents belong to the problem of strong speciality, the accurate evaluation of the similarity degree between the two patents can be determined only after the review of a very professional examiner, the personnel level requirement is high, and the labeling efficiency is low.
To solve the above problems and drawbacks (2) requires the use of chain models, such as: RNN, LSTM, etc., the use of such models will again cause the need for hardware devices and labeling data, i.e.: cause problems and defects (1) and problems and defects (3)
The problem and the defect (3) are solved by only providing fund support and purchasing a high-configuration server, but the system is mainly developed for specific crowds, so that the audience crowds are few, the use rate is low, and hardware resource waste is easily caused.
The significance of solving the problems and the defects is as follows:
solving the above problems and deficiencies (1) can alleviate the stress and demand on the standard personnel and reduce the cost for project research and development.
The problem and the defect (2) can be solved, the deep semantic features of the text can be obtained, and the overall detection quality of the system is improved.
The problems and the defects (3) can be solved, the configuration requirement on the server can be reduced, the cost investment is reduced, and the equipment utilization rate is improved.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a text semantic similarity information processing method based on multi-model fusion.
The invention is realized in this way, and a text semantic similarity information processing method based on multi-model fusion comprises the following steps:
step one, acquiring patent data in a patent library, and performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by adopting different models respectively to obtain corresponding word vector characteristics and sentence vector characteristics;
step two, the word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification are fused to be used as combined characteristic vectors of the patent; simultaneously, training a sentence vector model by using the data of the claims and the specification, and training a word vector model by using the data of the title and the abstract;
step three, storing the combination characteristics of each patent in the patent library obtained in the step two, and the trained word vector model and sentence vector model respectively;
step four, collecting relevant data of the patent to be retrieved, and performing word segmentation processing on the title, the abstract, the claim and the specification of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and specification sentence vector characteristics;
combining the obtained entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and description sentence vector characteristics of the patent to be retrieved to obtain combined characteristics of the patent to be retrieved;
and step six, performing similarity calculation on the obtained combined features of the patents to be retrieved and the combined features of all the patents stored in the patent library one by one.
Further, in the step one, the obtaining of corresponding word vector features and sentence vector features by performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by using different models respectively includes:
performing word segmentation processing on the title and the abstract, extracting keywords, and converting the extracted keywords into corresponding word vector characteristics by using a word vector model;
the method comprises the steps of performing word segmentation processing on the claims and the specification, and converting the contents of the claims and the specification into corresponding sentence vector characteristics by using a sentence vector model.
Furthermore, the word vector model predicts context words by using intermediate words and combines a gradient descent algorithm to minimize the difference between the predicted upper and lower words and the real upper and lower words, and the word vector model converts the obtained word vectors for accurately feeding back the inter-word relationship.
Furthermore, the sentence vector model jointly predicts context words by using paragraph vectors and central word vectors, the paragraph vectors slide in paragraphs through windows, the paragraph vectors are transferred along with the windows, and the paragraph vectors tend to be stable, namely, the paragraph vectors can represent text contents.
Further, in step four, the step of performing word segmentation processing on the title, the abstract, the claim and the specification of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and specification sentence vector characteristics includes:
performing word segmentation processing on titles and abstracts of patents to be retrieved, extracting keywords, and converting the extracted keywords into corresponding word vector characteristics by using a trained word vector model;
and performing word segmentation processing on the claims and the specification of the patent to be retrieved, and converting the trained sentence vector model into corresponding sentence vector characteristics.
Another object of the present invention is to provide a text semantic similarity information processing system based on multi-model fusion, which implements the text semantic similarity information processing method based on multi-model fusion, and the text semantic similarity information processing system based on multi-model fusion includes:
the data acquisition module is used for acquiring relevant data of the patent to be retrieved;
the text semantic extraction module is used for respectively extracting the entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and instruction sentence vector characteristics of the patent data based on the multiple models;
the feature fusion module is used for fusing the extracted entry word vector features, abstract word vector features, claim sentence vector features and description sentence vector features to obtain combined features of the patent to be retrieved;
the similarity calculation module is used for calculating the similarity based on the obtained combination characteristics of the patent to be retrieved and the combination characteristics of other patents stored in the database in advance;
and the database is used for storing the related patent data, the patent combination characteristic data, the trained word vector model and the sentence vector model.
It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:
acquiring patent data in a patent library, and performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by adopting different models respectively to obtain corresponding word vector characteristics and sentence vector characteristics;
the word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification are fused to be used as combined characteristic vectors of the patent; simultaneously, training a sentence vector model by using the data of the claims and the specification, and training a word vector model by using the data of the title and the abstract;
respectively storing the obtained combination characteristics of each patent in the patent library, and the trained word vector model and sentence vector model;
collecting relevant data of a patent to be retrieved, and performing word segmentation processing on a title, an abstract, a claim and a specification of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and specification sentence vector characteristics;
combining the obtained entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and instruction sentence vector characteristics of the patent to be retrieved to obtain the combined characteristics of the patent to be retrieved;
and carrying out similarity calculation on the obtained combined features of the patents to be retrieved and the combined features of all the patents stored in the patent library one by one.
It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
acquiring patent data in a patent library, and performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by adopting different models respectively to obtain corresponding word vector characteristics and sentence vector characteristics;
the word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification are fused to be used as combined characteristic vectors of the patent; simultaneously, training a sentence vector model by using the data of the claims and the specification, and training a word vector model by using the data of the title and the abstract;
respectively storing the obtained combination characteristics of each patent in the patent library, and the trained word vector model and sentence vector model;
collecting relevant data of a patent to be retrieved, and performing word segmentation processing on a title, an abstract, a claim and a specification of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and specification sentence vector characteristics;
combining the obtained entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and instruction sentence vector characteristics of the patent to be retrieved to obtain the combined characteristics of the patent to be retrieved;
and carrying out similarity calculation on the obtained combined features of the patents to be retrieved and the combined features of all the patents stored in the patent library one by one.
The invention also aims to provide a retrieval and duplication checking terminal for implementing the text semantic similarity information processing method based on multi-model fusion.
By combining all the technical schemes, the invention has the advantages and positive effects that:
Figure BDA0002604929210000061
Figure BDA0002604929210000071
description of test methods:
data range: 2000+ ten thousand Chinese patent data
The retrieval mode is as follows: inputting rejected patents, checking the detection rate of the XYA file of the top100, and not performing any keyword and IPC filtering.
Description of the effects: the data comparison shows that the retrieval rate of the keyword-word vector semantics is lowest only through the shallow layer, the retrieval rate of the XY documents can be greatly improved by using the sentence vectors to carry out deep semantic retrieval, the XY documents are more emphasized on similarity of contents, the A documents are more emphasized on correlation of the contents and belong to shallow layer correlation, and the effect of the method is obviously higher than that of any characteristic representation mode by carrying out characteristic fusion on the two characteristic vectors.
The method and the device calculate the text similarity from the two aspects of the shallow semantics and the deep semantics by fusing a plurality of models, can ensure that the detection result of the similar text does not deviate from the subjective consciousness of people based on character level, and can also mine the content of the similar semantics. The model architecture of the invention adopts unsupervised learning to carry out modeling all the time without any labeled data. Through off-line calculation of patent data in the patent library, calculation required in real-time calculation can be greatly compressed, and the real-time feedback speed is increased.
The model architecture of the invention uses unsupervised learning models, namely a word vector model and a sentence vector model, so that the demand of an algorithm model on labeled data is greatly reduced, deep semantic features of articles can be deeply mined by using the sentence vectors, the left part of the model can perform offline calculation aiming at patents in a patent database, and during real-time calculation, only one-by-one comparison with cached comprehensive feature vectors is needed, so that the calculation amount of real-time calculation is greatly reduced, and the feedback speed is accelerated.
The technical effect or experimental effect of the comparison is as follows:
Figure BDA0002604929210000072
Figure BDA0002604929210000081
description of test methods:
data range: 2000+ ten thousand Chinese patent data
The retrieval mode is as follows: inputting rejected patents, checking the detection rate of the XYA file of the top100, and not performing any keyword and IPC filtering.
Description of the effects: the data comparison shows that the retrieval rate of the keyword-word vector semantics is lowest only through the shallow layer, the retrieval rate of the XY documents can be greatly improved by using the sentence vectors to carry out deep semantic retrieval, the XY documents are more emphasized on similarity of contents, the A documents are more emphasized on correlation of the contents and belong to shallow layer correlation, and the effect of the method is obviously higher than that of any characteristic representation mode by carrying out characteristic fusion on the two characteristic vectors.
Through comparing test equipment and detection time, the technical scheme greatly improves the overall detection rate in the use of model fusion, but when detection is not sacrificed at all, the overall detection effect is greatly improved in comparison with a comparison system, and although the detection time is increased by 0.1s, required hardware equipment is reduced by 8 times.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained from the drawings without creative efforts.
Fig. 1 is a flowchart of a text semantic similarity information processing method based on multi-model fusion according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a text semantic similarity information processing method based on multi-model fusion according to an embodiment of the present invention.
FIG. 3 is a schematic structural diagram of a text semantic similarity information processing system based on multi-model fusion according to an embodiment of the present invention;
in the figure: 1. a data acquisition module; 2. a text semantic extraction module; 3. a feature fusion module; 4. a similarity calculation module; 5. a database.
Fig. 4 is a schematic diagram of feature extraction of a word vector model according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of sentence vector model feature extraction according to the embodiment of the present invention.
Fig. 6 is a schematic diagram of vector model construction provided in the embodiment of the present invention.
Fig. 7 is a schematic diagram of combined feature extraction provided in the embodiment of the present invention.
Fig. 8 is a schematic diagram of combination feature fusion provided by an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Aiming at the problems in the prior art, the invention provides a text semantic similarity information processing method based on multi-model fusion, and the invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1-2, a text semantic similarity information processing method based on multi-model fusion provided by the embodiment of the present invention includes:
s101, acquiring patent data in a patent library, and performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by adopting different models respectively to obtain corresponding word vector characteristics and sentence vector characteristics;
s102, combining the word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification as combined characteristic vectors of the patent; simultaneously, training a sentence vector model by using the data of the claims and the specification, and training a word vector model by using the data of the title and the abstract;
s103, storing the combination characteristics of the patents in the patent library obtained in the step S102, and the trained word vector model and sentence vector model respectively;
s104, collecting relevant data of the patent to be retrieved, and performing word segmentation processing on the title, the abstract, the claim and the specification of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and specification sentence vector characteristics;
s105, combining the acquired entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and description sentence vector characteristics of the patent to be retrieved to acquire combined characteristics of the patent to be retrieved;
and S106, performing similarity calculation on the obtained combined features of the patents to be retrieved and the combined features of the patents stored in the patent library one by one.
In step S101, the obtaining of corresponding word vector features and sentence vector features by performing word segmentation processing on the title, abstract, claim, and specification of a patent in patent data using different models according to the embodiments of the present invention includes:
performing word segmentation processing on the title and the abstract, extracting keywords, and converting the extracted keywords into corresponding word vector characteristics by using a word vector model;
the method comprises the steps of performing word segmentation processing on the claims and the specification, and converting the contents of the claims and the specification into corresponding sentence vector characteristics by using a sentence vector model.
The word vector model provided by the embodiment of the invention predicts context words by using intermediate words and combines a gradient descent algorithm to minimize the difference between the predicted upper and lower words and the real upper and lower words, and the word vector model converts the obtained word vectors for accurately feeding back the inter-word relationship.
The sentence vector model provided by the embodiment of the invention uses the paragraph vector and the central vocabulary vector to jointly predict the context vocabulary, the paragraph vector slides in the paragraph through the window, the paragraph vector transfers along with the window, and the paragraph vector tends to be stable, namely representing the text content.
In step S104, the method for performing word segmentation on the title, the abstract, the claim and the specification of the patent to be retrieved according to the embodiment of the present invention to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and description sentence vector characteristics includes:
performing word segmentation processing on titles and abstracts of patents to be retrieved, extracting keywords, and converting the extracted keywords into corresponding word vector characteristics by using a trained word vector model;
and performing word segmentation processing on the claims and the specification of the patent to be retrieved, and converting the trained sentence vector model into corresponding sentence vector characteristics.
As shown in fig. 3, the text semantic similarity information processing system based on multi-model fusion provided by the embodiment of the present invention includes:
the data acquisition module is used for acquiring relevant data of the patent to be retrieved;
the text semantic extraction module is used for respectively extracting the entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and instruction sentence vector characteristics of the patent data based on the multiple models;
the feature fusion module is used for fusing the extracted entry word vector features, abstract word vector features, claim sentence vector features and description sentence vector features to obtain combined features of the patent to be retrieved;
the similarity calculation module is used for calculating the similarity based on the obtained combination characteristics of the patent to be retrieved and the combination characteristics of other patents stored in the database in advance;
and the database is used for storing the related patent data, the patent combination characteristic data, the trained word vector model and the sentence vector model.
The technical solution of the present invention is further illustrated by the following specific examples.
Example 1:
for patent data in a patent library, different models are used for processing titles, abstracts, claims and specifications, and because the word frequency distribution, the text length and the syntactic structure of each part are greatly different, the four parts need to be processed respectively by using different models.
For the title and the abstract, the text length is short, most of the technical nouns and the explanation vocabularies thereof are professional, and the style is simple, so that for the title abstract, word segmentation operation is firstly carried out, then keywords are extracted, and the keywords are sent into a word vector model and converted into corresponding word vectors. The word vector model is used here because the word vector model is an unsupervised model, slides in an article through a window, intercepts article segments, and predicts context words by using intermediate words as shown in fig. 4, the model does not need any labeling data, combines with a gradient descent algorithm, minimizes the difference between the predicted upper and lower words and the real upper and lower words, and the final product, namely word vector, can accurately feed back the inter-word relationship, so that the method is used for mining the shallow semantics of the abstract part of the patent title.
Aiming at the right and the specification part, because the text of the part is longer and is mostly different from 3000-10000 characters, and more upper and lower sentence citation relations exist, the deep semantics of the part are mined by using sentence vectors after the part is participled. The sentence vector model is a variant of a word vector and is also an unsupervised learning model, a paragraph vector is introduced on the basis of the word vector, and different from the word vector, the sentence vector uses the paragraph vector and a central vocabulary vector to jointly predict context vocabularies, the paragraph vector can be transferred along with a window by sliding in the paragraph through the window, and when the paragraph vector tends to be stable, the text content can be represented.
The word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification are fused to be used as combined characteristic vectors of the article, the similarity between every two words is calculated, and only the combined characteristic similarity between every two words is required to be calculated.
In conclusion, the model architecture uses unsupervised learning models, namely a word vector model and a sentence vector model, so that the demand of an algorithm model on labeled data is greatly reduced, deep semantic features of an article can be deeply mined by using the sentence vectors, the left part of the model can perform offline calculation aiming at patents in a patent database, and during real-time calculation, only one-by-one comparison with cached comprehensive feature vectors is needed, so that the calculated amount of real-time calculation is greatly reduced, and the feedback speed is accelerated.
The text semantic similarity information processing method based on multi-model fusion specifically comprises the following steps:
1. the title, abstract, claims and specification of a patent in a patent library are respectively participled.
2. The word vector model is trained using the heading and summary data.
3. Sentence vector models are trained using claim and specification data.
4. And storing the trained model.
5. Calculated separately for all patents in the patent library.
6. The title, abstract, claims, and specification are each segmented.
7. And extracting key words aiming at the titles and the abstracts.
8. And calculating word vector characteristics corresponding to the title key words and the abstract key words.
9. Sentence vector features of claims and descriptions are calculated.
10. Combining the vector characteristics of the entry words, the vector characteristics of the abstract words, the vector characteristics of the claim sentences and the vector characteristics of the instruction sentences.
11. And storing the combined characteristics off line.
12. And calculating the combination characteristics of the patents to be detected.
13. And carrying out similarity calculation on the combined features of the patent to be detected and the pre-calculated combined features in the patent library one by one.
14. And selecting the required data according to the similarity calculation result.
The invention can use neural network to replace sentence vector model for calculation.
The present invention may use a chinese word vector model or other variant word vector calculation instead of the word vector calculation described herein.
The invention can use the same model architecture, but different data sources, such as using paper data, calculating abstract feature vectors by using word vectors, calculating text feature vectors by using sentence vectors, and calculating semantic similarity by using the combined features of the two.
Example 2
The text semantic similarity information processing method based on multi-model fusion comprises the following steps:
step 1, performing word segmentation operation on the titles and abstracts of the papers in the papers database respectively.
And 2, training a word vector model by using the titles and the abstracts.
And 3, splitting the full-text data in the paper according to large chapters such as introduction, background, experiment, effect comparison and the like, and performing word segmentation operation on each chapter.
And 4, training a sentence vector model by using the word list of each chapter obtained in the step 3.
And 5, storing the word vector model obtained in the step 2 and the sentence vector model obtained in the step 4.
And 6, performing module-by-module feature extraction on the papers in the local papers database by using the word vector model and the sentence vector model.
And 7, constructing a feature fusion method, and fusing the features obtained in the step 6.
And 8, storing the extracted features and the original text information.
And 9, performing word segmentation operation on the titles and abstracts of the papers to be retrieved.
And step 10, performing vector conversion aiming at the keyword information in the thesis and the abstract by using the word vector model.
And 11, splitting the full text information of the paper to be retrieved according to chapters and performing word segmentation operation.
And step 12, using the sentence vector model to extract features.
And step 13, performing feature fusion by using the feature fusion method adopted in the step 7.
And step 14, performing similarity calculation on the combined features of the papers to be retrieved and the pre-calculated combined features in the papers database one by one.
And step 15, selecting the required data according to the similarity calculation result.
The invention can use neural network to replace sentence vector model for calculation.
The present invention may use a chinese word vector model or other variant word vector calculation instead of the word vector calculation described herein.
The invention can use the same model architecture, but different data sources, such as using paper data, calculating abstract feature vectors by using word vectors, calculating text feature vectors by using sentence vectors, and calculating semantic similarity by using the combined features of the two.
The invention is further described below by way of comparison with the effects of the examples.
The system comprises the following steps: centos 7.
Equipment: 8 cores 16 threads 128G store 1 station.
Data set: 100 ten thousand Chinese patents.
Figure BDA0002604929210000141
Figure BDA0002604929210000151
By comparing the experimental control groups, it is easy to find that different word segmentation methods have large influence on feature extraction, and a word segmentation algorithm 2 is selected for comprehensive comparison.
The model fusion can enable the detection rate of the XYA to tend to be average, the detection rate of the X file is far higher than that of the A file, or the detection rate of the A file is far higher than that of the X file, the detection rate of the XYA file can be simply understood as the feature extraction capability of the model to the deep semantics and the shallow semantics, and the deep semantics and the shallow semantics can be fused in a balanced manner by using the model fusion.
In the aspect of the use of feature fusion, the feature dimension is not changed, and the detection time is not increased.
By using feature fusion, the detection rate is much higher than that of a single model due to the consideration of deep semantics and shallow semantics.
Comparing different characteristic dimensions and the change of the characteristic dimensions in the process of detection, along with the increase of the characteristic dimensions, the detection rate of the model is improved, and the cost is that when the detection is sacrificed, the selection of the characteristic dimensions needs to comprehensively consider the quality and the speed according to the service requirements.
In the description of the present invention, "a plurality" means two or more unless otherwise specified; the terms "upper", "lower", "left", "right", "inner", "outer", "front", "rear", "head", "tail", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are only for convenience in describing and simplifying the description, and do not indicate or imply that the device or element referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, should not be construed as limiting the invention. Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims (9)

1. A text semantic similarity information processing method based on multi-model fusion is characterized by comprising the following steps:
acquiring patent data in a patent library, and performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by adopting different models respectively to obtain corresponding word vector characteristics and sentence vector characteristics;
the word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification are fused to be used as combined characteristic vectors of the patent; simultaneously, training a sentence vector model by using the data of the claims and the specification, and training a word vector model by using the data of the title and the abstract;
respectively storing the obtained combination characteristics of each patent in the patent library, and the trained word vector model and sentence vector model;
collecting relevant data of a patent to be retrieved, and performing word segmentation processing on a title, an abstract, a claim and a specification of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and specification sentence vector characteristics;
combining the obtained entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and instruction sentence vector characteristics of the patent to be retrieved to obtain the combined characteristics of the patent to be retrieved;
and carrying out similarity calculation on the obtained combined features of the patents to be retrieved and the combined features of all the patents stored in the patent library one by one.
2. The method for processing text semantic similarity information based on multi-model fusion according to claim 1, wherein the step of performing word segmentation processing on titles, abstracts, claims and specifications of patents in patent data by using different models respectively to obtain corresponding word vector features and sentence vector features comprises the steps of:
performing word segmentation processing on the title and the abstract, extracting keywords, and converting the extracted keywords into corresponding word vector characteristics by using a word vector model;
the method comprises the steps of performing word segmentation processing on the claims and the specification, and converting the contents of the claims and the specification into corresponding sentence vector characteristics by using a sentence vector model.
3. The method as claimed in claim 1, wherein the word vector model predicts context words using intermediate words, and minimizes the difference between the predicted upper and lower words and the real upper and lower words by combining gradient descent algorithm, and the word vector model transforms the obtained word vectors for accurate feedback of the relationships between words.
4. The method as claimed in claim 1, wherein the sentence vector model predicts context vocabulary jointly by using paragraph vector and central vocabulary vector, and slides in paragraph through window, paragraph vector transfers along with window, paragraph vector tends to be stable, and text content can be represented.
5. The method for processing text semantic similarity information based on multi-model fusion according to claim 1, wherein the step of performing word segmentation processing on the title, abstract, claim and description of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and description sentence vector characteristics comprises:
performing word segmentation processing on titles and abstracts of patents to be retrieved, extracting keywords, and converting the extracted keywords into corresponding word vector characteristics by using a trained word vector model;
and performing word segmentation processing on the claims and the specification of the patent to be retrieved, and converting the trained sentence vector model into corresponding sentence vector characteristics.
6. A text semantic similarity information processing system based on multi-model fusion for implementing the text semantic similarity information processing method based on multi-model fusion according to claims 1-5, wherein the text semantic similarity information processing system based on multi-model fusion comprises:
the data acquisition module is used for acquiring relevant data of the patent to be retrieved;
the text semantic extraction module is used for respectively extracting the entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and instruction sentence vector characteristics of the patent data based on the multiple models;
the feature fusion module is used for fusing the extracted entry word vector features, abstract word vector features, claim sentence vector features and description sentence vector features to obtain combined features of the patent to be retrieved;
the similarity calculation module is used for calculating the similarity based on the obtained combination characteristics of the patent to be retrieved and the combination characteristics of other patents stored in the database in advance;
and the database is used for storing the related patent data, the patent combination characteristic data, the trained word vector model and the sentence vector model.
7. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of:
acquiring patent data in a patent library, and performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by adopting different models respectively to obtain corresponding word vector characteristics and sentence vector characteristics;
the word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification are fused to be used as combined characteristic vectors of the patent; simultaneously, training a sentence vector model by using the data of the claims and the specification, and training a word vector model by using the data of the title and the abstract;
respectively storing the obtained combination characteristics of each patent in the patent library, and the trained word vector model and sentence vector model;
collecting relevant data of a patent to be retrieved, and performing word segmentation processing on a title, an abstract, a claim and a specification of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and specification sentence vector characteristics;
combining the obtained entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and instruction sentence vector characteristics of the patent to be retrieved to obtain the combined characteristics of the patent to be retrieved;
and carrying out similarity calculation on the obtained combined features of the patents to be retrieved and the combined features of all the patents stored in the patent library one by one.
8. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
acquiring patent data in a patent library, and performing word segmentation processing on titles, abstracts, claims and specifications of patents in the patent data by adopting different models respectively to obtain corresponding word vector characteristics and sentence vector characteristics;
the word vector characteristics of the title, the word vector characteristics of the abstract, the sentence vector characteristics of the claim and the sentence vector characteristics of the specification are fused to be used as combined characteristic vectors of the patent; simultaneously, training a sentence vector model by using the data of the claims and the specification, and training a word vector model by using the data of the title and the abstract;
respectively storing the obtained combination characteristics of each patent in the patent library, and the trained word vector model and sentence vector model;
collecting relevant data of a patent to be retrieved, and performing word segmentation processing on a title, an abstract, a claim and a specification of the patent to be retrieved respectively to obtain corresponding entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and specification sentence vector characteristics;
combining the obtained entry word vector characteristics, abstract word vector characteristics, claim sentence vector characteristics and instruction sentence vector characteristics of the patent to be retrieved to obtain the combined characteristics of the patent to be retrieved;
and carrying out similarity calculation on the obtained combined features of the patents to be retrieved and the combined features of all the patents stored in the patent library one by one.
9. A retrieval and duplication checking terminal for implementing the text semantic similarity information processing method based on multi-model fusion according to claims 1-5.
CN202010735606.5A 2020-07-28 2020-07-28 Text semantic similarity information processing method and system based on multi-model fusion Pending CN112036177A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010735606.5A CN112036177A (en) 2020-07-28 2020-07-28 Text semantic similarity information processing method and system based on multi-model fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010735606.5A CN112036177A (en) 2020-07-28 2020-07-28 Text semantic similarity information processing method and system based on multi-model fusion

Publications (1)

Publication Number Publication Date
CN112036177A true CN112036177A (en) 2020-12-04

Family

ID=73583308

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010735606.5A Pending CN112036177A (en) 2020-07-28 2020-07-28 Text semantic similarity information processing method and system based on multi-model fusion

Country Status (1)

Country Link
CN (1) CN112036177A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112529302A (en) * 2020-12-15 2021-03-19 中国人民大学 Method and system for predicting success rate of patent application authorization and electronic equipment
CN112883722A (en) * 2021-03-04 2021-06-01 中山大学 Distributed text summarization method based on cloud data center
CN112989790A (en) * 2021-03-17 2021-06-18 中国科学院深圳先进技术研究院 Document characterization method and device based on deep learning, equipment and storage medium
CN113254586A (en) * 2021-05-31 2021-08-13 中国科学院深圳先进技术研究院 Unsupervised text retrieval method based on deep learning
CN113761890A (en) * 2021-08-17 2021-12-07 汕头市同行网络科技有限公司 BERT context sensing-based multi-level semantic information retrieval method
CN114398968A (en) * 2022-01-06 2022-04-26 北京博瑞彤芸科技股份有限公司 Method and device for labeling similar customer-obtaining files based on file similarity
CN114780690A (en) * 2022-06-20 2022-07-22 成都信息工程大学 Patent text retrieval method and device based on multi-mode matrix vector representation
CN115794999A (en) * 2023-02-01 2023-03-14 北京知呱呱科技服务有限公司 Patent document query method based on diffusion model and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method
CN109190112A (en) * 2018-08-10 2019-01-11 合肥工业大学 Patent classification method, system and storage medium based on binary channels Fusion Features
CN111104794A (en) * 2019-12-25 2020-05-05 同方知网(北京)技术有限公司 Text similarity matching method based on subject words
CN111428044A (en) * 2020-03-06 2020-07-17 中国平安人寿保险股份有限公司 Method, device, equipment and storage medium for obtaining supervision identification result in multiple modes

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method
CN109190112A (en) * 2018-08-10 2019-01-11 合肥工业大学 Patent classification method, system and storage medium based on binary channels Fusion Features
CN111104794A (en) * 2019-12-25 2020-05-05 同方知网(北京)技术有限公司 Text similarity matching method based on subject words
CN111428044A (en) * 2020-03-06 2020-07-17 中国平安人寿保险股份有限公司 Method, device, equipment and storage medium for obtaining supervision identification result in multiple modes

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张庆颖: "基于Simhash和CNN的相似新闻推荐", 中国优秀硕士学位论文全文数据库 信息科技辑 *
李峰;侯加英;曾荣仁;凌晨;: "融合词向量的多特征句子相似度计算方法研究", 计算机科学与探索 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112529302A (en) * 2020-12-15 2021-03-19 中国人民大学 Method and system for predicting success rate of patent application authorization and electronic equipment
CN112883722A (en) * 2021-03-04 2021-06-01 中山大学 Distributed text summarization method based on cloud data center
CN112989790A (en) * 2021-03-17 2021-06-18 中国科学院深圳先进技术研究院 Document characterization method and device based on deep learning, equipment and storage medium
CN112989790B (en) * 2021-03-17 2023-02-28 中国科学院深圳先进技术研究院 Document characterization method and device based on deep learning, equipment and storage medium
CN113254586A (en) * 2021-05-31 2021-08-13 中国科学院深圳先进技术研究院 Unsupervised text retrieval method based on deep learning
CN113254586B (en) * 2021-05-31 2021-11-05 中国科学院深圳先进技术研究院 Unsupervised text retrieval method based on deep learning
CN113761890A (en) * 2021-08-17 2021-12-07 汕头市同行网络科技有限公司 BERT context sensing-based multi-level semantic information retrieval method
CN113761890B (en) * 2021-08-17 2024-04-16 汕头市同行网络科技有限公司 Multi-level semantic information retrieval method based on BERT context awareness
CN114398968A (en) * 2022-01-06 2022-04-26 北京博瑞彤芸科技股份有限公司 Method and device for labeling similar customer-obtaining files based on file similarity
CN114780690A (en) * 2022-06-20 2022-07-22 成都信息工程大学 Patent text retrieval method and device based on multi-mode matrix vector representation
CN115794999A (en) * 2023-02-01 2023-03-14 北京知呱呱科技服务有限公司 Patent document query method based on diffusion model and computer equipment

Similar Documents

Publication Publication Date Title
CN112036177A (en) Text semantic similarity information processing method and system based on multi-model fusion
CN109284357B (en) Man-machine conversation method, device, electronic equipment and computer readable medium
CN110298033B (en) Keyword corpus labeling training extraction system
CN109493977B (en) Text data processing method and device, electronic equipment and computer readable medium
CN107832414B (en) Method and device for pushing information
CN108287822B (en) Chinese similarity problem generation system and method
CN106570171B (en) Science and technology information processing method and system based on semantics
CN107423282B (en) Method for concurrently extracting semantic consistency subject and word vector in text based on mixed features
CN111753060A (en) Information retrieval method, device, equipment and computer readable storage medium
CN108549723B (en) Text concept classification method and device and server
Kaur Incorporating sentimental analysis into development of a hybrid classification model: A comprehensive study
CN113961685A (en) Information extraction method and device
CN111027306A (en) Intellectual property matching technology based on keyword extraction and word shifting distance
CN111858842A (en) Judicial case screening method based on LDA topic model
CN111291168A (en) Book retrieval method and device and readable storage medium
Singh et al. Writing Style Change Detection on Multi-Author Documents.
CN115238039A (en) Text generation method, electronic device and computer-readable storage medium
Seker et al. Author attribution on streaming data
CN112380848B (en) Text generation method, device, equipment and storage medium
CN114493783A (en) Commodity matching method based on double retrieval mechanism
CN113343108A (en) Recommendation information processing method, device, equipment and storage medium
CN117332789A (en) Semantic analysis method and system for dialogue scene
Barroca et al. Enriching a fashion knowledge graph from product textual descriptions
CN116933782A (en) E-commerce text keyword extraction processing method and system
CN117291190A (en) User demand calculation method based on emotion dictionary and LDA topic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination