CN115759068A - Self-learning-based scene text matching method and system - Google Patents

Self-learning-based scene text matching method and system Download PDF

Info

Publication number
CN115759068A
CN115759068A CN202211524896.4A CN202211524896A CN115759068A CN 115759068 A CN115759068 A CN 115759068A CN 202211524896 A CN202211524896 A CN 202211524896A CN 115759068 A CN115759068 A CN 115759068A
Authority
CN
China
Prior art keywords
scene
text
vector
self
corpus data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211524896.4A
Other languages
Chinese (zh)
Inventor
周婷婷
焦旭
徐圣源
梁变
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Publication of CN115759068A publication Critical patent/CN115759068A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a self-learning-based scene text matching method and a self-learning-based scene text matching system, wherein a pre-training word vector data set is selected, and scene corpus data is converted into a scene word vector corresponding to the pre-training word vector data set; setting a threshold value of the number of scene corpus samples in a self-defined manner, and when the scene corpus data is smaller than the threshold value of the number of the scene corpus samples, inputting the scene corpus data serving as a small number of samples into an unsupervised learning model to convert the scene corpus data into corresponding first scene text vectors; after the scene corpus data accumulation exceeds the set scene corpus sample quantity threshold, inputting the scene corpus data accumulation into a supervised learning model and converting the scene corpus data accumulation into a corresponding second scene text vector; calculating and sequencing the text similarity of the first scene text vector, the second scene text vector and the text to be matched, and correcting the text matching result to obtain a text matching pair; and according to the text matching pair, optimizing the unsupervised learning model and the supervised learning model, and correcting the calculation mode of the text similarity.

Description

Self-learning-based scene text matching method and system
Technical Field
The invention belongs to the technical field of semantic text matching, and particularly relates to a scene text matching method and system based on self-learning.
Background
In the prior art, a text matching method usually includes extracting text features of two pieces of text information, and then judging whether the two pieces of text information are matched or not by calculating similarity based on extracted text feature vectors. The construction of the deep learning algorithm model based on supervised learning needs to obtain a proper parameter result by training large-scale parameters of the model on the basis of a large amount of text corpora, so that an inference result of the model in a specific scene is improved. However, the actual application scenario has a common status quo of corpus data loss.
For the situation of corpus data loss, large-scale corpus labeling can be usually obtained through a manual labeling mode, but large-scale boring and repeated corpus labeling work and large manpower input requirements invisibly improve the cost of model training, and Natural Language Processing (NLP) technology has a great prerequisite threshold requirement in actual scene application. Therefore, there is a need to provide a system solution that is acceptable to users and implements NLP technology floor on the basis of a small amount of corpus data.
According to the method, the unsupervised model is used for giving the recommendation result of the text matching to the user on the basis of a small amount of corpus data, and the corpus data of the text matching library is continuously accumulated by combining the feedback result given to the system by the user, so that the accumulation of large-scale corpus data is finally realized, and the iterative updating of the model and the improvement of the system performance are synchronously completed.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a self-learning-based scene text matching method and a self-learning-based scene text matching system.
In order to achieve the purpose, the technical scheme of the invention is as follows: the embodiment of the invention provides a self-learning-based scene text matching method in a first aspect, which comprises the following substeps:
selecting a pre-training word vector data set, and converting scene corpus data into scene word vectors corresponding to the pre-training word vector data set;
setting a threshold value of the number of scene corpus samples in a self-defined manner, and when the scene corpus data is smaller than the threshold value of the number of the scene corpus samples, taking the scene corpus data as a small amount of samples, inputting the small amount of samples into an unsupervised learning model, and converting scene word vectors into corresponding first scene text vectors;
after the scene corpus data accumulation exceeds the set scene corpus sample quantity threshold value, inputting the scene corpus data accumulation into a supervised learning model to convert the scene word vector into a corresponding second scene text vector;
calculating and sequencing the text similarity of the first scene text vector, the second scene text vector and the text to be matched, and correcting the text matching result to obtain a text matching pair;
and according to the text matching pair, optimizing the unsupervised learning model and the supervised learning model, and correcting the calculation mode of the text similarity.
The second aspect of the embodiments of the present invention provides a self-learning based scene text matching system, which is used to implement the above-mentioned scene text matching method, and the system includes:
the pre-training word vector generation module is used for selecting a pre-training word vector data set and converting scene corpus data into a scene word vector corresponding to the pre-training word vector data set;
the unsupervised learning module is used for converting the scene word vectors into corresponding first scene text vectors;
the supervised learning module is used for converting the scene word vector into a corresponding second scene text vector;
the text similarity calculation module is used for calculating the text similarity of the first scene text vector, the second scene text vector and the text to be matched and sequencing the text similarity;
and the human-computer interaction module is used for acquiring a corrected text matching result.
A third aspect of an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory is coupled to the processor; the memory is used for storing program data, and the processor is used for executing the program data to realize the self-learning-based scene text matching method.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the above-mentioned self-learning based scene text matching method.
Compared with the prior art, the invention has the following beneficial effects:
(1) The method comprises the steps of firstly setting a threshold value of the number of scene corpus samples in a self-defined mode, when the scene corpus data is smaller than the threshold value of the number of the scene corpus samples, using the scene corpus data as a small number of samples, firstly selecting a vector data set of pre-trained words as a general corpus on the basis of only a small number of samples, executing the pre-trained word vectors based on the existing vector data set of the pre-trained words, and applying an unsupervised learning model to realize lower starting cost;
(2) The method is based on the text matching results obtained by the unsupervised learning model and the supervised learning model and the text matching correction processing of the client, automatically extracts and realizes the construction of a text matching library, and completes the accumulation of text corpus data;
(3) The method is based on the data accumulation of the text matching library, the unsupervised learning model and the supervised learning model are corrected and optimized, and the corresponding weight in the calculation of the similarity of the fusion text based on the unsupervised learning model and the supervised learning model is adjusted by combining the misjudgment rate of the unsupervised learning model and the supervised learning model on the newly added test corpus data, so that the accuracy of the text matching result provided for a user is improved, and the method has self-learning capability.
Drawings
FIG. 1 is a flow chart of a method embodying the present invention.
Fig. 2 is a schematic diagram of the system module of the present invention.
Detailed Description
The foregoing examples are illustrative and are not to be construed as limiting the present invention, which is claimed to include but not be limited to the specific illustrative embodiments described above. Any self-learning based scene text matching method and system claims in accordance with the present invention shall fall within the scope of the present invention, and any person skilled in the art shall be able to select pre-training word vectors, change, replace and modify the word segmentation algorithm according to different business scenes within the scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a self-learning based scene text matching method and system, where the method specifically includes the following steps:
(1) And selecting a pre-training word vector data set, and converting the scene corpus data into a scene word vector corresponding to the pre-training word vector data set.
The step (1) specifically comprises the following substeps:
(1.1) in the example, a Chinese word segmentation tool jieba is used for carrying out word segmentation on scene corpus data;
(1.2) using a stop word list of the Hayard, and performing stop word removing operation on the word segmentation result obtained in the step (1.1) to obtain a word segmentation w;
(1.3) in this example, the pre-training word vector dataset selects "tend-ailab-embedding-zh-d 100-s", and the Chinese pre-training word vector dataset "tend-ailab-embedding-zh-d 100-s" is used to convert the participle w into the corresponding scene word vector v w (where w ∈ V, V is scene corpus data).
(2) Setting a threshold value of the number of scene corpus samples in a user-defined mode, wherein the threshold value of the number of samples is set to be 1 ten thousand, when the number of scene corpus data is smaller than the threshold value of the number of scene corpus samples, the scene corpus data is used as a small number of samples, and the small number of samples are input into an unsupervised learning model, so that scene word vectors are converted into corresponding first scene text vectors.
(2.1) calculating the occurrence frequency p (w) of the participles w appearing in a small number of samples according to the following calculation formula:
Figure BDA0003972685720000031
(2.2) calculating a scene text vector v from the scene word vector s The calculation formula is as follows:
Figure BDA0003972685720000032
wherein s is the number of words in the sentence s, and a is a hyperparameter.
(2.3) text vector w for all texts S in scene corpus data S s And forming a matrix X, and performing Principal Component Analysis (PCA) to obtain a maximum principal component vector u.
(2.4) re-acquiring scene corpus data, and calculating a text vector v of TS aiming at a text TS in the scene corpus data TS ts The calculation formula is as follows:
Figure BDA0003972685720000041
(2.5) v output for (2.4) step ts Removing the maximum principal component of the sample analysis to serve as a first scene text vector v of the text ts ts The calculation formula is as follows:
v ts =v ts -uu T v ts
(3) And continuously acquiring scene corpus data, inputting the scene corpus data into the supervised learning model, and converting the scene corpus data into corresponding second scene text vectors from word vectors.
Based on the supervised learning algorithm, after the accumulation of scene corpus data exceeds the set threshold of the number of scene corpus samples, the word vector v related in the text w As a supervised learning modelInputting, using supervised learning models including, without limitation, ESIM, biMPM, DIIN, cafe, ELmo, GPT, or Bert, a second scene text vector.
(4) Calculating and sequencing the text similarity of the first scene text vector, the second scene text vector and the text to be matched, outputting the text with the highest similarity value to the client, and acquiring a corrected text matching result from the client to obtain a text matching pair; the text matching pairs are stored to build a text matching library.
Marking a first scene text vector output by the unsupervised learning model as v s_unsupervised (ii) a Marking a second scene text vector output by the supervised learning model as v s_supervised
Sequentially measuring the text similarity of the first scene text vector, the second scene text vector and the text vectors in the text matching library, wherein the measuring index adopts cosine distance, and the calculation formula is as follows:
Figure BDA0003972685720000042
Figure BDA0003972685720000043
wherein the content of the first and second substances,
Figure BDA0003972685720000044
cosine similarity, v, calculated for unsupervised models offline_unsupervised For a preselected text vector to be matched for the unsupervised model,
Figure BDA0003972685720000045
cosine similarity, v, calculated for supervised models offline_supervised The text vectors to be matched are pre-selected corresponding to the supervised model.
Under the condition that only a few scene samples exist initially, only the text vector output by the unsupervised learning model is used, and the similarity of the text is calculated by using the cosine distanceDegree; and after the scene samples are accumulated to exceed the set threshold value of the number of the scene corpus samples, combining a small amount of initial sample data, outputting text vectors of a supervised learning model and an unsupervised learning model, and respectively calculating the text similarity by utilizing the preselected distance. A weight distribution is made according to the text similarity of the unsupervised model and the supervised model, and the weights are respectively alpha unsupervised And alpha supervised (before scene samples accumulate no updates, α unsupervised =1,α supervised = 0), the text similarity calculation formula is as follows:
Figure BDA0003972685720000051
wherein the content of the first and second substances,
Figure BDA0003972685720000052
for the cosine similarity calculated by the supervised model,
Figure BDA0003972685720000053
and calculating the cosine similarity obtained by the unsupervised model.
(5) And iteratively optimizing the unsupervised learning model and the supervised learning model according to the text matching pairs in the text matching pair library.
Updating the frequency p (w) of the participle w according to the established and maintained text matching library, and the text vector v of the text matching library s And calculating a text vector aiming at the texts input into the system subsequently according to the updated maximum principal component vector u.
According to the established and maintained text matching library, incremental text matching data are input into a supervised learning model, and the supervised learning model comprises ESIM, biMPM, DIIN, cafe, ELmo, GPT or Bert without limitation, so that the supervised model is ensured to complete continuous optimization of parameters under specific scene data corpora.
(6) And correcting the calculation formula of the text similarity according to the text matching correction feedback of the client so as to gradually improve the accuracy of the unsupervised learning model and the supervised learning model.
(6.1) comparing the matched text results output by using the unsupervised model and the supervised model with the text matching results fed back by the client based on the newly added data of the text matching library, and respectively recording the misjudgment rates of the statistical models as the misjudgment rates e of the unsupervised models unsupervised And the misjudgment rate e of the supervised model supervised
(6.2) parameter α of formula for calculating text similarity supervised And alpha unsupervised The following calculation operations are respectively executed for parameter updating:
Figure BDA0003972685720000054
wherein, there is a first weight alpha corresponding to the supervision model supervised And updating the ratio of the unsupervised model misjudgment rate to the total misjudgment rate. Second weight alpha corresponding to unsupervised model unsupervised And updating the ratio of the error rate of the supervised model to the total error rate.
(7) The method further comprises the following steps: scene corpus data is obtained and used as a test sample, and the test sample is input into an unsupervised learning model and a supervised learning model which are well iterated, so that a matched text result is obtained.
On the other hand, the invention also provides a self-learning-based scene text matching system, which is used for realizing the self-learning-based scene text matching method, and the system comprises the following steps:
the pre-training word vector generation module is used for selecting a pre-training word vector data set and converting the scene corpus data into a scene word vector corresponding to the pre-training word vector data set;
the unsupervised learning module is used for converting the scene word vector into a corresponding first scene text vector;
the supervised learning module is used for converting the scene word vector into a corresponding second scene text vector;
the text similarity calculation module is used for calculating the text similarity of the first scene text vector, the second scene text vector and the text to be matched and sequencing the text similarity;
and the human-computer interaction module is used for acquiring a corrected text matching result.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above embodiments are only used for illustrating the design idea and features of the present invention, and the purpose of the present invention is to enable those skilled in the art to understand the content of the present invention and implement the present invention accordingly, and the protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes and modifications made in accordance with the principles and concepts disclosed herein are intended to be included within the scope of the present invention.

Claims (10)

1. A self-learning based scene text matching method is characterized by comprising the following sub-steps:
selecting a pre-training word vector data set, and converting scene corpus data into scene word vectors corresponding to the pre-training word vector data set;
setting a threshold value of the number of scene corpus samples in a self-defined manner, and when the scene corpus data is smaller than the threshold value of the number of the scene corpus samples, taking the scene corpus data as a small amount of samples, inputting the small amount of samples into an unsupervised learning model, and converting scene word vectors into corresponding first scene text vectors;
after the scene corpus data accumulation exceeds the set scene corpus sample quantity threshold value, inputting the scene corpus data accumulation into a supervised learning model to convert the scene word vector into a corresponding second scene text vector;
calculating and sequencing the text similarity of the first scene text vector, the second scene text vector and the text to be matched, and correcting the text matching result to obtain a text matching pair;
and according to the text matching pair, optimizing the unsupervised learning model and the supervised learning model, and correcting the calculation mode of the text similarity.
2. The self-learning based scene text matching method of claim 1, wherein the process of selecting a pre-training word vector data set and converting the scene corpus data into a scene word vector corresponding to the pre-training word vector data set comprises: and performing word segmentation on the scene corpus data, performing stop word elimination on word segmentation results, and converting the processed word segmentation into corresponding scene word vectors in a pre-training word vector data set.
3. The self-learning based scene text matching method of claim 1, wherein the process of converting the scene word vector into the corresponding first scene text vector comprises:
calculating the frequency corresponding to the participles appearing in a small number of samples;
calculating a scene text vector from the scene word vector, wherein the formula is as follows:
Figure FDA0003972685710000011
wherein | s | is the number of words in the sentence s, and a is a hyper-parameter; p (w) is the frequency corresponding to the participles appearing in a small number of samples;
text vector w for all texts S in scene corpus data S s Forming a matrix X, and performing principal component analysis to obtain a maximum principal component vector u;
re-acquiring scene corpus data, and calculating a text vector v of the text TS according to the text TS in the scene corpus data TS ts The calculation formula is as follows:
Figure FDA0003972685710000012
text v ts Removing the maximum principal component of the sample analysis to be used as a text vector v of the text ts ts The calculation formula is as follows:
v ts =v ts -uu T v ts
4. the self-learning based scene text matching method according to claim 1, wherein the supervised learning model is selected from ESIM, bimm, DIIN, cafe, ELmo, GPT or Bert.
5. The self-learning based scene text matching method of claim 1, wherein the process of calculating the text similarity between the first scene text vector and the text to be matched and the second scene text vector comprises: calculating the text similarity of the first scene text vector, the second scene text vector and the text to be matched based on the cosine similarity to obtain a first text similarity and a second text similarity; setting a first weight and a second weight, taking the first weight multiplied by the first text similarity as a first item and the first weight multiplied by the first text similarity as a second item, and adding the two items to obtain the text similarity.
6. The self-learning based scene text matching method according to claim 5, wherein the process of correcting the text matching result comprises:
comparing the matching text results output by the unsupervised model and the supervised model with the corrected text matching results respectively, and counting the unsupervised model misjudgment rate and the supervised model misjudgment rate;
the first weight is updated as: the ratio of the unsupervised model false positive rate to the total false positive rate;
the second weight is updated as: there is a ratio of the error rate of the supervisory model to the total error rate.
7. The self-learning based scene text matching method according to claim 1, further comprising: and re-acquiring scene corpus data as a test sample, and inputting the scene corpus data into the optimized unsupervised learning model and the optimized unsupervised learning model to obtain a matched text result.
8. A self-learning based scene text matching system for implementing the self-learning based scene text matching method of any one of claims 1 to 7, the system comprising:
the pre-training word vector generation module is used for selecting a pre-training word vector data set and converting scene corpus data into a scene word vector corresponding to the pre-training word vector data set;
the unsupervised learning module is used for converting the scene word vectors into corresponding first scene text vectors;
the supervised learning module is used for converting the scene word vector into a corresponding second scene text vector;
the text similarity calculation module is used for calculating and sequencing the text similarity of the first scene text vector, the second scene text vector and the text to be matched;
and the man-machine interaction module is used for acquiring a corrected text matching result.
9. An electronic device comprising a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is used for storing program data, and the processor is used for executing the program data to realize the self-learning based scene text matching method of any one of the above claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a self-learning based scene text matching method according to any one of claims 1 to 7.
CN202211524896.4A 2022-11-28 2022-11-30 Self-learning-based scene text matching method and system Pending CN115759068A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211505004 2022-11-28
CN2022115050046 2022-11-28

Publications (1)

Publication Number Publication Date
CN115759068A true CN115759068A (en) 2023-03-07

Family

ID=85341632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211524896.4A Pending CN115759068A (en) 2022-11-28 2022-11-30 Self-learning-based scene text matching method and system

Country Status (1)

Country Link
CN (1) CN115759068A (en)

Similar Documents

Publication Publication Date Title
CN113254599B (en) Multi-label microblog text classification method based on semi-supervised learning
CN112270379B (en) Training method of classification model, sample classification method, device and equipment
CN106502985B (en) neural network modeling method and device for generating titles
CN108319666B (en) Power supply service assessment method based on multi-modal public opinion analysis
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
CN112667818B (en) GCN and multi-granularity attention fused user comment sentiment analysis method and system
US20120323560A1 (en) Method for symbolic correction in human-machine interfaces
US20230244704A1 (en) Sequenced data processing method and device, and text processing method and device
CN107688576B (en) Construction and tendency classification method of CNN-SVM model
CN110851176B (en) Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus
CN111581954B (en) Text event extraction method and device based on grammar dependency information
CN109582794A (en) Long article classification method based on deep learning
CN111046183A (en) Method and device for constructing neural network model for text classification
CN111626041B (en) Music comment generation method based on deep learning
CN111078876A (en) Short text classification method and system based on multi-model integration
CN107526721A (en) A kind of disambiguation method and device to electric business product review vocabulary
CN103678318A (en) Multi-word unit extraction method and equipment and artificial neural network training method and equipment
CN113886562A (en) AI resume screening method, system, equipment and storage medium
CN116127060A (en) Text classification method and system based on prompt words
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN113642727A (en) Training method of neural network model and processing method and device of multimedia information
CN115759068A (en) Self-learning-based scene text matching method and system
CN110874408A (en) Model training method, text recognition device and computing equipment
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN112364666B (en) Text characterization method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination