CN115759068A

CN115759068A - Self-learning-based scene text matching method and system

Info

Publication number: CN115759068A
Application number: CN202211524896.4A
Authority: CN
Inventors: 周婷婷; 焦旭; 徐圣源; 梁变
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-11-28
Filing date: 2022-11-30
Publication date: 2023-03-07

Abstract

The invention discloses a self-learning-based scene text matching method and a self-learning-based scene text matching system, wherein a pre-training word vector data set is selected, and scene corpus data is converted into a scene word vector corresponding to the pre-training word vector data set; setting a threshold value of the number of scene corpus samples in a self-defined manner, and when the scene corpus data is smaller than the threshold value of the number of the scene corpus samples, inputting the scene corpus data serving as a small number of samples into an unsupervised learning model to convert the scene corpus data into corresponding first scene text vectors; after the scene corpus data accumulation exceeds the set scene corpus sample quantity threshold, inputting the scene corpus data accumulation into a supervised learning model and converting the scene corpus data accumulation into a corresponding second scene text vector; calculating and sequencing the text similarity of the first scene text vector, the second scene text vector and the text to be matched, and correcting the text matching result to obtain a text matching pair; and according to the text matching pair, optimizing the unsupervised learning model and the supervised learning model, and correcting the calculation mode of the text similarity.

Description

Self-learning-based scene text matching method and system

Technical Field

The invention belongs to the technical field of semantic text matching, and particularly relates to a scene text matching method and system based on self-learning.

Background

In the prior art, a text matching method usually includes extracting text features of two pieces of text information, and then judging whether the two pieces of text information are matched or not by calculating similarity based on extracted text feature vectors. The construction of the deep learning algorithm model based on supervised learning needs to obtain a proper parameter result by training large-scale parameters of the model on the basis of a large amount of text corpora, so that an inference result of the model in a specific scene is improved. However, the actual application scenario has a common status quo of corpus data loss.

For the situation of corpus data loss, large-scale corpus labeling can be usually obtained through a manual labeling mode, but large-scale boring and repeated corpus labeling work and large manpower input requirements invisibly improve the cost of model training, and Natural Language Processing (NLP) technology has a great prerequisite threshold requirement in actual scene application. Therefore, there is a need to provide a system solution that is acceptable to users and implements NLP technology floor on the basis of a small amount of corpus data.

According to the method, the unsupervised model is used for giving the recommendation result of the text matching to the user on the basis of a small amount of corpus data, and the corpus data of the text matching library is continuously accumulated by combining the feedback result given to the system by the user, so that the accumulation of large-scale corpus data is finally realized, and the iterative updating of the model and the improvement of the system performance are synchronously completed.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a self-learning-based scene text matching method and a self-learning-based scene text matching system.

In order to achieve the purpose, the technical scheme of the invention is as follows: the embodiment of the invention provides a self-learning-based scene text matching method in a first aspect, which comprises the following substeps:

selecting a pre-training word vector data set, and converting scene corpus data into scene word vectors corresponding to the pre-training word vector data set;

setting a threshold value of the number of scene corpus samples in a self-defined manner, and when the scene corpus data is smaller than the threshold value of the number of the scene corpus samples, taking the scene corpus data as a small amount of samples, inputting the small amount of samples into an unsupervised learning model, and converting scene word vectors into corresponding first scene text vectors;

after the scene corpus data accumulation exceeds the set scene corpus sample quantity threshold value, inputting the scene corpus data accumulation into a supervised learning model to convert the scene word vector into a corresponding second scene text vector;

calculating and sequencing the text similarity of the first scene text vector, the second scene text vector and the text to be matched, and correcting the text matching result to obtain a text matching pair;

and according to the text matching pair, optimizing the unsupervised learning model and the supervised learning model, and correcting the calculation mode of the text similarity.

The second aspect of the embodiments of the present invention provides a self-learning based scene text matching system, which is used to implement the above-mentioned scene text matching method, and the system includes:

the pre-training word vector generation module is used for selecting a pre-training word vector data set and converting scene corpus data into a scene word vector corresponding to the pre-training word vector data set;

the unsupervised learning module is used for converting the scene word vectors into corresponding first scene text vectors;

the supervised learning module is used for converting the scene word vector into a corresponding second scene text vector;

the text similarity calculation module is used for calculating the text similarity of the first scene text vector, the second scene text vector and the text to be matched and sequencing the text similarity;

and the human-computer interaction module is used for acquiring a corrected text matching result.

A third aspect of an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory is coupled to the processor; the memory is used for storing program data, and the processor is used for executing the program data to realize the self-learning-based scene text matching method.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the above-mentioned self-learning based scene text matching method.

Compared with the prior art, the invention has the following beneficial effects:

(1) The method comprises the steps of firstly setting a threshold value of the number of scene corpus samples in a self-defined mode, when the scene corpus data is smaller than the threshold value of the number of the scene corpus samples, using the scene corpus data as a small number of samples, firstly selecting a vector data set of pre-trained words as a general corpus on the basis of only a small number of samples, executing the pre-trained word vectors based on the existing vector data set of the pre-trained words, and applying an unsupervised learning model to realize lower starting cost;

(2) The method is based on the text matching results obtained by the unsupervised learning model and the supervised learning model and the text matching correction processing of the client, automatically extracts and realizes the construction of a text matching library, and completes the accumulation of text corpus data;

(3) The method is based on the data accumulation of the text matching library, the unsupervised learning model and the supervised learning model are corrected and optimized, and the corresponding weight in the calculation of the similarity of the fusion text based on the unsupervised learning model and the supervised learning model is adjusted by combining the misjudgment rate of the unsupervised learning model and the supervised learning model on the newly added test corpus data, so that the accuracy of the text matching result provided for a user is improved, and the method has self-learning capability.

Drawings

FIG. 1 is a flow chart of a method embodying the present invention.

Fig. 2 is a schematic diagram of the system module of the present invention.

Detailed Description

The foregoing examples are illustrative and are not to be construed as limiting the present invention, which is claimed to include but not be limited to the specific illustrative embodiments described above. Any self-learning based scene text matching method and system claims in accordance with the present invention shall fall within the scope of the present invention, and any person skilled in the art shall be able to select pre-training word vectors, change, replace and modify the word segmentation algorithm according to different business scenes within the scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a self-learning based scene text matching method and system, where the method specifically includes the following steps:

(1) And selecting a pre-training word vector data set, and converting the scene corpus data into a scene word vector corresponding to the pre-training word vector data set.

The step (1) specifically comprises the following substeps:

(1.1) in the example, a Chinese word segmentation tool jieba is used for carrying out word segmentation on scene corpus data;

(1.2) using a stop word list of the Hayard, and performing stop word removing operation on the word segmentation result obtained in the step (1.1) to obtain a word segmentation w;

(1.3) in this example, the pre-training word vector dataset selects "tend-ailab-embedding-zh-d 100-s", and the Chinese pre-training word vector dataset "tend-ailab-embedding-zh-d 100-s" is used to convert the participle w into the corresponding scene word vector v _w (where w ∈ V, V is scene corpus data).

(2) Setting a threshold value of the number of scene corpus samples in a user-defined mode, wherein the threshold value of the number of samples is set to be 1 ten thousand, when the number of scene corpus data is smaller than the threshold value of the number of scene corpus samples, the scene corpus data is used as a small number of samples, and the small number of samples are input into an unsupervised learning model, so that scene word vectors are converted into corresponding first scene text vectors.

(2.1) calculating the occurrence frequency p (w) of the participles w appearing in a small number of samples according to the following calculation formula:

(2.2) calculating a scene text vector v from the scene word vector _s The calculation formula is as follows:

wherein s is the number of words in the sentence s, and a is a hyperparameter.

(2.3) text vector w for all texts S in scene corpus data S _s And forming a matrix X, and performing Principal Component Analysis (PCA) to obtain a maximum principal component vector u.

(2.4) re-acquiring scene corpus data, and calculating a text vector v of TS aiming at a text TS in the scene corpus data TS _ts The calculation formula is as follows:

(2.5) v output for (2.4) step _ts Removing the maximum principal component of the sample analysis to serve as a first scene text vector v of the text ts _ts The calculation formula is as follows:

v _ts ＝v _ts -uu ^T v _ts

(3) And continuously acquiring scene corpus data, inputting the scene corpus data into the supervised learning model, and converting the scene corpus data into corresponding second scene text vectors from word vectors.

Based on the supervised learning algorithm, after the accumulation of scene corpus data exceeds the set threshold of the number of scene corpus samples, the word vector v related in the text _w As a supervised learning modelInputting, using supervised learning models including, without limitation, ESIM, biMPM, DIIN, cafe, ELmo, GPT, or Bert, a second scene text vector.

(4) Calculating and sequencing the text similarity of the first scene text vector, the second scene text vector and the text to be matched, outputting the text with the highest similarity value to the client, and acquiring a corrected text matching result from the client to obtain a text matching pair; the text matching pairs are stored to build a text matching library.

Marking a first scene text vector output by the unsupervised learning model as v _{s_unsupervised} (ii) a Marking a second scene text vector output by the supervised learning model as v _{s_supervised} 。

Sequentially measuring the text similarity of the first scene text vector, the second scene text vector and the text vectors in the text matching library, wherein the measuring index adopts cosine distance, and the calculation formula is as follows:

wherein the content of the first and second substances,

cosine similarity, v, calculated for unsupervised models _{offline_unsupervised} For a preselected text vector to be matched for the unsupervised model,

cosine similarity, v, calculated for supervised models _{offline_supervised} The text vectors to be matched are pre-selected corresponding to the supervised model.

Under the condition that only a few scene samples exist initially, only the text vector output by the unsupervised learning model is used, and the similarity of the text is calculated by using the cosine distanceDegree; and after the scene samples are accumulated to exceed the set threshold value of the number of the scene corpus samples, combining a small amount of initial sample data, outputting text vectors of a supervised learning model and an unsupervised learning model, and respectively calculating the text similarity by utilizing the preselected distance. A weight distribution is made according to the text similarity of the unsupervised model and the supervised model, and the weights are respectively alpha _unsupervised And alpha _supervised (before scene samples accumulate no updates, α _unsupervised ＝1,α _supervised = 0), the text similarity calculation formula is as follows:

wherein the content of the first and second substances,

for the cosine similarity calculated by the supervised model,

and calculating the cosine similarity obtained by the unsupervised model.

(5) And iteratively optimizing the unsupervised learning model and the supervised learning model according to the text matching pairs in the text matching pair library.

Updating the frequency p (w) of the participle w according to the established and maintained text matching library, and the text vector v of the text matching library _s And calculating a text vector aiming at the texts input into the system subsequently according to the updated maximum principal component vector u.

According to the established and maintained text matching library, incremental text matching data are input into a supervised learning model, and the supervised learning model comprises ESIM, biMPM, DIIN, cafe, ELmo, GPT or Bert without limitation, so that the supervised model is ensured to complete continuous optimization of parameters under specific scene data corpora.

(6) And correcting the calculation formula of the text similarity according to the text matching correction feedback of the client so as to gradually improve the accuracy of the unsupervised learning model and the supervised learning model.

(6.1) comparing the matched text results output by using the unsupervised model and the supervised model with the text matching results fed back by the client based on the newly added data of the text matching library, and respectively recording the misjudgment rates of the statistical models as the misjudgment rates e of the unsupervised models _unsupervised And the misjudgment rate e of the supervised model _supervised 。

(6.2) parameter α of formula for calculating text similarity _supervised And alpha _unsupervised The following calculation operations are respectively executed for parameter updating:

wherein, there is a first weight alpha corresponding to the supervision model _supervised And updating the ratio of the unsupervised model misjudgment rate to the total misjudgment rate. Second weight alpha corresponding to unsupervised model _unsupervised And updating the ratio of the error rate of the supervised model to the total error rate.

(7) The method further comprises the following steps: scene corpus data is obtained and used as a test sample, and the test sample is input into an unsupervised learning model and a supervised learning model which are well iterated, so that a matched text result is obtained.

On the other hand, the invention also provides a self-learning-based scene text matching system, which is used for realizing the self-learning-based scene text matching method, and the system comprises the following steps:

the pre-training word vector generation module is used for selecting a pre-training word vector data set and converting the scene corpus data into a scene word vector corresponding to the pre-training word vector data set;

the unsupervised learning module is used for converting the scene word vector into a corresponding first scene text vector;

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above embodiments are only used for illustrating the design idea and features of the present invention, and the purpose of the present invention is to enable those skilled in the art to understand the content of the present invention and implement the present invention accordingly, and the protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes and modifications made in accordance with the principles and concepts disclosed herein are intended to be included within the scope of the present invention.

Claims

1. A self-learning based scene text matching method is characterized by comprising the following sub-steps:

2. The self-learning based scene text matching method of claim 1, wherein the process of selecting a pre-training word vector data set and converting the scene corpus data into a scene word vector corresponding to the pre-training word vector data set comprises: and performing word segmentation on the scene corpus data, performing stop word elimination on word segmentation results, and converting the processed word segmentation into corresponding scene word vectors in a pre-training word vector data set.

3. The self-learning based scene text matching method of claim 1, wherein the process of converting the scene word vector into the corresponding first scene text vector comprises:

calculating the frequency corresponding to the participles appearing in a small number of samples;

calculating a scene text vector from the scene word vector, wherein the formula is as follows:

wherein | s | is the number of words in the sentence s, and a is a hyper-parameter; p (w) is the frequency corresponding to the participles appearing in a small number of samples;

text vector w for all texts S in scene corpus data S _s Forming a matrix X, and performing principal component analysis to obtain a maximum principal component vector u;

re-acquiring scene corpus data, and calculating a text vector v of the text TS according to the text TS in the scene corpus data TS _ts The calculation formula is as follows:

text v _ts Removing the maximum principal component of the sample analysis to be used as a text vector v of the text ts _ts The calculation formula is as follows:

v _ts ＝v _ts -uu ^T v _ts 。

4. the self-learning based scene text matching method according to claim 1, wherein the supervised learning model is selected from ESIM, bimm, DIIN, cafe, ELmo, GPT or Bert.

5. The self-learning based scene text matching method of claim 1, wherein the process of calculating the text similarity between the first scene text vector and the text to be matched and the second scene text vector comprises: calculating the text similarity of the first scene text vector, the second scene text vector and the text to be matched based on the cosine similarity to obtain a first text similarity and a second text similarity; setting a first weight and a second weight, taking the first weight multiplied by the first text similarity as a first item and the first weight multiplied by the first text similarity as a second item, and adding the two items to obtain the text similarity.

6. The self-learning based scene text matching method according to claim 5, wherein the process of correcting the text matching result comprises:

comparing the matching text results output by the unsupervised model and the supervised model with the corrected text matching results respectively, and counting the unsupervised model misjudgment rate and the supervised model misjudgment rate;

the first weight is updated as: the ratio of the unsupervised model false positive rate to the total false positive rate;

the second weight is updated as: there is a ratio of the error rate of the supervisory model to the total error rate.

7. The self-learning based scene text matching method according to claim 1, further comprising: and re-acquiring scene corpus data as a test sample, and inputting the scene corpus data into the optimized unsupervised learning model and the optimized unsupervised learning model to obtain a matched text result.

8. A self-learning based scene text matching system for implementing the self-learning based scene text matching method of any one of claims 1 to 7, the system comprising:

the text similarity calculation module is used for calculating and sequencing the text similarity of the first scene text vector, the second scene text vector and the text to be matched;

and the man-machine interaction module is used for acquiring a corrected text matching result.

9. An electronic device comprising a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is used for storing program data, and the processor is used for executing the program data to realize the self-learning based scene text matching method of any one of the above claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a self-learning based scene text matching method according to any one of claims 1 to 7.