CN115017264A - Model effect verification method and device - Google Patents

Model effect verification method and device Download PDF

Info

Publication number
CN115017264A
CN115017264A CN202210665280.2A CN202210665280A CN115017264A CN 115017264 A CN115017264 A CN 115017264A CN 202210665280 A CN202210665280 A CN 202210665280A CN 115017264 A CN115017264 A CN 115017264A
Authority
CN
China
Prior art keywords
text
texts
score
calculating
similar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210665280.2A
Other languages
Chinese (zh)
Inventor
范凌
王喆
蒋兆湘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tezign Shanghai Information Technology Co Ltd
Original Assignee
Tezign Shanghai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tezign Shanghai Information Technology Co Ltd filed Critical Tezign Shanghai Information Technology Co Ltd
Priority to CN202210665280.2A priority Critical patent/CN115017264A/en
Publication of CN115017264A publication Critical patent/CN115017264A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for verifying model effect, wherein a plurality of first texts are randomly selected from text information, the same number of second texts are roughly recalled in a collaborative filtering mode, and similar text pairs are constructed according to the first texts and the second texts; respectively coding the similar text pairs according to the original coding model and the trained coding model to obtain a first similar score and a second similar score; calculating the first similarity score and the artificial similarity score to obtain a first relevance score; calculating the second similarity score and the artificial similarity score to obtain a second correlation score; and judging whether the coding representation of the text by the trained coding model conforms to the text semantics or not according to the first relevance score and the second relevance score. The method can simulate the online effect of the optimized coding model, and recalls and extracts the approximate text pairs from the massive unmarked text pairs without manual participation, thereby saving a large amount of time for manually selecting similar text pairs.

Description

Model effect verification method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for verifying model effect, a computer device, and a storage medium.
Background
With the rapid expansion of the marketing field and the business layout transition of emerging propagation media, more and more merchants in the fast-moving industry start to play advertising marketing from traditional television media, turn to content distribution of public domains such as tremble/small red book and private domains such as WeChat friend circles, and realize the drainage of consumers and the sales transformation of products by relying on the content production, content management and content operation of the public domains and the private domains of a lander and a studio. In the process, more and more content advertisements and document materials are accumulated and deposited, how to realize the association and manage the document contents is convenient for searching and searching in the downstream, and secondary creation becomes a problem to be solved.
When the user uses the content management platform, various marketing advertisement documents are collected and returned from various marketing platforms and are placed into a content pool; when the secondary creation is carried out, the content pool is searched, and useful and satisfactory file information is selected for the secondary creation. In the process, a plurality of retrieval behaviors can occur, a batch of marketing documents meeting the requirements can be collected, if the documents in the content pools can be correlated, when a user finds a first article meeting the requirements, more documents of the same type are pushed for the user, the retrieval time of the user is greatly reduced, the content two-creation efficiency of the user is improved, and the content circulation and distribution of the user on different platforms are met. In a real scene, the original coding model bert in the prior art has the problems of high recommendation difficulty and inaccurate recommendation result of similar texts due to inconsistency of content specifications of a plurality of platforms, lack of file labels and the like.
However, even after the original coding model is optimized, the problem of how to select a proper verification mode to simulate the online effect on the line still exists. Specifically, the method comprises the following steps: and manually selecting a sentence pair, and determining in advance that the two sentences are related so as to construct a pre-online model evaluation set. This process is very labor-consuming, because we cannot select and construct similar sample pairs in thousands of marketing documents, and how to construct similar text pairs in these corpora directly by machine is also a problem to be considered.
Aiming at the problem of how to select a proper verification mode to simulate the online effect of the optimized coding model in the related technology, the time for manually selecting the similar evaluation text pair is saved, and an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the invention provides a method and a device for verifying model effect, computer equipment and a storage medium, which are used for solving the problem of how to select a proper verification mode to simulate the online effect of an optimized coding model in the related art.
In order to achieve the above object, a first aspect of the embodiments of the present invention provides a method for verifying model effect, including:
acquiring text information subjected to data cleaning;
extracting keywords from the text in the text information, and using the keywords as labels of the text;
randomly selecting a plurality of first texts from the text information, roughly recalling the same number of second texts by using labels of the first texts and adopting a collaborative filtering mode, and constructing similar text pairs according to the first texts and the second texts;
coding the similar text pairs according to an original coding model to obtain first similar scores, and calculating the first similar scores and artificial similar scores to obtain first relevance scores, wherein the artificial similar scores are obtained by calculating the similar text pairs in an artificial labeling mode;
coding the similar text pair according to the trained coding model to obtain a second similarity score, and calculating the second similarity score and the artificial similarity score to obtain a second relevance score;
and judging whether the coding representation of the trained coding model to the text conforms to the text semantics or not according to the first relevance score and the second relevance score.
Optionally, in a possible implementation manner of the first aspect, the extracting a keyword from a text in the text information includes:
setting word frequency and word length, and performing word segmentation processing on the text in the text information in a mode of calculating information entropy to obtain a plurality of word segmentation results;
and extracting at least one keyword from the word segmentation results in a TF-IDF calculation mode to serve as a label of the text.
Optionally, in a possible implementation manner of the first aspect, calculating the first similarity score and an artificial similarity score to obtain a first relevance score includes:
the relevance score is calculated using the following formula:
Figure BDA0003691400350000031
wherein d is i The difference of the bit order values of the ith data pair is represented, n represents the total number of samples, r represents the relevance scores of two columns of data, and the closer the value of r is to 1, the more relevant the two columns of data are, the closer to 0, the more irrelevant the two columns of scores are.
Optionally, in a possible implementation manner of the first aspect, before the obtaining the text information subjected to data cleansing, the method includes:
acquiring marketing pattern information of different channels;
performing first data cleaning and second data cleaning on the marketing file information;
the first data cleansing comprises: filtering the meaningless special symbols, uniformly replacing the meaningless special symbols with text commas, and replacing all Chinese and English symbols which are not commas with text commas;
the second data cleansing comprises: and calculating the information entropy of the marketing case information subjected to the first data cleaning, and filtering meaningless repeated long text phrases according to the word frequency and the word length to obtain the text information subjected to the data cleaning.
Optionally, in a possible implementation manner of the first aspect, the performing information entropy calculation on the marketing copy information subjected to the first data cleansing to filter out meaningless repeated long text phrases according to word frequency and word length includes:
setting word frequency and word length, and filtering out meaningless repeated long text phrases in a mode of calculating information entropy, wherein the information entropy calculation formula comprises the following steps:
Figure BDA0003691400350000041
wherein W represents a phrase to be subjected to information entropy calculation, len (W) < m, which represents that the length of the word is limited within m, freq (W) > k, which represents that the occurrence frequency of the word is more than k, and P (i | W) represents the probability of all words around the word W.
Optionally, in a possible implementation manner of the first aspect, randomly selecting a plurality of first texts from the text information, roughly recalling a same number of second texts by using labels of the first texts and adopting a collaborative filtering manner, and constructing similar text pairs according to the first texts and the second texts, includes:
step 1: randomly selecting a preset number of sentences from the corpus to serve as a first sentence group, and identifying the first sentence group; keyword tags and tag number for each statement;
step 2: randomly selecting a sentence from the first sentence group as a first sentence;
and step 3: respectively identifying the keyword labels and the label quantity of other sentences except the first sentence group in the corpus set;
and 4, step 4: judging the results of the number of overlapped labels of the first statement and other statements/the square root of the total number of labels of the first statement and the square root of the total number of labels of other statements, sorting the results from small to large, and randomly and coarsely recalling one statement in the other statements ranked in a preset range to serve as a second statement; constructing a sentence pair according to the first sentence and the second sentence;
and 5: and coarsely recalling the sentences with the same number in the first sentence group according to the collaborative filtering mode in the steps 1-4 to construct sentence pairs.
In a second aspect of the embodiments of the present invention, there is provided a model effect verification apparatus, including:
the text information acquisition module is used for acquiring the text information subjected to data cleaning;
the keyword extraction module is used for extracting keywords from the text in the text information and taking the keywords as labels of the text;
the similar text pair construction module is used for randomly selecting a plurality of first texts from the text information, roughly recalling the same number of second texts by using the labels of the first texts and adopting a collaborative filtering mode, and constructing similar text pairs according to the first texts and the second texts;
the first relevance score calculating module is used for coding the similar text pairs according to an original coding model to obtain a first similarity score, and calculating the first similarity score and an artificial similarity score to obtain a first relevance score, wherein the artificial similarity score is obtained by calculating the similar text pairs in an artificial labeling mode;
the second relevance score calculating module is used for coding the similar text pair according to the trained coding model to obtain a second similarity score, and calculating the second similarity score and the artificial similarity score to obtain a second relevance score;
and the judging module is used for judging whether the coding representation of the trained coding model to the text conforms to the text semantics according to the first relevance score and the second relevance score.
Optionally, in a possible implementation manner of the second aspect, the keyword extraction module includes:
the word segmentation processing unit is used for setting word frequency and word length, and performing word segmentation processing on the text in the text information in a mode of calculating information entropy to obtain a plurality of word segmentation results;
and the keyword extraction unit is used for extracting at least one keyword from the word segmentation results in a TF-IDF calculation mode to be used as a label of the text.
Optionally, in a possible implementation manner of the second aspect, the first relevance score calculating module or the second relevance score calculating module is further configured to perform the following steps:
the relevance score is calculated using the following formula:
Figure BDA0003691400350000051
wherein di represents the difference of the bit order values of the ith data pair, n represents the total number of samples, r represents the correlation score of two columns of data, and the closer the r value is to 1, the more correlated the two columns of data are, the closer to 0, the more irrelevant the two columns of scores are.
In a third aspect of the embodiments of the present invention, a computer device is provided, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor implements the steps in the above method embodiments when executing the computer program.
A fourth aspect of the embodiments of the present invention provides a readable storage medium, in which a computer program is stored, which, when being executed by a processor, is adapted to carry out the steps of the method according to the first aspect of the present invention and various possible designs of the first aspect of the present invention.
According to the model effect verification method, the model effect verification device, the computer equipment and the storage medium, the text information cleaned by data is obtained; extracting keywords from the text in the text information, and using the keywords as labels of the text; randomly selecting a plurality of first texts from the text information, roughly recalling the same number of second texts by using labels of the first texts and adopting a collaborative filtering mode, and constructing similar text pairs according to the first texts and the second texts; coding the similar text pairs according to an original coding model to obtain first similar scores, and calculating the first similar scores and artificial similar scores to obtain first relevance scores, wherein the artificial similar scores are obtained by calculating the similar text pairs in an artificial labeling mode; coding the similar text pair according to the trained coding model to obtain a second similarity score, and calculating the second similarity score and the artificial similarity score to obtain a second relevance score; and judging whether the coding representation of the trained coding model to the text conforms to the text semantics or not according to the first relevance score and the second relevance score. The method can simulate the online effect of the optimized coding model, extracts keywords from all texts in a unsupervised text keyword extraction mode, extracts the similar text pairs from a large number of unmarked text pairs by using the idea of collaborative filtering after extraction, and saves a large amount of time for manually selecting similar evaluation text pairs.
Drawings
Fig. 1 and 2 are schematic flow charts of a model effect verification method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of filtering unwanted interference information using different cleansing rules;
fig. 4 is a schematic structural diagram of a model effect verification apparatus according to an embodiment of the present application;
fig. 5 and 6 are schematic flow charts of a model training method according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
It should be understood that in the present application, "comprising" and "having" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that, in the present invention, "a plurality" means two or more. "and/or" is merely an association describing an associated object, meaning that three relationships may exist, for example, and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "comprises A, B and C" and "comprises A, B, C" means that all three of A, B, C comprise, "comprises A, B or C" means that one of A, B, C comprises, "comprises A, B and/or C" means that any 1 or any 2 or 3 of A, B, C comprises.
It should be understood that in the present invention, "B corresponding to a", "a corresponds to B", or "B corresponds to a" means that B is associated with a, and B can be determined from a. Determining B from a does not mean determining B from a alone, but may be determined from a and/or other information. And the matching of A and B means that the similarity of A and B is greater than or equal to a preset threshold value.
As used herein, "if" may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context.
The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Example 1:
the invention provides a method for verifying model effect, which is shown in a flow chart of figures 1 and 2 and comprises the following steps:
and step S110, acquiring the text information subjected to data cleaning.
In this step, the text information includes the literature information of different channels, for example, the marketing literature of the reflux of the collected customers (such as a fast-moving brand) put on each platform, and the platforms include the marketing literature of small red books, buffalo notes, WeChat friend circles and the like. The data cleaning treatment is to divide marketing file information of different channels (the content specifications of multiple platforms are inconsistent), and different cleaning rules are adopted to clean and filter out useless interference information.
And step S120, extracting keywords from the text in the text information, and taking the keywords as labels of the text.
In step S120, the text information refers to a corpus of a multi-platform marketing scenario summary;
when extracting keywords from a text, word frequency and word length need to be set first, and word segmentation processing is performed on the text in the text information in a mode of calculating information entropy to obtain a plurality of word segmentation results; and then extracting at least one keyword from the word segmentation results in a TF-IDF calculation mode to serve as a label of the text. The information entropy calculation formula is as follows:
Figure BDA0003691400350000081
wherein W represents a phrase to be subjected to information entropy calculation, len (W) < m, which represents that the length of the word is limited within m, freq (W) > k, which represents that the occurrence frequency of the word is more than k, and P (i | W) represents the probability of all words around the word W. The larger the information entropy, the more chaotic the surrounding words are, the more the surrounding words of the word w are, and the probability of the word is considered as information which can be fixed, because the word can be combined with a plurality of words; the less the information entropy, the more pure the surrounding words, the less the adjacent words around the word w itself, and the more likely the word is to be combined with the surrounding words to form a new phrase, then the judgment will be made to see whether the phrase can be constructed under the condition that ngram adds 1.
Specifically, when word frequency and word length are set and word segmentation processing is carried out on a text in text information in a mode of calculating information entropy, word forming conditions of 1 word, 2 words and 3 to m words are constructed according to word granularity ngram, and a word segmentation user-defined word library is constructed by combined words meeting the word forming conditions.
For example: when m is set to be 5, k is set to be 3, and ngram is 2, counting that the term 'retinal' meets the setting conditions of m and k, the number of words is within 5, the occurrence frequency is above 3, and then counting the magnitude of the entropy of the information appearing in the word around the term 'retinal', wherein we find that the entropy of the information of the word adjacent to the word 'retinal' is high, and the entropy of the information of the word adjacent to the word 'retinal' is low, such as 'aldehyde', 'alcohol', 'acid', so that new words 'retinal', 'retinol', and 'retinol' are likely to be constructed.
When all sentences in the corpus are subjected to word segmentation processing one by one according to the method and a word segmentation word bank is established, at least one keyword is extracted from a plurality of word segments obtained by the sentence word segmentation processing by adopting a tfidf (word frequency-reverse document frequency) calculation mode to serve as a label of the sentence.
Step S130, randomly selecting a plurality of first texts from the text information, roughly recalling the same number of second texts by using the labels of the first texts and adopting a collaborative filtering mode, and constructing similar text pairs according to the first texts and the second texts.
In this step, the first text and the second text refer to sentences in a corpus of the multi-platform marketing scenario summary. And (4) taking the keywords extracted in the step (S120) as the labels of the sentences, then randomly selecting the sentences, and recalling the labels in a collaborative filtering mode, thereby constructing the similar text pairs to be evaluated. For example, 5000 sentences are collected in the corpus summarized in the marketing literature, 100 sentences are randomly sampled, and how to select a text which is closer to the 100 sentences sampled by us from the 5000 sentences is achieved, the process does not need to manually browse the 5000 sentences, and the close 100 sentences are roughly recalled in a collaborative filtering mode to construct 100 sentence pairs.
The calculation mode of collaborative filtering comprises the following steps:
1. randomly selecting a preset number of sentences from the corpus to serve as a first sentence group, and identifying the keyword tag and the tag number of each sentence in the first sentence group;
2. randomly selecting a sentence from the first sentence group as a first sentence;
3. respectively identifying the keyword labels and the label quantity of other sentences except the first sentence group in the corpus set;
4. judging a result of (the number of overlapped labels of the first sentence and other sentences)/(the square root of the total number of labels of the first sentence and the square root of the total number of labels of other sentences), sorting the results from small to large, and randomly and coarsely recalling one sentence from other sentences ranked within a preset range (such as top30) as a second sentence; constructing a sentence pair according to the first sentence and the second sentence;
5. and roughly recalling the sentences with the same quantity in the first sentence group according to the collaborative filtering mode in the step 1-3 to construct sentence pairs.
In the above 4, it is also possible to judge the result of (the number of overlapped tags of the first sentence and the other sentences)/(the square root of the total number of tags of the first sentence and the square root of the total number of tags of the other sentences), judge whether or not the result is greater than a preset threshold (for example, 0.5, 0.6, 0.75, etc.), and if the result is greater than or equal to the threshold, roughly recall one sentence at random as the second sentence from among a plurality of sentences greater than the preset threshold.
Or adopting other methods, which are concretely as follows:
1. randomly selecting a sentence from the corpus set as a first sentence, and identifying a keyword tag and the number of tags of the first sentence; 2. respectively identifying the keyword labels and the label quantity of other sentences except the first sentence in the corpus set; 3. judging a result of (the number of overlapped labels of the first sentence and other sentences)/(the square root of the total number of labels of the first sentence and the square root of the total number of labels of other sentences), sorting the results from small to large, and randomly and coarsely recalling one sentence from other sentences ranked within a preset range (such as top30) as a second sentence; constructing a sentence pair according to the first sentence and the second sentence; 4. and roughly recalling a preset number of sentences according to the collaborative filtering mode in the step 1-3 to construct sentence pairs.
The above collaborative filtering method is exemplified: the threshold value is set to 0.5 in advance, sentence 1, the extracted keyword labels are "shampoo", "anti-dandruff", "anti-itch", "fragrance", "permanent", sentence 2, and the extracted keywords are "shampoo", "supple", "anti-dandruff", "oil control", and "permanent", so that the score of the collaborative filtering is (number of overlapping labels of sentence 1 and sentence 2)/(square root of total number of labels of sentence 1 x square root of total number of labels of sentence 2), and the result is 3/5 is 0.6.
Step S140, coding the similar text pair according to the original coding model to obtain a first similar score, and calculating the first similar score and an artificial similar score to obtain a first relevance score, wherein the artificial similar score is obtained by calculating the similar text pair in a manual labeling mode.
And S150, coding the similar text pair according to the trained coding model to obtain a second similarity score, and calculating the second similarity score and the artificial similarity score to obtain a second relevance score.
And step S160, judging whether the coding representation of the trained coding model to the text conforms to the text semantics according to the first relevance score and the second relevance score.
In steps S140-S160, the first or second relevance score is calculated using the following formula:
Figure BDA0003691400350000111
wherein di represents the difference of the bit order values of the ith data pair, n represents the total number of samples, r represents the correlation score of two columns of data, and the closer the r value is to 1, the more correlated the two columns of data are, the closer to 0, the more irrelevant the two columns of scores are.
Specifically, the following steps S140 to S160 are described as an example: after the original bert model outputs vector encoding of the text pairs according to the 100 sentence pairs constructed in the step S130, the trained bert model outputs vector encoding of the text pairs, the first similarity scores of the 100 sentence pairs typed by the original bert model and the second similarity scores of the 100 sentence pairs typed by the trained bert model are manually labeled (between 1 and 10 partitions), and finally the first similarity scores speerman correction of the first similarity score and the manual similarity score of the original bert model is calculated to result in 0.65, and the second similarity score speerman correction of the second similarity score and the manual similarity score of the trained bert model is 0.83. The second correlation score (0.83) is closer to 1 than the first correlation score (0.65), indicating that the trained bert model scores are consistent with the manual scores, that models considered similar by the manual are considered similar, that models considered dissimilar by the manual are considered dissimilar, and that models are considered dissimilar. Thereby measuring whether the trend of the model to the text pair is related to the trend of the manual mark score.
Therefore, the sperman coreactalon can know that the refined bert model has higher scores for similar texts and is more relevant to artificial scoring, the coded representation of the texts conforms to the text semantics better, and the calculated similarity score is more accurate. Thereby determining the trained bert model as the final on-line text coding model.
In one embodiment, before the acquiring the text information subjected to data cleansing, the method includes:
acquiring marketing pattern information of different channels;
performing first data cleaning and second data cleaning on the marketing file information;
the first data cleansing includes: filtering the meaningless special symbols, uniformly replacing the meaningless special symbols with text commas, and replacing all Chinese and English symbols which are not commas with text commas;
the second data cleansing comprises: and calculating the information entropy of the marketing case information subjected to the first data cleaning, and filtering meaningless repeated long text phrases according to the word frequency and the word length to obtain the text information subjected to the data cleaning.
In this step, as shown in fig. 3, marketing pattern information of different channels is divided (content specifications of multiple platforms are inconsistent), and different cleaning rules are adopted to clean and filter out useless interference information.
First, consider that the marketing paper of the small red book is characterized by a long article, fixed in terms of words, and contains a large number of emoji expressions. Therefore, meaningless emoji expressions need to be filtered, and meaningful emoji expressions, such as shape and appearance effect symbols, cross symbols, expression tone exclamation marks, question marks and other emoji expressions, need to be replaced by text commas with pause functions in a unified manner; and simultaneously, replacing all Chinese and English symbols which are not commas with text commas.
Secondly, the marketing copy of WeChat friend circles is characterized by being short and containing a large amount of meaningless text information. Therefore, the marketing case information needs to be subjected to information entropy calculation, meaningless long text phrases which repeatedly appear are filtered according to the word frequency and the word length, wherein the information entropy calculation formula is as shown in step S120, when m is set to be 25, k is set to be 10, some meaningless long texts which appear at high frequency can be obtained in an information entropy mode, for example, in a WeChat friend circle material, meaningless long texts such as "adding WeChat acquisition preference codes and acquiring new information on the latest product" can be obtained, and the texts which appear in a large number of cases do not help to distinguish the texts, so that the texts need to be removed.
According to the model effect verification method provided by the invention, the text information cleaned by data is obtained; extracting keywords from the text in the text information, and using the keywords as labels of the text; randomly selecting a plurality of first texts from the text information, roughly recalling the same number of second texts by using labels of the first texts and adopting a collaborative filtering mode, and constructing similar text pairs according to the first texts and the second texts; coding the similar text pair according to an original coding model to obtain a first similar score, and calculating the first similar score and an artificial similar score to obtain a first correlation score; coding the similar text pair according to the trained coding model to obtain a second similarity score, and calculating the second similarity score and the artificial similarity score to obtain a second relevance score; and judging whether the coding representation of the text by the trained coding model accords with text semantics or not according to the first relevance score and the second relevance score. The method can simulate the online effect of the optimized coding model, extracts keywords from all texts in a unsupervised text keyword extraction mode, extracts the similar text pairs from a large number of unmarked text pairs by using the idea of collaborative filtering after extraction, and saves a large amount of time for manually selecting similar evaluation text pairs. In addition, in the invention, the similar text pair constructed by the rough recall is manually scored, the original bert model is scored, the trimmed bert model is scored, the sperman correlation score between the original bert model and the manual scoring is finally calculated, the sperman correlation score between the trimmed bert model and the manual scoring is verified on the data of users, the encoding expression of the text by the model after the fine tuning training of the scheme is more consistent with the text semantics, and the calculated similar score is more accurate.
Example 2:
an embodiment of the present invention further provides a model effect verification apparatus, as shown in fig. 4, including:
the text information acquisition module is used for acquiring the text information subjected to data cleaning;
the keyword extraction module is used for extracting keywords from the text in the text information and taking the keywords as labels of the text;
the similar text pair construction module is used for randomly selecting a plurality of first texts from the text information, roughly recalling the same number of second texts by using the labels of the first texts and adopting a collaborative filtering mode, and constructing similar text pairs according to the first texts and the second texts;
the first relevance score calculating module is used for coding the similar text pairs according to an original coding model to obtain a first similarity score, and calculating the first similarity score and an artificial similarity score to obtain a first relevance score, wherein the artificial similarity score is obtained by calculating the similar text pairs in an artificial labeling mode;
the second relevance score calculating module is used for coding the similar text pair according to the trained coding model to obtain a second similarity score, and calculating the second similarity score and the artificial similarity score to obtain a second relevance score;
and the judging module is used for judging whether the coding representation of the trained coding model to the text conforms to the text semantics according to the first relevance score and the second relevance score.
In one embodiment, the keyword extraction module includes:
the word segmentation processing unit is used for setting word frequency and word length, and performing word segmentation processing on the text in the text information in a mode of calculating information entropy to obtain a plurality of word segmentation results;
and the keyword extraction unit is used for extracting at least one keyword from the word segmentation results in a TF-IDF calculation mode to be used as a label of the text.
In one embodiment, the first relevance score calculating module or the second relevance score calculating module is further configured to perform the following steps:
the relevance score is calculated using the following formula:
Figure BDA0003691400350000141
wherein d is i The difference of the bit order values of the ith data pair is represented, n represents the total number of samples, r represents the relevance scores of two columns of data, and the closer the value of r is to 1, the more relevant the two columns of data are, the closer to 0, the more irrelevant the two columns of scores are.
Example 3:
the embodiment of the present invention further provides a model training method, as shown in fig. 5 and 6, including:
step 1, acquiring text information to be trained which is subjected to normalization processing.
In this step, the text information to be trained includes the information of the documents to be trained in different channels, for example, the marketing documents collected by the clients (such as a fast-moving brand) put back on various platforms, and the platforms include the marketing documents such as a small red book, a tremble, a WeChat friend circle, and the like.
The normalization processing refers to the division of marketing document information of different channels (the content specifications of multiple platforms are inconsistent), and different cleaning rules are adopted to clean and filter useless interference information.
First, consider that the marketing paper of the small red book is characterized by a long article, fixed in terms of words, and contains a large number of emoji expressions. Therefore, meaningless emoji expressions need to be filtered, and meaningful emoji expressions, such as shape and appearance effect symbols, cross symbols, expression tone exclamation marks, question marks and other emoji expressions, need to be replaced by text commas with pause functions in a unified manner; and simultaneously, replacing all Chinese and English symbols which are not commas with text commas.
Secondly, the marketing copy of WeChat friend circles is characterized by being short and containing a large amount of meaningless text information. Therefore, the marketing copy information needs to be used for calculating the information entropy, and filtering meaningless repeated long text phrases according to the word frequency and the word length.
And 2, randomly selecting a sentence in the text information to be trained as a first sample, performing data enhancement processing on the first sample to obtain a second sample, and taking other sentences except the first sample in the text information to be trained as third samples.
In the step 2, the process is carried out,
firstly, randomly selecting one sentence from normalized text information to be trained as a first sample a, and using the other sentence as a third sample b;
secondly, the first sample a is subjected to data enhancement, namely, the adjectives in the first sample a are randomly replaced by the synonyms/synonyms (the random replacement is performed for 50 percent, and the replacement is not performed for 50 percent) to obtain a second sample a'.
And 3, respectively coding the first sample, the second sample and the third sample through the original coding model to obtain a first coding result, a second coding result and a third coding result, and randomly disordering the second coding result in a word position dimension and a word coding dimension to obtain a fourth coding result.
In step 3, the samples a, a ', b are encoded through an original bert encoding model to obtain encoding results a1, a1', b1, and the encoding results of a1' are randomly scrambled in a token dimension (word position dimension) and an embedding dimension (word encoding dimension) to obtain a 2.
Specifically, the following is exemplified as "randomly shuffle in the word position dimension and the self-coding dimension" in step 3:
the test sentence "antidandruff and antipruritic shampoo" is a sentence of 7 words, which is coded as a three-dimensional vector of (1, 7, 768), the first dimension represents several sentences here, here 1 sentence; the second dimension represents several words, here 7 words, corresponding to the sentence; the third dimension represents how many digits represent each word, here 768 digit features.
The dimension of the character position is randomly disordered, namely randomly disordered in the second dimension, for example, the disordered character may become 'shampoo itching relieving and scurf removing water'; the word encoding dimension is scrambled, that is, the scrambling is performed in the third dimension, for example, we originally use 768 numbers to represent "go" the word [0.32,0.98,0.12,0.55,0.86,0.77.... 0.32,0.65], where there are a total of 768 numbers, and the encoding of 768 numbers representing "go" the word after the scrambling may become [0.98,0.32,0.12,0.65,0.86,0.77.... 0.32,0.55 ].
And 4, constructing the first coding result and the third coding result into a negative sample pair, and constructing the first coding result and the fourth coding result into a positive sample pair.
And 5, performing steps 2-4 on all sentences in the text information to be trained to obtain all positive and negative sample pairs.
In step 4-5, constructing a1 and b1 as a negative sample pair, constructing a1 and a2 as a positive sample pair, and performing the above operations on N text pairs in the text information to be trained to obtain all the sample pairs.
And 6, calculating loss by using a contrast learning method according to the positive and negative sample pairs so as to optimize the original coding model.
In step 6, the following formula in the comparative learning loss is used for calculation:
Figure BDA0003691400350000161
wherein ri, rj represent positive samples, ri, rk represent negative sample pairs, sim (ri, rj) represent the cosine similarity of the calculation vector, and t is a temperature normalization factor;
and the distance of the positive sample pair is shortened, and the distance of the negative sample pair is lengthened, so that the loss is reduced, and the original coding model is optimized. The loss calculation adopts comparison learning loss, so that fine tuning training is carried out on the bert model.
The readable storage medium may be a computer storage medium or a communication medium. Communication media includes any medium that facilitates transfer of a computer program from one place to another. Computer storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, a readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Additionally, the ASIC may reside in user equipment. Of course, the processor and the readable storage medium may also reside as discrete components in a communication device. The readable storage medium may be a read-only memory (ROM), a random-access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
The present invention also provides a program product comprising execution instructions stored in a readable storage medium. The at least one processor of the device may read the execution instructions from the readable storage medium, and the execution of the execution instructions by the at least one processor causes the device to implement the methods provided by the various embodiments described above.
In the above embodiments of the terminal or the server, it should be understood that the Processor may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the spirit of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for verifying model effect, comprising:
acquiring text information subjected to data cleaning;
extracting keywords from the text in the text information, and using the keywords as labels of the text;
randomly selecting a plurality of first texts from the text information, roughly recalling the same number of second texts by using labels of the first texts and adopting a collaborative filtering mode, and constructing similar text pairs according to the first texts and the second texts;
coding the similar text pairs according to an original coding model to obtain first similar scores, and calculating the first similar scores and artificial similar scores to obtain first relevance scores, wherein the artificial similar scores are obtained by calculating the similar text pairs in an artificial labeling mode;
coding the similar text pair according to the trained coding model to obtain a second similarity score, and calculating the second similarity score and the artificial similarity score to obtain a second relevance score;
and judging whether the coding representation of the trained coding model to the text conforms to the text semantics or not according to the first relevance score and the second relevance score.
2. The method for verifying model effect according to claim 1, wherein the extracting keywords from the text in the text information comprises:
setting word frequency and word length, and performing word segmentation processing on the text in the text information in a mode of calculating information entropy to obtain a plurality of word segmentation results;
and extracting at least one keyword from the word segmentation results in a TF-IDF calculation mode to serve as a label of the text.
3. The method of claim 1, wherein calculating the first similarity score and the artificial similarity score to obtain a first relevance score comprises:
the relevance score is calculated using the following formula:
Figure FDA0003691400340000021
wherein d is i The difference of the bit order values of the ith data pair is represented, n represents the total number of samples, r represents the relevance scores of two columns of data, and the closer the value of r is to 1, the more relevant the two columns of data are, the closer to 0, the more irrelevant the two columns of scores are.
4. The method for verifying model effect according to claim 1, wherein before obtaining the text information subjected to data cleansing, the method comprises:
acquiring marketing pattern information of different channels;
performing first data cleaning and second data cleaning on the marketing file information;
the first data cleansing comprises: filtering the meaningless special symbols, uniformly replacing the meaningless special symbols with text commas, and replacing all Chinese and English symbols which are not commas with text commas;
the second data cleansing comprises: and calculating the information entropy of the marketing case information subjected to the first data cleaning, and filtering meaningless repeated long text phrases according to the word frequency and the word length to obtain the text information subjected to the data cleaning.
5. The method for verifying model effect according to claim 4, wherein the calculating of the entropy of the marketing copy information cleaned by the first data and filtering out meaningless repeated long text phrases according to word frequency and word length comprises:
setting word frequency and word length, and filtering out meaningless repeated long text phrases in a mode of calculating information entropy, wherein the information entropy calculation formula comprises the following steps:
Figure FDA0003691400340000022
wherein W represents a phrase to be subjected to information entropy calculation, len (W) < m, which represents that the length of the word is limited within m, freq (W) > k, which represents that the occurrence frequency of the word is more than k, and P (i | W) represents the probability of all words around the word W.
6. The method for verifying model effect according to claim 1, wherein randomly selecting a plurality of first texts from the text information, roughly recalling a same number of second texts by using labels of the first texts and adopting a collaborative filtering manner, and constructing similar text pairs according to the first texts and the second texts comprises:
step 1: randomly selecting a preset number of sentences from the corpus to serve as a first sentence group, and identifying the first sentence group; keyword tags and tag number for each statement;
step 2: randomly selecting a sentence from the first sentence group as a first sentence;
and step 3: respectively identifying the keyword labels and the label quantity of other sentences except the first sentence group in the corpus set;
and 4, step 4: judging the results of the number of overlapped labels of the first statement and other statements/the square root of the total number of labels of the first statement and the square root of the total number of labels of other statements, sorting the results from small to large, and randomly and coarsely recalling one statement in the other statements ranked in a preset range to serve as a second statement; constructing a sentence pair according to the first sentence and the second sentence;
and 5: and coarsely recalling the sentences with the same number in the first sentence group according to the collaborative filtering mode in the steps 1-4 to construct sentence pairs.
7. A model effect verification apparatus, comprising:
the text information acquisition module is used for acquiring the text information subjected to data cleaning;
the keyword extraction module is used for extracting keywords from the text in the text information and taking the keywords as labels of the text;
the similar text pair construction module is used for randomly selecting a plurality of first texts from the text information, roughly recalling the same number of second texts by using the labels of the first texts and adopting a collaborative filtering mode, and constructing similar text pairs according to the first texts and the second texts;
the first relevance score calculating module is used for coding the similar text pairs according to an original coding model to obtain a first similarity score, and calculating the first similarity score and an artificial similarity score to obtain a first relevance score, wherein the artificial similarity score is obtained by calculating the similar text pairs in an artificial labeling mode;
the second relevance score calculating module is used for coding the similar text pair according to the trained coding model to obtain a second similarity score, and calculating the second similarity score and the artificial similarity score to obtain a second relevance score;
and the judging module is used for judging whether the coding representation of the trained coding model to the text conforms to the text semantics according to the first relevance score and the second relevance score.
8. The model effect verification apparatus according to claim 7, wherein the keyword extraction module comprises:
the word segmentation processing unit is used for setting word frequency and word length, and performing word segmentation processing on the text in the text information in a mode of calculating information entropy to obtain a plurality of word segmentation results;
and the keyword extraction unit is used for extracting at least one keyword from the word segmentation results in a TF-IDF calculation mode to be used as the label of the text.
9. The model effect verification apparatus according to claim 7, wherein the first relevance score calculating module or the second relevance score calculating module is further configured to perform the following steps:
the relevance score is calculated using the following formula:
Figure FDA0003691400340000041
wherein d is i The difference of the bit order values of the ith data pair is represented, n represents the total number of samples, r represents the relevance scores of two columns of data, and the closer the value of r is to 1, the more relevant the two columns of data are, the closer to 0, the more irrelevant the two columns of scores are.
10. A computer device comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the steps of the model effect verification method of any one of claims 1 to 5 when executing the computer program.
CN202210665280.2A 2022-06-13 2022-06-13 Model effect verification method and device Pending CN115017264A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210665280.2A CN115017264A (en) 2022-06-13 2022-06-13 Model effect verification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210665280.2A CN115017264A (en) 2022-06-13 2022-06-13 Model effect verification method and device

Publications (1)

Publication Number Publication Date
CN115017264A true CN115017264A (en) 2022-09-06

Family

ID=83075689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210665280.2A Pending CN115017264A (en) 2022-06-13 2022-06-13 Model effect verification method and device

Country Status (1)

Country Link
CN (1) CN115017264A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591698A (en) * 2024-01-19 2024-02-23 腾讯科技(深圳)有限公司 Training method of video retrieval model, video retrieval method, device and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591698A (en) * 2024-01-19 2024-02-23 腾讯科技(深圳)有限公司 Training method of video retrieval model, video retrieval method, device and equipment
CN117591698B (en) * 2024-01-19 2024-04-26 腾讯科技(深圳)有限公司 Training method of video retrieval model, video retrieval method, device and equipment

Similar Documents

Publication Publication Date Title
CN106776574B (en) User comment text mining method and device
CN108628833B (en) Method and device for determining summary of original content and method and device for recommending original content
CN104881458B (en) A kind of mask method and device of Web page subject
CN115292469B (en) Question-answering method combining paragraph search and machine reading understanding
CN105975454A (en) Chinese word segmentation method and device of webpage text
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
CN112131863A (en) Comment opinion theme extraction method, electronic equipment and storage medium
CN108363691B (en) Domain term recognition system and method for power 95598 work order
CN111160019B (en) Public opinion monitoring method, device and system
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN112364628B (en) New word recognition method and device, electronic equipment and storage medium
CN111198946A (en) Network news hotspot mining method and device
CN108536676B (en) Data processing method and device, electronic equipment and storage medium
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN114020876A (en) Method, device and equipment for extracting keywords of text and storage medium
Nugraha et al. Typographic-based data augmentation to improve a question retrieval in short dialogue system
CN111914554A (en) Training method of field new word recognition model, field new word recognition method and field new word recognition equipment
CN115017264A (en) Model effect verification method and device
CN111191469A (en) Large-scale corpus cleaning and aligning method and device
CN110929022A (en) Text abstract generation method and system
CN110457707B (en) Method and device for extracting real word keywords, electronic equipment and readable storage medium
CN113312903B (en) Method and system for constructing word stock of 5G mobile service product
CN112035670B (en) Multi-modal rumor detection method based on image emotional tendency
CN114943285A (en) Intelligent auditing system for internet news content data
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination