WO2020220539A1 - 数据增量方法、装置、计算机设备及存储介质 - Google Patents

数据增量方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020220539A1
WO2020220539A1 PCT/CN2019/103271 CN2019103271W WO2020220539A1 WO 2020220539 A1 WO2020220539 A1 WO 2020220539A1 CN 2019103271 W CN2019103271 W CN 2019103271W WO 2020220539 A1 WO2020220539 A1 WO 2020220539A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
target
synonyms
word
replaced
Prior art date
Application number
PCT/CN2019/103271
Other languages
English (en)
French (fr)
Inventor
郑立颖
徐亮
阮晓雯
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020220539A1 publication Critical patent/WO2020220539A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Definitions

  • This application relates to the field of data increment technology, and in particular to a data increment method, device, computer equipment and storage medium.
  • the embodiments of the present application provide a data increment method, device, computer equipment, and storage medium to solve the problem that the training text data used in the current text classification model training is unbalanced and the accuracy of the model training cannot be guaranteed.
  • a data increment method including:
  • a data increment device includes:
  • the sample acquisition module is used to acquire a scene classification sample corresponding to a specific scene and a specified sample ratio, and the scene classification sample corresponds to a classification label;
  • a text acquisition module to be trained configured to use regular expressions to preprocess the scene classification samples to obtain the text to be trained
  • the target word vector model acquisition module is configured to use the pre-trained original word vector model to perform incremental training on the text to be trained to acquire the target word vector model;
  • the actual sample ratio determination module is used to count the actual sample number corresponding to each classification label and the total sample number corresponding to all the scene classification samples, and determine the classification based on the actual sample number and the total sample number The actual sample ratio corresponding to the label;
  • a sample-to-be-incremented determination module configured to, if the actual sample ratio corresponding to the classification label is less than the specified sample ratio, use the scene classification sample corresponding to the classification label as the sample to be incremented;
  • the candidate phrase acquisition module is configured to input the sample to be incremented into the target word vector model for processing, and to acquire at least one candidate phrase corresponding to the sample to be incremented, and the candidate phrase includes a word vector At least one target synonym;
  • the first newly added sample acquisition module is configured to randomly select one of the target synonyms from each candidate phrase to perform replacement processing on the sample to be incremented, and obtain the first newly added sample corresponding to the classification label.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • FIG. 1 is a schematic diagram of an application environment of a data increment method in an embodiment of the present application
  • FIG. 2 is a flowchart of a data increment method in an embodiment of the present application
  • FIG. 3 is a specific flowchart of step S10 in FIG. 2;
  • FIG. 4 is a specific flowchart of step S60 in FIG. 2;
  • FIG. 5 is a specific flowchart of step S70 in FIG. 2;
  • FIG. 6 is a specific flowchart of step S63 in FIG. 4;
  • FIG. 7 is a flowchart of a data increment method in an embodiment of the present application.
  • FIG. 8 is a flowchart of a data increment method in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a data increment device in an embodiment of the present application.
  • Fig. 10 is a schematic diagram of a computer device in an embodiment of the present application.
  • the data increment method provided in the embodiments of this application can be applied to a data increment tool, which is used to perform automatic data increment on part of samples with uneven sample distribution in text classification, so as to make various types of samples evenly distributed and improve subsequent The accuracy of text classification. Furthermore, this method can also achieve the purpose of increasing the training set, ensuring that the training set for model training is sufficient, and improving the accuracy of the model.
  • the data increment method can be applied in the application environment as shown in Fig. 1, where the computer equipment communicates with the server through the network.
  • Computer equipment can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented as an independent server.
  • a data increment method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • S10 Obtain a scene classification sample corresponding to a specific scene and a specified sample ratio, and the scene classification sample corresponds to a classification label.
  • the scene classification samples corresponding to a specific scene are texts obtained for different text classification scenes (such as smart interview scoring scenes), and the scene classification samples correspond to a classification label.
  • the classification labels refer to the classification labels corresponding to different categories in different text classification scenarios. For example, in the smart interview scoring, the classification labels include preference, deviation, medium, very good, and very bad.
  • the data increment tool pre-stores the text data corresponding to different scene types. The user can select the desired scene type in the data increment tool and upload the self-collected corpus data as a scene classification sample for the server to obtain Scene classification sample.
  • Specified sample ratio refers to the ratio of scene classification samples corresponding to different classification labels to the total number of samples.
  • S20 Use regular expressions to perform text preprocessing on the scene classification samples to obtain the text to be trained.
  • the preprocessing of the scene classification samples includes but is not limited to English removal processing and stop word removal processing.
  • stop word removal processing means that in information retrieval, in order to save storage space and improve search efficiency, certain stop words (such as text) are automatically filtered before or after processing natural language data (or text). "I", "A”, "Down”).
  • regular expressions can be used to filter, such as [ ⁇ u4e00- ⁇ u9fa5], English can be filtered out to obtain text to be trained that only contains Chinese characters.
  • S30 Use the pre-trained original word vector model to perform incremental training on the text to be trained to obtain the target word vector model.
  • the original word vector model is a word vector model obtained after incremental training is performed using the word2vec training function in the gensim library.
  • gensim is a python natural language processing library that can convert documents into vector patterns based on TF-IDF, LDA, LSI and other models for further processing.
  • the gensim library also contains a word2vec training function to convert words into word vectors (word2vec). Since the word vector has good semantic characteristics, it is a common way to express the characteristics of a word. By representing the word in the form of a word vector, the subsequent use of the word vector to train the text classification model is convenient for calculation.
  • the word2vec training function is a training function used to train a word vector model.
  • word2vec can be efficiently trained on millions of dictionaries and hundreds of millions of data sets.
  • the training results obtained by this tool-word embedding can be a good measure of the similarity between words .
  • S40 Count the actual number of samples corresponding to each classification label and the total number of samples corresponding to all scene classification samples, and determine the actual sample ratio corresponding to the classification label based on the actual number of samples and the total number of samples.
  • the total number of samples refers to the total data volume of the scene classification samples.
  • the actual number of samples refers to the actual number of samples corresponding to each classification label.
  • the server counts the actual number of samples corresponding to each classification label and the total number of samples corresponding to all scene classification samples, and can determine the actual sample ratio corresponding to the classification label based on the actual number of samples and the total number of samples, that is, the actual number of samples
  • the ratio to the total sample size is used as a function of the actual sample proportion corresponding to the classification label.
  • the sample to be incremented is a text sample that requires data increment processing.
  • Different classification labels correspond to different classification sample ratios.
  • the sample ratio must be maintained at a certain ratio to ensure the accuracy of model training. If the sample ratio corresponding to a certain type of text is low, it will lead to model training There is a deviation, which makes the accuracy of the model not high. Therefore, in this embodiment, the server dynamically adjusts the scene classification samples according to the specified ratio corresponding to each classification label set by the user.
  • the user inputs the scene classification sample and the specified sample ratio corresponding to each classification label into the data increment tool, and the server will default to the specified sample ratio input by the user as the sample ratio for the balanced data ratio.
  • the server counts the actual sample ratio corresponding to each classification label and compares it with the specified sample ratio. If the actual sample ratio corresponding to the classification label is less than the specified sample ratio, it is considered that the data of the classification sample input by the user is not balanced, and it will The classification sample corresponding to the classification label is used as the sample to be incremented, so that the server performs data increment on the sample to be incremented.
  • the actual sample ratio corresponding to the classification label is not less than the specified sample ratio, it is considered that the data of the classification sample input by the user is balanced, and data increment processing is not required.
  • the actual sample ratio is compared with the specified sample ratio to determine whether it is necessary Perform data enhancement processing to ensure the effectiveness of data incremental processing.
  • S60 Input the sample to be increased into the target word vector model for processing, and obtain at least one candidate phrase corresponding to the sample to be increased, where the candidate phrase includes at least one target synonym carrying the word vector.
  • the sample to be incremented contains a number of Chinese words, and each Chinese word corresponds to a number of target synonyms, and the candidate phrase is a set of target synonyms corresponding to each Chinese word and each Chinese word in the sample to be incremented.
  • the target synonym carries the word vector, so that when the subsequent text classification model uses the text after data enhancement processing for training, there is no need to perform word vector conversion to realize the automatic labeling function and further improve the training of the subsequent text classification model effectiveness.
  • the sample to be incremented is input into the target word vector model for processing, and the candidate phrase corresponding to the sample to be incremented is obtained, so that subsequent replacement of the sample to be incremented according to the target synonym in the candidate phrase is the data increment Provide data sources.
  • S70 Randomly select a target synonym from each candidate phrase to perform replacement processing on the incremental sample to obtain the first newly added sample corresponding to the classification label.
  • the first new sample refers to the new sample obtained by replacing the target synonym in the candidate phrase with the incremental sample.
  • the server randomly selects a target synonym from each candidate phrase and performs replacement processing on the incremental sample to obtain the first newly added sample corresponding to the classification label to achieve the purpose of data increment, thereby ensuring the data balance of the sample .
  • Data increment processing that is, if the actual sample ratio corresponding to the classification label is less than the specified sample ratio, the scene classification sample corresponding to the classification label is used as the sample to be incremented to ensure the effectiveness of the data increment processing.
  • step S10 obtaining a scene classification sample corresponding to a specific scene, specifically includes the following steps:
  • S11 Obtain original voice information corresponding to a specific scene, and use a voice enhancement algorithm to perform noise reduction processing on the original voice information to obtain target voice information.
  • specific scenarios include, but are not limited to, specific scenarios that require text classification, such as smart interviews.
  • the original voice information refers to the voice information collected in a specific scenario.
  • a smart interview scenario is taken as an example for description.
  • the server can receive the interviewer’s reply voice information collected by the voice collection device in real time, that is, the original voice message.
  • the original voice collected by the voice collection device generally contains noise, including noise in the background environment and noise generated during the recording process of the voice collection device.
  • noise-carrying original speech information will affect the accuracy of speech recognition during speech recognition. Therefore, it is necessary to perform noise reduction processing on the original speech to extract as much purer original speech as possible from the speech signal, so that the speech Recognition is more accurate.
  • methods for reducing noise on the original speech include but are not limited to using spectral subtraction, EEMD decomposition algorithm, and SVD singular value algorithm.
  • the scene classification sample can be voice data or text data. If it is voice data, it needs to be converted into processable text data; if it is text data, it does not need to be processed to ensure the data increment tool. Generalization.
  • S12 Perform feature extraction on target voice information, and obtain target voice features corresponding to the target voice information.
  • the target voice features include but are not limited to filter features.
  • Filter-Bank (Fbank) features are commonly used voice features in the process of voice recognition. Since the Mel feature commonly used in the prior art performs dimensionality reduction processing on the voice information during model recognition, resulting in the loss of part of the voice information, in order to avoid the above problems, the filter feature is used in this embodiment to replace the commonly used Mel feature .
  • S13 Use a pre-trained speech recognition model to recognize target speech features, and obtain scene classification samples corresponding to a specific scene.
  • the speech recognition model includes a pre-trained acoustic model and a language model.
  • the acoustic model is used to obtain the phoneme sequence corresponding to the target speech feature.
  • Phoneme is the smallest unit in phonetics, which can be understood as the pinyin in Chinese characters.
  • the Chinese syllable ⁇ ( ⁇ ) has only one phoneme
  • ài ( ⁇ ) has two phonemes
  • d ⁇ i ( ⁇ ) has three phonemes, etc.
  • the training method of the acoustic model includes but is not limited to the use of GMM-HMM (Gaussian Mixture Model) for training.
  • the language model is a model used to convert phoneme sequences into natural language text.
  • the server inputs the voice features into the pre-trained acoustic model for recognition, obtains the phoneme sequence corresponding to the target voice feature, and then inputs the obtained phoneme sequence into the pre-trained language model for conversion to obtain the corresponding recognition text,.
  • the data type of the scene classification sample corresponding to the specific scene is used to determine whether it needs to be converted to text, that is, if it is voice data, the voice data needs to be converted into processable text data, if it is text data, No processing is required to ensure the generalization of the data increment.
  • the target word vector model includes an approximation function.
  • step S60 the sample to be incremented is input into the target word vector model for processing, and the candidate phrase corresponding to the sample to be incremented is obtained ,
  • the candidate phrase includes at least one target synonym carrying a word vector, specifically including the following steps:
  • S61 Use regular expressions to divide the sample to be incremented, and obtain at least one sentence to be replaced corresponding to the sample to be incremented.
  • the sentence to be replaced refers to a sentence obtained by segmenting the sample to be incremented by a regular expression.
  • the maximum length MAX of sentence segmentation needs to be set; and then the sample to be incremented is segmented into at least one sentence, that is, the sentence to be replaced.
  • the segmentation method can specifically use regular expressions to divide the sentence according to the end of the sentence (such as: ?.,!).
  • S62 Use a Chinese word segmentation algorithm to segment each sentence to be replaced, and obtain at least one word to be replaced corresponding to the sentence to be replaced.
  • the server before performing data increment, the server also needs to segment the sample to be incremented to obtain the word order, so that it can be subsequently input into the word vector model for processing.
  • the Chinese word segmentation algorithm includes but is not limited to the algorithm of maximum reverse matching.
  • the maximum reverse matching algorithm is used to segment the sample to be incremented, and the first word order corresponding to the sample to be incremented is obtained.
  • the algorithm of maximum reverse matching is an algorithm used to segment Chinese words. This algorithm has the advantages of high accuracy and low algorithm complexity.
  • the Chinese thesaurus (hereinafter referred to as "thesaurus") is a thesaurus used to segment Chinese characters.
  • the specific steps of using the maximum reverse matching algorithm to segment each sentence to be replaced are: start segmenting each sentence from right to left to obtain a single character string; then compare the single character string with the lexicon, If the word is included in the thesaurus, record it to form a word order, otherwise, by reducing one word, continue to compare until there is one word left.
  • the maximum length of sentence segmentation MAX 5, and the input sentence is "I eat alone”.
  • the input sentence After confirming that "eat” is the first word in the input sentence, the input sentence becomes "I am alone". If there is no such word in the lexicon, then one word is reduced, namely "I", corresponding to the single word string Change to "a person”; if there is no such word in the thesaurus, continue to reduce a single word “one”, and the corresponding single character string becomes "individual”; if the word exists in the thesaurus, that is "individual”, record the word Down, get the second word order.
  • the input sentence After determining "person” as the second word order in the input sentence, the input sentence becomes "I one". If there is no such word in the lexicon, one word is reduced, namely "I”, and the corresponding single word string becomes Is " ⁇ "; the word " ⁇ " exists in the thesaurus, record the word and get the third word order.
  • the algorithm terminates.
  • the segmentation result of the sentence "I eat alone” using the algorithm of maximum reverse matching is "I/one/person/eat”.
  • the word order position of the word to be replaced corresponding to each sentence to be replaced is fixed and corresponds to the sentence to be replaced.
  • the sentence to be replaced is "I eat alone”. From the above word segmentation example, it can be seen that the word order to be replaced is As "me/one/person/dinner".
  • S63 Input each word to be replaced corresponding to the sentence to be replaced into the approximation function for processing, to obtain at least one target synonym that carries a word vector corresponding to the word to be replaced.
  • the approximation function is a function for returning the synonyms corresponding to each word to be replaced.
  • the target word vector model corresponds to the approximation function, so that the approximation function corresponding to the target word vector model can be directly called to obtain the target synonym corresponding to the replacement.
  • the server inputs each word to be replaced corresponding to the sample to be incremented into the approximation function corresponding to the target word vector model for processing, and obtains the word vector carrying word corresponding to the word to be replaced returned by the approximation function
  • At least one target synonym provides a data source for subsequent incremental data processing.
  • the word order to be replaced and the corresponding set of at least one target synonym carrying the word vector are used as a candidate phrase, so that at least one target synonym is randomly selected from the candidate phrase to replace the incremental sample to achieve the purpose of data increment .
  • the sample to be incremented is segmented by using regular expressions to obtain at least one sentence to be replaced corresponding to the sample to be incremented, so that when synonym replacement is subsequently performed, the server can according to the corresponding sentence of each sentence to be replaced
  • the result of word segmentation is the position of the word to be replaced in the sentence to be replaced to ensure that each first new sample is consistent with the sentence pattern of the sentence to be replaced.
  • use the word order to be replaced and the corresponding at least one target synonym carrying the word vector as the candidate phrase corresponding to the word order to be replaced, so that subsequent synonym replacement is performed according to the candidate phrase corresponding to each replacement word order to achieve the purpose of data increment .
  • step S70 at least one target synonym is randomly selected from the candidate phrase group to replace the incremental samples to obtain the first new sample corresponding to the classification label, which specifically includes the following step:
  • S71 Randomly select a target synonym from the candidate phrase corresponding to each word to be replaced, and determine it as the target word corresponding to the word to be replaced.
  • the target term is the target synonym randomly selected by the server from the candidate phrase.
  • the server randomly selects a target synonym from the candidate phrase group as the target word order corresponding to the word order to be replaced, and then replaces at least one word order to be replaced in the sample to be incremented with the target word order corresponding to the word order to be replaced , Obtain a number of first newly added samples corresponding to the classification label to achieve the purpose of data increment.
  • each candidate phrase corresponding to each word to be replaced includes multiple words to be replaced
  • a target synonym is randomly selected from the candidate phrase corresponding to each word to be replaced and determined as
  • the target word order may be the same as the word order to be replaced
  • the first new sample is the same as the sentence to be replaced. Therefore, after the first new sample is obtained, the All the first new additions are deduplicated and updated, and the first new samples corresponding to the classification labels are obtained to ensure the validity of the data set.
  • the word to be replaced includes A and B. Since the position of each word to be replaced corresponds to the sentence to be replaced, there is the following sentence sequence, namely AB, each to be replaced
  • the target synonyms corresponding to the term include A-(a1) and B-(b1, b2), then the candidate phrase corresponding to A is ⁇ A, a1 ⁇ , and the candidate phrase corresponding to B is ⁇ B, b1, b2 ⁇
  • Randomly select a target synonym from the candidate phrase corresponding to each word to be replaced, and determine it as the target word corresponding to the word to be replaced, that is, randomly select a target synonym from the candidate phrase candidate phrase can include the following forms, (A , B), (A, b1), (A, b2), (B, a1), (a1, b1), (a1, b2), replace each word to be replaced in the sentence to be replaced with Replace the target term corresponding to the replacement term, and obtain the first new sample, namely
  • each word to be replaced corresponding to the sample to be incremented is input into the approximation function for processing to obtain a carrying word vector corresponding to the word to be replaced
  • S631 Input each term to be replaced corresponding to the sample to be incremented into the approximate degree function for processing, and obtain at least one original synonym corresponding to the term to be replaced and the approximate degree corresponding to each original synonym.
  • the original synonym is a synonym corresponding to the word to be replaced obtained by inputting each word to be replaced corresponding to the sample to be incremented into the approximation function for processing. Specifically, each word to be replaced corresponding to the incremental sample of the server is input into the approximation function for processing, and at least one original synonym corresponding to the word to be replaced and the similarity corresponding to each original synonym are obtained for subsequent determination Target synonyms provide a data basis.
  • S632 Determine the specified sample size based on the total sample size and the specified sample ratio.
  • the designated sample size refers to the total number of samples of the classification label corresponding to the sample to be incremented under the condition of data balance. Understandably, the specified sample size can keep the data in the sample set balanced. Specifically, the designated sample number is determined based on the total sample number of the samples to be incremented and the designated sample ratio, that is, the total sample number is multiplied by the designated sample ratio to obtain the designated sample number.
  • S633 Determine the increment parameter according to the difference between the specified sample size and the actual sample size.
  • S634 Calculate based on the calculation formula for the number of target synonyms to obtain the number of target synonyms carrying the word vector, where the calculation formula for the number of target synonyms includes m is the number of words to be replaced, N is the number of target synonyms, and Z is the increment parameter.
  • the increment parameter refers to the number of samples to be increased to be supplemented. Specifically, the number of samples to be incremented is subtracted from the actual number of samples to obtain the increment parameters.
  • the server performs calculations based on the calculation formula for the number of target synonyms to obtain the number of target synonyms carrying the word vector, where the calculation formula for the number of target synonyms includes m is the number of words to be replaced, N is the number of target synonyms, and Z is the increment parameter. Understandably, since the number of original synonyms is large and cannot be used all, in order to achieve data balance in this embodiment, the number of target synonyms needs to be determined to ensure the data balance of the sample.
  • S635 According to the number of target synonyms, select the top N target synonyms that carry word vectors from the original synonyms in descending order of similarity.
  • the server selects the top N original synonyms as the target synonym from the original synonyms in descending order of similarity according to the number of target synonyms.
  • the value of N can be set according to actual needs and is not limited here.
  • each word to be replaced corresponding to the sample to be incremented is input into the approximation function for processing, so as to obtain at least one original synonym corresponding to each word to be replaced and the approximation corresponding to each original synonym
  • the specified sample size is determined, so that the number of target synonyms can be determined according to the calculation formula of the specified sample size and the number of target synonyms; finally, the approximate degree corresponding to each original synonym and the number of target synonyms, Determine the target synonyms to ensure the data balance of the sample.
  • the data increment method further includes the following steps:
  • the value of N may be a positive integer or a floating-point number, so the server needs to judge the value type of N. If the number of target synonyms carrying the word vector is a positive integer, it can be directly executed according to the carrying word vector The number of target synonyms is the step of selecting the top N target synonyms with word vectors from the original synonyms in descending order of similarity.
  • the server rounds down the number of target synonyms carrying the word vector.
  • N is 5.1
  • N is rounded down to 5.
  • the value type of the number of target synonyms is judged to ensure the smooth execution of data increment and improve fault tolerance.
  • the data increment method further includes the following steps:
  • the actual number of selected target synonyms is less than the number of target synonyms obtained by calculating the incremental parameter calculation formula to maintain sample balance, so Part of the missing quantity needs to be supplemented, that is, the number of target synonyms carrying the word vector and the number of updated synonyms are processed by using the calculation formula of the number of samples to be supplemented to obtain the number of samples to be supplemented, so that the samples can be supplemented subsequently based on the number of samples to be supplemented.
  • the number of samples to be supplemented is a floating point number
  • the number of samples to be supplemented is rounded down or rounded up to obtain an integer number of samples to be supplemented.
  • the value of the number to be supplemented may be a floating point number, so it is necessary to judge the value type of the number of samples to be supplemented. If the number of samples to be supplemented is For floating-point numbers, round down or round up the number of samples to be supplemented to obtain the integer number of samples to be supplemented. If the number of samples to be supplemented is a positive integer, no processing is required.
  • S92 Use the first translation tool to translate the sample to be incremented into non-Chinese text, and then use the first translation tool or the second translation tool to translate the non-Chinese text into Chinese text, and obtain the second new sample corresponding to the classification label. Until the number of samples of the second newly added sample reaches the number of samples to be supplemented, the second newly added sample is stored in association with the classification label.
  • the formula for calculating the number of target synonyms involves the calculation of exponential power, so the method of replacing word synonyms is used to process large data volume data increments, and in this embodiment, small data volume increments are required, so translation tools are used for processing Incremental sample processing to achieve the purpose of data increment. Understandably, since the language supported by the translation tool is fixed, it can be used to supplement a small part of the sample, that is, to enhance the data by using the translation tool to ensure data balance.
  • the sample to be incremented is a Chinese text.
  • the first translation tool needs to be used to translate the sample to be incremented into text corresponding to other languages (ie, non-Chinese text), and then the non-Chinese text is translated into Chinese text.
  • non-Chinese text is translated into Chinese text.
  • the first translation tool refers to the current existing translation tools, such as Baidu Translate or Youdao Translate or Google Translate.
  • the second translation tool refers to the existing translation tools other than the first translation tool.
  • Non-Chinese text refers to the translated text obtained by using the first translation tool to translate the sample to be incremented into non-Chinese.
  • the Chinese text refers to the translated text containing only Chinese characters obtained by translation using the first translation tool or the second translation tool.
  • the second new sample refers to the sample obtained by data increment through the translation tool.
  • the sample size of the second new sample is the number of supplementary samples that use translation tools to supplement data.
  • the translation tool includes but is not limited to the Google translation tool, which supports a wide variety of languages to obtain more samples to be supplemented.
  • N m refers to the number of first new samples that need to be obtained
  • B m refers to the number of first new samples that have been obtained currently.
  • A indicates that the number of samples to be supplemented is the number of second new samples that need to be obtained.
  • acquiring the second newly added sample is a continuous process, and it can be understood that if the number of samples of the currently acquired second newly added sample reaches the number to be supplemented, stop acquiring the second newly added sample.
  • the server can call the translation interface provided by the first translation tool to translate the sample to be incremented into non-Chinese text, and then use the second translation tool to translate the non-Chinese text into Chinese text, and obtain the first translation tool corresponding to the classification label. 2. Add new samples, until the number of samples of the second new sample reaches the number of samples to be supplemented, store the second new sample in association with the classification label to obtain more Chinese expressions and achieve the purpose of data increment.
  • the server will also use Chinese word segmentation algorithm to segment the second new sample to obtain the number of words to be labeled corresponding to the second new sample, and then input the number of words to be labeled into the target word vector model for recognition, so as to achieve The word vector corresponding to each word to be labeled is labeled, and the word vector corresponding to the second newly added sample is obtained without manual labeling.
  • the second new sample, the word vector corresponding to the second new sample, and the classification label corresponding to the second new sample are associated and stored as model training samples, so that the model training samples can be directly used to train the text classification model without manual collection. , Reduce labor costs.
  • the number of updated synonyms in the above embodiment is obtained by rounding down the number of target synonyms
  • the number of target synonyms actually selected is smaller than the retained sample calculated by the incremental parameter calculation formula Balance the number of target synonyms, so it is necessary to supplement a small part of the missing quantity, that is, to use translation tools to process the incremental samples to obtain more Chinese expressions and achieve the purpose of supplementing a small part of the sample.
  • the target word vector model is obtained, so as to obtain the N synonyms of the first word order corresponding to each classification sample according to the target word vector model to perform data increment, and
  • the value N can be dynamically adjusted according to the specified sample ratio input by the user to achieve the purpose of data balance.
  • the server will also use translation tools to supplement a small number of missing samples in the method of replacing synonyms for data increment because the N value is non-integer, to ensure data balance, and can effectively collect more samples without manpower Collect, save time. Further, the server can also realize the purpose of automatically labeling the acquired new sample word vectors through the target word vector model, without manual intervention, and reducing labor costs.
  • a data increment device is provided, and the data increment device corresponds to the data increment method in the above-mentioned embodiment one-to-one.
  • the data increment device includes a sample acquisition module 10, a text to be trained acquisition module 20, a target word vector model acquisition module 30, an actual sample ratio determination module 40, a sample to be incremented determination module 50, candidate phrase acquisition Module 60 and the first newly added sample acquisition module 70.
  • the detailed description of each functional module is as follows:
  • the sample acquisition module 10 is configured to acquire a scene classification sample corresponding to a specific scene and a specified sample ratio, where the scene classification sample corresponds to a classification label;
  • the to-be-trained text obtaining module 20 is configured to use regular expressions to perform text preprocessing on the scene classification samples to obtain the to-be-trained text;
  • the target word vector model obtaining module 30 is configured to use a pre-trained original word vector model to perform incremental training on the text to be trained to obtain a target word vector model;
  • the actual sample ratio determination module 40 is configured to count the actual sample number corresponding to each of the classification labels and the total sample number corresponding to all the scene classification samples, and determine the actual sample number and the total sample number based on the actual sample number and the total sample number.
  • the to-be-incremented sample determining module 50 is configured to, if the actual sample ratio corresponding to the classification label is less than the specified sample ratio, use the scene classification sample corresponding to the classification label as the sample to be increased;
  • the candidate phrase acquisition module 60 is configured to input the sample to be incremented into the target word vector model for processing, and acquire at least one candidate phrase corresponding to the sample to be incremented, the candidate phrase including a carrier word vector At least one target synonym of
  • the first newly added sample obtaining module 70 is configured to randomly select one of the target synonyms from each candidate phrase to perform replacement processing on the sample to be incremented, and obtain the first newly added sample corresponding to the classification label .
  • the sample acquisition module includes a target voice information acquisition unit, a target voice feature acquisition unit, and a scene classification sample acquisition unit.
  • the target voice information acquisition unit is used to acquire original voice information corresponding to a specific scene, and use a voice enhancement algorithm to perform noise reduction processing on the original voice information to obtain target voice information;
  • a target voice feature acquiring unit configured to perform feature extraction on the target voice information, and obtain a target voice feature corresponding to the target voice information
  • the scene classification sample acquisition unit is used to recognize the target voice feature using a pre-trained speech recognition model, and acquire a scene classification sample corresponding to the specific scene.
  • the target word vector model includes an approximate degree function
  • the candidate phrase acquisition module includes a sentence acquisition unit to be replaced, a word order acquisition unit to be replaced, a target synonym acquisition unit, and a candidate phrase acquisition unit.
  • a sentence-to-be-replaced acquisition unit configured to divide the sample to be incremented by using a regular expression, and acquire at least one sentence to be replaced corresponding to the sample to be incremented;
  • a word order to be replaced acquiring unit configured to use a Chinese word segmentation algorithm to segment each sentence to be replaced, and obtain at least one word order to be replaced corresponding to the sentence to be replaced;
  • a target synonym acquisition unit configured to input each word to be replaced corresponding to the sentence to be replaced into the approximation function for processing, to obtain at least one target carrying a word vector corresponding to the word to be replaced Synonym
  • the candidate phrase acquisition unit is configured to use the word order to be replaced and the corresponding at least one target synonym carrying the word vector as a candidate phrase corresponding to the word order to be replaced.
  • the first newly-added sample acquiring module includes a target word order acquiring unit and a first newly-added sample acquiring unit.
  • the target term acquisition unit is configured to randomly select at least one target synonym from the candidate phrase corresponding to each term to be replaced, and determine it as the target term corresponding to the term to be replaced;
  • the first newly-added sample acquisition unit is configured to replace each of the word times to be replaced in the sentence to be replaced with the target word times corresponding to the word times to be replaced, and obtain the first word corresponding to the classification label One new sample.
  • the target synonym acquisition unit includes an approximate degree acquisition unit, a designated sample number acquisition unit, an increment parameter acquisition unit, a target synonym number acquisition unit, and a target synonym acquisition unit.
  • the approximate degree acquisition unit is configured to input each term to be replaced corresponding to the sample to be incremented into the approximate degree function for processing, and obtain at least one original synonym and each term corresponding to the term to be replaced. 1. The degree of similarity corresponding to the original synonyms;
  • the designated sample quantity acquiring unit determines the designated sample quantity based on the total sample quantity and the designated sample ratio
  • the increment parameter acquisition unit determines the increment parameter according to the difference between the specified sample number and the actual sample number
  • the target synonym number obtaining unit is configured to perform calculation based on the target synonym number calculation formula to obtain the target synonym number carrying the word vector; wherein the target synonym number calculation formula includes m is the number of words to be replaced, N is the number of target synonyms, and Z is the increment parameter;
  • the target synonym obtaining unit is configured to select the top N target synonyms carrying word vectors from the original synonyms arranged in descending order of the similarity according to the number of the target synonyms.
  • the data increment device further includes a target synonym quantity acquiring unit and an updated synonym quantity acquiring unit.
  • the target synonym quantity acquiring unit is configured to directly execute the original synonyms according to the number of target synonyms carrying the word vector in descending order from the degree of similarity if the number of target synonyms carrying the word vector is a positive integer
  • the update synonym quantity obtaining unit is configured to, if the number of target synonyms carrying the word vector is a floating-point number, round down the number of target synonyms carrying the word vector to obtain the number of updated synonyms; based on the update synonym Number, the step of selecting the top N target synonyms with word vectors from the original synonyms arranged in descending order of the degree of similarity.
  • the data increment device further includes an acquisition unit for the number of samples to be supplemented and an update unit for the number of samples to be supplemented.
  • the number of samples to be supplemented acquiring unit is used to process the number of target synonyms carrying the word vector and the number of updated synonyms by using a calculation formula of the number of samples to be supplemented to obtain the number of samples to be supplemented; wherein the number of samples to be supplemented is calculated
  • the number of samples to be supplemented update unit is configured to, if the number of samples to be supplemented is a floating-point number, round down or round up the number of samples to be supplemented to obtain the number of samples to be supplemented;
  • the second newly-added sample acquisition unit is configured to use the first translation tool to translate the sample to be incremented into non-Chinese text, and then use the first translation tool or the second translation tool to translate the non-Chinese text into Chinese Text, acquiring a second newly added sample corresponding to the classification label, until the number of samples of the second newly added sample reaches the number of samples to be supplemented, and storing the second newly added sample in association with the classification label .
  • Each module in the above-mentioned data increment device can be implemented in whole or in part by software, hardware and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a readable storage medium and an internal memory.
  • the readable storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer readable instructions in the readable storage medium.
  • the database of the computer device is used to store data generated or acquired during the execution of the data increment method, such as the first newly added sample.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to implement a data increment method.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • one or more readable storage media storing computer readable instructions are provided, and when the computer readable instructions are executed by one or more processors, the one or more processors execute The steps of the data increment method in the foregoing embodiment are, for example, the steps shown in FIG. 2 or the steps shown in FIG. 3 to FIG. 8. Or, when the computer-readable instructions are executed by one or more processors, the one or more processors realize the functions of each module/unit in the embodiment of the data increment device when executed, for example, FIG. 9 The functions of the modules/units shown are not repeated here in order to avoid repetition.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • ROM read only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

一种数据增量方法、装置、计算机设备及存储介质,该方法包括:获取特定场景对应的场景分类样本和指定样本比例(S10),采用正则表达式对场景分类样本进行文本预处理,获取待训练文本(S20);采用原始词向量模型对待训练文本进行增量训练,获取目标词向量模型(S30);基于每一分类标签对应的实际样本数量和场景分类样本对应的总样本数量,确定分类标签对应的实际样本比例(S40);若实际样本比例小于指定样本比例,则将分类标签对应的场景分类样本作为待增量样本(S50);将待增量样本输入至目标词向量模型中进行处理,获取与待增量样本对应的候选词组(S60),从每一候选词组中随机选取一个目标同义词对待增量样本进行替换处理,获取第一新增样本(S70),该方法可有效保证数据平衡。

Description

数据增量方法、装置、计算机设备及存储介质
本申请以2019年4月28日提交的申请号为201910350861.5,名称为“数据增量方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。
技术领域
本申请涉及数据增量技术领域,尤其涉及一种数据增量方法、装置、计算机设备及存储介质。
背景技术
在文本分类场景中,数据不平衡是很常见的一个问题,就智能面试场景来说,大部分候选人会给出比较中等或者比较好的回答来表现自己,很少会给出很差的回答。因此在实现智能面试针对面试者回答自动评分的过程中,通常中等和偏好的回答样本会比较多,而偏差的样本会很少,造成样本极不平衡,造成采用该样本进行模型训练时的准确率不高的问题。
发明内容
本申请实施例提供一种数据增量方法、装置、计算机设备及存储介质,以解决目前文本分类模型训练采用的训练文本数据不平衡,无法保证模型训练准确率的问题。
一种数据增量方法,包括:
获取特定场景对应的场景分类样本和指定样本比例,所述场景分类样本对应一分类标签;
采用正则表达式对所述场景分类样本进行文本预处理,获取待训练文本;
采用预先训练好的原始词向量模型对所述待训练文本进行增量训练,获取目标词向量模型;
统计每一所述分类标签对应的实际样本数量和所有所述场景分类样本对应的总样本数量,基于所述实际样本数量和所述总样本数量,确定所述分类标签对应的实际样本比例;
若所述分类标签对应的实际样本比例小于所述指定样本比例,则将所述分类标签对应的场景分类样本作为待增量样本;
将所述待增量样本输入至所述目标词向量模型中进行处理,获取与所述待增量样本对应的至少一个候选词组,所述候选词组包括携带词向量的至少一个目标同义词;
从每一所述候选词组中随机选取一个所述目标同义词对所述待增量样本进行替换处理,获取与所述分类标签对应的第一新增样本。
一种数据增量装置,包括:
样本获取模块,用于获取特定场景对应的场景分类样本和指定样本比例,所述场景分类样本对应一分类标签;
待训练文本获取模块,用于采用正则表达式对所述场景分类样本进行文本预处理,获取待训练文本;
目标词向量模型获取模块,用于采用预先训练好的原始词向量模型对所述待训练文本进行增量训练,获取目标词向量模型;
实际样本比例确定模块,用于统计每一所述分类标签对应的实际样本数量和所有所述场景分类样本对应的总样本数量,基于所述实际样本数量和所述总样本数量,确定所述分类标签对应的实际样本比例;
待增量样本确定模块,用于若所述分类标签对应的实际样本比例小于所述指定样本比例,则将所述分类标签对应的场景分类样本作为待增量样本;
候选词组获取模块,用于将所述待增量样本输入至所述目标词向量模型中进行处理,获取与所述待增量样本对应的至少一个候选词组,所述候选词组包括携带词向量的至少一个目标同义词;
第一新增样本获取模块,用于从每一所述候选词组中随机选取一个所述目标同义词对所述待增量样本进行替换处理,获取与所述分类标签对应的第一新增样本。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取特定场景对应的场景分类样本和指定样本比例,所述场景分类样本对应一分类标签;
采用正则表达式对所述场景分类样本进行文本预处理,获取待训练文本;
采用预先训练好的原始词向量模型对所述待训练文本进行增量训练,获取目标词向量模型;
统计每一所述分类标签对应的实际样本数量和所有所述场景分类样本对应的总样本数量,基于所述实际样本数量和所述总样本数量,确定所述分类标签对应的实际样本比例;
若所述分类标签对应的实际样本比例小于所述指定样本比例,则将所述分类标签对应的场景分类样本作为待增量样本;
将所述待增量样本输入至所述目标词向量模型中进行处理,获取与所述待增量样本对应的至少一个候选词组,所述候选词组包括携带词向量的至少一个目标同义词;
从每一所述候选词组中随机选取一个所述目标同义词对所述待增量样本进行替换处理,获取与所述分类标签对应的第一新增样本。
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
获取特定场景对应的场景分类样本和指定样本比例,所述场景分类样本对应一分类标签;
采用正则表达式对所述场景分类样本进行文本预处理,获取待训练文本;
采用预先训练好的原始词向量模型对所述待训练文本进行增量训练,获取目标词向量模型;
统计每一所述分类标签对应的实际样本数量和所有所述场景分类样本对应的总样本数量,基于所述实际样本数量和所述总样本数量,确定所述分类标签对应的实际样本比例;
若所述分类标签对应的实际样本比例小于所述指定样本比例,则将所述分类标签对应的场景分类样本作为待增量样本;
将所述待增量样本输入至所述目标词向量模型中进行处理,获取与所述待增量样本对应的至少一个候选词组,所述候选词组包括携带词向量的至少一个目标同义词;
从每一所述候选词组中随机选取一个所述目标同义词对所述待增量样本进行替换处理,获取与所述分类标签对应的第一新增样本。
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中数据增量方法的一应用环境示意图;
图2是本申请一实施例中数据增量方法的一流程图;
图3是图2中步骤S10的一具体流程图;
图4是图2中步骤S60的一具体流程图;
图5是图2中步骤S70的一具体流程图;
图6是图4中步骤S63的一具体流程图;
图7是本申请一实施例中数据增量方法的一流程图;
图8是本申请一实施例中数据增量方法的一流程图;
图9是本申请一实施例中数据增量装置的一示意图;
图10是本申请一实施例中计算机设备的一示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的数据增量方法可应用在一种数据增量工具中,用于针对文本分类的样本分布不均匀的部分样本进行自动数据增量,以使各类样本分布均匀,提高后续进行文本分类的准确性。更进一步地,该方法还可实现增大训练集的目的,保证模型训练的训练集足够,提高模型的准确率。该数据增量方法可应用在如图1的应用环境中,其中,计算机设备通过网络与服务器进行通信。计算机设备可以但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务器可以用独立的服务器来实现。
在一实施例中,如图2所示,提供一种数据增量方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:
S10:获取特定场景对应的场景分类样本和指定样本比例,场景分类样本对应一分类标签。
其中,特定场景对应的场景分类样本是针对不同文本分类场景(如智能面试评分场景)获取到的文本,该场景分类样本对应一分类标签。分类标签是指不同文本分类场景下的不同类别所对应的类别标签,如智能面试评分中,该分类标签包括偏好、偏差、中等、特别好和特别差等。具体地,数据增量工具中预先存储有不同场景类型对应的文本数据,用户可在数据增量工具中,选择所需的场景类型,并上传自行采集的语料数据作为场景分类样本以使服务器获取场景分类样本。指定样本比例是指不同分类标签对应的场景分类样本占总样本数量的比值。
S20:采用正则表达式对场景分类样本进行文本预处理,获取待训练文本。
其中,对场景分类样本进行预处理包括但不限于去除英文处理和去除停用词处理。本实施例中,去除停词处理是指是指在信息检索中,为节省存储空间和提高搜索效率,在处理自然语言数据(或文本)之前或之后会自动过滤掉某些停用词(如“我”“个”“下”)的处理。去除英文处理可采用正则表达式进行过滤,例如[\u4e00-\u9fa5],即可将英文过滤掉,以获取仅包含中文字符的待训练文本。通过采用正则表达式对场景分类样本进行文本预处理,获取待训练文本,以排除英文字符和停用词的干扰,提高后续增量训练的训练效率。
S30:采用预先训练好的原始词向量模型对待训练文本进行增量训练,获取目标词向量模型。
其中,原始词向量模型是采用gensim库中的word2vec训练函数进行增量训练后得到的词向量模型。gensim是一个python的自然语言处理库,能够将文档根据TF-IDF,LDA,LSI等模型转化成向量模式,以便进行进一步的处理。此外,gensim库还包含了word2vec训练函数,以实现单词转化为词向量(word2vec)功能。由于有词向量具有良好的语义特性,是表示词语特征的常用方式,通过将单词以词向量的形式进行表示,以便后续采用该词向量训练文本分类模型,方便运算。
其中,word2vec训练函数是用于训练词向量模型的训练函数。word2vec可以在百万数量级的词典和上亿的数据集上进行高效地训练,其次,该工具得到的训练结果——词向量(word embedding),可以很好地度量词与词之间的相似性。具体地,现有技术中已有开发好的原始词向量模型(如基于百度百科和微博等语料训练训练的中文词向量模型),但由于本实施例中增加了场景分类文本,因此,为了适用本实施例,需加载原始词向量模型,并在训练好的原始词向量模型的基础上,直接将待训练文本输入到word2vec训练函数中进行增量训练即可获取目标词向量模型,无需重新训练原始词向量模型,有效保证训练效率。通过采用预先训练好的原始词向量模型对待训练文本进行增量训练,获取目标词向量模型,以使该目标词向量模型中加入特定场景对应的文本样本,使得该目标词向量模型与特定场景相匹配,提高后续基于该目标词向量模型进行训练所得到的文本分类模型的准确率。
S40:统计每一分类标签对应的实际样本数量和所有场景分类样本对应的总样本数量,基于实际样本数量和总样本数量,确定分类标签对应的实际样本比例。
其中,总样本数量是指场景分类样本的总数据量。实际样本数量是指每一分类标签对应的样本实际数量。具体地,服务器通过统计每一分类标签对应的实际样本数量和所有场景分类样本对应的总样本数量,可实现基于实际样本数量和总样本数量,确定分类标签对应的实际样本比例,即将实际样本数量和总样本数量的比值作为分类标签对应的实际样本比例的功能。
S50:若分类标签对应的实际样本比例小于指定样本比例,则将分类标签对应的场景分类样本作为待增量样本。
其中,待增量样本是需要进行数据增量处理的文本样本。不同分类标签对应的分类样本比例不同,在进行模型训练时,样本比例需保持一定的比例,才可保证模型训练的准确率,如某一类文本对应的样本比例较低,则会导致模型训练出现偏差,使得模型准确率不高。因此,本实施例中,服务器会根据用户设置的每一分类标签对应的指定比例对场景分类样本进行动态调整。
具体地,用户将场景分类样本和每一分类标签对应指定样本比例输入到数据增量工具中,服务器会默认将用户输入的指定样本比例作为数据比例均衡的样本比例。首先,服务器统计每一分类标签对应的实际样本比例,并与指定样本比例进行比较,若分类标签对应的实际样本比例小于指定样本比例,则认为用户输入的分类样本的数据不均衡,则会将该分类标签对应的分类样本作为待增量样本,以便服务器对待增量样本进行数据增量。可理解地,若分类标签对应的实际样本比例不小于指定样本比例,则认为用户输入的分类样本的数据均衡,无需进行数据增量处理,通过比较实际样本比例与指定样本比例,以确定是否需要进行数据增强处理,保证数据增量处理的有效性。
S60:将待增量样本输入至目标词向量模型中进行处理,获取与待增量样本对应的至少一个候选词组,候选词组包括携带词向量的至少一个目标同义词。
其中,待增量样本中包含若干个中文词,每一中文词会对应若干个目标同义词,候选词组是待增量样本中每一中文词和每一中文词对应的目标同义词的集合。
本实施例中,目标同义词携带词向量,以使后续文本分类模型采用数据增强处理后的文本进行训练时,无需进行词向量的转换,以实现自动标注的功能,进一步提高后续文本分类模型的训练效率。具体地,将待增量样本输入至目标词向量模型中进行处理,获取与待增量样本对应的候选词组,以便后续根据该候选词组中的目标同义词对待增量样本进行替换,为数据增量提供数据来源。
S70:从每一候选词组中随机选取一个目标同义词对待增量样本进行替换处理,获取与分类标签对应的第一新增样本。
其中,第一新增样本是指将候选词组中的目标同义词对待增量样本进行替换所得到的新增样本。具体地,服务器从每一候选词组中随机选取一个目标同义词对待增量样本进行替换处理,以获取与分类标签对应的第一新增样本,以实现数据增量的目的,从而保证样本的数据平衡。
本实施例中,通过获取特定场景对应的场景分类样本和指定样本比例,以便采用正则表达式对场景分类样本进行文本预处理,获取待训练文本,以排除英文字符和停用词的干扰。然后,采用预先训练好的原始词向量模型对待训练文本进行增量训练,获取目标词向量模型,以使目标词向量模型中加入特定场景对应的文本样本,保证基于该目标词向量模型进行样本标注并所得到的样本的准确率。接着,统计每一分类标签对应的实际样本数量和所有场景分类样本对应的总样本数量,基于实际样本数量和总样本数量,确定分类标签对应的实际样本比例,以便针对实际样本比例确定是否需要进行数据增量处理,即若分类标签对应的实际样本比例小于指定样本比例,则将分类标签对应的场景分类样本作为待增量样本,保证数据增量处理的有效性。
在一实施例中,如图3所示,步骤S10中,即获取特定场景对应的场景分类样本,具体包括如下步骤:
S11:获取特定场景对应的原始语音信息,采用语音增强算法对原始语音信息进行降噪处理,获取目标语音信息。
其中,特定场景包括但不限于各需要进行文本分类的特定场景,如智能面试。原始语音信息是指在特定场景下所采集的语音信息。
本实施例中,以智能面试场景为例进行说明,通过预先模拟智能面试场景并设置一语音采集设备(如麦克风),以使服务器实时接收语音采集设备所采集的面试者的回复语音信息即原始语音信息。具体地,由于由语音采集设备采集到原始语音一般都带有噪声,包括背景环境中的噪声以及语音采集设备录音过程中产生的噪声。这些携带噪声的原始语音信息在进行语音识别时,会影响语音识别的准确性,因此,需要对原始语音进行降噪处理,以从该语音信号中尽可能提取到更纯净的原始语音,使语音识别更加准确。其中,对原始语音进行降噪的方法包括但不限于采用谱减法、EEMD分解算法和SVD奇异值算法等。
可以理解地,场景分类样本可以为语音数据或文本数据,若为语音数据,则需要将语音数据转换为可处理的文本数据;若为文本数据,则无需进行处理,以保证数据增量工具的泛化性。
S12:对目标语音信息进行特征提取,获取与目标语音信息相对应的目标语音特征。
本实施例中,目标语音特征包括但不限于滤波器特征。滤波器(Filter-Bank,简称Fbank)特征是语音识别过程中常用的语音特征。由于现有技术中常用的梅尔特征在进行模型识别时会对语音信息进行降维处理,导致部分语音信息丢失,为避免上述问题出现,本实施例中采用滤波器特征代替常用的梅尔特征。
S13:采用预先训练好的语音识别模型对目标语音特征进行识别,获取与特定场景相对应的场景分类样本。
可理解地,语音识别模型包括预先训练好的声学模型和语言模型。其中,声学模型是用来获取目标语音特征对应的音素序列。音素是由语音中最小的单位,可理解为汉字里面的拼音。例如:汉语音节ā(啊)只有一个音素,ài(爱)有两个音素,dāi(呆)有三个音素等。声学模型的训练方法包括但不限于采用GMM-HMM(混合高斯模型)进行训练。语言模型是用于将音素序列转换为自然语言文本的模型。具体地,服务器将语音特征输入到预先训练好的声学模型中进行识别,获取目标语音特征对应的音素序列,然后将获取的音素序列输入到预先训练好的语言模型中进行转换,获取对应的识别文本,。
本实施例中,通过特定场景对应的场景分类样本的数据类型,以确定是否需要进行转文本处理,即若为语音数据,则需要将语音数据转换为可处理的文本数据,若为文本数据,则无需进行处理,以保证数据增量的泛化性。
在一实施例中,如图4所示,目标词向量模型包括近似度函数,步骤S60中,即将待增量样本输入至目标词向量模型中进行处理,获取与待增量样本对应的候选词组,候选词组包括携带词向量的至少一个目标同义词,具体包括如下步骤:
S61:采用正则表达式对待增量样本进行分割,获取待增量样本对应的至少一个待替换语句。
其中,待替换语句是指采用正则表达式对待增量样本进行分割所获取的句子。具体地,需设定句子分割的最大长度MAX;然后将待增量样本分割为至少一个句子即待替换语句,该分割方法具体可采用正则表达式按照句子的结束符进行划拆分(如:?。,!)。
S62:采用中文分词算法对每一待替换语句进行分词,获取待替换语句对应的至少一个待替换词次。
进一步地,在进行数据增量之前,服务器还需对待增量样本进行分词,获取词次,以便后续输入到词向量模型中进行处理。本实施例中,中文分词算法包括但不限于最大逆向匹配的算法。通过最大逆向匹配的算法对待增量样本进行分词,获取待增量样本对应的第一词次。最大逆向匹配的算法是用于对中文进行分词的算法,该算法具有准确率高、算法复杂度低的优点。
具体地,在进行分词之前,开发人员会预先设定好中文词库,为分词提供技术支持。其中,中文词库(以下简称“词库”)是用于对中文字符进行分词的词库。采用最大逆向匹配的算法对每一待替换语句进行分词的具体步骤为:对每一句子按照从右往左的顺序开始切分,获取单字串;然后将该单字串和词库进行比对,若词库中包含有该词就记录下来,形成一词次,否则通过减少一个单字,继续比较,直至剩下一个单字则停止。
例如,句子分割的最大长度MAX=5,输入的句子为“我一个人吃饭”,首先按照从右往左的顺序开始切分,获取单字串即“一个人吃饭”;在词库中没有该词,则减少一个单字“一”,对应的单字串 变为“个人吃饭”;在词库中没有该词,则继续减少一个单字“个”,对应的单字串变为“人吃饭”;在词库中没有该词则继续减少一个单字即“人”,对应的单字串变为“吃饭”;在词库中存在该词即“吃饭”,则将该词记录下来,获取第一个词次。
在确定“吃饭”为输入的句子中的第一个词次后,输入的句子变为“我一个人”,在词库中没有该词,则减少一个单字即“我”,对应的单字串变为“一个人”;在词库中没有该词,则继续减少一个单字“一”,对应的单字串变为“个人”;在词库中存在该词即“个人”,将该词记录下来,获取第二个词次。
在确定“个人”为输入的句子中的第二个词次后,输入的句子变为“我一”,在词库中没有该词,则减少一个单字即“我”,对应的单字串变为“一”;在词库中存在该词即“一”,将该词记录下来,获取第三个词次。
在确定“一”为输入的句子中的第三个词次后,输入的句子只剩下一个单字“我”,算法终止。最终,采用最大逆向匹配的算法对于句子“我一个人吃饭”的分词结果为“我/一/个人/吃饭”。可理解地,每一待替换语句对应的待替换词次的词次位置固定且与待替换语句对应,例如待替换语句为“我一个人吃饭”,由上述分词示例可知,待替换词次即为“我/一/个人/吃饭”。
S63:将待替换语句对应的每一待替换词次输入到近似度函数中进行处理,得到与待替换词次相对应的携带词向量的至少一个目标同义词。
其中,近似度函数是用于返回与每一待替换词次对应的同义词的函数。需说明,目标词向量模型与近似度函数相对应,以便直接调用目标词向量模型对应的近似度函数,获取待替换此次对应的目标同义词。具体地,服务器将待增量样本对应的每一待替换词次输入到目标词向量模型对应的近似度函数中进行处理,获取近似度函数返回的与待替换词次相对应的携带词向量的至少一个目标同义词,为后续进行数据增量处理提供数据来源。
S64:将待替换词次和对应的携带词向量的至少一个目标同义词作为待替换词次对应的候选词组。
具体地,将待替换词次和对应的携带词向量的至少一个目标同义词的集合作为候选词组,以便后续从候选词组中随机选取至少一个目标同义词对待增量样本进行替换,实现数据增量的目的。
本实施例中,通过采用正则表达式对待增量样本进行分割,以获取待增量样本对应的至少一个待替换语句,以使后续在进行同义词替换时,服务器能够根据每一待替换语句对应的分词结果即待替换词次在待替换语句中的位置进行替换,保证每一第一新增样本与待替换语句的句式保持一致。最后,将待替换词次和对应的携带词向量的至少一个目标同义词作为待替换词次对应的候选词组,以便后续根据每一替换词次对应的候选词组进行同义词替换,实现数据增量的目的。
在一实施例中,如图5所示,步骤S70中,即从候选词组中随机选取至少一个目标同义词对待增量样本进行替换处理,获取与分类标签对应的第一新增样本,具体包括如下步骤:
S71:从每一待替换词次对应的候选词组中随机选取一个目标同义词,确定为待替换词次对应的目标词次。
S72:将待替换语句中的每一待替换词次替换为与待替换词次对应的目标词次,获取与分类标签对应的第一新增样本。
其中,目标词次是服务器从候选词组中随机选取的目标同义词。具体地,服务器从候选词组中随机选取一个目标同义词作为待替换词次对应的目标词次,再将待增量样本中的至少一个待替换词次替换为与待替换词次对应的目标词次,获取与分类标签对应的若干个第一新增样本,以实现数据增量的目的。
进一步地,本实施例中,由于每一待替换词次对应的候选词组中包括多个待替换词次,故在从每一待替换词次对应的候选词组中随机选取一个目标同义词,确定为待替换词次对应的目标词次时,该目标词次可能与待替换词次相同,会出现第一新增样本与待替换语句相同的情况,故在得到第一新增样本后,需对所有第一新增加本进行去重处理并更新,获取与分类标签对应的第一新增样本,以保证数据集的有效性。
为方便理解,现已如下示例进行说明,例如,待替换词次包括A和B,由于每一待替换词次的位置与待替换语句对应,则有如下语句顺序,即A-B,每一待替换词次对应的目标同义词包括A-(a1)和 B-(b1,b2),则A对应的候选词组即为{A,a1},B对应的候选词组即为{B,b1,b2},从每一待替换词次对应的候选词组中随机选取一个目标同义词,确定为待替换词次对应的目标词次,即从候选词组候选词组随机选取一个目标同义词可包括如下几种形式,(A,B)、(A,b1)、(A、b2)、(B、a1)、(a1、b1)、(a1,b2),将待替换语句中的每一待替换词次替换为与待替换词次对应的目标词次,获取第一新增样本,即(A-B)、(A-b1)、(A-b2)、(B-a1)、(a1-b1)、(a1-b2),将重复的第一新增样本去除,获取与分类标签对应的第一新增样本,即(A-b1)、(A-b2)、(B-a1)、(a1-b1)、(a1-b2)。
本实施例中,通过从每一待替换词次对应的候选词组中随机选取一个目标同义词,确定为待替换词次对应的目标词次,再将待替换语句中的每一待替换词次替换为与待替换词次对应的目标词次,获取与分类标签对应的若干个第一新增样本,以实现数据增量的目的。
在一实施例中,如图6所示,步骤S63中,即将待增量样本对应的每一待替换词次输入到近似度函数中进行处理,得到与待替换词次相对应的携带词向量的至少一个目标同义词,具体包括如下步骤:
S631:将待增量样本对应的每一待替换词次输入到近似度函数中进行处理,获取与待替换词次相对应的至少一个原始同义词和每一原始同义词对应的近似度。
其中,原始同义词是通过将待增量样本对应的每一待替换词次输入到近似度函数中进行处理所获取到的与待替换词次相对应的同义词。具体地,服务器增量样本对应的每一待替换词次输入到近似度函数中进行处理,获取与待替换词次相对应的至少一个原始同义词和每一原始同义词对应的近似度,为后续确定目标同义词提供数据基础。
S632:基于总样本数量和指定样本比例,确定指定样本数量。
其中,指定样本数量是指在数据平衡的情况下的待增量样本对应的分类标签的样本总数量。可理解地,该指定样本数量可使样本集中的数据保持平衡。具体地,基于待增量样本的总样本数量和指定样本比例,确定指定样本数量,即将总样本数量与指定样本比例进行相乘运算,获取指定样本数量。
S633:根据指定样本数量和实际样本数量的差值,确定增量参数。
S634:基于目标同义词数量计算公式进行计算,获取携带词向量的目标同义词数量,其中,目标同义词数量计算公式包括
Figure PCTCN2019103271-appb-000001
m为待替换词次的数量,N为目标同义词数量,Z为增量参数。
其中,增量参数是指指待增量样本数量待补充数量。具体地,将待增量样本与实际样本数量进行相减运算,即可获取增量参数。服务器基于目标同义词数量计算公式进行计算,以获取携带词向量的目标同义词数量,其中,目标同义词数量计算公式包括
Figure PCTCN2019103271-appb-000002
m为待替换词次的数量,N为目标同义词数量,Z为增量参数。可理解地,由于原始同义词的数量很大,不可全部采用,故本实施例中为了达到数据平衡,需确定目标同义词数量,以保证样本的数据平衡。
S635:按照目标同义词数量,从近似度降序排列的原始同义词中选取前N位的携带词向量的目标同义词。
具体地,服务器按照目标同义词数量,从近似度降序排列的原始同义词中选取前N位原始同义词作为目标同义词。其中,N的取值可根据实际需要自行设定,在此不做限定。
本实施例中,通过将待增量样本对应的每一待替换词次输入到近似度函数中进行处理,以获取每一待替换词次对应的至少一个原始同义词和每一原始同义词对应的近似度;同时基于总样本数量和指定样本比例,确定指定样本数量,以便根据指定样本数量和目标同义词数量计算公式,确定目标同义词数量;最后,通过每一原始同义词对应的近似度和目标同义词数量,确定目标同义词,以保证样本的数据平衡。
在一实施例中,如图7所示,步骤S635之后,该数据增量方法还包括如下步骤:
S811:若携带词向量的目标同义词数量为正整数,则直接执行按照携带词向量的目标同义词数量,从近似度降序排列的原始同义词中选取前N位的携带词向量的目标同义词的步骤。
S821:若携带词向量的目标同义词数量为浮点数,则对携带词向量的目标同义词数量进行向下取整处理,获取更新同义词数量;并基于更新同义词数量,从近似度降序排列的原始同义词中选取前N 位的携带词向量的目标同义词的步骤。
其中,由上述增量参数计算公式
Figure PCTCN2019103271-appb-000003
可知,N的取值有可能为正整数,也有可能为浮点数,故服务器需对N的取值类型进行判断,若携带词向量的目标同义词数量为正整数,则可直接执行按照携带词向量的目标同义词数量,从近似度降序排列的原始同义词中选取前N位的携带词向量的目标同义词的步骤。
若携带词向量的目标同义词数量为浮点数,则由于取上限值可能会造成样本量过多的情况,故本实施例中,服务器对携带词向量的目标同义词数量进行向下取整处理,以获取更新同义词数量,例如N为5.1,则将N向下取整为5。最后,基于更新同义词数量,从近似度降序排列的原始同义词中选取前N位的携带词向量的目标同义词的步骤。
本实施例中,通过对目标同义词数量的取值类型进行判断,以保证数据增量的顺利执行,提高容错性。
在一实施例中,如图8所示,步骤S821之后,该数据增量方法还包括如下步骤:
S91:采用待补充样本数量计算公式对携带词向量的目标同义词数量与更新同义词数量进行处理,获取待补充样本数量,其中,待补充样本数量计算公式为A=N m-B m,N表示目标同义词数量,B表示更新同义词数量,A表示待补充样本数量。
具体地,由于更新同义词数量是通过对目标同义词数量进行向下取整所得到的,故实际选取的目标同义词数量小于通过增量参数计算公式进行计算所得到的保持样本平衡的目标同义词数量,故需要补充部分缺失的数量,即通过采用待补充样本数量计算公式对携带词向量的目标同义词数量与更新同义词数量进行处理,以获取待补充样本数量,以便后续基于待补充样本数量对样本进行补充。
进一步地,若待补充样本数量为浮点数,则对待补充样本数量进行向下取整或向上取整处理,获取整数型的待补充样本数量。
具体地,由待补充样本数量计算公式A=N m-B m,可知,待补充数量的取值可能为浮点数,故需要对待补充样本数量的取值类型进行判断,若待补充样本数量为浮点数,则对待补充样本数量进行向下取整或向上取整处理,以获取整数型的待补充样本数量,若待补充样本数量为正整数,则无需进行处理。
S92:采用第一翻译工具将待增量样本翻译为非中文文本,再采用第一翻译工具或第二翻译工具将非中文文本翻译为中文文本,获取与分类标签对应的第二新增样本,直至第二新增样本的样本数量达到待补充样本数量,将第二新增样本与分类标签关联存储。
具体地,根据目标同义词数量计算公式
Figure PCTCN2019103271-appb-000004
可知,目标同义词数量计算公式中涉及指数幂的计算,故采用替换词同义词的方法进行大数据量的数据增量处理,而本实施例中,需要小数据量的增量,故采用翻译工具对待增量样本处理,以达到数据增量的目的。可以理解地,由于翻译工具支持的语言固定,因此可用来补充一小部分样本即通过采用翻译工具进行数据增强,以保证数据平衡。
可以理解地,待增量样本是中文文本,本实施例需要采用第一翻译工具将待增量样本翻译为其他语种对应的文本(即非中文文本),再将非中文文本翻译为中文文本,以得到与待增量样本本身中文语义相同,但表述不同的文本。
其中,第一翻译工具是指目前现有的翻译工具,如百度翻译或、有道翻译或谷歌翻译。第二翻译工具是指目前现有的除第一翻译工具外的其他翻译工具。非中文文本是指采用第一翻译工具将待增量样本进行翻译为非中文所得到的翻译文本。中文文本是指采用第一翻译工具或第二翻译工具进行翻译得到的仅包含中文字符的翻译文本。第二新增样本是指通过翻译工具进行数据增量所得到的样本。第二新增样本的样本数量,即采用翻译工具进行数据补充的补充样本数量。翻译工具包括但不限于google翻译工具,该翻译工具支持语言种类较多,以获取更多待补充样本。
可以理解地,待补充样本数量计算公式A=N m-B m中,N m是指需要获取的第一新增样本的数量,B m是指当前已获取的第一新增样本的数量,A表示待补充样本数量即需要获取第二新增样本的数量。步骤S92中,获取第二新增样本是一个持续的过程,可以理解为,若当前已获取的第二新增样本的样本数量达到待补充数量,则停止获取第二新增样本。
本实施例中,服务器可调用第一翻译工具提供的翻译接口,将待增量样本翻译为非中文文本,再采用第二翻译工具将非中文文本翻译为中文文本,获取与分类标签对应的第二新增样本,直至第二新增样本的样本数量达到待补充样本数量,将第二新增样本与分类标签关联存储,以获取更多的中文表达方式,达到数据增量的目的。
进一步地,服务器还会采用中文分词算法对第二新增样本进行分词,以获取第二新增样本对应待标注词次,再将待标注词次输入到目标词向量模型中进行识别,以实现对每一待标注词次对应的词向量标注,获取第二新增样本对应的词向量,无需人工进行标注。最后,将第二新增样本、第二新增样本对应的词向量和第二新增样本对应的分类标签作为模型训练样本关联存储,以便后续直接采用模型训练样本训练文本分类模型,无需人工采集,降低人工成本。
本实施例中,由于上述实施例中的更新同义词数量是通过对目标同义词数量进行向下取整所得到的,故实际选取的目标同义词数量小于通过增量参数计算公式进行计算所得到的保持样本平衡的目标同义词数量,故需要补充少部分缺失的数量,即通过采用翻译工具对待增量样本进行处理,以获取更多的中文表达方式,达到补充少部分样本的目的。
本实施例中,通过预先加入场景分类样本进行训练,以获取目标词向量模型,以便根据目标词向量模型获取每一分类样本对应的第一词次的N个同义词,以进行数据增量,并可动态根据用户输入的指定样本比例,动态调整数值N,以达到数据平衡的目的。进一步地,服务器还会针对替换同义词进行数据增量的方法中由于N值为非整数的情况采取翻译工具的方式补充少部分缺失样本,以保证数据平衡,且可有效采集更多样本,无需人力采集,节省时间。进一步地,服务器还可通过目标词向量模型实现对获取的新增样本词向量自动标注的目的,无需人工干预,降低人力成本。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种数据增量装置,该数据增量装置与上述实施例中数据增量方法一一对应。如图9所示,该数据增量装置包括样本获取模块10、待训练文本获取模块20、目标词向量模型获取模块30、实际样本比例确定模块40、待增量样本确定模块50、候选词组获取模块60和第一新增样本获取模块70。各功能模块详细说明如下:
样本获取模块10,用于获取特定场景对应的场景分类样本和指定样本比例,所述场景分类样本对应一分类标签;
待训练文本获取模块20,用于采用正则表达式对所述场景分类样本进行文本预处理,获取待训练文本;
目标词向量模型获取模块30,用于采用预先训练好的原始词向量模型对所述待训练文本进行增量训练,获取目标词向量模型;
实际样本比例确定模块40,用于统计每一所述分类标签对应的实际样本数量和所有所述场景分类样本对应的总样本数量,基于所述实际样本数量和所述总样本数量,确定所述分类标签对应的实际样本比例;
待增量样本确定模块50,用于若所述分类标签对应的实际样本比例小于所述指定样本比例,则将所述分类标签对应的场景分类样本作为待增量样本;
候选词组获取模块60,用于将所述待增量样本输入至所述目标词向量模型中进行处理,获取与所述待增量样本对应的至少一个候选词组,所述候选词组包括携带词向量的至少一个目标同义词;
第一新增样本获取模块70,用于从每一所述候选词组中随机选取一个所述目标同义词对所述待增 量样本进行替换处理,获取与所述分类标签对应的第一新增样本。
具体地,样本获取模块包括标语音信息获取单元、目标语音特征获取单元和场景分类样本获取单元。
标语音信息获取单元,用于获取特定场景对应的原始语音信息,采用语音增强算法对所述原始语音信息进行降噪处理,获取目标语音信息;
目标语音特征获取单元,用于对所述目标语音信息进行特征提取,获取与所述目标语音信息相对应的目标语音特征;
场景分类样本获取单元,用于采用预先训练好的语音识别模型对目标语音特征进行识别,获取与所述特定场景相对应的场景分类样本。
具体地,所述目标词向量模型包括近似度函数,候选词组获取模块包括待替换语句获取单元、待替换词次获取单元、目标同义词获取单元和候选词组获取单元。
待替换语句获取单元,用于采用正则表达式对所述待增量样本进行分割,获取所述待增量样本对应的至少一个待替换语句;
待替换词次获取单元,用于采用中文分词算法对每一所述待替换语句进行分词,获取所述待替换语句对应的至少一个待替换词次;
目标同义词获取单元,用于将所述待替换语句对应的每一待替换词次输入到所述近似度函数中进行处理,得到与所述待替换词次相对应的携带词向量的至少一个目标同义词;
候选词组获取单元,用于将所述待替换词次和对应的所述携带词向量的至少一个目标同义词作为所述待替换词次对应的候选词组。
具体地,第一新增样本获取模块包括目标词次获取单元和第一新增样本获取单元。
目标词次获取单元,用于从每一所述待替换词次对应的候选词组中随机选取至少一个所述目标同义词,确定为所述待替换词次对应的目标词次;
第一新增样本获取单元,用于将所述待替换语句中的每一所述待替换词次替换为与所述待替换词次对应的目标词次,获取与所述分类标签对应的第一新增样本。
具体地,目标同义词获取单元包括近似度获取单元、指定样本数量获取单元、增量参数获取单元、目标同义词数量获取单元和目标同义词获取单元。
近似度获取单元,用于将所述待增量样本对应的每一待替换词次输入到所述近似度函数中进行处理,获取与所述待替换词次相对应的至少一个原始同义词和每一所述原始同义词对应的近似度;
指定样本数量获取单元,基于所述总样本数量和所述指定样本比例,确定指定样本数量;
增量参数获取单元,根据所述指定样本数量和所述实际样本数量的差值,确定增量参数;
目标同义词数量获取单元,用于基于所述目标同义词数量计算公式进行计算,获取携带词向量的目标同义词数量;其中,所述目标同义词数量计算公式包括
Figure PCTCN2019103271-appb-000005
m为所述待替换词次的数量,N为所述目标同义词数量,Z为所述增量参数;
目标同义词获取单元,用于按照所述目标同义词数量,从所述近似度降序排列的所述原始同义词中选取前N位的携带词向量的目标同义词。
具体地,该数据增量装置还包括目标同义词数量获取单元和更新同义词数量获取单元。
目标同义词数量获取单元,用于若所述携带词向量的目标同义词数量为正整数,则直接执行所述按照所述携带词向量的目标同义词数量,从所述近似度降序排列的所述原始同义词中选取前N位的携带词向量的目标同义词的步骤;
更新同义词数量获取单元,用于若所述携带词向量的目标同义词数量为浮点数,则对所述携带词向量的目标同义词数量进行向下取整处理,获取更新同义词数量;基于所述更新同义词数量,从所述近似度降序排列的所述原始同义词中选取前N位的携带词向量的目标同义词的步骤。
具体地,该数据增量装置还包括待补充样本数量获取单元和待补充样本数量更新单元。
待补充样本数量获取单元,用于采用待补充样本数量计算公式对所述携带词向量的目标同义词数 量与所述更新同义词数量进行处理,获取待补充样本数量;其中,所述待补充样本数量计算公式为A=N m-B m,N表示所述目标同义词数量,B表示所述更新同义词数量,A表示待补充样本数量;
待补充样本数量更新单元,用于若待补充样本数量为浮点数,则对所述待补充样本数量进行向下取整或向上取整处理,获取待补充样本数量;
第二新增样本获取单元,用于采用第一翻译工具将所述待增量样本翻译为非中文文本,再采用所述第一翻译工具或第二翻译工具将所述非中文文本翻译为中文文本,获取与所述分类标签对应的第二新增样本,直至所述第二新增样本的样本数量达到所述待补充样本数量,将所述第二新增样本与所述分类标签关联存储。
关于数据增量装置的具体限定可以参见上文中对于数据增量方法的限定,在此不再赘述。上述数据增量装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储执行数据增量方法过程中生成或获取的数据,如第一新增样本。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种数据增量方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。
在一个实施例中,提供了一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行上述实施例中的数据增量方法的步骤,例如图2所示的步骤,或者图3至图8中所示的步骤。或者,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现数据增量装置这一实施例中的各模块/单元的功能,例如图9所示的各模块/单元的功能,为避免重复,这里不再赘述。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种数据增量方法,其特征在于,包括:
    获取特定场景对应的场景分类样本和指定样本比例,所述场景分类样本对应一分类标签;
    采用正则表达式对所述场景分类样本进行文本预处理,获取待训练文本;
    采用预先训练好的原始词向量模型对所述待训练文本进行增量训练,获取目标词向量模型;
    统计每一所述分类标签对应的实际样本数量和所有所述场景分类样本对应的总样本数量,基于所述实际样本数量和所述总样本数量,确定所述分类标签对应的实际样本比例;
    若所述分类标签对应的实际样本比例小于所述指定样本比例,则将所述分类标签对应的场景分类样本作为待增量样本;
    将所述待增量样本输入至所述目标词向量模型中进行处理,获取与所述待增量样本对应的至少一个候选词组,所述候选词组包括携带词向量的至少一个目标同义词;
    从每一所述候选词组中随机选取一个所述目标同义词对所述待增量样本进行替换处理,获取与所述分类标签对应的第一新增样本。
  2. 如权利要求1所述数据增量方法,其特征在于,所述获取特定场景对应的场景分类样本,包括:
    获取特定场景对应的原始语音信息,采用语音增强算法对所述原始语音信息进行降噪处理,获取目标语音信息;
    对所述目标语音信息进行特征提取,获取与所述目标语音信息相对应的目标语音特征;
    采用预先训练好的语音识别模型对目标语音特征进行识别,获取与所述特定场景相对应的场景分类样本。
  3. 如权利要求1所述数据增量方法,其特征在于,所述目标词向量模型包括近似度函数;
    所述将所述待增量样本输入至所述目标词向量模型中进行处理,获取与所述待增量样本对应的候选词组,所述候选词组包括携带词向量的至少一个目标同义词,包括:
    采用正则表达式对所述待增量样本进行分割,获取所述待增量样本对应的至少一个待替换语句;
    采用中文分词算法对每一所述待替换语句进行分词,获取所述待替换语句对应的至少一个待替换词次;
    将所述待替换语句对应的每一待替换词次输入到所述近似度函数中进行处理,得到与所述待替换词次相对应的携带词向量的至少一个目标同义词;
    将所述待替换词次和对应的所述携带词向量的至少一个目标同义词作为所述待替换词次对应的候选词组。
  4. 如权利要求3所述数据增量方法,其特征在于,所述从所述候选词组中随机选取至少一个所述目标同义词对所述待增量样本进行替换处理,获取与所述分类标签对应的第一新增样本,包括:
    从每一所述待替换词次对应的候选词组中随机选取至少一个所述目标同义词,确定为所述待替换词次对应的目标词次;
    将所述待替换语句中的每一所述待替换词次替换为与所述待替换词次对应的目标词次,获取与所述分类标签对应的第一新增样本。
  5. 如权利要求3所述数据增量方法,其特征在于,所述将所述待增量样本对应的每一待替换词次输入到所述近似度函数中进行处理,得到与所述待替换词次相对应的携带词向量的至少一个目标同义词,包括:
    将所述待增量样本对应的每一待替换词次输入到所述近似度函数中进行处理,获取与所述待替换词次相对应的至少一个原始同义词和每一所述原始同义词对应的近似度;
    基于所述总样本数量和所述指定样本比例,确定指定样本数量;
    根据所述指定样本数量和所述实际样本数量的差值,确定增量参数;
    基于目标同义词数量计算公式进行计算,获取携带词向量的目标同义词数量;其中,所述目标同义词数量计算公式包括
    Figure PCTCN2019103271-appb-100001
    m为所述待替换词次的数量,N为所述目标同义词数量,Z为所述 增量参数;
    按照所述目标同义词数量,从所述近似度降序排列的所述原始同义词中选取前N位的携带词向量的目标同义词。
  6. 如权利要求5所述数据增量方法,其特征在于,在所述获取携带词向量的目标同义词数量之后,所述数据增量方法还包括:
    若所述携带词向量的目标同义词数量为正整数,则直接执行所述按照所述携带词向量的目标同义词数量,从所述近似度降序排列的所述原始同义词中选取前N位的携带词向量的目标同义词的步骤;
    若所述携带词向量的目标同义词数量为浮点数,则对所述携带词向量的目标同义词数量进行向下取整处理,获取更新同义词数量;基于所述更新同义词数量,从所述近似度降序排列的所述原始同义词中选取前N位的携带词向量的目标同义词的步骤。
  7. 如权利要求6所述数据增量方法,其特征在于,在所述获取更新同义词数量之后,所述数据增量方法包括:
    采用待补充样本数量计算公式对所述携带词向量的目标同义词数量与所述更新同义词数量进行处理,获取待补充样本数量;其中,所述待补充样本数量计算公式为A=N m-B m,N表示所述目标同义词数量,B表示所述更新同义词数量,A表示待补充样本数量;
    采用第一翻译工具将所述待增量样本翻译为非中文文本,再采用所述第一翻译工具或第二翻译工具将所述非中文文本翻译为中文文本,获取与所述分类标签对应的第二新增样本,直至所述第二新增样本的样本数量达到所述待补充样本数量,将所述第二新增样本与所述分类标签关联存储。
  8. 一种数据增量装置,其特征在于,包括:
    样本获取模块,用于获取特定场景对应的场景分类样本和指定样本比例,所述场景分类样本对应一分类标签;
    待训练文本获取模块,用于采用正则表达式对所述场景分类样本进行文本预处理,获取待训练文本;
    目标词向量模型获取模块,用于采用预先训练好的原始词向量模型对所述待训练文本进行增量训练,获取目标词向量模型;
    实际样本比例确定模块,用于统计每一所述分类标签对应的实际样本数量和所有所述场景分类样本对应的总样本数量,基于所述实际样本数量和所述总样本数量,确定所述分类标签对应的实际样本比例;
    待增量样本确定模块,用于若所述分类标签对应的实际样本比例小于所述指定样本比例,则将所述分类标签对应的场景分类样本作为待增量样本;
    候选词组获取模块,用于将所述待增量样本输入至所述目标词向量模型中进行处理,获取与所述待增量样本对应的至少一个候选词组,所述候选词组包括携带词向量的至少一个目标同义词;
    第一新增样本获取模块,用于从每一所述候选词组中随机选取一个所述目标同义词对所述待增量样本进行替换处理,获取与所述分类标签对应的第一新增样本。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取特定场景对应的场景分类样本和指定样本比例,所述场景分类样本对应一分类标签;
    采用正则表达式对所述场景分类样本进行文本预处理,获取待训练文本;
    采用预先训练好的原始词向量模型对所述待训练文本进行增量训练,获取目标词向量模型;
    统计每一所述分类标签对应的实际样本数量和所有所述场景分类样本对应的总样本数量,基于所述实际样本数量和所述总样本数量,确定所述分类标签对应的实际样本比例;
    若所述分类标签对应的实际样本比例小于所述指定样本比例,则将所述分类标签对应的场景分类样本作为待增量样本;
    将所述待增量样本输入至所述目标词向量模型中进行处理,获取与所述待增量样本对应的至少一个候选词组,所述候选词组包括携带词向量的至少一个目标同义词;
    从每一所述候选词组中随机选取一个所述目标同义词对所述待增量样本进行替换处理,获取与所述 分类标签对应的第一新增样本。
  10. 如权利要求9所述的计算机设备,其特征在于,所述获取特定场景对应的场景分类样本,包括:
    获取特定场景对应的原始语音信息,采用语音增强算法对所述原始语音信息进行降噪处理,获取目标语音信息;
    对所述目标语音信息进行特征提取,获取与所述目标语音信息相对应的目标语音特征;
    采用预先训练好的语音识别模型对目标语音特征进行识别,获取与所述特定场景相对应的场景分类样本。
  11. 如权利要求9所述的计算机设备,其特征在于,所述目标词向量模型包括近似度函数;
    所述将所述待增量样本输入至所述目标词向量模型中进行处理,获取与所述待增量样本对应的候选词组,所述候选词组包括携带词向量的至少一个目标同义词,包括:
    采用正则表达式对所述待增量样本进行分割,获取所述待增量样本对应的至少一个待替换语句;
    采用中文分词算法对每一所述待替换语句进行分词,获取所述待替换语句对应的至少一个待替换词次;
    将所述待替换语句对应的每一待替换词次输入到所述近似度函数中进行处理,得到与所述待替换词次相对应的携带词向量的至少一个目标同义词;
    将所述待替换词次和对应的所述携带词向量的至少一个目标同义词作为所述待替换词次对应的候选词组。
  12. 如权利要求11所述的计算机设备,其特征在于,所述将所述待增量样本对应的每一待替换词次输入到所述近似度函数中进行处理,得到与所述待替换词次相对应的携带词向量的至少一个目标同义词,包括:
    将所述待增量样本对应的每一待替换词次输入到所述近似度函数中进行处理,获取与所述待替换词次相对应的至少一个原始同义词和每一所述原始同义词对应的近似度;
    基于所述总样本数量和所述指定样本比例,确定指定样本数量;
    根据所述指定样本数量和所述实际样本数量的差值,确定增量参数;
    基于目标同义词数量计算公式进行计算,获取携带词向量的目标同义词数量;其中,所述目标同义词数量计算公式包括
    Figure PCTCN2019103271-appb-100002
    m为所述待替换词次的数量,N为所述目标同义词数量,Z为所述增量参数;
    按照所述目标同义词数量,从所述近似度降序排列的所述原始同义词中选取前N位的携带词向量的目标同义词。
  13. 如权利要求11所述的计算机设备,其特征在于,在所述获取携带词向量的目标同义词数量之后,所述处理器执行所述计算机可读指令时还实现如下步骤:
    若所述携带词向量的目标同义词数量为正整数,则直接执行所述按照所述携带词向量的目标同义词数量,从所述近似度降序排列的所述原始同义词中选取前N位的携带词向量的目标同义词的步骤;
    若所述携带词向量的目标同义词数量为浮点数,则对所述携带词向量的目标同义词数量进行向下取整处理,获取更新同义词数量;基于所述更新同义词数量,从所述近似度降序排列的所述原始同义词中选取前N位的携带词向量的目标同义词的步骤。
  14. 如权利要求13所述的计算机设备,其特征在于,在所述获取更新同义词数量之后,所述处理器执行所述计算机可读指令时还实现如下步骤:
    采用待补充样本数量计算公式对所述携带词向量的目标同义词数量与所述更新同义词数量进行处理,获取待补充样本数量;其中,所述待补充样本数量计算公式为A=N m-B m,N表示所述目标同义词数量,B表示所述更新同义词数量,A表示待补充样本数量;
    采用第一翻译工具将所述待增量样本翻译为非中文文本,再采用所述第一翻译工具或第二翻译工具将所述非中文文本翻译为中文文本,获取与所述分类标签对应的第二新增样本,直至所述第二新增样本 的样本数量达到所述待补充样本数量,将所述第二新增样本与所述分类标签关联存储。
  15. 一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取特定场景对应的场景分类样本和指定样本比例,所述场景分类样本对应一分类标签;
    采用正则表达式对所述场景分类样本进行文本预处理,获取待训练文本;
    采用预先训练好的原始词向量模型对所述待训练文本进行增量训练,获取目标词向量模型;
    统计每一所述分类标签对应的实际样本数量和所有所述场景分类样本对应的总样本数量,基于所述实际样本数量和所述总样本数量,确定所述分类标签对应的实际样本比例;
    若所述分类标签对应的实际样本比例小于所述指定样本比例,则将所述分类标签对应的场景分类样本作为待增量样本;
    将所述待增量样本输入至所述目标词向量模型中进行处理,获取与所述待增量样本对应的至少一个候选词组,所述候选词组包括携带词向量的至少一个目标同义词;
    从每一所述候选词组中随机选取一个所述目标同义词对所述待增量样本进行替换处理,获取与所述分类标签对应的第一新增样本。
  16. 如权利要求15所述的可读存储介质,其特征在于,所述获取特定场景对应的场景分类样本,包括:
    获取特定场景对应的原始语音信息,采用语音增强算法对所述原始语音信息进行降噪处理,获取目标语音信息;
    对所述目标语音信息进行特征提取,获取与所述目标语音信息相对应的目标语音特征;
    采用预先训练好的语音识别模型对目标语音特征进行识别,获取与所述特定场景相对应的场景分类样本。
  17. 如权利要求15所述的可读存储介质,其特征在于,所述目标词向量模型包括近似度函数;
    所述将所述待增量样本输入至所述目标词向量模型中进行处理,获取与所述待增量样本对应的候选词组,所述候选词组包括携带词向量的至少一个目标同义词,包括:
    采用正则表达式对所述待增量样本进行分割,获取所述待增量样本对应的至少一个待替换语句;
    采用中文分词算法对每一所述待替换语句进行分词,获取所述待替换语句对应的至少一个待替换词次;
    将所述待替换语句对应的每一待替换词次输入到所述近似度函数中进行处理,得到与所述待替换词次相对应的携带词向量的至少一个目标同义词;
    将所述待替换词次和对应的所述携带词向量的至少一个目标同义词作为所述待替换词次对应的候选词组。
  18. 如权利要求17所述的可读存储介质,其特征在于,所述将所述待增量样本对应的每一待替换词次输入到所述近似度函数中进行处理,得到与所述待替换词次相对应的携带词向量的至少一个目标同义词,包括:
    将所述待增量样本对应的每一待替换词次输入到所述近似度函数中进行处理,获取与所述待替换词次相对应的至少一个原始同义词和每一所述原始同义词对应的近似度;
    基于所述总样本数量和所述指定样本比例,确定指定样本数量;
    根据所述指定样本数量和所述实际样本数量的差值,确定增量参数;
    基于目标同义词数量计算公式进行计算,获取携带词向量的目标同义词数量;其中,所述目标同义词数量计算公式包括
    Figure PCTCN2019103271-appb-100003
    m为所述待替换词次的数量,N为所述目标同义词数量,Z为所述增量参数;
    按照所述目标同义词数量,从所述近似度降序排列的所述原始同义词中选取前N位的携带词向量的目标同义词。
  19. 如权利要求18所述的可读存储介质,其特征在于,在所述获取携带词向量的目标同义词数量之 后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:
    若所述携带词向量的目标同义词数量为正整数,则直接执行所述按照所述携带词向量的目标同义词数量,从所述近似度降序排列的所述原始同义词中选取前N位的携带词向量的目标同义词的步骤;
    若所述携带词向量的目标同义词数量为浮点数,则对所述携带词向量的目标同义词数量进行向下取整处理,获取更新同义词数量;基于所述更新同义词数量,从所述近似度降序排列的所述原始同义词中选取前N位的携带词向量的目标同义词的步骤。
  20. 如权利要求19所述的可读存储介质,其特征在于,在所述获取更新同义词数量之后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:
    采用待补充样本数量计算公式对所述携带词向量的目标同义词数量与所述更新同义词数量进行处理,获取待补充样本数量;其中,所述待补充样本数量计算公式为A=N m-B m,N表示所述目标同义词数量,B表示所述更新同义词数量,A表示待补充样本数量;
    采用第一翻译工具将所述待增量样本翻译为非中文文本,再采用所述第一翻译工具或第二翻译工具将所述非中文文本翻译为中文文本,获取与所述分类标签对应的第二新增样本,直至所述第二新增样本的样本数量达到所述待补充样本数量,将所述第二新增样本与所述分类标签关联存储。
PCT/CN2019/103271 2019-04-28 2019-08-29 数据增量方法、装置、计算机设备及存储介质 WO2020220539A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910350861.5 2019-04-28
CN201910350861.5A CN110162627B (zh) 2019-04-28 2019-04-28 数据增量方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020220539A1 true WO2020220539A1 (zh) 2020-11-05

Family

ID=67640197

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103271 WO2020220539A1 (zh) 2019-04-28 2019-08-29 数据增量方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN110162627B (zh)
WO (1) WO2020220539A1 (zh)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766501A (zh) * 2021-02-26 2021-05-07 上海商汤智能科技有限公司 增量训练方法和相关产品
CN112836053A (zh) * 2021-03-05 2021-05-25 三一重工股份有限公司 用于工业领域的人机对话情感分析方法及系统
CN112989045A (zh) * 2021-03-17 2021-06-18 中国平安人寿保险股份有限公司 神经网络训练方法、装置、电子设备及存储介质
CN113360346A (zh) * 2021-06-22 2021-09-07 北京百度网讯科技有限公司 用于训练模型的方法和装置
CN113408280A (zh) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 负例构造方法、装置、设备和存储介质
CN113435188A (zh) * 2021-06-25 2021-09-24 平安科技(深圳)有限公司 基于语义相似的过敏文本样本生成方法、装置及相关设备
CN113537345A (zh) * 2021-07-15 2021-10-22 中国南方电网有限责任公司 一种通信网设备数据关联的方法及系统
CN113705683A (zh) * 2021-08-30 2021-11-26 北京达佳互联信息技术有限公司 推荐模型的训练方法、装置、电子设备及存储介质
CN113791694A (zh) * 2021-08-17 2021-12-14 咪咕文化科技有限公司 数据输入方法、装置、设备及计算机可读存储介质
CN114491076A (zh) * 2022-02-14 2022-05-13 平安科技(深圳)有限公司 基于领域知识图谱的数据增强方法、装置、设备及介质
WO2022198477A1 (zh) * 2021-03-24 2022-09-29 深圳大学 分类模型增量学习实现方法、装置、电子设备及介质
CN115408527A (zh) * 2022-11-02 2022-11-29 北京亿赛通科技发展有限责任公司 文本分类方法、装置、电子设备及存储介质
CN115455177A (zh) * 2022-08-02 2022-12-09 淮阴工学院 基于混合样本空间的不平衡化工文本数据增强方法及装置
CN115688868A (zh) * 2022-12-30 2023-02-03 荣耀终端有限公司 一种模型训练方法及计算设备
CN116227431A (zh) * 2023-03-17 2023-06-06 中科雨辰科技有限公司 一种文本数据增强方法、电子设备及存储介质

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162627B (zh) * 2019-04-28 2022-04-15 平安科技(深圳)有限公司 数据增量方法、装置、计算机设备及存储介质
CN111401397A (zh) * 2019-11-05 2020-07-10 杭州海康威视系统技术有限公司 分类方法、装置及设备、存储介质
CN111177367B (zh) * 2019-11-11 2023-06-23 腾讯科技(深圳)有限公司 案件分类方法、分类模型训练方法及相关产品
CN111079406B (zh) * 2019-12-13 2022-01-11 华中科技大学 自然语言处理模型训练方法、任务执行方法、设备及系统
CN112989794A (zh) * 2019-12-16 2021-06-18 科沃斯商用机器人有限公司 模型训练方法、装置、智能机器人和存储介质
CN111124925B (zh) * 2019-12-25 2024-04-05 斑马网络技术有限公司 基于大数据的场景提取方法、装置、设备和存储介质
CN111291560B (zh) * 2020-03-06 2023-05-23 深圳前海微众银行股份有限公司 样本扩充方法、终端、装置及可读存储介质
CN111400431A (zh) * 2020-03-20 2020-07-10 北京百度网讯科技有限公司 一种事件论元抽取方法、装置以及电子设备
CN111814538B (zh) * 2020-05-25 2024-03-05 北京达佳互联信息技术有限公司 目标对象的类别识别方法、装置、电子设备及存储介质
CN111522570B (zh) * 2020-06-19 2023-09-05 杭州海康威视数字技术股份有限公司 目标库更新方法、装置、电子设备及机器可读存储介质
CN111563152A (zh) * 2020-06-19 2020-08-21 平安科技(深圳)有限公司 智能问答语料分析方法、装置、电子设备及可读存储介质
CN112101042A (zh) * 2020-09-14 2020-12-18 平安科技(深圳)有限公司 文本情绪识别方法、装置、终端设备和存储介质
CN112183074A (zh) * 2020-09-27 2021-01-05 中国建设银行股份有限公司 一种数据增强方法、装置、设备及介质
CN112906669A (zh) * 2021-04-08 2021-06-04 济南博观智能科技有限公司 一种交通目标检测方法、装置、设备及可读存储介质
CN113469090B (zh) * 2021-07-09 2023-07-14 王晓东 水质污染预警方法、装置及存储介质
CN114637824B (zh) * 2022-03-18 2023-12-01 马上消费金融股份有限公司 数据增强处理方法及装置
CN115131631A (zh) * 2022-07-28 2022-09-30 广州广电运通金融电子股份有限公司 图像识别模型训练方法、装置、计算机设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776534A (zh) * 2016-11-11 2017-05-31 北京工商大学 词向量模型的增量式学习方法
CN108509415A (zh) * 2018-03-16 2018-09-07 南京云问网络技术有限公司 一种基于词序加权的句子相似度计算方法
CN108509422A (zh) * 2018-04-04 2018-09-07 广州荔支网络技术有限公司 一种词向量的增量学习方法、装置和电子设备
US20180276507A1 (en) * 2015-10-28 2018-09-27 Hewlett-Packard Development Company, L.P. Machine learning classifiers
CN110162627A (zh) * 2019-04-28 2019-08-23 平安科技(深圳)有限公司 数据增量方法、装置、计算机设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180276507A1 (en) * 2015-10-28 2018-09-27 Hewlett-Packard Development Company, L.P. Machine learning classifiers
CN106776534A (zh) * 2016-11-11 2017-05-31 北京工商大学 词向量模型的增量式学习方法
CN108509415A (zh) * 2018-03-16 2018-09-07 南京云问网络技术有限公司 一种基于词序加权的句子相似度计算方法
CN108509422A (zh) * 2018-04-04 2018-09-07 广州荔支网络技术有限公司 一种词向量的增量学习方法、装置和电子设备
CN110162627A (zh) * 2019-04-28 2019-08-23 平安科技(深圳)有限公司 数据增量方法、装置、计算机设备及存储介质

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766501A (zh) * 2021-02-26 2021-05-07 上海商汤智能科技有限公司 增量训练方法和相关产品
CN112836053A (zh) * 2021-03-05 2021-05-25 三一重工股份有限公司 用于工业领域的人机对话情感分析方法及系统
CN112989045A (zh) * 2021-03-17 2021-06-18 中国平安人寿保险股份有限公司 神经网络训练方法、装置、电子设备及存储介质
CN112989045B (zh) * 2021-03-17 2023-07-25 中国平安人寿保险股份有限公司 神经网络训练方法、装置、电子设备及存储介质
WO2022198477A1 (zh) * 2021-03-24 2022-09-29 深圳大学 分类模型增量学习实现方法、装置、电子设备及介质
CN113360346A (zh) * 2021-06-22 2021-09-07 北京百度网讯科技有限公司 用于训练模型的方法和装置
CN113360346B (zh) * 2021-06-22 2023-07-11 北京百度网讯科技有限公司 用于训练模型的方法和装置
CN113435188A (zh) * 2021-06-25 2021-09-24 平安科技(深圳)有限公司 基于语义相似的过敏文本样本生成方法、装置及相关设备
CN113408280B (zh) * 2021-06-30 2024-03-22 北京百度网讯科技有限公司 负例构造方法、装置、设备和存储介质
CN113408280A (zh) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 负例构造方法、装置、设备和存储介质
CN113537345A (zh) * 2021-07-15 2021-10-22 中国南方电网有限责任公司 一种通信网设备数据关联的方法及系统
CN113537345B (zh) * 2021-07-15 2023-01-24 中国南方电网有限责任公司 一种通信网设备数据关联的方法及系统
CN113791694A (zh) * 2021-08-17 2021-12-14 咪咕文化科技有限公司 数据输入方法、装置、设备及计算机可读存储介质
CN113705683A (zh) * 2021-08-30 2021-11-26 北京达佳互联信息技术有限公司 推荐模型的训练方法、装置、电子设备及存储介质
CN114491076A (zh) * 2022-02-14 2022-05-13 平安科技(深圳)有限公司 基于领域知识图谱的数据增强方法、装置、设备及介质
CN114491076B (zh) * 2022-02-14 2024-04-09 平安科技(深圳)有限公司 基于领域知识图谱的数据增强方法、装置、设备及介质
CN115455177A (zh) * 2022-08-02 2022-12-09 淮阴工学院 基于混合样本空间的不平衡化工文本数据增强方法及装置
CN115408527A (zh) * 2022-11-02 2022-11-29 北京亿赛通科技发展有限责任公司 文本分类方法、装置、电子设备及存储介质
CN115408527B (zh) * 2022-11-02 2023-03-10 北京亿赛通科技发展有限责任公司 文本分类方法、装置、电子设备及存储介质
CN115688868B (zh) * 2022-12-30 2023-10-20 荣耀终端有限公司 一种模型训练方法及计算设备
CN115688868A (zh) * 2022-12-30 2023-02-03 荣耀终端有限公司 一种模型训练方法及计算设备
CN116227431A (zh) * 2023-03-17 2023-06-06 中科雨辰科技有限公司 一种文本数据增强方法、电子设备及存储介质
CN116227431B (zh) * 2023-03-17 2023-08-15 中科雨辰科技有限公司 一种文本数据增强方法、电子设备及存储介质

Also Published As

Publication number Publication date
CN110162627A (zh) 2019-08-23
CN110162627B (zh) 2022-04-15

Similar Documents

Publication Publication Date Title
WO2020220539A1 (zh) 数据增量方法、装置、计算机设备及存储介质
CN109408526B (zh) Sql语句生成方法、装置、计算机设备及存储介质
US10657325B2 (en) Method for parsing query based on artificial intelligence and computer device
CN108829893B (zh) 确定视频标签的方法、装置、存储介质和终端设备
US10108607B2 (en) Method and device for machine translation
US11734514B1 (en) Automated translation of subject matter specific documents
CN110444198B (zh) 检索方法、装置、计算机设备和存储介质
CN109522393A (zh) 智能问答方法、装置、计算机设备和存储介质
US20180329894A1 (en) Language conversion method and device based on artificial intelligence and terminal
CN110415679B (zh) 语音纠错方法、装置、设备和存储介质
CN114580382A (zh) 文本纠错方法以及装置
CN111144102B (zh) 用于识别语句中实体的方法、装置和电子设备
CN110717021B (zh) 人工智能面试中获取输入文本和相关装置
CN112528681A (zh) 跨语言检索及模型训练方法、装置、设备和存储介质
CN113076431A (zh) 机器阅读理解的问答方法、装置、计算机设备及存储介质
CN113821593A (zh) 一种语料处理的方法、相关装置及设备
US11270085B2 (en) Generating method, generating device, and recording medium
CN111126084A (zh) 数据处理方法、装置、电子设备和存储介质
CN110309513B (zh) 一种文本依存分析的方法和装置
CN109684357B (zh) 信息处理方法及装置、存储介质、终端
CN115858776B (zh) 一种变体文本分类识别方法、系统、存储介质和电子设备
CN111563381A (zh) 文本处理方法和装置
CN113868389A (zh) 基于自然语言文本的数据查询方法、装置及计算机设备
CN109727591B (zh) 一种语音搜索的方法及装置
CN114430832A (zh) 数据处理方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19927450

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19927450

Country of ref document: EP

Kind code of ref document: A1