WO2021073390A1 - 数据筛选方法、装置、设备及计算机可读存储介质 - Google Patents

数据筛选方法、装置、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2021073390A1
WO2021073390A1 PCT/CN2020/117418 CN2020117418W WO2021073390A1 WO 2021073390 A1 WO2021073390 A1 WO 2021073390A1 CN 2020117418 W CN2020117418 W CN 2020117418W WO 2021073390 A1 WO2021073390 A1 WO 2021073390A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
answer text
interview answer
interview
scoring
Prior art date
Application number
PCT/CN2020/117418
Other languages
English (en)
French (fr)
Inventor
邓悦
金戈
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021073390A1 publication Critical patent/WO2021073390A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a data screening method, device, equipment, and computer-readable storage medium.
  • the industry uses the Bidirectional Encoder Representations from Transformers (BERT) model from Transformer to automatically filter out the data that meets the requirements from the data set.
  • BERT Bidirectional Encoder Representations from Transformers
  • the BERT model is used to filter out the data that meets the requirements from the resume data set or the target data set.
  • Resume or target data but the BERT model requires a lot of labeled data to train the model, and the labeling of the data takes a lot of time, and the labeling is done manually.
  • the inventor realized that in the case of a large amount of manual labeling data , There is the problem of inaccurate labeling, which easily reduces the accuracy of the model, and it is impossible to accurately filter out the data that meets the requirements from the data set. Therefore, how to improve the accuracy of data screening is a problem that needs to be solved urgently.
  • the main purpose of this application is to provide a data screening method, device, equipment and computer-readable storage medium, aiming to improve the accuracy of data screening.
  • this application provides a data screening method.
  • the data screening method includes the following steps:
  • Target data set is a data set to be filtered
  • the target data set is screened and processed to obtain the interview answer text that meets the preset conditions.
  • this application also provides a data screening device, which includes:
  • An acquisition module for acquiring a target data set, where the target data set is a data set to be filtered
  • the scoring module is used to score each interview answer text in the target data set based on a preset data scoring model to obtain the scoring value of each interview answer text, wherein the data scoring model is based on multi-task Deep neural network implementation;
  • the screening module is used to screen the target data set according to the scoring value of each interview answer text to obtain the interview answer text that meets the preset conditions.
  • the present application also provides a computer device that includes a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program is When the processor is executed, the steps of the data screening method as described above are realized.
  • the present application also provides a computer-readable storage medium with a computer program stored on the computer-readable storage medium, wherein when the computer program is executed by a processor, the steps of the above-mentioned data screening method are realized.
  • This application provides a data screening method, device, equipment, and computer readable storage medium.
  • This application uses a data scoring model based on a multi-task deep neural network to accurately and quickly score each interview answer text in the data set.
  • the accurate scoring value of each interview answer text can accurately filter out the qualified interview answer text from the data set, effectively improving the accuracy of data screening.
  • FIG. 1 is a schematic flowchart of a data screening method provided by an embodiment of the application
  • FIG. 2 is a schematic flowchart of sub-steps of the data screening method in FIG. 1;
  • FIG. 3 is a schematic flowchart of another data screening method provided by an embodiment of the application.
  • FIG. 4 is a schematic block diagram of a data screening device provided by an embodiment of the application.
  • FIG. 5 is a schematic block diagram of sub-modules of the data screening device in FIG. 4;
  • FIG. 6 is a schematic block diagram of another data screening device provided by an embodiment of the application.
  • FIG. 7 is a schematic block diagram of the structure of a computer device related to an embodiment of the application.
  • the embodiments of the present application provide a data screening method, device, equipment, and computer-readable storage medium.
  • the data filtering method can be applied to a server, and the server can be a single server or a server cluster composed of multiple servers.
  • FIG. 1 is a schematic flowchart of a data screening method provided by an embodiment of the application.
  • the data screening method includes steps S101 to S103.
  • the server stores a data set to be screened.
  • the data set to be screened includes interview answer texts of each interviewer in different positions.
  • the interview answer text records the interviewer’s basic personal information and the answer information of each interview question. Wait.
  • the server stores the interview answer text of each interviewer in each position by position as a unit, so as to obtain the data set corresponding to each position, and mark the screened interview response text and the unscreened interview response text.
  • the data set to be screened corresponding to each position is obtained, and the interview answer text in the data set to be screened is the unscreened interview answer text.
  • the server can obtain the unscreened interview answer text corresponding to each post in real time or at preset intervals, and collect the unscreened interview answer text by post as a unit to obtain the data set to be filtered for each post. , The target data set. It should be noted that the aforementioned preset time can be set based on actual conditions, which is not specifically limited in this application.
  • the recruiter can select one or more positions for data filtering through the terminal device, specifically: the terminal device displays the position selection page, and obtains the position identifier corresponding to the position selected by the user based on the position selection page;
  • the data screening request containing the post identifier, and the data screening request is sent to the server; when the server receives the data screening request, it obtains the post identifier from the data screening request and obtains the corresponding post identifier Target data set, and then filter the target data in the target data set to obtain data that meets the requirements.
  • the job identifier is used to uniquely identify the job. It can be numbers, letters, or a combination of numbers and letters. This application does not specifically limit this.
  • the terminal equipment can be mobile phones, tablets, laptops, desktop computers, personal digital assistants, and personal digital assistants. Electronic equipment such as wearable devices.
  • Step S102 Based on a preset data scoring model, score each interview answer text in the target data set to obtain a score value of each interview answer text, wherein the data scoring model is based on a multi-task deep neural network. Network implementation.
  • a data scoring model is stored in the server.
  • the data scoring model is implemented based on a multi-task deep neural network.
  • the multi-task deep neural network combines multi-task learning and language model pre-training.
  • Multi-task learning uses all learning tasks in multiple learning tasks. Contains useful information to help each task learn and get a more accurate learner, while language model pre-training uses a large amount of unlabeled data to pre-train the model, and then fine-tune the model for a single specific task to improve the text Express learning to improve various natural language comprehension tasks.
  • the multi-task deep neural network includes input layer, Lexicon coding layer (word coding layer), Transformer coding layer (context coding layer) and specific task output layer.
  • the specific task output layer includes single sentence classification output layer, text similarity output layer, Paired text classification output layer and relevance ranking output layer.
  • the Lexicon coding layer is used to map the input text or sentence into an embedding vector by summing the corresponding words, segments and positions.
  • the Transformer coding layer is composed of multiple identical levels, each of which includes two different sub-levels.
  • One sub-level is a multi-head attention layer, which is used to learn the word dependency relationship within the sentence, capture the internal structure of the sentence, and the other sub-level.
  • the level is a fully connected layer, and each sub-level is connected to the residual connection layer and the normalization layer.
  • the Transformer coding layer pre-trains the deep bidirectional representation by jointly adjusting the context in all layers, that is, the Transformer coding layer maps the embedding vector to the context embedding vector.
  • the single-sentence classification output layer is used to judge the grammatical correctness of the sentence, or to judge the type of emotion in the sentence.
  • the logistic regression of the softmax function predicts the probability that the sentence X is marked as C, the formula is: P r (C
  • X) softmax(W T *X), W T is the model parameter of the single sentence classification model.
  • the text similarity output layer is used to judge the semantic similarity of two sentences.
  • the relevance ranking output layer is used to score the interview answer text, input an interview answer text, calculate the similarity between the interview answer text and the standard answer text, and then score based on the similarity.
  • the training process of the model is mainly divided into two steps: pre-training and multi-task fine-tuning.
  • Pre-training Use two unsupervised prediction tasks to pre-train the coding layer (Lexicon coding layer and Transformer coding layer) to learn the parameters of the coding layer.
  • the two unsupervised prediction tasks are Masked Language Modeling and Next Sentence Prediction.
  • Masked language model In order to train a deep bidirectional representation, a simple method is adopted, which is to randomly mask part of the input token, and then only predict the masked token. The data generator will do the following instead of always replacing the selected word with [MASK]: 80% of the time: replace the word with the [MASK] tag; 10% of the time: replace the word with a random word; 10% of the time Time: Keep the word unchanged.
  • Next sentence prediction In order to train a model relationship for understanding sentences, a binary next sentence prediction task is pre-trained. This task can be generated from any monolingual corpus. Specifically, when sentences A and B are selected as pre-training samples, B is 50% likely to be the next sentence of A, and 50% is likely to be a random sentence from the corpus.
  • Mini-batch gradient descent algorithm (Mini-batch Gradient Descent) is used to learn the parameters of the model (coding layer and specific task output layer). Proceed as follows:
  • the same method as multi-task fine-tuning is used to train the model to learn the model parameters of the data scoring model. Only a small number of labeled data sets are required to fine-tune the data scoring model to obtain a highly accurate data scoring model.
  • the data scoring model includes an input layer, a word coding layer (Lexicon coding layer), a context coding layer (Transformer coding layer), and a data scoring layer.
  • the server can score each interview answer text in the target data set based on a preset data scoring model, and obtain the score value of each interview answer text.
  • the data scoring model can quickly and accurately score the target data, which is convenient for subsequent accurate screening of the target data set.
  • step S102 includes: sub-step S1021 to sub-step S1023.
  • each interview answer text in the target data set is sequentially mapped to its corresponding embedding vector through the word encoding layer.
  • each interview answer text in the target data set is mapped to its corresponding embedding vector through the word encoding layer in the data scoring model.
  • the target data set includes 5 interview answer texts, which are interview answer text A, interview answer text B, interview answer text C, interview answer text D, and interview answer text E.
  • interview answer texts which are interview answer text A, interview answer text B, interview answer text C, interview answer text D, and interview answer text E.
  • sub-step S1022 the embedding vector corresponding to each interview answer text is sequentially mapped to the corresponding context embedding vector through the context coding layer.
  • the corresponding embedding vector of each interview answer text is sequentially mapped to the respective corresponding context embedding vector through the context coding layer.
  • the embedding vectors corresponding to each interview answer text are embedding vector a, embedding vector b, embedding vector c, embedding vector d, and embedding vector e.
  • the corresponding context embedding vector is obtained. That is, the embedding vector a1, the embedding vector b1, the embedding vector c1, the embedding vector d1, and the embedding vector e1.
  • Sub-step S1023 Based on the data scoring layer, determine the scoring value of each interview answer text according to the context embedding vector corresponding to each interview answer text.
  • the server obtains the text vector corresponding to the preset standard answer text, and calculates the semantic similarity between the context embedding vector corresponding to each interview answer text and the text vector through the model parameters of the data scoring model ; According to the semantic similarity between the context embedding vector corresponding to each interview answer text and the text vector, the score value of each interview answer text is determined; among them, the server performs the standard answer text through the word encoding layer and the context encoding layer After processing, the text vector corresponding to the standard answer text is obtained and stored, which is convenient for subsequent quick acquisition.
  • the semantic similarity between the context embedding vector and the text vector corresponding to each interview answer text is mapped to obtain the score value of each interview answer text.
  • the foregoing preset mapping function can be set based on actual conditions, which is not specifically limited in this application.
  • the preset mapping function is a sigmoid function.
  • the server obtains the text vector corresponding to the answer text of each interview question in the preset standard answer text; according to the text vector corresponding to the answer text of each interview question, the target corresponding to the standard answer text is determined Text vector; Calculate the semantic similarity between the context embedding vector corresponding to each interview answer text and the target text vector; determine the semantic similarity between the context embedding vector corresponding to each interview answer text and the target text vector The score value of each interview answer text.
  • the standard answer text includes the answer texts of multiple interview questions.
  • the target text vector is determined by splicing the corresponding text vector of the answer text of each interview question to obtain the text splicing vector, and use the text splicing vector as The target text vector corresponding to the standard answer text.
  • the server processes the answer text of each interview question through the word encoding layer and the context encoding layer, and obtains the text vector corresponding to the answer text of each interview question, and stores it for quick subsequent acquisition.
  • the target text vector corresponding to the standard answer text is determined, and the characteristics of the standard answer text can be accurately characterized.
  • Step S103 Perform a screening process on the target data set according to the scoring value of each interview answer text to obtain an interview answer text that meets preset conditions.
  • the server After determining the score value of each interview answer text, the server screens the target data set according to the score value of each interview answer text, and obtains the interview answer text that meets the preset conditions, that is, the score value of each interview answer text It is compared with the preset scoring threshold to obtain a scoring comparison result, and based on the scoring comparison result, the target data set is filtered to obtain an interview answer text with a score greater than or equal to the preset threshold.
  • each interview answer text is sorted according to the score value of each interview answer text to obtain an interview answer text queue; according to the order of each interview answer text in the interview answer text queue, the interview answer texts are sorted in turn The interview answer text is selected in the queue until the number of interview answer text reaches the preset number, so that an interview answer text with a score greater than or equal to the preset threshold is obtained.
  • the data screening method provided in the above embodiments can accurately and quickly score each interview answer text in the data set through a data scoring model implemented based on a multi-task deep neural network, and accurate scoring values for each interview answer text can be accurate Filtering out qualified interview answer texts from the data set effectively improves the accuracy of data selection.
  • FIG. 3 is a schematic flowchart of another data screening method provided by an embodiment of the application.
  • the data screening method includes steps S201 to S206.
  • the server stores a data set to be screened.
  • the data set to be screened includes interview answer texts of each interviewer in different positions.
  • the interview answer text records the interviewer’s basic personal information and the answer information of each interview question. Wait.
  • the server stores the interview answer text of each interviewer in each position by position as a unit, so as to obtain the data set corresponding to each position, and mark the screened interview response text and the unscreened interview response text.
  • the data set to be screened corresponding to each position is obtained, and the interview answer text in the data set to be screened is the unscreened interview answer text.
  • the server can obtain the unscreened interview answer text corresponding to each post in real time or at preset intervals, and collect the unscreened interview answer text by post as a unit to obtain the data set to be filtered for each post. , The target data set. It should be noted that the aforementioned preset time can be set based on actual conditions, which is not specifically limited in this application.
  • Step S202 Map each interview answer text in the target data set to its corresponding embedding vector in turn through the word encoding layer.
  • each interview answer text in the target data set is mapped to its corresponding embedding vector through the word encoding layer in the data scoring model.
  • the target data set includes 5 interview answer texts, which are interview answer text A, interview answer text B, interview answer text C, interview answer text D, and interview answer text E.
  • interview answer texts which are interview answer text A, interview answer text B, interview answer text C, interview answer text D, and interview answer text E.
  • the corresponding embedding vector of each interview answer text is sequentially mapped to the respective corresponding context embedding vector through the context coding layer.
  • the embedding vectors corresponding to each interview answer text are embedding vector a, embedding vector b, embedding vector c, embedding vector d, and embedding vector e.
  • the corresponding context embedding vector is obtained. That is, the embedding vector a1, the embedding vector b1, the embedding vector c1, the embedding vector d1, and the embedding vector e1.
  • Step S204 Obtain a text vector corresponding to each standard answer text in the preset standard data set.
  • the preset standard data set includes multiple standard answer texts, and each standard answer text includes the correct answer.
  • the server processes each standard answer text in the standard data set through the word encoding layer and the context encoding layer to obtain The text vector corresponding to each standard answer text.
  • Step S205 Calculate the semantic similarity between the context embedding vector corresponding to each interview answer text and each text vector.
  • the context embedding vector corresponding to each interview answer text, and the text vector corresponding to each standard answer text calculate the context embedding vector corresponding to each interview answer text and each text vector The semantic similarity of.
  • Step S206 Determine the score value of each interview response text according to the semantic similarity between the context embedding vector corresponding to each interview response text and each text vector.
  • the target similarity corresponding to each interview answer text is determined; according to each interview answer text.
  • determine the score value of each interview answer text that is, according to the preset mapping function, the semantic similarity between the context embedding vector and the text vector corresponding to each interview answer text is mapped to obtain each The score value of the interview response text.
  • the method for determining the target similarity is specifically: taking the interview answer text as a unit, collecting the semantic similarity between the context embedding vector of the interview answer text and the text vector corresponding to each standard answer text to form the interview answer
  • the semantic similarity set of the text, an interview answer text corresponds to a semantic similarity set; the maximum semantic similarity in the semantic similarity set is used as the target similarity corresponding to the interview answer text.
  • Step S207 Perform a screening process on the target data set according to the score value of each interview answer text to obtain an interview answer text that meets preset conditions.
  • the server After determining the score value of each interview answer text, the server screens the target data set according to the score value of each interview answer text, and obtains the interview answer text that meets the preset conditions, that is, the score value of each interview answer text It is compared with the preset scoring threshold to obtain a scoring comparison result, and based on the scoring comparison result, the target data set is filtered to obtain an interview answer text with a score greater than or equal to the preset threshold.
  • the data screening method provided in the foregoing embodiment can further accurately score the interview answer text through the data scoring model and multiple standard answer texts implemented based on a multi-task deep neural network. Based on the score of the interview answer text, the interview answer text can be accurately scored. The data is collected to screen out eligible interview answer texts, which effectively improves the accuracy of job candidates' selection.
  • FIG. 4 is a schematic block diagram of a data screening device provided by an embodiment of the application.
  • the data screening device 300 includes: an acquisition module 301, a scoring module 302, and a screening module 303.
  • the obtaining module 301 is configured to obtain a target data set, where the target data set is a data set to be filtered;
  • the scoring module 302 is used for scoring each interview answer text in the target data set based on a preset data scoring model to obtain the scoring value of each interview answer text, wherein the data scoring model is based on multiple Implementation of task deep neural network;
  • the screening module 303 is configured to perform screening processing on the target data set according to the score value of each interview answer text to obtain interview answer texts that meet preset conditions.
  • the scoring module 302 includes:
  • the first vector determining sub-module 3021 is configured to map each interview answer text in the target data set to its corresponding embedding vector through the word encoding layer;
  • the second vector determining sub-module 3022 is configured to sequentially map the respective embedding vectors of each interview answer text to the respective corresponding context embedding vectors through the context coding layer;
  • the scoring sub-module 3023 is configured to determine the scoring value of each interview answer text based on the data scoring layer and according to the context embedding vector corresponding to each interview answer text.
  • the scoring submodule 3023 is also used to obtain the text vector corresponding to the preset standard answer text; calculate the semantics between the context embedding vector corresponding to each interview answer text and the text vector Similarity: Determine the scoring value of each interview answer text according to the semantic similarity between the context embedding vector corresponding to each interview answer text and the text vector.
  • the scoring submodule 3023 is further configured to perform mapping processing on the semantic similarity between the context embedding vector corresponding to each interview answer text and the text vector according to a preset mapping function, Obtain the score value of each interview answer text.
  • the screening module 303 is further configured to compare the score value of each interview answer text with a preset score threshold to obtain a score comparison result; according to the score comparison result, compare the score The target data set is subjected to screening processing, and the interview answer text with the score value greater than or equal to the preset threshold is obtained.
  • FIG. 6 is a schematic block diagram of another data screening device provided by an embodiment of the application.
  • the data screening device 400 includes: an acquisition module 401, a vector determination module 402, a calculation module 403, a scoring module 404, and a screening module 405.
  • the obtaining module 401 is configured to obtain a target data set, where the target data set is a data set to be filtered;
  • the vector determining module 402 is configured to sequentially map each interview answer text in the target data set to its corresponding embedding vector through the word encoding layer;
  • the vector determining module 402 is further configured to sequentially map the respective embedding vectors of each interview answer text to the respective corresponding context embedding vectors through the context coding layer;
  • the obtaining module 401 is also used to obtain the text vector corresponding to each standard answer text in the preset standard data set;
  • the calculation module 403 is configured to calculate the semantic similarity between the context embedding vector corresponding to each interview answer text and each text vector;
  • the scoring module 404 is configured to determine the scoring value of each interview response text according to the semantic similarity between the context embedding vector corresponding to each interview response text and each text vector;
  • the screening module 405 is configured to perform screening processing on the target data set according to the score value of each interview answer text to obtain interview answer texts that meet preset conditions.
  • the scoring module 404 is further configured to determine each interview answer according to the semantic similarity between the context embedding vector corresponding to each interview answer text and each text vector.
  • the target similarity corresponding to each text; and the scoring value of each interview answer text is determined according to the target similarity corresponding to each interview answer text.
  • the apparatus provided in the foregoing embodiment may be implemented in the form of a computer program, and the computer program may run on the computer device as shown in FIG. 7.
  • FIG. 7 is a schematic block diagram of a structure of a computer device provided by an embodiment of the application.
  • the computer device may be a server.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium can store an operating system and a computer program.
  • the computer program includes program instructions, and when the program instructions are executed, the processor can execute any data screening method.
  • the processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
  • the internal memory provides an environment for the operation of the computer program in the non-volatile storage medium.
  • the processor can execute any data screening method.
  • the network interface is used for network communication, such as sending assigned tasks.
  • the network interface is used for network communication, such as sending assigned tasks.
  • FIG. 7 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the processor is used to run a computer program stored in a memory to implement the following steps:
  • Target data set is a data set to be filtered
  • the target data set is screened and processed to obtain the interview answer text that meets the preset conditions.
  • the scoring value of each interview answer text is determined according to the context embedding vector corresponding to each interview answer text.
  • the processor is configured to determine the scoring value of each interview response text based on the data scoring layer and according to the context embedding vector corresponding to each interview response text. :
  • the scoring value of each interview answer text is determined.
  • the processor determines the score of each interview answer text based on the semantic similarity between the context embedding vector corresponding to each of the interview answer text and each text vector. When it is a value, it is used to achieve:
  • the processor is configured to determine the scoring value of each interview response text based on the data scoring layer and according to the context embedding vector corresponding to each interview response text. :
  • the scoring value of each interview answer text is determined according to the semantic similarity between the context embedding vector corresponding to each interview answer text and the text vector.
  • the processor determines the score value of each interview answer text according to the semantic similarity between the context embedding vector corresponding to each of the interview answer text and the text vector. To achieve:
  • the semantic similarity between the context embedding vector corresponding to each interview answer text and the text vector is mapped to obtain the score value of each interview answer text.
  • the processor when the processor implements a screening process on the target data set according to the scoring value of each interview answer text to obtain an interview answer text that meets preset conditions, it is used to implement:
  • the embodiments of the present application also provide a computer-readable storage medium, and the computer-readable storage medium may be non-volatile or volatile.
  • a computer program is stored on the computer-readable storage medium, and the computer program includes program instructions. The method implemented when the program instructions are executed can refer to the various embodiments of the data screening method of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

一种数据筛选方法、装置、设备及计算机可读存储介质,涉及人工智能技术领域,尤其涉及智能决策和神经网络技术,该方法包括:获取目标数据集(S101);基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到每个所述面试回答文本的评分数值(S102),其中,所述数据评分模型基于多任务深度神经网络实现;根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本(S103)。该方法可以有效的提高数据筛选准确性。

Description

数据筛选方法、装置、设备及计算机可读存储介质
本申请要求于2019年10月16日提交中国专利局、申请号为201910984851.7,发明名称为“数据筛选方法、装置、设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种数据筛选方法、装置、设备及计算机可读存储介质。
背景技术
目前,业内通过来自Transformer的双向编码器表征(Bidirectional Encoder Representations from Transformers,BERT)模型自动从数据集中筛选出符合要求的数据,例如,通过BERT模型从简历数据集或者目标数据集中筛选出符合要求的简历或目标数据,但BERT模型需要大量的标注好的数据来训练模型,而数据的标注需要耗费较多的时间,且标注是人工进行的,发明人意识到,在大量人工标注数据的情况下,存在标注不准确的问题,容易降低模型的准确率,无法准确的从数据集中筛选出符合要求的数据。因此,如何提高数据筛选的准确性是目前亟待解决的问题。
发明内容
本申请的主要目的在于提供一种数据筛选方法、装置、设备及计算机可读存储介质,旨在提高数据筛选的准确性。
第一方面,本申请提供一种数据筛选方法,所述数据筛选方法包括以下步骤:
获取目标数据集,其中,所述目标数据集为待筛选的数据集;
基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到每个所述面试回答文本的评分数值,其中,所述数据评分模型基于多任务深度神经网络实现;
根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本。
第二方面,本申请还提供一种数据筛选装置,所述数据筛选装置包括:
获取模块,用于获取目标数据集,其中,所述目标数据集为待筛选的数据集;
评分模块,用于基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到每个所述面试回答文本的评分数值,其中,所述数据评分模型基于多任务深度神经网络实现;
筛选模块,用于根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本。
第三方面,本申请还提供一种计算机设备,所述计算机设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机程序,其中所述计算机程序被所述处理器执行时,实现如上述的数据筛选方法的步骤。
第四方面,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现如上述的数据筛选方法的步骤。
本申请提供一种数据筛选方法、装置、设备及计算机可读存储介质,本申请通过基于多任务深度神经网络实现的数据评分模型,可以准确快速的对数据集中每个面试回答文本 进行评分,通过准确的每个面试回答文本的评分数值可以准确的从数据集中筛选出符合条件的面试回答文本,有效的提高数据筛选的准确性。
附图说明
图1为本申请实施例提供的一种数据筛选方法的流程示意图;
图2为图1中的数据筛选方法的子步骤流程示意图;
图3为本申请实施例提供的另一种数据筛选方法的流程示意图;
图4为本申请实施例提供的一种数据筛选装置的示意性框图;
图5为图4中的数据筛选装置的子模块的示意性框图;
图6为本申请实施例提供的另一种数据筛选装置的示意性框图;
图7为本申请一实施例涉及的计算机设备的结构示意框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。
本申请实施例提供一种数据筛选方法、装置、设备及计算机可读存储介质。其中,该数据筛选方法可应用于服务器,该服务器可以为单台的服务器,也可以为由多台服务器组成的服务器集群。
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。
请参照图1,图1为本申请的实施例提供的一种数据筛选方法的流程示意图。
如图1所示,该数据筛选方法包括步骤S101至步骤S103。
步骤S101、获取目标数据集,其中,所述目标数据集为待筛选的数据集。
其中,服务器中存储有待筛选的数据集,该待筛选的数据集包括不同岗位的每个面试者的面试回答文本,该面试回答文本记录有面试者的个人基本信息和每个面试问题的作答信息等。服务器以岗位为单位,存储每个岗位的的每个面试者的面试回答文本,从而得到每个岗位对应的数据集,并对筛选过的面试回答文本和未筛选过的面试回答文本进行标记,得到每个岗位各自对应的待筛选的数据集,待筛选的数据集中的面试回答文本为未筛选的面试回答文本。
服务器可以实时或以间隔预设时间获取每个岗位对应的未筛选过的面试回答文本,以岗位为单位,汇集未筛选过的面试回答文本,可以得到每个岗位各自对应的待筛选的数据集,即目标数据集。需要说明的是,上述预设时间可基于实际情况进行设置,本申请对此不作具体限定。
在一实施例中,招聘者可以通过终端设备选择一个或多个岗位进行数据筛选,具体为:终端设备显示岗位选择页面,并获取用户基于该岗位选择页面选择的岗位对应的岗位标识符;生成包含该岗位标识符的数据筛选请求,并将该数据筛选请求发送至服务器;当服务器接收到该数据筛选请求时,从该数据筛选请求中获取岗位标识符,并获取与该岗位标识符对应的目标数据集,然后再对目标数据集中的目标数据进行筛选,得到符合要求的数据。 其中,岗位标识符用于唯一标识岗位,可以为数字、字母或数字与字母的组合,本申请对此不作具体限定,该终端设备可以手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等电子设备。
步骤S102、基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到每个所述面试回答文本的评分数值,其中,所述数据评分模型基于多任务深度神经网络实现。
其中,服务器中存储有数据评分模型,该数据评分模型基于多任务深度神经网络实现,该多任务深度神经网络结合了多任务学习和语言模型预训练,多任务学习是利用多个学习任务中所包含的有用信息帮助每个任务学习,得到更为准确的学习器,而语言模型预训练是借助大量无标注数据来对模型进行预训练,再对单个特定的任务进行模型的微调,可以改进文本表达的学习来提升各种自然语言理解任务。
通过大量无标注数据对多任务深度神经网络进行预训练之后,再对单个特定的任务进行微调,即可得到数据评分模型。其中,多任务深度神经网络包括输入层、Lexicon编码层(词编码层)、Transformer编码层(上下文编码层)和特定任务输出层,特定任务输出层包括单句分类输出层、文本相似度输出层、成对文本分类输出层和相关性排名输出层。Lexicon编码层用于通过对相应的单词、段和位置求和,将输入的文本或句子映射为嵌入向量。
Transformer编码层由多个相同的层级组成,每个层级包括两个不同的子层级,一个子层级为多头注意力层,用于学习句子内部的词依赖关系,捕获句子的内部结构,另一个子层级为全连接层,且每个子层级都与残差连接层和归一化层连接。Transformer编码层通过联合调节所有层中的上下文来预先训练深度双向表示,即Transformer编码层将嵌入向量映射为上下文嵌入向量。
单句分类输出层用于判断句子的语法正确性,或者判断句子所带的情感的类型。通过softmax函数的逻辑回归预测句子X被标记为C类的概率,公式为:P r(C|X)=softmax(W T*X),W T为单句分类模型的模型参数。
文本相似度输出层用于判断两个句子的语义相似性。公式为:Sim(X 1,X 2)=g(W T*x),W T为文本相似度计算模型的模型参数,x为两个句子的向量,g(x)为sigmoid函数,先计算两个句子的语义相似度,再基于sigmoid函数将语义相似度映射到0-1之间。
成对文本分类输出层用于推断两个句子的逻辑关系,如蕴涵关系、中立关系或者对立关系等。假设两个句子为P=(p 1,...,p m)和H=(h 1,...h n),目标是去推断P和H的逻辑关系R。M p和M h分别为P和H经过编码层后的输出。
相关性排名输出层用于对面试回答文本进行评分,输入一个面试回答文本,计算面试回答文本与标准答案文本之间的相似度,再基于相似度进行评分。公式为:Rel(Q,A)=g(W T*x),W T为相关性排名模型的模型参数,g(x)为sigmoid函数,x为回答文本和候选答案的拼接向量,首先计算回答文本和候选答案之间的语义相似度,再通过sigmoid函数将语义相似度输出映射到0-1。
模型的训练过程主要分为两步:预训练和多任务微调。
预训练:使用两个非监督预测任务对编码层(Lexicon编码层和Transformer编码层)进行预训练来学习编码层的参数。两个非监督预测任务分别为屏蔽语言模型(Masked  Language Modeling)和下一句预测模型(Next Sentence Prediction)。屏蔽语言模型:为了训练一个深度双向表示(deep bidirectional representation),采用一种简单的方法,即随机屏蔽(masking)部分输入token,然后只预测被屏蔽的token。数据生成器将执行以下操作,而不是始终用[MASK]替换所选单词:80%的时间:用[MASK]标记替换单词;10%的时间:用一个随机的单词替换该单词;10%的时间:保持单词不变。下一句预测:在为了训练一个理解句子的模型关系,预先训练一个二进制化的下一句预测任务,这一任务可以从任何单语语料库中生成。具体地说,当选择句子A和B作为预训练样本时,B有50%的可能是A的下一个句子,也有50%的可能是来自语料库的随机句子。
多任务微调:采用小批量梯度下降算法(Mini-batch Gradient Descent)来学习模型的参数(编码层以及特定任务输出层)。步骤如下:
1、设置训练的次数N,将数据集分为同等大小的mini-batchD 1,D 2,...,D T
2、对于每一次训练,合并四个特定任务的数据集,在每一个mini-batch下,通过随机梯度下降算法来更新模型的参数,每次迭代都朝着最优解逼近。
对数据评分这个任务,与多任务微调同样的方法训练模型来学习数据评分模型的模型参数,只需要少量标注的数据集对数据评分模型进行微调就可以获得准确度很高的数据评分模型。其中,该数据评分模型包括输入层、词编码层(Lexicon编码层)、上下文编码层(Transformer编码层)和数据评分层。
服务器在获取到目标数据集之后,可以基于预设的数据评分模型,对目标数据集中的每个面试回答文本进行评分,得到每个面试回答文本的评分数值。通过数据评分模型可以快速且准确的对目标数据进行评分,便于后续准确的对目标数据集进行筛选。
在一实施例中,具体地,参照图2,步骤S102包括:子步骤S1021至子步骤S1023。
子步骤S1021,通过所述词编码层依次将所述目标数据集中的每个面试回答文本映射为各自对应的嵌入向量。
在获取到目标数据集之后,通过数据评分模型中的词编码层依次将目标数据集中的每个面试回答文本映射为各自对应的嵌入向量。例如,目标数据集包括5个面试回答文本,分别为面试回答文本A、面试回答文本B、面试回答文本C、面试回答文本D和面试回答文本E,输入至词编码层之后,得到各自对应的嵌入向量,即嵌入向量a、嵌入向量b、嵌入向量c、嵌入向量d和嵌入向量e。
子步骤S1022、通过所述上下文编码层依次将每个所述面试回答文本各自对应的嵌入向量映射为各自对应的上下文嵌入向量。
在得到每个面试回答文本各自对应的嵌入向量之后,通过该上下文编码层依次将每个面试回答文本各自对应的嵌入向量映射为各自对应的上下文嵌入向量。例如,每个面试回答文本各自对应的嵌入向量分别为嵌入向量a、嵌入向量b、嵌入向量c、嵌入向量d和嵌入向量e,则输入至上下文编码层之后,得到各自对应的上下文嵌入向量,即嵌入向量a1、嵌入向量b1、嵌入向量c1、嵌入向量d1和嵌入向量e1。
子步骤S1023、基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值。
在得到每个面试回答文本各自对应的上下文嵌入向量之后,基于数据评分层,根据每个面试回答文本各自对应的上下文嵌入向量,确定每个面试回答文本的评分数值。通过每个面试回答文本各自对应的上下文嵌入向量和数据评分模型的模型参数即可确定每个面试回答文本的评分数值。
在一实施例中,服务器获取预设的标准答案文本对应的文本向量,并通过数据评分模型的模型参数,计算每个面试回答文本各自对应的上下文嵌入向量与该文本向量之间的语义相似度;根据每个面试回答文本各自对应的上下文嵌入向量与该文本向量之间的语义相似度,确定每个面试回答文本的评分数值;其中,服务器通过词编码层和上下文编码层对标准答案文本进行处理,得到标准答案文本对应的文本向量,并存储,便于后续快速获取。
在一实施例中,根据预设映射函数,对每个面试回答文本各自对应的上下文嵌入向量与文本向量之间的语义相似度进行映射处理,得到每个面试回答文本的评分数值。需要说明的是,上述预设映射函数可基于实际情况进行设置,本申请对此不作具体限定。可选地,预设映射函数为sigmoid函数。
在一实施例中,服务器获取预设的标准答案文本中的每个面试问题的答案文本各自对应的文本向量;根据每个面试问题的答案文本各自对应的文本向量,确定标准答案文本对应的目标文本向量;计算每个面试回答文本各自对应的上下文嵌入向量与目标文本向量之间的语义相似度;根据每个面试回答文本各自对应的上下文嵌入向量与目标文本向量之间的语义相似度,确定每个面试回答文本的评分数值。
其中,标准答案文本包括多个面试问题的答案文本,目标文本向量的确定方式为:将每个面试问题的答案文本各自对应的文本向量进行拼接,得到文本拼接向量,并将该文本拼接向量作为标准答案文本对应的目标文本向量。其中,服务器通过词编码层和上下文编码层对每个面试问题的答案文本进行处理,得到每个面试问题的答案文本对应的文本向量,并存储,便于后续快速获取。通过每个面试问题的答案文本各自对应的文本向量,确定标准答案文本对应的目标文本向量,可以准确的表征标准答案文本的特征。
步骤S103、根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本。
在确定每个面试回答文本的评分数值之后,服务器根据每个面试回答文本的评分数值,对目标数据集进行筛选处理,得到符合预设条件的面试回答文本,即将每个面试回答文本的评分数值与预设的评分阈值进行比较,得到评分比较结果,并根据该评分比较结果,对目标数据集进行筛选处理,得到评分数值大于或等于预设阈值的面试回答文本。
在一实施例中,确定该评分数值大于或等于预设的评分阈值的目标数据的数量是否大于或等于预设数量;如果该评分数值大于或等于预设的评分阈值的目标数据的数量大于或等于预设数量,则根据每个面试回答文本的评分数值,对每个面试回答文本进行排序,得到面试回答文本队列;按照面试回答文本队列中的各面试回答文本的排序,依次从面试回答文本队列中选择面试回答文本,直至面试回答文本的数量达到预设数量,从而得到评分数值大于或等于预设阈值的面试回答文本。
上述实施例提供的数据筛选方法,通过基于多任务深度神经网络实现的数据评分模型,可以准确快速的对数据集中每个面试回答文本进行评分,通过准确的每个面试回答文本的评分数值可以准确的从数据集中筛选出符合条件的面试回答文本,有效的提高数据筛选的准确性。
请参照图3,图3为本申请实施例提供的另一种数据筛选方法的流程示意图。
如图3所示,该数据筛选方法包括步骤S201至S206。
步骤S201、获取目标数据集,其中,所述目标数据集为待筛选的数据集。
其中,服务器中存储有待筛选的数据集,该待筛选的数据集包括不同岗位的每个面试者的面试回答文本,该面试回答文本记录有面试者的个人基本信息和每个面试问题的作答信息等。服务器以岗位为单位,存储每个岗位的的每个面试者的面试回答文本,从而得到每个岗位对应的数据集,并对筛选过的面试回答文本和未筛选过的面试回答文本进行标记,得到每个岗位各自对应的待筛选的数据集,待筛选的数据集中的面试回答文本为未筛选的 面试回答文本。
服务器可以实时或以间隔预设时间获取每个岗位对应的未筛选过的面试回答文本,以岗位为单位,汇集未筛选过的面试回答文本,可以得到每个岗位各自对应的待筛选的数据集,即目标数据集。需要说明的是,上述预设时间可基于实际情况进行设置,本申请对此不作具体限定。
步骤S202、通过所述词编码层依次将所述目标数据集中的每个面试回答文本映射为各自对应的嵌入向量。
在获取到目标数据集之后,通过数据评分模型中的词编码层依次将目标数据集中的每个面试回答文本映射为各自对应的嵌入向量。例如,目标数据集包括5个面试回答文本,分别为面试回答文本A、面试回答文本B、面试回答文本C、面试回答文本D和面试回答文本E,输入至词编码层之后,得到各自对应的嵌入向量,即嵌入向量a、嵌入向量b、嵌入向量c、嵌入向量d和嵌入向量e。
步骤S203、通过所述上下文编码层依次将每个所述面试回答文本各自对应的嵌入向量映射为各自对应的上下文嵌入向量。
在得到每个面试回答文本各自对应的嵌入向量之后,通过该上下文编码层依次将每个面试回答文本各自对应的嵌入向量映射为各自对应的上下文嵌入向量。例如,每个面试回答文本各自对应的嵌入向量分别为嵌入向量a、嵌入向量b、嵌入向量c、嵌入向量d和嵌入向量e,则输入至上下文编码层之后,得到各自对应的上下文嵌入向量,即嵌入向量a1、嵌入向量b1、嵌入向量c1、嵌入向量d1和嵌入向量e1。
步骤S204、获取预设的标准数据集中的每个标准答案文本各自对应的文本向量。
其中,预设的标准数据集包括多个标准答案文本,且每一个标准答案文本均包括正确的答案,服务器通过词编码层和上下文编码层对标准数据集中的每个标准答案文本进行处理,得到每个标准答案文本各自对应的文本向量。
步骤S205、计算每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度。
通过数据评分模型的模型参数、每个面试回答文本各自对应的上下文嵌入向量以及每个标准答案文本各自对应的文本向量,计算每个面试回答文本各自对应的上下文嵌入向量与每个文本向量之间的语义相似度。
步骤S206、根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值。
在得到每个面试回答文本各自对应的上下文嵌入向量与每个文本向量之间的语义相似度之后,根据每个面试回答文本各自对应的上下文嵌入向量与每个文本向量之间的语义相似度,确定每个面试回答文本的评分数值。通过目标数据与多个标准答案文本之间的相似度,可以更加准确的确定每个面试回答文本的评分数值。
在一实施例中,根据每个面试回答文本各自对应的上下文嵌入向量与每个文本向量之间的语义相似度,确定每个面试回答文本各自对应的目标相似度;根据每个面试回答文本各自对应的目标相似度,确定每个面试回答文本的评分数值,即根据预设映射函数,对每个面试回答文本各自对应的上下文嵌入向量与文本向量之间的语义相似度进行映射处理,得到每个面试回答文本的评分数值。
其中,目标相似度的确定方式具体为:以面试回答文本为单位,汇集该面试回答文本的上下文嵌入向量与每个标准答案文本各自对应的文本向量之间的语义相似度,以形成该面试回答文本的语义相似度集,一个面试回答文本对应一个语义相似度集;将该语义相似度集中的最大语义相似度作为面试回答文本对应的目标相似度。
步骤S207、根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理, 得到符合预设条件的面试回答文本。
在确定每个面试回答文本的评分数值之后,服务器根据每个面试回答文本的评分数值,对目标数据集进行筛选处理,得到符合预设条件的面试回答文本,即将每个面试回答文本的评分数值与预设的评分阈值进行比较,得到评分比较结果,并根据该评分比较结果,对目标数据集进行筛选处理,得到评分数值大于或等于预设阈值的面试回答文本。
上述实施例提供的数据筛选方法,通过基于多任务深度神经网络实现的数据评分模型和多个标准答案文本,可以进一步准确的对面试回答文本进行评分,基于面试回答文本的评分,可以准确的从数据集中筛选出符合条件的面试回答文本,有效的提高岗位候选人的筛选准确性。
请参照图4,图4为本申请实施例提供的一种数据筛选装置的示意性框图。
如图4所示,该数据筛选装置300,包括:获取模块301、评分模块302和筛选模块303。
获取模块301,用于获取目标数据集,其中,所述目标数据集为待筛选的数据集;
评分模块302、用于基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到每个所述面试回答文本的评分数值,其中,所述数据评分模型基于多任务深度神经网络实现;
筛选模块303,用于根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本。
在一个实施例中,如图5所示,所述评分模块302包括:
第一向量确定子模块3021,用于通过所述词编码层依次将所述目标数据集中的每个面试回答文本映射为各自对应的嵌入向量;
第二向量确定子模块3022,用于通过所述上下文编码层依次将每个所述面试回答文本各自对应的嵌入向量映射为各自对应的上下文嵌入向量;
评分子模块3023,用于基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值。
在一个实施例中,所述评分子模块3023,还用于获取预设的标准答案文本对应的文本向量;计算每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度;根据每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值。
在一个实施例中,所述评分子模块3023,还用于根据预设映射函数,对每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度进行映射处理,得到每个所述面试回答文本的评分数值。
在一个实施例中,所述筛选模块303,还用于将每个所述面试回答文本的评分数值与预设的评分阈值进行比较,得到评分比较结果;根据所述评分比较结果,对所述目标数据集进行筛选处理,得到所述评分数值大于或等于预设阈值的面试回答文本。
请参照图6,图6为本申请实施例提供的另一种数据筛选装置的示意性框图。
如图6所示,该数据筛选装置400,包括:获取模块401、向量确定模块402、计算模块403、评分模块404和筛选模块405。
获取模块401,用于获取目标数据集,其中,所述目标数据集为待筛选的数据集;
向量确定模块402,用于通过所述词编码层依次将所述目标数据集中的每个面试回答文本映射为各自对应的嵌入向量;
所述向量确定模块402,还用于通过所述上下文编码层依次将每个所述面试回答文本各自对应的嵌入向量映射为各自对应的上下文嵌入向量;
所述获取模块401,还用于获取预设的标准数据集中的每个标准答案文本各自对应的 文本向量;
计算模块403,用于计算每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度;
评分模块404,用于根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值;
筛选模块405,用于根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本。
在一实施例中,所述评分模块404,还用于根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本各自对应的目标相似度;根据每个所述面试回答文本各自对应的目标相似度,确定每个所述面试回答文本的评分数值。
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和各模块及单元的具体工作过程,可以参考前述数据筛选方法实施例中的对应过程,在此不再赘述。
上述实施例提供的装置可以实现为一种计算机程序的形式,该计算机程序可以在如图7所示的计算机设备上运行。
请参阅图7,图7为本申请实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以为服务器。
如图7所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括非易失性存储介质和内存储器。
非易失性存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令,该程序指令被执行时,可使得处理器执行任意一种数据筛选方法。
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。
内存储器为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种数据筛选方法。
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
应当理解的是,处理器可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机程序,以实现如下步骤:
获取目标数据集,其中,所述目标数据集为待筛选的数据集;
基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到 每个所述面试回答文本的评分数值,其中,所述数据评分模型基于多任务深度神经网络实现;
根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本。
可选地,所述数据评分模型包括词编码层、上下文编码层和数据评分层;所述处理器在实现基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到每个所述面试回答文本的评分数值时,用于实现:
通过所述词编码层依次将所述目标数据集中的每个面试回答文本映射为各自对应的嵌入向量;
通过所述上下文编码层依次将每个所述面试回答文本各自对应的嵌入向量映射为各自对应的上下文嵌入向量;
基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值。
在一个实施例中,所述处理器在实现基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值时,用于实现:
获取预设的标准数据集中每个标准答案文本各自对应的文本向量;
计算每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度;
根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值。
在一个实施例中,所述处理器在实现根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值时,用于实现:
根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本各自对应的目标相似度;
根据每个所述面试回答文本各自对应的目标相似度,确定每个所述面试回答文本的评分数值。
在一个实施例中,所述处理器在实现基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值时,用于实现:
获取预设的标准答案文本对应的文本向量;
计算每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度;
根据每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值。
在一个实施例中,所述处理器在实现根据每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值时,用于实现:
根据预设映射函数,对每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度进行映射处理,得到每个所述面试回答文本的评分数值。
在一个实施例中,所述处理器在实现根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本时,用于实现:
将每个所述面试回答文本的评分数值与预设的评分阈值进行比较,得到评分比较结果;
根据所述评分比较结果,对所述目标数据集进行筛选处理,得到所述评分数值大于或等于预设阈值的面试回答文本。
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的计算机设备的具体工作过程,可以参考前述数据筛选方法实施例中的对应过程,在此不再赘述。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性的。所述计算机可读存储介质上存储有计算机程序,所述计算机程序中包括程序指令,所述程序指令被执行时所实现的方法可参照本申请数据筛选方法的各个实施例。
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种数据筛选方法,其中,包括:
    获取目标数据集,其中,所述目标数据集为待筛选的数据集;
    基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到每个所述面试回答文本的评分数值,其中,所述数据评分模型基于多任务深度神经网络实现;
    根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本。
  2. 根据权利要求1所述的数据筛选方法,其中,所述数据评分模型包括词编码层、上下文编码层和数据评分层;所述基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到每个所述面试回答文本的评分数值,包括:
    通过所述词编码层依次将所述目标数据集中的每个面试回答文本映射为各自对应的嵌入向量;
    通过所述上下文编码层依次将每个所述面试回答文本各自对应的嵌入向量映射为各自对应的上下文嵌入向量;
    基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值。
  3. 根据权利要求2所述的数据筛选方法,其中,所述基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值,包括:
    获取预设的标准数据集中每个标准答案文本各自对应的文本向量;
    计算每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度;
    根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值。
  4. 根据权利要求3所述的数据筛选方法,其中,所述根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值,包括:
    根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本各自对应的目标相似度;
    根据每个所述面试回答文本各自对应的目标相似度,确定每个所述面试回答文本的评分数值。
  5. 根据权利要求2所述的数据筛选方法,其中,所述基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值,包括:
    获取预设的标准答案文本对应的文本向量;
    计算每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度;
    根据每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值。
  6. 根据权利要求5所述的数据筛选方法,其中,所述根据每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值,包括:
    根据预设映射函数,对每个所述面试回答文本各自对应的上下文嵌入向量与所述文本 向量之间的语义相似度进行映射处理,得到每个所述面试回答文本的评分数值。
  7. 根据权利要求1至6中任一项所述的数据筛选方法,其中,所述根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本,包括:
    将每个所述面试回答文本的评分数值与预设的评分阈值进行比较,得到评分比较结果;
    根据所述评分比较结果,对所述目标数据集进行筛选处理,得到所述评分数值大于或等于预设阈值的面试回答文本。
  8. 一种数据筛选装置,其中,所述数据筛选装置包括:
    获取模块,用于获取目标数据集,其中,所述目标数据集为待筛选的数据集;
    评分模块,用于基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到每个所述面试回答文本的评分数值,其中,所述数据评分模型基于多任务深度神经网络实现;
    筛选模块,用于根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本。
  9. 一种计算机设备,其中,所述计算机设备包括存储器和处理器,所述存储器和所述处理器相互连接,所述存储器用于存储计算机程序,所述计算机程序被配置为由所述处理器执行,所述计算机程序配置用于执行一种数据筛选方法:
    其中,所述方法包括:
    获取目标数据集,其中,所述目标数据集为待筛选的数据集;
    基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到每个所述面试回答文本的评分数值,其中,所述数据评分模型基于多任务深度神经网络实现;
    根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本。
  10. 根据权利要求9所述的计算机设备,其中,所述数据评分模型包括词编码层、上下文编码层和数据评分层;所述基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到每个所述面试回答文本的评分数值,包括:
    通过所述词编码层依次将所述目标数据集中的每个面试回答文本映射为各自对应的嵌入向量;
    通过所述上下文编码层依次将每个所述面试回答文本各自对应的嵌入向量映射为各自对应的上下文嵌入向量;
    基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值。
  11. 根据权利要求10所述的计算机设备,其中,所述基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值,包括:
    获取预设的标准数据集中每个标准答案文本各自对应的文本向量;
    计算每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度;
    根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值。
  12. 根据权利要求11所述的计算机设备,其中,所述根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答 文本的评分数值,包括:
    根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本各自对应的目标相似度;
    根据每个所述面试回答文本各自对应的目标相似度,确定每个所述面试回答文本的评分数值。
  13. 根据权利要求10所述的计算机设备,其中,所述基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值,包括:
    获取预设的标准答案文本对应的文本向量;
    计算每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度;
    根据每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值。
  14. 根据权利要求13所述的计算机设备,其中,所述根据每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值,包括:
    根据预设映射函数,对每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度进行映射处理,得到每个所述面试回答文本的评分数值。
  15. 根据权利要求9至14中任一项所述的计算机设备,其中,所述根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本,包括:
    将每个所述面试回答文本的评分数值与预设的评分阈值进行比较,得到评分比较结果;
    根据所述评分比较结果,对所述目标数据集进行筛选处理,得到所述评分数值大于或等于预设阈值的面试回答文本。
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时用于实现一种数据筛选方法,所述方法包括以下步骤:
    获取目标数据集,其中,所述目标数据集为待筛选的数据集;
    基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到每个所述面试回答文本的评分数值,其中,所述数据评分模型基于多任务深度神经网络实现;
    根据每个所述面试回答文本的评分数值,对所述目标数据集进行筛选处理,得到符合预设条件的面试回答文本。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述数据评分模型包括词编码层、上下文编码层和数据评分层;所述基于预设的数据评分模型,对所述目标数据集中的每个面试回答文本进行评分,得到每个所述面试回答文本的评分数值,包括:
    通过所述词编码层依次将所述目标数据集中的每个面试回答文本映射为各自对应的嵌入向量;
    通过所述上下文编码层依次将每个所述面试回答文本各自对应的嵌入向量映射为各自对应的上下文嵌入向量;
    基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值。
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值,包括:
    获取预设的标准数据集中每个标准答案文本各自对应的文本向量;
    计算每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度;
    根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值,包括:
    根据每个所述面试回答文本各自对应的上下文嵌入向量与每个所述文本向量之间的语义相似度,确定每个所述面试回答文本各自对应的目标相似度;
    根据每个所述面试回答文本各自对应的目标相似度,确定每个所述面试回答文本的评分数值。
  20. 根据权利要求17所述的计算机可读存储介质,其中,所述基于所述数据评分层,根据每个所述面试回答文本各自对应的上下文嵌入向量,确定每个所述面试回答文本的评分数值,包括:
    获取预设的标准答案文本对应的文本向量;
    计算每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度;
    根据每个所述面试回答文本各自对应的上下文嵌入向量与所述文本向量之间的语义相似度,确定每个所述面试回答文本的评分数值。
PCT/CN2020/117418 2019-10-16 2020-09-24 数据筛选方法、装置、设备及计算机可读存储介质 WO2021073390A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910984851.7A CN110929524A (zh) 2019-10-16 2019-10-16 数据筛选方法、装置、设备及计算机可读存储介质
CN201910984851.7 2019-10-16

Publications (1)

Publication Number Publication Date
WO2021073390A1 true WO2021073390A1 (zh) 2021-04-22

Family

ID=69849238

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/117418 WO2021073390A1 (zh) 2019-10-16 2020-09-24 数据筛选方法、装置、设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN110929524A (zh)
WO (1) WO2021073390A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116226481A (zh) * 2022-12-30 2023-06-06 北京视友科技有限责任公司 一种基于脑电的实验数据筛选方法、系统及存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929524A (zh) * 2019-10-16 2020-03-27 平安科技(深圳)有限公司 数据筛选方法、装置、设备及计算机可读存储介质
CN111694937A (zh) * 2020-04-26 2020-09-22 平安科技(深圳)有限公司 基于人工智能的面试方法、装置、计算机设备及存储介质
CN111695591B (zh) * 2020-04-26 2024-05-10 平安科技(深圳)有限公司 基于ai的面试语料分类方法、装置、计算机设备和介质
CN112084764B (zh) * 2020-09-02 2022-06-17 北京字节跳动网络技术有限公司 数据检测方法、装置、存储介质及设备
CN112686020B (zh) * 2020-12-29 2024-06-04 科大讯飞股份有限公司 作文评分方法、装置、电子设备及存储介质
CN113609121A (zh) * 2021-08-17 2021-11-05 平安资产管理有限责任公司 基于人工智能的目标数据处理方法、装置、设备和介质
CN116469448B (zh) * 2022-02-18 2024-02-02 武汉置富半导体技术有限公司 一种闪存颗粒的筛选方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170270409A1 (en) * 2016-03-16 2017-09-21 Maluuba Inc. Parallel-hierarchical model for machine comprehension on small data
CN109670168A (zh) * 2018-11-14 2019-04-23 华南师范大学 基于特征学习的短答案自动评分方法、系统及存储介质
CN109933661A (zh) * 2019-04-03 2019-06-25 上海乐言信息科技有限公司 一种基于深度生成模型的半监督问答对归纳方法和系统
CN110046244A (zh) * 2019-04-24 2019-07-23 中国人民解放军国防科技大学 一种用于问答系统的答案选择方法
CN110929524A (zh) * 2019-10-16 2020-03-27 平安科技(深圳)有限公司 数据筛选方法、装置、设备及计算机可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170270409A1 (en) * 2016-03-16 2017-09-21 Maluuba Inc. Parallel-hierarchical model for machine comprehension on small data
CN109670168A (zh) * 2018-11-14 2019-04-23 华南师范大学 基于特征学习的短答案自动评分方法、系统及存储介质
CN109933661A (zh) * 2019-04-03 2019-06-25 上海乐言信息科技有限公司 一种基于深度生成模型的半监督问答对归纳方法和系统
CN110046244A (zh) * 2019-04-24 2019-07-23 中国人民解放军国防科技大学 一种用于问答系统的答案选择方法
CN110929524A (zh) * 2019-10-16 2020-03-27 平安科技(深圳)有限公司 数据筛选方法、装置、设备及计算机可读存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116226481A (zh) * 2022-12-30 2023-06-06 北京视友科技有限责任公司 一种基于脑电的实验数据筛选方法、系统及存储介质
CN116226481B (zh) * 2022-12-30 2023-11-21 北京视友科技有限责任公司 一种基于脑电的实验数据筛选方法、系统及存储介质

Also Published As

Publication number Publication date
CN110929524A (zh) 2020-03-27

Similar Documents

Publication Publication Date Title
WO2021073390A1 (zh) 数据筛选方法、装置、设备及计算机可读存储介质
CN112270196B (zh) 实体关系的识别方法、装置及电子设备
US20190347571A1 (en) Classifier training
US20180068221A1 (en) System and Method of Advising Human Verification of Machine-Annotated Ground Truth - High Entropy Focus
US20180068222A1 (en) System and Method of Advising Human Verification of Machine-Annotated Ground Truth - Low Entropy Focus
US20170017716A1 (en) Generating Probabilistic Annotations for Entities and Relations Using Reasoning and Corpus-Level Evidence
US11669740B2 (en) Graph-based labeling rule augmentation for weakly supervised training of machine-learning-based named entity recognition
US11216739B2 (en) System and method for automated analysis of ground truth using confidence model to prioritize correction options
US10824808B2 (en) Robust key value extraction
CN112101042A (zh) 文本情绪识别方法、装置、终端设备和存储介质
CN112380421A (zh) 简历的搜索方法、装置、电子设备及计算机存储介质
CN113505786A (zh) 试题拍照评判方法、装置及电子设备
CN110929532B (zh) 数据处理方法、装置、设备及存储介质
CN114722832A (zh) 一种摘要提取方法、装置、设备以及存储介质
US20230047800A1 (en) Artificial intelligence-assisted non-pharmaceutical intervention data curation
US20140207712A1 (en) Classifying Based on Extracted Information
WO2021174814A1 (zh) 众包任务的答案验证方法、装置、计算机设备及存储介质
CN114898426B (zh) 一种同义标签聚合方法、装置、设备及存储介质
CN112529743B (zh) 合同要素抽取方法、装置、电子设备及介质
WO2023000725A1 (zh) 电力计量的命名实体识别方法、装置和计算机设备
US20220277197A1 (en) Enhanced word embedding
CN113988085B (zh) 文本语义相似度匹配方法、装置、电子设备及存储介质
CN113722477B (zh) 基于多任务学习的网民情绪识别方法、系统及电子设备
US20220129784A1 (en) Predicting topic sentiment using a machine learning model trained with observations in which the topics are masked
JP6026036B1 (ja) データ分析システム、その制御方法、プログラム、及び、記録媒体

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20877578

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20877578

Country of ref document: EP

Kind code of ref document: A1