WO2021027029A1 - 数据处理方法、装置、计算机设备和存储介质 - Google Patents
数据处理方法、装置、计算机设备和存储介质 Download PDFInfo
- Publication number
- WO2021027029A1 WO2021027029A1 PCT/CN2019/107727 CN2019107727W WO2021027029A1 WO 2021027029 A1 WO2021027029 A1 WO 2021027029A1 CN 2019107727 W CN2019107727 W CN 2019107727W WO 2021027029 A1 WO2021027029 A1 WO 2021027029A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- interviewer
- emotion
- voice
- speech
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 14
- 230000008451 emotion Effects 0.000 claims abstract description 283
- 238000004458 analytical method Methods 0.000 claims abstract description 30
- 238000013145 classification model Methods 0.000 claims description 79
- 238000012549 training Methods 0.000 claims description 56
- 239000000284 extract Substances 0.000 claims description 23
- 238000000605 extraction Methods 0.000 claims description 22
- 238000000034 method Methods 0.000 claims description 19
- 238000012545 processing Methods 0.000 claims description 19
- 238000012795 verification Methods 0.000 claims description 18
- 230000011218 segmentation Effects 0.000 claims description 14
- 238000002372 labelling Methods 0.000 claims description 4
- 230000002996 emotional effect Effects 0.000 claims 3
- 230000014509 gene expression Effects 0.000 abstract description 7
- 238000010586 diagram Methods 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 5
- 238000001228 spectrum Methods 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/45—Clustering; Classification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Definitions
- This application relates to a data processing method, device, computer equipment and storage medium.
- the traditional intelligent interview system mostly recognizes facial micro-expression to find the abnormal expression of the interviewee, and serves as one of the basis for risk assessment.
- Micro expression is a psychological term. People express their inner feelings to each other by making some facial expressions. Between the different expressions that people make, or in a certain expression, the face will "leak” other information. The shortest "micro expression” can last for 1/25 second. Although a subconscious expression may only last for a moment, it sometimes expresses the opposite emotion.
- a data processing method, device, computer equipment, and storage medium are provided.
- a data processing method includes:
- the grammar analysis network is obtained by training the second sample text data;
- the interview result of the interviewer is determined.
- a data processing device includes:
- Acquisition module used to acquire interviewer's audio data and interviewer's video data
- the first extraction module is used to extract the interviewer's micro-speech feature based on the interviewer's audio data, and obtain the first voice emotion data according to the micro-speech feature;
- the first processing module is used to convert the interviewer’s audio data into text data, split the text data into multiple sentences, and perform word segmentation on multiple sentences, and find matching preset and trained emotions according to each word in each sentence
- the dictionary corresponding to the classification network determines the confidence that the text data belongs to each preset emotion category according to the search and matching results, and obtains the second speech emotion data, and the emotion classification network is trained by the first sample text data;
- the second processing module is used to input text data into the trained grammatical analysis network to obtain the grammatical score of each sentence in the text data, calculate the average value of the grammatical score of each sentence, and obtain the grammatical score of the text data.
- the second extraction module is used to randomly intercept video frames from the interviewer's video data, extract the interviewer's micro-expression features according to the video frames, and obtain the confidence level of the video data according to the micro-expression features;
- the analysis module is used to determine the interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar score and the confidence of the video data.
- a computer device including a memory and one or more processors, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:
- the grammar analysis network is obtained by training the second sample text data;
- the interview result of the interviewer is determined.
- One or more non-volatile computer-readable storage media storing computer-readable instructions.
- the one or more processors execute the following steps:
- the grammar analysis network is obtained by training the second sample text data;
- the interview result of the interviewer is determined.
- Fig. 1 is an application scenario diagram of a data processing method according to one or more embodiments.
- Fig. 2 is a schematic flowchart of a data processing method according to one or more embodiments.
- FIG. 3 is a schematic diagram of a sub-flow according to one or more of step 204 in FIG. 2.
- FIG. 4 is a schematic diagram of a sub-flow according to one or more of step 204 in FIG. 2.
- Fig. 5 is a schematic diagram of a sub-flow according to one or more of step 204 in Fig. 2.
- FIG. 6 is a schematic diagram of a sub-process according to one or more of step 206 in FIG. 2.
- FIG. 7 is a schematic diagram of a sub-flow according to one or more of step 212 in FIG. 2.
- Fig. 8 is a block diagram of a data processing device according to one or more embodiments.
- Figure 9 is a block diagram of a computer device according to one or more embodiments.
- the data processing method provided in this application can be applied to the application environment as shown in FIG. 1.
- the terminal 102 and the server 104 communicate through the network.
- the server 104 obtains the interviewer’s audio data and the interviewer’s video data, extracts the interviewer’s micro-speech feature according to the interviewer’s audio data, obtains the first voice emotion data according to the micro-speech feature, converts the interviewer’s audio data into text data, and converts the text
- the data is split into multiple sentences, and multiple sentences are segmented. According to the words in each sentence, the dictionary corresponding to the trained emotion classification network is searched and matched.
- the search and matching results it is determined that the text data belongs to each preset
- the confidence level of the emotion category the second speech emotion data is obtained
- the emotion classification network is trained from the first sample text data
- the text data is input into the trained grammar analysis network
- the grammar score of each sentence in the text data is obtained
- each sentence is calculated
- the grammatical score average of the text data is obtained.
- the grammatical analysis network is trained from the second sample text data.
- the video frame is randomly intercepted from the interviewer’s video data, and the interviewer’s micro-expression features are extracted according to the video frame.
- the video data confidence level is obtained, and the interview result of the interviewer is determined according to the first voice emotion data, the second voice emotion data, the grammar score and the video data confidence level, and is pushed to the terminal 102.
- the terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
- the server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
- a data processing method is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:
- Step S202 Obtain interviewer audio data and interviewer video data.
- the interviewer's video data refers to the video data recorded by the interviewer during the interview
- the interviewer's audio data refers to the interviewer's audio data during the interview.
- the interviewer's audio data can be extracted from the interviewer's video data.
- Step S204 Extract the interviewer's micro-speech feature according to the interviewer's audio data, and obtain the first voice emotion data according to the micro-speech feature.
- the server can extract the interviewer's micro-speech features from the interviewer's audio data by calling the speech feature extraction tool.
- the micro-speech features include speech rate features, pitch features, and Mel frequency cepstral coefficients.
- Speaking rate refers to the number of words per second in the voice data.
- the words can be Chinese or English.
- Pitch refers to the level of voice frequency.
- Mel frequency cepstrum is based on the non-linear Mel scale of voice frequency. For the linear transformation of the logarithmic energy spectrum, the Mel frequency cepstrum coefficients are the coefficients that make up the Mel frequency cepstrum.
- the server inputs the micro-speech features into the voice emotion classification model matching the interviewer’s gender information in the trained voice emotion classification model set, and can obtain the first speech emotion data corresponding to the micro-speech feature.
- the first speech emotion data refers to the micro-speech
- the voice features belong to the confidence of each preset emotion category.
- the set of trained speech emotion classification models includes speech emotion classification models trained on sample data of interviewers of different genders, that is, an emotion classification model for analyzing male speech data and an emotion classification model for analyzing female speech data.
- the server will obtain the interviewer's gender information, match the trained voice emotion classification model set according to the interviewer's gender information, and obtain a voice emotion classification model matching the interviewer's gender information from the trained voice emotion classification model set.
- the voice emotion classification model is trained from sample voice data carrying annotation information.
- the annotation information includes emotion category information and gender information.
- the server divides the sample voice data according to gender information, and performs model training according to the divided sample voice data to obtain a set of voice emotion classification models.
- Step S206 Convert the interviewer’s audio data into text data, split the text data into multiple sentences, and perform word segmentation on the multiple sentences, and find matching presets corresponding to the trained emotion classification network according to the words in each sentence
- the dictionary determines the confidence that the text data belongs to each preset emotion category according to the search and matching results, and obtains the second voice emotion data, and the emotion classification network is trained by the first sample text data.
- the emotion classification network can be a network based on BERT and superimposed on a classification layer containing N neurons (assuming N emotions are preset).
- the server splits the text data into multiple sentences, performs word segmentation on each sentence, searches the dictionary matching BERT according to the words in each sentence, converts each word into the corresponding serial number of the word in the BERT dictionary, and converts the entire
- the sequence number of the sentence is input into the BERT to obtain the confidence that each sentence belongs to each preset emotion category, and then according to the confidence that each sentence belongs to each preset emotion category, it is determined that the text data belongs to each preset emotion category
- the confidence level of the second voice emotion data is obtained.
- the emotion classification network can be obtained by training on the first sample text data. Each sample sentence in the first sample text data carries label information, and the label information is the emotion category information of each sample sentence.
- the cache space realizes the optimization of the server's cache space.
- Step S208 Input the text data into the trained grammatical analysis network to obtain the grammar score of each sentence in the text data, calculate the average of the grammatical scores of each sentence, and obtain the grammar score of the text data.
- the grammatical analysis network is trained by the second sample text data get.
- CoLA Corpus of Linguistic Acceptability
- the data set includes multiple single sentences with annotations, which are marked as grammatically correct or not (0 is wrong, 1 is correct)
- the grammatical analysis network can be used to determine the grammatical accuracy of the sentence.
- the grammatical score ranges from 0 to 1, where 0 represents grammatical error, 1 represents grammatical correctness, and the confidence level between 0 and 1 can be understood as grammatical accuracy. degree.
- the server calculates the average value of the grammar score of each sentence to obtain the grammar score of the text data.
- the grammatical analysis network will automatically learn from the text data, without splitting and matching the grammatical structure of each sentence in the text data.
- Step S210 randomly intercepting video frames from the interviewer's video data, extracting the interviewer's micro-expression features from the video frames, and obtaining the video data confidence level according to the micro-expression features.
- the server randomly intercepts video frames from the interviewer's video data according to the preset time interval, obtains the interviewer's micro-expression features according to the video frames, and inputs the micro-expression features into the trained micro-expression model to obtain the micro-expression features attributable to each preset
- the confidence level of the emotion category is sorted by the confidence level that the micro-expression features belong to each preset emotion category, and the maximum value of the confidence level is obtained to obtain the confidence level of the video data.
- the micro-expression model is trained on sample micro-expression data.
- Step S212 Determine the interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar score, and the confidence of the video data.
- the server can obtain the audio data confidence by inputting the first voice emotion data, the second voice emotion data, and the grammar score into the trained audio classification model, and then according to the audio data confidence, video data confidence, and confidence parameters, Determine the interview result of the interviewer.
- the parameters of the audio classification model include the confidence that the audio data in the first voice emotion data belongs to each preset emotion category, the confidence that the text data in the second voice emotion data belongs to each preset emotion category and the grammatical score.
- sample voice data and sample text data that carry annotation information can be used as a training set.
- the annotation information is used to mark whether the interviewer corresponding to the sample voice data and sample text data is lying.
- the confidence parameter can be set according to needs, and the confidence parameter is an adjustable parameter.
- the above data processing method extracts micro-speech features from the interviewer’s audio data, obtains the first voice emotion data based on the micro-speech features, converts the interviewer’s audio data into text data, analyzes the text data, and obtains the second voice emotion data and Grammar score, extract micro-expression features based on the interviewer’s video data, obtain the confidence of the video data based on the micro-expression features, and determine the interviewer’s interview based on the first voice emotion data, the second voice emotion data, the grammar score, and the confidence of the video data result.
- step S204 includes:
- Step S302 Invoking a voice feature extraction tool, and extracting the interviewer's micro-speech features based on the interviewer's audio data.
- the micro-speech features include speech rate features, Mel frequency cepstral coefficients, and pitch features;
- Step S304 Input the micro-speech feature into the matched voice emotion classification model to obtain first voice emotion data corresponding to the micro-speech feature.
- the voice feature extraction tool to extract the Mel frequency cepstral coefficients perform fast Fourier transform on the interviewer’s audio data to obtain the spectrum, map the spectrum to the Mel scale, and perform discrete cosine transform after logarithm. Get the Mel frequency cepstrum coefficient.
- the pitch features include the current segment pitch average, current segment pitch standard deviation, historical pitch average, and historical pitch standard deviation.
- the method of extracting the average value of the current segment pitch is: Fast Fourier Transformation is performed on the interviewer's audio data to obtain a spectrogram of the audio data, and then the variance of each frequency band and the center of the spectrum is calculated, and the variance is summed and the square root is taken.
- Historical pitch average and standard deviation refer to the average and standard deviation of the interviewer from the beginning of the interview to the current segment. These data will be stored in the server after the interview. For the convenience of calculation, the exponential moving average can be used for approximate calculation.
- the update formula is:
- Average historical pitch ⁇ *average historical pitch+(1- ⁇ )*average current pitch
- Historical pitch standard deviation ⁇ *historical pitch standard deviation+(1- ⁇ )*current pitch standard deviation
- ⁇ is a weight parameter ranging from 0 to 1, which can be set according to your needs. The default here is 0.9.
- Speaking rate features include current speaking rate, historical speaking rate average, and historical speaking rate standard deviation.
- the historical speaking rate average and standard deviation are calculated and memorized by the server after the interview begins.
- an exponential moving average can be used for approximate calculation.
- the update formula is:
- Average historical speaking rate ⁇ *average historical speaking rate+(1- ⁇ )*current speaking rate
- Mean square deviation of historical speaking rate ⁇ *mean square deviation of historical speaking rate+(1- ⁇ )*(current speaking rate-average of historical speaking rate) 2
- Standard deviation of historical speaking rate square root of the mean square deviation of historical speaking rate
- the voice feature extraction tool is called, and the interviewer's micro-voice features are extracted from the interviewer's audio data, so as to realize the extraction of the interviewer's micro-voice features.
- step S204 includes:
- Step S402 Obtain gender information of the interviewer, and obtain a voice emotion classification model matching the interviewer's gender information from the trained voice emotion classification model set.
- the voice emotion classification model is obtained by training the sample voice data carrying the annotation information, and the annotation information includes Emotion category information and gender information;
- Step S404 acquiring the pitch feature, the Mel frequency cepstrum coefficient and the speech rate feature in the micro-speech feature
- Step S406 Input the pitch feature, the Mel frequency cepstrum coefficient and the speech rate feature into the matched speech emotion classification model to obtain the confidence that the micro-speech feature belongs to each preset emotion category, and obtain the first micro-speech feature One voice emotion data.
- the set of trained speech emotion classification models includes speech emotion classification models trained on sample data of interviewers of different genders, that is, an emotion classification model for analyzing male speech data and an emotion classification model for analyzing female speech data.
- the server will obtain the interviewer's gender information, match the trained voice emotion classification model set according to the interviewer's gender information, and obtain a voice emotion classification model matching the interviewer's gender information from the trained voice emotion classification model set.
- the voice emotion classification model is trained from sample voice data carrying annotation information.
- the annotation information includes emotion category information and gender information.
- the server divides the sample voice data according to gender information, and performs model training according to the divided sample voice data to obtain a set of voice emotion classification models.
- Pitch features include the current segment pitch average, current segment pitch standard deviation, historical pitch average, and historical pitch standard deviation.
- Speech rate features include current speech rate, historical speech rate average, and historical speech rate standard deviation. Server All the features included in the three features will be input as parameters into the matched speech emotion classification model. The convolutional neural network in the speech emotion classification model will synthesize all the features to give the micro-speech features belonging to each preset emotion category Confidence.
- the matched speech emotion classification model is obtained according to the interviewer’s gender information, and the pitch feature, Mel frequency cepstrum coefficient and speech rate feature are input into the matched speech emotion classification model, and the obtained micro-speech feature belongs to each prediction Set the confidence level of the emotion category to obtain the first voice emotion data of the micro-speech feature, and realize the acquisition of the first voice emotion data.
- the method further includes:
- Step S502 Obtain sample voice data carrying label information
- Step S504 dividing the sample voice data into a training set and a verification set
- Step S506 Perform model training according to the training set and the initial speech emotion classification model to obtain a speech emotion classification model set;
- Step S508 Perform model verification according to the verification set, and adjust each voice emotion classification model in the voice emotion classification model set.
- the server After obtaining the sample voice data carrying the annotation information, the server first divides the sample voice data into the first sample voice data set and the second sample voice data set according to the gender information in the annotation information, and then the first sample voice data set And the second sample voice data set are divided into a training set and a validation set, respectively.
- Model training is performed according to the training set in the first sample voice data set and the second sample voice data set to obtain the first voice emotion classification model and the second voice
- the emotion classification model performs model verification according to the verification set in the first sample voice data set and the second sample voice data set, and adjusts the first voice emotion classification model and the second voice emotion classification model.
- Both the first sample voice data set and the second sample voice data set only include sample voice data of interviewers of the same gender.
- sample voice data carrying tagging information is obtained, the sample voice data is divided into a training set and a verification set, model training is performed based on the training set, and model verification is performed based on the verification set to obtain each voice emotion classification in the voice emotion classification model set
- the model realizes the acquisition of the voice emotion classification model set.
- step S206 includes:
- Step S602 searching and matching a preset dictionary corresponding to the trained emotion classification network according to each word in each sentence, and determining the corresponding serial number of each word in each sentence in the dictionary;
- Step S604 input the sequence number of each word in each sentence in the dictionary into the emotion classification network to obtain the confidence that each sentence in the text data belongs to each preset emotion category;
- Step S606 Obtain the average value of the confidence that each sentence in the text data belongs to each preset emotion category, and obtain the confidence that the text data belongs to each preset emotion category according to the average value of the confidence.
- the emotion classification network can be a network based on BERT and superimposed on a classification layer containing N neurons (assuming N emotions are preset).
- the server splits the text data into multiple sentences, performs word segmentation on each sentence, searches the dictionary matching BERT according to the words in each sentence, converts each word into the corresponding serial number of the word in the BERT dictionary, and converts the entire
- the sequence number of the sentence is input into the BERT to obtain the confidence that each sentence belongs to each preset emotion category, and then according to the confidence that each sentence belongs to each preset emotion category, it is determined that the text data belongs to each preset emotion category
- the confidence level of the second voice emotion data is obtained.
- the emotion classification network can be obtained by training on the first sample text data. Each sample sentence in the first sample text data carries label information, and the label information is the emotion category information of each sample sentence.
- the sequence number of each word in each sentence in the dictionary is input into the emotion classification network to obtain the confidence that each sentence in the text data belongs to each preset emotion category, and then according to the confidence that each sentence in the text data belongs to each The confidence of the preset emotion category is obtained, and the confidence that the text data belongs to each preset emotion category is obtained, which realizes the acquisition of the confidence that the text data belongs to each preset emotion category.
- step S212 includes:
- Step S702 Obtain the audio data confidence level according to the first voice emotion data, the second voice emotion data, and the grammar score;
- Step S704 Determine the interview result of the interviewer according to the audio data confidence level, the video data confidence level and the preset confidence level parameters.
- the server can obtain the audio data confidence by inputting the first voice emotion data, the second voice emotion data, and the grammar score into the trained audio classification model, and then according to the audio data confidence, video data confidence, and confidence parameters, Determine the interview result of the interviewer.
- the parameters of the audio classification model include the confidence that the audio data in the first voice emotion data belongs to each preset emotion category, the confidence that the text data in the second voice emotion data belongs to each preset emotion category and the grammatical score.
- sample voice data and sample text data that carry annotation information can be used as a training set.
- the annotation information is used to mark whether the interviewer corresponding to the sample voice data and sample text data is lying.
- the confidence parameter can be set according to needs, and the confidence parameter is an adjustable parameter.
- the interview result can be obtained from the interview score.
- the method before step S206, the method further includes:
- each sample sentence in the first sample text data carries emotion category information
- the first sample text data is used as the training set for model training to obtain the emotion classification network.
- a data processing device including: an acquisition module 802, a first extraction module 804, a first processing module 806, a second processing module 808, and a second extraction module 810 And analysis module 812, where:
- the obtaining module 802 is used to obtain interviewer audio data and interviewer video data
- the first extraction module 804 is configured to extract the interviewer's micro-speech feature according to the interviewer's audio data, and obtain first speech emotion data according to the micro-speech feature;
- the first processing module 806 is used to convert the interviewer’s audio data into text data, split the text data into multiple sentences, and perform word segmentation on multiple sentences, and search for matching preset and trained words according to each sentence in each sentence.
- the dictionary corresponding to the emotion classification network determines the confidence that the text data belongs to each preset emotion category according to the search matching result, and obtains the second speech emotion data, and the emotion classification network is trained by the first sample text data;
- the second processing module 808 is used to input the text data into the trained grammatical analysis network to obtain the grammatical score of each sentence in the text data, calculate the average of the grammatical scores of each sentence, and obtain the grammatical score of the text data. Two samples of text data are obtained through training;
- the second extraction module 810 is configured to randomly intercept video frames from the interviewer's video data, extract the interviewer's micro-expression features according to the video frames, and obtain the video data confidence level according to the micro-expression features;
- the analysis module 812 is configured to determine the interview result of the interviewer according to the first voice emotion data, the second voice emotion data, the grammar score, and the confidence of the video data.
- the above data processing device extracts micro-speech features based on the interviewer’s audio data, obtains first voice emotion data based on the micro-speech features, converts the interviewer’s audio data into text data, analyzes the text data, and obtains second voice emotion data, and Grammar score, extract micro-expression features based on the interviewer’s video data, obtain the confidence of the video data based on the micro-expression features, and determine the interviewer’s interview based on the first voice emotion data, the second voice emotion data, the grammar score, and the confidence of the video data result.
- the first extraction module is also used to call the voice feature extraction tool to extract the interviewer’s micro-speech features based on the interviewer’s audio data.
- the micro-speech features include speech rate features, Mel frequency cepstral coefficients, and pitch feature.
- the first extraction module is also used to obtain gender information of the interviewer, and obtain a voice emotion classification model that matches the interviewer's gender information from the set of trained voice emotion classification models.
- the voice emotion classification model is marked by carrying The sample voice data of the information is trained.
- the labeling information includes emotion category information and gender information.
- the pitch feature, Mel frequency cepstral coefficient and speech speed feature in the micro-speech feature are obtained, and the pitch feature and Mel frequency cepstral coefficient are obtained.
- the speech emotion classification model that has been matched with the speech rate feature input, the confidence that the micro-speech feature belongs to each preset emotion category is obtained, and the first speech emotion data of the micro-speech feature is obtained.
- the first extraction module is also used to obtain sample voice data carrying annotation information, divide the sample voice data into a training set and a validation set, and perform model training according to the training set and the initial voice emotion classification model to obtain the voice Emotion classification model set, perform model verification based on the validation set, and adjust each voice emotion classification model in the voice emotion classification model set.
- the first processing module is further configured to search and match a preset dictionary corresponding to the trained emotion classification network according to each word in each sentence, and determine the corresponding serial number of each word in each sentence in the dictionary. Input the sequence number of each word in each sentence in the dictionary into the emotion classification network to obtain the confidence that each sentence in the text data belongs to each preset emotion category, and obtain the confidence that each sentence in the text data belongs to each preset emotion category According to the average value of confidence, the confidence that the text data belongs to each preset emotion category is obtained.
- the analysis module is further configured to obtain the audio data confidence level according to the first voice emotion data, the second voice emotion data, and the grammar score, according to the audio data confidence level, the video data confidence level, and the preset confidence level Parameters to determine the interview result of the interviewer.
- the first processing module is also used to obtain first sample text data, each sample sentence in the first sample text data carries emotion category information, and the first sample text data is used as a training set for the model Train to get the emotion classification network.
- Each module in the above-mentioned data processing device may be implemented in whole or in part by software, hardware, and a combination thereof.
- the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
- a computer device is provided.
- the computer device may be a server, and its internal structure diagram may be as shown in FIG. 9.
- the computer equipment includes a processor, a memory, and a network interface connected through a system bus.
- the processor of the computer device is used to provide calculation and control capabilities.
- the memory of the computer device includes a non-volatile storage medium and an internal memory.
- the non-volatile storage medium stores an operating system and computer readable instructions.
- the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
- the network interface of the computer device is used to communicate with an external terminal through a network connection.
- the computer-readable instructions are executed by the processor to realize a data processing method.
- FIG. 9 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
- the specific computer device may Including more or less parts than shown in the figure, or combining some parts, or having a different part arrangement.
- a computer device includes a memory and one or more processors.
- the memory stores computer readable instructions.
- the one or more processors execute the following step:
- the grammar analysis network is obtained by training the second sample text data;
- the interview result of the interviewer is determined.
- one or more non-volatile computer-readable storage media storing computer-readable instructions, when the computer-readable instructions are executed by one or more processors, cause the one or more processors to execute The following steps:
- the grammar analysis network is obtained by training the second sample text data;
- the interview result of the interviewer is determined.
- Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- Volatile memory may include random access memory (RAM) or external cache memory.
- RAM is available in many forms, such as static RAM (step RAM), dynamic RAM (DRAM), synchronous DRAM (step DRAM), double data rate step DRAM (DDR step DRAM), enhanced step DRAM (E step DRAM), synchronous link (step ynchlink) DRAM (step LDRAM), memory bus (Rambu step) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (20)
- 一种数据处理方法,包括:获取面试者音频数据以及面试者视频数据;根据所述面试者音频数据提取面试者的微语音特征,根据所述微语音特征,得到第一语音情绪数据;将所述面试者音频数据转换为文字数据,将所述文字数据拆分为多个句子,并对多个句子进行分词,根据各句子中各词语查找匹配预设的与已训练情绪分类网络对应的字典,根据查找匹配结果确定所述文字数据归属于各预设的情绪类别的置信度,得到第二语音情绪数据,所述情绪分类网络由第一样本文字数据训练得到;将所述文字数据输入已训练的语法分析网络,得到所述文字数据中各句子的语法分数,计算各句子的语法分数平均值,得到所述文字数据的语法评分,所述语法分析网络由第二样本文字数据训练得到;从所述面试者视频数据中随机截取视频帧,根据所述视频帧提取面试者的微表情特征,根据所述微表情特征,得到视频数据置信度;及根据所述第一语音情绪数据、所述第二语音情绪数据、所述语法评分以及所述视频数据置信度,确定面试者的面试结果。
- 根据权利要求1所述的方法,其特征在于,所述根据所述面试者音频数据提取面试者的微语音特征,包括:调用语音特征提取工具,根据所述面试者音频数据提取面试者的微语音特征,所述微语音特征包括语速特征、梅尔频率倒谱系数以及音高特征。
- 根据权利要求1所述的方法,其特征在于,所述根据所述微语音特征,得到第一语音情绪数据,包括:获取面试者性别信息,从已训练的语音情绪分类模型集合中获取与所述面试者性别信息匹配的语音情绪分类模型,所述语音情绪分类模型由携带标注信息的样本语音数据训练得到,所述标注信息包括情绪类别信息以及性别信息;获取所述微语音特征中的音高特征、梅尔频率倒谱系数以及语速特征;及将所述音高特征、所述梅尔频率倒谱系数以及所述语速特征输入已匹配的语音情绪分类模型中,获取所述微语音特征归属于各预设的情绪类别的置信度,得到所述微语音特征的第一语音情绪数据。
- 根据权利要求3所述的方法,其特征在于,在从已训练的语音情绪分类模型集合中获取与所述面试者性别信息匹配的语音情绪分类模型之前,所述方法还包括:获取携带标注信息的样本语音数据;将所述样本语音数据划分为训练集和验证集;根据所述训练集以及初始语音情绪分类模型进行模型训练,得到语音情绪分类模型集合;及根据所述验证集进行模型验证,调整所述语音情绪分类模型集合中各语音情绪分类模型。
- 根据权利要求1所述的方法,其特征在于,所述根据各句子中各词语查找匹配预设的与已训练情绪分类网络对应的字典,根据查找匹配结果确定所述文字数据归属于各预设的情绪类别的置信度,得到第二语音情绪数据,包括:根据各句子中各词语查找匹配预设的与已训练的情绪分类网络对应的字典,确定各句子中各词语在所述字典中对应的序列号;将各句子中各词语在所述字典中对应的序列号输入所述情绪分类网络,得到文字数据中各句子归属于各预设的情绪类别的置信度;及获取所述文字数据中各句子归属于各预设的情绪类别的置信度的平均值,根据所述置信度的平均值,得到所述文字数据归属于各预设的情绪类别的置信度。
- 根据权利要求1所述的方法,其特征在于,所述根据所述第一语音情绪数据、所述第二语音情绪数据、所述语法评分以及所述视频数据置信度,确定面试者的面试结果包括:根据所述第一语音情绪数据、所述第二语音情绪数据以及所述语法评分,得到音频数据置信度;及根据所述音频数据置信度、所述视频数据置信度以及预设的置信度参数,确定面试者的面试结果。
- 根据权利要求1所述的方法,其特征在于,在根据各句子中各词语查找匹配预设的与已训练情绪分类网络对应的字典之前,所述方法还包括:获取所述第一样本文字数据,所述第一样本文字数据中各样本句携带有情绪类别信息;及将所述第一样本文字数据作为训练集进行模型训练,得到情绪分类网络。
- 一种数据处理装置,包括:获取模块,用于获取面试者音频数据以及面试者视频数据;第一提取模块,用于根据所述面试者音频数据提取面试者的微语音特征,根据所述微语音特征,得到第一语音情绪数据;第一处理模块,用于将所述面试者音频数据转换为文字数据,将所述文字数据拆分为多个句子,并对多个句子进行分词,根据各句子中各词语查找匹配预设的与已训练情绪分类网络对应的字典,根据查找匹配结果确定所述文字数据归属于各预设的情绪类别的置信度,得到第二语音情绪数据,所述情绪分类网络由第一样本文字数据训练得到;第二处理模块,用于将所述文字数据输入已训练的语法分析网络,得到所述文字数据中各句子的语法分数,计算各句子的语法分数平均值,得到所述文字数据的语法评分,所述语法分析网络由第二样本文字数据训练得到;第二提取模块,用于从所述面试者视频数据中随机截取视频帧,根据所述视频帧提取 面试者的微表情特征,根据所述微表情特征,得到视频数据置信度;及分析模块,用于根据所述第一语音情绪数据、所述第二语音情绪数据、所述语法评分以及所述视频数据置信度,确定面试者的面试结果。
- 根据权利要求8所述的装置,其特征在于,第一提取模块还用于调用语音特征提取工具,根据所述面试者音频数据提取面试者的微语音特征,所述微语音特征包括语速特征、梅尔频率倒谱系数以及音高特征。
- 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:获取面试者音频数据以及面试者视频数据;根据所述面试者音频数据提取面试者的微语音特征,根据所述微语音特征,得到第一语音情绪数据;将所述面试者音频数据转换为文字数据,将所述文字数据拆分为多个句子,并对多个句子进行分词,根据各句子中各词语查找匹配预设的与已训练情绪分类网络对应的字典,根据查找匹配结果确定所述文字数据归属于各预设的情绪类别的置信度,得到第二语音情绪数据,所述情绪分类网络由第一样本文字数据训练得到;将所述文字数据输入已训练的语法分析网络,得到所述文字数据中各句子的语法分数,计算各句子的语法分数平均值,得到所述文字数据的语法评分,所述语法分析网络由第二样本文字数据训练得到;从所述面试者视频数据中随机截取视频帧,根据所述视频帧提取面试者的微表情特征,根据所述微表情特征,得到视频数据置信度;及根据所述第一语音情绪数据、所述第二语音情绪数据、所述语法评分以及所述视频数据置信度,确定面试者的面试结果。
- 根据权利要求10所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:调用语音特征提取工具,根据所述面试者音频数据提取面试者的微语音特征,所述微语音特征包括语速特征、梅尔频率倒谱系数以及音高特征。
- 根据权利要求10所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:获取面试者性别信息,从已训练的语音情绪分类模型集合中获取与所述面试者性别信息匹配的语音情绪分类模型,所述语音情绪分类模型由携带标注信息的样本语音数据训练得到,所述标注信息包括情绪类别信息以及性别信息;获取所述微语音特征中的音高特征、梅尔频率倒谱系数以及语速特征;及将所述音高特征、所述梅尔频率倒谱系数以及所述语速特征输入已匹配的语音情绪分类模型中,获取所述微语音特征归属于各预设的情绪类别的置信度,得到所述微语音特征 的第一语音情绪数据。
- 根据权利要求10所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:获取携带标注信息的样本语音数据;将所述样本语音数据划分为训练集和验证集;根据所述训练集以及初始语音情绪分类模型进行模型训练,得到语音情绪分类模型集合;及根据所述验证集进行模型验证,调整所述语音情绪分类模型集合中各语音情绪分类模型。
- 根据权利要求10所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:根据各句子中各词语查找匹配预设的与已训练的情绪分类网络对应的字典,确定各句子中各词语在所述字典中对应的序列号;将各句子中各词语在所述字典中对应的序列号输入所述情绪分类网络,得到文字数据中各句子归属于各预设的情绪类别的置信度;及获取所述文字数据中各句子归属于各预设的情绪类别的置信度的平均值,根据所述置信度的平均值,得到所述文字数据归属于各预设的情绪类别的置信度。
- 根据权利要求10所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:根据所述第一语音情绪数据、所述第二语音情绪数据以及所述语法评分,得到音频数据置信度;及根据所述音频数据置信度、所述视频数据置信度以及预设的置信度参数,确定面试者的面试结果。
- 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:获取面试者音频数据以及面试者视频数据;根据所述面试者音频数据提取面试者的微语音特征,根据所述微语音特征,得到第一语音情绪数据;将所述面试者音频数据转换为文字数据,将所述文字数据拆分为多个句子,并对多个句子进行分词,根据各句子中各词语查找匹配预设的与已训练情绪分类网络对应的字典,根据查找匹配结果确定所述文字数据归属于各预设的情绪类别的置信度,得到第二语音情绪数据,所述情绪分类网络由第一样本文字数据训练得到;将所述文字数据输入已训练的语法分析网络,得到所述文字数据中各句子的语法分数,计算各句子的语法分数平均值,得到所述文字数据的语法评分,所述语法分析网络由第二样本文字数据训练得到;从所述面试者视频数据中随机截取视频帧,根据所述视频帧提取面试者的微表情特征,根据所述微表情特征,得到视频数据置信度;及根据所述第一语音情绪数据、所述第二语音情绪数据、所述语法评分以及所述视频数据置信度,确定面试者的面试结果。
- 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:调用语音特征提取工具,根据所述面试者音频数据提取面试者的微语音特征,所述微语音特征包括语速特征、梅尔频率倒谱系数以及音高特征。
- 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:获取面试者性别信息,从已训练的语音情绪分类模型集合中获取与所述面试者性别信息匹配的语音情绪分类模型,所述语音情绪分类模型由携带标注信息的样本语音数据训练得到,所述标注信息包括情绪类别信息以及性别信息;获取所述微语音特征中的音高特征、梅尔频率倒谱系数以及语速特征;及将所述音高特征、所述梅尔频率倒谱系数以及所述语速特征输入已匹配的语音情绪分类模型中,获取所述微语音特征归属于各预设的情绪类别的置信度,得到所述微语音特征的第一语音情绪数据。
- 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:获取携带标注信息的样本语音数据;将所述样本语音数据划分为训练集和验证集;根据所述训练集以及初始语音情绪分类模型进行模型训练,得到语音情绪分类模型集合;及根据所述验证集进行模型验证,调整所述语音情绪分类模型集合中各语音情绪分类模型。
- 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:根据各句子中各词语查找匹配预设的与已训练的情绪分类网络对应的字典,确定各句子中各词语在所述字典中对应的序列号;将各句子中各词语在所述字典中对应的序列号输入所述情绪分类网络,得到文字数据中各句子归属于各预设的情绪类别的置信度;及获取所述文字数据中各句子归属于各预设的情绪类别的置信度的平均值,根据所述置信度的平均值,得到所述文字数据归属于各预设的情绪类别的置信度。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
SG11202004543PA SG11202004543PA (en) | 2019-08-13 | 2019-09-25 | Data processing method and apparatus, computer device, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910745443.6 | 2019-08-13 | ||
CN201910745443.6A CN110688499A (zh) | 2019-08-13 | 2019-08-13 | 数据处理方法、装置、计算机设备和存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021027029A1 true WO2021027029A1 (zh) | 2021-02-18 |
Family
ID=69108262
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/107727 WO2021027029A1 (zh) | 2019-08-13 | 2019-09-25 | 数据处理方法、装置、计算机设备和存储介质 |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN110688499A (zh) |
SG (1) | SG11202004543PA (zh) |
WO (1) | WO2021027029A1 (zh) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111739559B (zh) * | 2020-05-07 | 2023-02-28 | 北京捷通华声科技股份有限公司 | 一种话语预警方法、装置、设备及存储介质 |
CN112818740A (zh) * | 2020-12-29 | 2021-05-18 | 南京智能情资创新科技研究院有限公司 | 一种用于智能面试的心理素质维度评价方法及装置 |
CN112884326A (zh) * | 2021-02-23 | 2021-06-01 | 无锡爱视智能科技有限责任公司 | 一种多模态分析的视频面试评估方法、装置和存储介质 |
CN112786054B (zh) * | 2021-02-25 | 2024-06-11 | 深圳壹账通智能科技有限公司 | 基于语音的智能面试评估方法、装置、设备及存储介质 |
CN112990301A (zh) * | 2021-03-10 | 2021-06-18 | 深圳市声扬科技有限公司 | 情绪数据标注方法、装置、计算机设备和存储介质 |
CN112836691A (zh) * | 2021-03-31 | 2021-05-25 | 中国工商银行股份有限公司 | 智能面试方法及装置 |
CN113506586B (zh) * | 2021-06-18 | 2023-06-20 | 杭州摸象大数据科技有限公司 | 用户情绪识别的方法和系统 |
CN113724697A (zh) * | 2021-08-27 | 2021-11-30 | 北京百度网讯科技有限公司 | 模型生成方法、情绪识别方法、装置、设备及存储介质 |
CN113808709B (zh) * | 2021-08-31 | 2024-03-22 | 天津师范大学 | 一种基于文本分析的心理弹性预测方法及系统 |
CN114627218B (zh) * | 2022-05-16 | 2022-08-12 | 成都市谛视无限科技有限公司 | 一种基于虚拟引擎的人脸细微表情捕捉方法及装置 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106570496A (zh) * | 2016-11-22 | 2017-04-19 | 上海智臻智能网络科技股份有限公司 | 情绪识别方法和装置以及智能交互方法和设备 |
CN108305642A (zh) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | 情感信息的确定方法和装置 |
US20180376001A1 (en) * | 2016-11-02 | 2018-12-27 | International Business Machines Corporation | System and Method for Monitoring and Visualizing Emotions in Call Center Dialogs at Call Centers |
CN109766917A (zh) * | 2018-12-18 | 2019-05-17 | 深圳壹账通智能科技有限公司 | 面试视频数据处理方法、装置、计算机设备和存储介质 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106503646B (zh) * | 2016-10-19 | 2020-07-10 | 竹间智能科技(上海)有限公司 | 多模态情感辨识系统及方法 |
EP3729419A1 (en) * | 2017-12-19 | 2020-10-28 | Wonder Group Technologies Ltd. | Method and apparatus for emotion recognition from speech |
CN109829363A (zh) * | 2018-12-18 | 2019-05-31 | 深圳壹账通智能科技有限公司 | 表情识别方法、装置、计算机设备和存储介质 |
CN109902158A (zh) * | 2019-01-24 | 2019-06-18 | 平安科技(深圳)有限公司 | 语音交互方法、装置、计算机设备及存储介质 |
CN109948438A (zh) * | 2019-02-12 | 2019-06-28 | 平安科技(深圳)有限公司 | 自动面试评分方法、装置、系统、计算机设备及存储介质 |
CN109905381A (zh) * | 2019-02-15 | 2019-06-18 | 北京大米科技有限公司 | 自助面试方法、相关装置和存储介质 |
-
2019
- 2019-08-13 CN CN201910745443.6A patent/CN110688499A/zh active Pending
- 2019-09-25 WO PCT/CN2019/107727 patent/WO2021027029A1/zh active Application Filing
- 2019-09-25 SG SG11202004543PA patent/SG11202004543PA/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180376001A1 (en) * | 2016-11-02 | 2018-12-27 | International Business Machines Corporation | System and Method for Monitoring and Visualizing Emotions in Call Center Dialogs at Call Centers |
CN106570496A (zh) * | 2016-11-22 | 2017-04-19 | 上海智臻智能网络科技股份有限公司 | 情绪识别方法和装置以及智能交互方法和设备 |
CN108305642A (zh) * | 2017-06-30 | 2018-07-20 | 腾讯科技(深圳)有限公司 | 情感信息的确定方法和装置 |
CN109766917A (zh) * | 2018-12-18 | 2019-05-17 | 深圳壹账通智能科技有限公司 | 面试视频数据处理方法、装置、计算机设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
SG11202004543PA (en) | 2021-03-30 |
CN110688499A (zh) | 2020-01-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021027029A1 (zh) | 数据处理方法、装置、计算机设备和存储介质 | |
CN110276259B (zh) | 唇语识别方法、装置、计算机设备及存储介质 | |
WO2021068321A1 (zh) | 基于人机交互的信息推送方法、装置和计算机设备 | |
EP3832519A1 (en) | Method and apparatus for evaluating translation quality | |
US10176804B2 (en) | Analyzing textual data | |
WO2020244153A1 (zh) | 会议语音数据处理方法、装置、计算机设备和存储介质 | |
WO2020177230A1 (zh) | 基于机器学习的医疗数据分类方法、装置、计算机设备及存储介质 | |
US9558741B2 (en) | Systems and methods for speech recognition | |
WO2021000497A1 (zh) | 检索方法、装置、计算机设备和存储介质 | |
WO2020147395A1 (zh) | 基于情感的文本分类处理方法、装置和计算机设备 | |
CN113094578B (zh) | 基于深度学习的内容推荐方法、装置、设备及存储介质 | |
CN113707125B (zh) | 一种多语言语音合成模型的训练方法及装置 | |
JP2017058674A (ja) | 音声認識のための装置及び方法、変換パラメータ学習のための装置及び方法、コンピュータプログラム並びに電子機器 | |
CN111833845A (zh) | 多语种语音识别模型训练方法、装置、设备及存储介质 | |
US20230089308A1 (en) | Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering | |
US11961515B2 (en) | Contrastive Siamese network for semi-supervised speech recognition | |
CN113254613B (zh) | 对话问答方法、装置、设备及存储介质 | |
CN110717021B (zh) | 人工智能面试中获取输入文本和相关装置 | |
US11893813B2 (en) | Electronic device and control method therefor | |
CN110047469A (zh) | 语音数据情感标注方法、装置、计算机设备及存储介质 | |
CN111126084B (zh) | 数据处理方法、装置、电子设备和存储介质 | |
CN115796653A (zh) | 一种面试发言评价方法及系统 | |
US20220392439A1 (en) | Rescoring Automatic Speech Recognition Hypotheses Using Audio-Visual Matching | |
JP2015175859A (ja) | パターン認識装置、パターン認識方法及びパターン認識プログラム | |
CN111933187B (zh) | 情感识别模型的训练方法、装置、计算机设备和存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19941414 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19941414 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.08.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19941414 Country of ref document: EP Kind code of ref document: A1 |