WO2023222088A1 - Procédé et appareil de reconnaissance vocale et de classification - Google Patents

Procédé et appareil de reconnaissance vocale et de classification Download PDF

Info

Publication number
WO2023222088A1
WO2023222088A1 PCT/CN2023/095080 CN2023095080W WO2023222088A1 WO 2023222088 A1 WO2023222088 A1 WO 2023222088A1 CN 2023095080 W CN2023095080 W CN 2023095080W WO 2023222088 A1 WO2023222088 A1 WO 2023222088A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
speech
text data
real
text
Prior art date
Application number
PCT/CN2023/095080
Other languages
English (en)
Chinese (zh)
Inventor
曾谁飞
孔令磊
张景瑞
李敏
刘卫强
Original Assignee
青岛海尔电冰箱有限公司
海尔智家股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 青岛海尔电冰箱有限公司, 海尔智家股份有限公司 filed Critical 青岛海尔电冰箱有限公司
Publication of WO2023222088A1 publication Critical patent/WO2023222088A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present invention relates to the field of computer technology, and in particular to a speech recognition and classification method and device.
  • the object of the present invention is to provide a speech recognition and classification method and device.
  • the invention provides a speech recognition and classification method, which includes the steps:
  • the transcribing the real-time voice data into voice text data specifically includes:
  • the output text data is combined through a fully connected layer, it is output to a classification function to calculate a score to obtain speech text data.
  • the transcribing the real-time voice data into voice text data further includes the steps:
  • inputting the speech feature vector into a speech recognition convolutional neural network to obtain output text data specifically includes:
  • the speech feature vector is input into a multi-size and multi-channel multi-layer speech recognition convolutional neural network to obtain output text data.
  • the extraction of real-time voice data features specifically includes:
  • the extraction of text features of the real-time speech data and the historical text data specifically includes:
  • the word vector is input into a bidirectional long and short memory network model to obtain a context feature vector containing context feature information based on the speech text data and the historical text data.
  • extracting text features of the real-time voice data and the historical text data also includes:
  • the context feature vector is input into the attention mechanism model to obtain an attention feature vector containing weight information.
  • inputting the context feature vector into the attention mechanism model to obtain the attention feature vector containing weight information specifically includes:
  • the first attention feature vector is input into the mutual attention mechanism model, and a second attention feature vector containing association weight information between different words of the text data is obtained.
  • the text features are output to a classifier to calculate scores to obtain classification result information, which specifically includes:
  • the attention feature vector After the attention feature vector is combined through the fully connected layer, it is output to the Softmax function, and the scores of the text semantics of the speech text data and the historical text data and their normalized score results are calculated to obtain classification result information.
  • the obtaining of real-time voice data specifically includes:
  • the real-time voice data transmitted from the client terminal is obtained.
  • the acquisition of historical text data specifically includes:
  • Preprocessing the real-time voice data includes: framing and windowing the real-time voice data,
  • Preprocessing the historical text data includes: cleaning, annotating, word segmenting, and removing stop words on the speech text data.
  • the outputting the classification result information includes:
  • obtaining the context information and weight information of the real-time voice data and the historical text data specifically includes:
  • Obtain the configuration data stored in the external cache perform deep neural network calculations on the speech text data and the historical text data based on the configuration data, and obtain the context information and weight information of the real-time speech data and the historical text data.
  • the invention also provides a speech recognition and classification device, including:
  • Data acquisition module used to acquire real-time voice data and historical text data
  • a transliteration module used to transcribe the real-time voice data into voice text data
  • a feature extraction module used to extract text features of the real-time voice data and the historical text data
  • the result calculation module is used to combine the text features through the fully connected layer and output it to the classifier to calculate the score to obtain the classification result information;
  • An output module is used to output the classification result information.
  • the method provided by the present invention completes the task of identifying and classifying the acquired speech data, and by acquiring historical text data, the historical text data is used as part of the data set of the pre-training and prediction model,
  • the text semantic feature information is more comprehensively obtained.
  • the historical text number is used as supplementary data, which makes up for the problem of less semantic information in the speech data text and effectively improves the accuracy of text classification.
  • the accuracy of real-time speech recognition is improved by constructing a neural network model that combines convolutional neural networks, connection temporal classification methods and attention mechanisms; Build a neural network model that integrates contextual information mechanism, self-attention mechanism and mutual attention mechanism to more fully extract text semantic feature information.
  • the overall model structure has excellent deep learning representation capabilities, high speech recognition accuracy, and high accuracy in classifying speech text.
  • Figure 1 is a structural block diagram of a model involved in a speech recognition and classification method in an embodiment of the present invention.
  • Figure 2 is a schematic diagram of the steps of a speech recognition and classification method in an embodiment of the present invention.
  • Figure 3 is a schematic diagram of the steps of acquiring real-time voice data and acquiring historical text data in an embodiment of the present invention.
  • Figure 4 is a schematic diagram of the steps of converting the real-time voice data into voice text data in one embodiment of the invention.
  • Figure 5 is a schematic diagram of the steps of extracting text features of the real-time voice data and the historical text data in an embodiment of the invention.
  • Figure 6 is a schematic structural diagram of a speech recognition and classification device in an embodiment of the present invention.
  • FIG. 1 it is a structural block diagram of a model involved in a speech recognition and classification method provided by the present invention.
  • Figure 2 it is a schematic diagram of the steps of the speech recognition and classification method, which includes:
  • S1 Get real-time voice data and historical text data.
  • S2 Transcribe the real-time voice data into voice text data.
  • the method provided by the present invention can be used by an intelligent electronic device to implement functions such as real-time interaction or message push with the user based on the user's real-time voice input.
  • a smart refrigerator is taken as an example, and the method is explained in combination with a pre-trained deep learning model.
  • the smart refrigerator Based on the user's voice input, the smart refrigerator classifies the text content corresponding to the user's voice, and calculates the classification result information that needs to be output based on the classification result information.
  • step S1 it specifically includes:
  • the real-time voice data transmitted from the client terminal is obtained.
  • the real-time voice mentioned here refers to the inquiry or instructional statements currently spoken by the user to the intelligent electronic device or to the client terminal device that is communicatively connected to the intelligent electronic device.
  • the user can ask questions such as "What vegetables are in the refrigerator today?" Fruits of the season” and other commands.
  • the processor of the smart refrigerator performs voice recognition through the method provided by the present invention, and then performs real-time voice interaction or pushes relevant information with the user.
  • the historical text data mentioned here refers to the speech text data transcribed into the user's real-time speech during the previous use process. Furthermore, it may also include historical text data input by the user himself. Specifically, in this embodiment, it may include: text transcribed into relevant questions and instructions after the user asked questions or issued instructions in the past; text transcribed from the explanatory voice issued by the user based on the items put in during the past use. Text, such as "I put in a watermelon today", "There are 3 bottles of yogurt left in the refrigerator”, etc.; text transcribed from the user's comments on the ingredients in the past, such as "The chili I put in today is very spicy""A certain brand of yogurt tastes good”etc.; or other text data entered by the user during past use, etc. In different implementations, one or more of the above historical texts can be selected as the number of historical texts required by this method as needed. according to.
  • the user's real-time voice can be collected through voice collection devices such as pickups and microphone arrays installed in the smart refrigerator.
  • voice collection devices such as pickups and microphone arrays installed in the smart refrigerator.
  • the smart refrigerator emits a voice.
  • the transmitted real-time voice of the user can also be obtained through a client terminal connected to the smart refrigerator based on a wireless communication protocol.
  • the client terminal is an electronic device with information sending function, such as a mobile phone, tablet computer, smart speaker, smart bracelet or Bluetooth.
  • the user directly sends voice to the customer terminal, and the customer terminal collects the voice and transmits it to the smart refrigerator through wireless communication methods such as wifi or Bluetooth.
  • users When users have interaction needs, they can send real-time voice through any convenient channel, which can significantly improve user convenience.
  • one or more of the above real-time voice acquisition methods may also be used, or the real-time voice may be acquired through other channels based on existing technologies, which will not be described again here.
  • the historical text data can be obtained by reading the historical text stored in the internal memory of the smart refrigerator. Moreover, the historical text data can also be obtained by reading the historical text stored in the external storage device configured in the smart refrigerator.
  • the external storage device is a device such as a U disk, SD card, etc.
  • the smart refrigerator can be further expanded by setting an external storage device. of storage space.
  • the historical text data stored in client terminals such as mobile phones and tablet computers or application software servers can also be obtained, and when needed, the historical text data can be transmitted to the smart refrigerator for processing through client terminal communication.
  • the realization of multi-channel historical text acquisition channels can greatly increase the data volume of historical text information, thus improving the accuracy of subsequent speech recognition.
  • one or more of the above methods for obtaining historical text data may also be used, or the historical text data may be obtained through other channels based on existing technologies, which will not be described in detail here. .
  • the smart refrigerator is configured with an external cache, and at least part of the historical text data is stored in the external cache. As the use time increases, the historical text data increases. By storing part of the data in In the external cache, the internal storage space of the smart refrigerator can be saved, and when performing neural network calculations, the historical text data stored in the external cache can be directly read, which can improve algorithm efficiency.
  • the Redis component is used as the external cache
  • the Redis component is Currently a distributed caching system that uses a more widely used key/value storage structure, which can be used as a database, cache and message queue agent.
  • Other external caches such as Memcached may also be used in other embodiments of the present invention, and the present invention places no specific limitations on this.
  • steps S11 and S12 real-time voice data and historical text data can be flexibly obtained through multiple channels, which not only improves the user experience, but also ensures the amount of data and effectively improves algorithm efficiency.
  • step S1 also includes the step of preprocessing the data, which includes:
  • S13 Preprocess the real-time voice data, including: performing frame processing and windowing processing on the real-time voice data.
  • S14 Preprocess the historical text data, including cleaning, annotating, word segmenting, and removing stop words on the speech text data.
  • step S13 the speech is segmented according to the specified length (time period or number of samples), structured into a programmable data structure, and the frame processing of the speech is completed to obtain speech signal data. Then, the speech signal data is multiplied by a window function, so that the originally non-periodic speech signal exhibits some characteristics of the periodic function, completing the windowing process. Furthermore, pre-emphasis processing can be performed before frame processing to emphasize the high-frequency part of the speech to eliminate the influence of lip radiation during the voicing process, thereby compensating for the high-frequency part of the speech signal that is suppressed by the articulation system, and can Highlight the high frequency resonance peaks.
  • steps such as filtering audio noise points and enhancing vocal processing can be performed to complete the enhancement of the real-time voice data, extract the characteristic parameters of the real-time voice, and make the real-time voice data Meet the input requirements of subsequent neural network models.
  • step S14 irrelevant data and duplicate data in the historical text data set are deleted, abnormal value and missing value data are processed, information irrelevant to the classification is initially screened out, and the historical text data is cleaned. Then, the historical text data is labeled with category labels using methods based on rule statistics, and word segmentation methods based on string matching, understanding-based word segmentation methods, statistics-based word segmentation methods, and rule-based word segmentation methods are used to label the historical text data. Text data is segmented into words. After that, stop words are removed and the preprocessing of the historical text data is completed, so that the historical text data meets the input requirements of the subsequent neural network model.
  • step S13 and step S14 the specific algorithm used to preprocess the real-time voice data and the historical text data may refer to the current technology in the field, and will not be described again here.
  • step S2 it specifically includes the following steps:
  • S22 Input the speech features into the speech recognition convolutional neural network to obtain output text data.
  • step S23 the following steps are also included:
  • step S21 extracting the real-time voice data features specifically includes:
  • MFCC Mel-scale Frequency Cepstral Coefficients
  • step S21 may include:
  • the preprocessed real-time speech data is subjected to fast Fourier transform to obtain the energy spectrum of each frame of real-time speech data signal, and the energy spectrum is passed through a set of Mel-scale triangular filter banks to smooth the spectrum and eliminate
  • the role of harmonics highlights the formants of real-time speech, and then the MFCC coefficient characteristics are obtained through further logarithmic operations and discrete cosine transforms.
  • characteristic parameters such as perceptual linear prediction features (Perceptual Linear Predictive, referred to as PLP) or linear predictive coefficient features (Linear Predictive Coding, referred to as LPC) of the real-time speech data can also be obtained through different algorithm steps.
  • PLP Perceptual Linear Predictive
  • LPC Linear Predictive Coding
  • step S22 it specifically includes:
  • the speech feature vector is input into a multi-size and multi-channel multi-layer speech recognition convolutional neural network to obtain output text data.
  • a multi-size and multi-channel multi-layer convolutional neural network is constructed to improve the network width of the model.
  • the width of the convolutional neural network refers to the number of channels in the convolutional neural network model.
  • the network width is increased by increasing the number of convolutional layer channels.
  • a wider network can allow each layer of the convolutional neural network to learn Richer features, thereby improving the performance of the speech recognition convolutional neural network model, to make up for the shortcomings of short effective duration in real-time speech data.
  • the convolution kernel size is 3*3
  • the number of channels in each convolution layer is 32
  • maximum pooling is used to reduce the size of the speech recognition convolutional neural network model and improve the calculation speed. , and at the same time improve the robustness of the extracted features.
  • the number of channels is set to 32. On the one hand, it ensures the network width of the speech recognition deconvolution neural network. On the other hand, it avoids the excessive calculation amount caused by the excessive network width, which causes the speech recognition Reduction in efficiency of convolutional neural networks.
  • the speech recognition convolutional neural network model parameters can also be specifically adjusted based on the actual model parameters and the field of practical application of the method.
  • the present invention does not impose specific limitations on this.
  • the Connectionist temporal classification (CTC) model is a completely end-to-end acoustic model training method that can expand the label set, add empty elements, and label the sequence using the expanded label set. , all predicted sequences that can be converted into real sequences through the mapping function are correct prediction results. By introducing it, the problem of tag sequence alignment between the speech features and the output text data is solved.
  • the attention mechanism can guide the neural network to focus on more critical information and suppress other non-critical information. Therefore, by introducing the attention mechanism, the local key features or weight information of the output text data can be obtained. This further reduces the irregular error alignment of the sequence during model training. Since the CTC model lacks the connection between the preceding and following speech features, it relies more on the correction of the language model, and simply uses The attention mechanism model has nothing to do with the frame order of the input real-time speech data. Each decoding unit generates the current result through the decoding result of the previous unit and the overall speech characteristics, ignoring the monotonous timing of speech. In order to take into account The advantages and disadvantages of the CTC model and the attention model are chosen to be used in combination in this implementation.
  • step S2 rich high-level speech feature information of the real-time speech data can be obtained by constructing a deep neural network model that integrates a convolutional neural network, a connection timing classification method and an attention mechanism, thereby improving the model speech recognition capabilities and accuracy.
  • the real-time speech data can also be transcribed into the speech text data by constructing a shallow neural network model or a model such as a Gaussian mixture model, as long as the real-time speech data can be Just transcribe it into the voice text data.
  • the speech text data and the historical text data corresponding to real-time speech are obtained through steps S1 and S2.
  • step S3 it specifically includes:
  • S32 Input the word vector into the bidirectional long and short memory network model to obtain a context feature vector containing context feature information based on the speech text data and the historical text data.
  • step S32 the following steps are also included:
  • step S31 in order to convert the text data into a vectorized form that can be recognized and processed by the computer, the historical text data and the speech text data can be converted into the word vector through the Word2Vec algorithm, or other methods can be used.
  • the word vector can be obtained by conversion using existing algorithms in the field, such as the Glove algorithm, and the present invention does not impose specific limitations on this.
  • the Bi-directional Long Short-Term Memory (BiLSTM) is composed of a forward Long Short-Term Memory (LSTM) and a backward long short memory network.
  • the LSTM model It can better obtain the long-distance dependencies of text semantics, and based on it, the BiLSTM model can better obtain the bidirectional semantics of text.
  • Input multiple word vectors into the BiLSTM model respectively.
  • the effective information output at each time step is obtained.
  • the hidden layer state of the information is output, and the context feature vector with contextual context information is output.
  • a common recurrent network model in the field such as a Gated Recurrent Unit (GRU) network can also be used to extract contextual feature information, and the present invention does not impose specific limitations on this.
  • GRU Gated Recurrent Unit
  • step S33 the context feature vector is used as the input of the attention mechanism model to obtain the output attention feature vector. Further, in this embodiment, step S33 specifically includes:
  • S331 Input the context feature vector into the self-attention mechanism model, and obtain the first attention feature vector containing the weight information of the text semantics of the text data.
  • S332 Input the first attention feature vector into the mutual attention mechanism model, and obtain a second attention feature vector containing association weight information between different words of the text data.
  • the input context feature vector is given its own weight information through the self-attention mechanism model to obtain the first attention feature vector, thereby obtaining the internal semantic features of the speech text data and the historical text data. weight information. And further assign the input first attention feature vector to the association weight information between different words of the text through the mutual attention mechanism model to obtain the second attention feature vector, thereby obtaining the speech text data and the Association weight information between different words in historical text data.
  • the second attention feature vector finally obtained integrates the context information of text semantics, the internal weight information of words and the association weight information between different words, and has rich semantic feature information, so that excellent text, Speech representation ability.
  • the text feature enhancement of the context feature vector can also be completed based only on the self-attention mechanism model, or through other algorithm models.
  • step S3 may also include:
  • Obtain the configuration data stored in the external cache perform deep neural network calculations on the speech text data and the historical text data based on the configuration data, and obtain the context information and weight information of the real-time speech data and the historical text data.
  • the calculation efficiency of the algorithm is improved by configuring an external cache, and effectively solves the problems of time response and space calculation complexity caused by the large amount of historical text data.
  • the order of the layers of the deep neural network can be adjusted as needed. Or some layers can be omitted, as long as the text classification of the speech text data and the historical text data can be completed, the present invention does not impose specific limitations on this.
  • step S4 it specifically includes:
  • the attention feature vector After the attention feature vector is combined through the fully connected layer, it is output to the Softmax function, and the scores of the text semantics of the speech text data and the historical text data and their normalized score results are calculated to obtain classification result information.
  • the method provided by the present invention sequentially completes the recognition and classification tasks of the acquired speech data through the above steps, and by acquiring the historical text data, the historical text data is used as pre-training and prediction As part of the data set of the model, the text semantic feature information is more comprehensively obtained.
  • the historical text data is used as supplementary data to make up for the text semantics of the voice data.
  • the accuracy of text classification is effectively improved.
  • the accuracy of real-time speech recognition is improved by building a neural network model that combines convolutional neural networks, connection temporal classification methods, and attention mechanisms; by building a neural network model that combines context information mechanisms, self-attention mechanisms, and mutual attention mechanisms.
  • Network model can more fully extract text semantic feature information.
  • the overall model structure has excellent deep learning representation capabilities, high speech recognition accuracy, and high accuracy in classifying speech text.
  • step S5 it specifically includes:
  • the classification result information can be converted into voice, and the voice of the classification result information can be broadcast through the sound playback device built in the smart refrigerator, thereby directly communicating with The user can perform voice interaction, or the classification result information can be converted into text and displayed directly through the display device configured in the smart refrigerator.
  • the classification result information can also be transmitted to a client terminal via voice communication for output.
  • the client terminal is an electronic device with an information receiving function, such as transmitting voice to Devices such as mobile phones, smart speakers, Bluetooth headsets, etc.
  • a multi-channel and multi-type classification result information output method is realized.
  • the user is not limited to only obtaining relevant information near the smart refrigerator.
  • the multi-channel and multi-type real-time voice acquisition method provided by the present invention, the user can directly obtain relevant information remotely. Interacting with the smart refrigerator is extremely convenient and greatly improves the user experience.
  • only one or more of the above classification result information output methods may be used, or the classification result information may be output through other channels based on existing technology, and the present invention does not impose specific limitations on this.
  • the present invention provides a speech recognition and classification method that obtains real-time speech data through multiple channels. After transcribing the real-time speech data into text, it combines historical text data with a deep neural network model to fully extract the text. Semantic features, after obtaining the classification result information, output it through multiple channels, which significantly improves the speech recognition accuracy and text classification accuracy, while making the interaction method more convenient and diverse, greatly improving the user experience.
  • the present invention also provides a speech recognition and classification device 6, which includes:
  • Data acquisition module 61 used to acquire real-time voice data and historical text data
  • Transcription module 62 used to transcribe the real-time voice data into voice text data
  • Feature extraction module 63 used to extract text features of the real-time voice data and the historical text data
  • the result calculation module 64 is used to combine the text features through the fully connected layer and output them to the classifier to calculate the score to obtain the classification result information;
  • the output module 65 is used to output the classification result information.
  • an electrical device which includes:
  • Memory used to store executable instructions
  • the processor is used to implement the above speech recognition and classification method when running executable instructions stored in the memory.
  • the present invention also provides a refrigerator, which includes:
  • Memory used to store executable instructions
  • the processor is used to implement the above speech recognition and classification method when running executable instructions stored in the memory.
  • the present invention also provides a computer-readable storage medium that stores executable instructions.
  • executable instructions When the executable instructions are executed by a processor, the above-mentioned speech recognition and classification method is implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention concerne un procédé et un appareil de reconnaissance vocale et de classification, qui se rapportent au domaine technique des ordinateurs. Le procédé comprend les étapes consistant à : acquérir des données vocales en temps réel, et acquérir des données de texte historiques (S1) ; transcrire les données vocales en temps réel sous forme de données de texte vocal (S2) ; extraire des caractéristiques de texte des données vocales en temps réel et des données de texte historiques (S3) ; combiner les caractéristiques de texte au moyen d'une couche complètement connectée, puis délivrer celles-ci à un classificateur, calculer un score, et obtenir des informations de résultat de classification (S4) ; et délivrer les informations de résultat de classification (S5).
PCT/CN2023/095080 2022-05-20 2023-05-18 Procédé et appareil de reconnaissance vocale et de classification WO2023222088A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210556181.0 2022-05-20
CN202210556181.0A CN115062143A (zh) 2022-05-20 2022-05-20 语音识别与分类方法、装置、设备、冰箱及存储介质

Publications (1)

Publication Number Publication Date
WO2023222088A1 true WO2023222088A1 (fr) 2023-11-23

Family

ID=83199399

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/095080 WO2023222088A1 (fr) 2022-05-20 2023-05-18 Procédé et appareil de reconnaissance vocale et de classification

Country Status (2)

Country Link
CN (1) CN115062143A (fr)
WO (1) WO2023222088A1 (fr)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115098765A (zh) * 2022-05-20 2022-09-23 青岛海尔电冰箱有限公司 基于深度学习的信息推送方法、装置、设备及存储介质
CN114944156A (zh) * 2022-05-20 2022-08-26 青岛海尔电冰箱有限公司 基于深度学习的物品分类方法、装置、设备及存储介质
CN115062143A (zh) * 2022-05-20 2022-09-16 青岛海尔电冰箱有限公司 语音识别与分类方法、装置、设备、冰箱及存储介质
CN116741151B (zh) * 2023-08-14 2023-11-07 成都筑猎科技有限公司 一种基于呼叫中心的用户呼叫实时监测系统
CN116975301A (zh) * 2023-09-22 2023-10-31 腾讯科技(深圳)有限公司 文本聚类方法、装置、电子设备和计算机可读存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305641A (zh) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 情感信息的确定方法和装置
CN109146066A (zh) * 2018-11-01 2019-01-04 重庆邮电大学 一种基于语音情感识别的虚拟学习环境自然交互方法
CN109523994A (zh) * 2018-11-13 2019-03-26 四川大学 一种基于胶囊神经网络的多任务语音分类方法
US10468019B1 (en) * 2017-10-27 2019-11-05 Kadho, Inc. System and method for automatic speech recognition using selection of speech models based on input characteristics
CN113053366A (zh) * 2021-03-12 2021-06-29 中国电子科技集团公司第二十八研究所 一种基于多模态融合的管制话音复述一致性校验方法
CN113808622A (zh) * 2021-09-17 2021-12-17 青岛大学 基于中文语音和文本的情感识别系统及方法
CN115062143A (zh) * 2022-05-20 2022-09-16 青岛海尔电冰箱有限公司 语音识别与分类方法、装置、设备、冰箱及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305641A (zh) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 情感信息的确定方法和装置
US10468019B1 (en) * 2017-10-27 2019-11-05 Kadho, Inc. System and method for automatic speech recognition using selection of speech models based on input characteristics
CN109146066A (zh) * 2018-11-01 2019-01-04 重庆邮电大学 一种基于语音情感识别的虚拟学习环境自然交互方法
CN109523994A (zh) * 2018-11-13 2019-03-26 四川大学 一种基于胶囊神经网络的多任务语音分类方法
CN113053366A (zh) * 2021-03-12 2021-06-29 中国电子科技集团公司第二十八研究所 一种基于多模态融合的管制话音复述一致性校验方法
CN113808622A (zh) * 2021-09-17 2021-12-17 青岛大学 基于中文语音和文本的情感识别系统及方法
CN115062143A (zh) * 2022-05-20 2022-09-16 青岛海尔电冰箱有限公司 语音识别与分类方法、装置、设备、冰箱及存储介质

Also Published As

Publication number Publication date
CN115062143A (zh) 2022-09-16

Similar Documents

Publication Publication Date Title
WO2023222088A1 (fr) Procédé et appareil de reconnaissance vocale et de classification
CN111968679B (zh) 情感识别方法、装置、电子设备及存储介质
CN111933129A (zh) 音频处理方法、语言模型的训练方法、装置及计算机设备
WO2023222089A1 (fr) Procédé et appareil de classification d'éléments sur la base d'un apprentissage profond
CN113408385A (zh) 一种音视频多模态情感分类方法及系统
CN109509470A (zh) 语音交互方法、装置、计算机可读存储介质及终端设备
US11763801B2 (en) Method and system for outputting target audio, readable storage medium, and electronic device
CN114818649A (zh) 基于智能语音交互技术的业务咨询处理方法及装置
CN115602165A (zh) 基于金融系统的数字员工智能系统
WO2023222090A1 (fr) Procédé et appareil de poussée d'informations basés sur un apprentissage profond
CN114566189A (zh) 基于三维深度特征融合的语音情感识别方法及系统
CN116431806A (zh) 自然语言理解方法及冰箱
WO2024114303A1 (fr) Procédé et appareil de reconnaissance de phonèmes, dispositif électronique et support de stockage
CN114399995A (zh) 语音模型的训练方法、装置、设备及计算机可读存储介质
CN115798459B (zh) 音频处理方法、装置、存储介质及电子设备
CN112199498A (zh) 一种养老服务的人机对话方法、装置、介质及电子设备
CN116186258A (zh) 基于多模态知识图谱的文本分类方法、设备及存储介质
CN116108176A (zh) 基于多模态深度学习的文本分类方法、设备及存储介质
CN116070020A (zh) 基于知识图谱的食材推荐方法、设备及存储介质
CN112150103B (zh) 一种日程设置方法、装置和存储介质
CN112397053B (zh) 语音识别方法、装置、电子设备及可读存储介质
US11277304B1 (en) Wireless data protocol
CN114373443A (zh) 语音合成方法和装置、计算设备、存储介质及程序产品
Sartiukova et al. Remote Voice Control of Computer Based on Convolutional Neural Network
WO2024140434A1 (fr) Procédé de classification de texte basé sur un graphe de connaissances multimodal, et dispositif et support de stockage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23807038

Country of ref document: EP

Kind code of ref document: A1