WO2023222089A1 - Item classification method and apparatus based on deep learning - Google Patents

Item classification method and apparatus based on deep learning Download PDF

Info

Publication number
WO2023222089A1
WO2023222089A1 PCT/CN2023/095081 CN2023095081W WO2023222089A1 WO 2023222089 A1 WO2023222089 A1 WO 2023222089A1 CN 2023095081 W CN2023095081 W CN 2023095081W WO 2023222089 A1 WO2023222089 A1 WO 2023222089A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
data
historical
text data
features
Prior art date
Application number
PCT/CN2023/095081
Other languages
French (fr)
Chinese (zh)
Inventor
曾谁飞
孔令磊
张景瑞
刘卫强
李敏
Original Assignee
青岛海尔电冰箱有限公司
海尔智家股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 青岛海尔电冰箱有限公司, 海尔智家股份有限公司 filed Critical 青岛海尔电冰箱有限公司
Publication of WO2023222089A1 publication Critical patent/WO2023222089A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present invention relates to the field of computer technology, and in particular to an item classification method and device based on deep learning.
  • the purpose of the present invention is to provide an item classification method and device based on deep learning.
  • the present invention provides an item classification method based on deep learning, including the steps:
  • the joint features are combined through the fully connected layer, they are output to the classifier to calculate the score to obtain the classification result information, and determine the item category information;
  • the acquisition of historical text data specifically includes:
  • the transcribing the real-time speech data into speech text data, and extracting the text features of the speech text data specifically include:
  • the first speech text vector is input into a speech recognition convolutional neural network for encoding to obtain a second speech text vector.
  • the extraction of real-time voice data features specifically includes:
  • extracting text features of the historical text data specifically includes:
  • the food material review word vector is input into a two-way long and short memory network model to obtain an food material review context feature vector containing contextual feature information based on the historical food material review text data.
  • the text features of the speech text data and the historical ingredient review text data are enhanced.
  • the text features of the speech text data and historical ingredient review text data are enhanced, specifically including:
  • Obtain a voice text attention feature vector that includes the weight information of the voice text data itself and the weight information between the voice text data and the historical ingredient review text data;
  • Obtain the food review text attention feature vector including the weight information of the historical food review text data itself and the weight information between the historical food review text number context feature vector and the voice text data.
  • the joint representation of the text features of the real-time speech data and the text features of the historical text data to obtain a joint feature vector specifically includes:
  • the voice text attention feature vector and the food review text attention feature vector are jointly mapped to a unified multi-modal vector space for joint representation to obtain the joint feature vector.
  • the text features are output to a classifier to calculate scores to obtain classification result information, which specifically includes:
  • the joint feature vector After the joint feature vector is combined through the fully connected layer, it is output to the Softmax function, and the scores of the textual semantics of the speech text data and the historical food review text data and their normalized score results are calculated to obtain classification result information.
  • obtaining real-time voice data containing item information specifically includes:
  • the real-time voice data transmitted from the client terminal is obtained.
  • the acquisition of historical ingredient review text data as the historical text data specifically includes:
  • Preprocessing the real-time voice data includes: framing and windowing the real-time voice data,
  • Preprocessing the historical text data includes: cleaning, annotating, word segmenting, and removing stop words on the speech text data.
  • the outputting the item category information includes:
  • the step of transcribing the real-time voice data into voice text data, extracting text features of the voice text data, and extracting text features of the historical text data also includes:
  • Obtain the configuration data stored in the external cache perform deep neural network calculations on the real-time voice data and the historical food review text data based on the configuration data, perform text transcription and extract text features.
  • the present invention also provides an item classification device based on deep learning, including:
  • Data acquisition module used to acquire real-time voice data and historical text data
  • a transliteration module used to transcribe the real-time voice data into voice text data
  • a feature extraction module used to extract text features of the speech text data and extract text features of the historical text data
  • a joint representation module used to jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features
  • the result calculation module is used to combine the joint features through the fully connected layer and output it to the classifier to calculate the score to obtain the classification result information, and to determine the item category information;
  • An output module is used to output the item category information.
  • the method provided by the present invention completes the task of identifying and classifying the acquired voice data, and by obtaining historical food material review text data, the historical food material review text data is used as data for pre-training and prediction models As part of the set, the text semantic feature information is more comprehensively obtained.
  • the historical ingredient review text data is used as supplementary data to make up for the problem of less semantic information in the voice data text, effectively Improved text classification accuracy, thereby improving the accuracy of classifying items.
  • the accuracy of real-time speech recognition is improved; by building a neural network model that combines context information mechanisms, self-attention mechanisms, and mutual attention mechanisms, it can more fully Extract text semantic feature information.
  • the overall model structure has excellent deep learning representation capabilities, high speech recognition accuracy, and high accuracy in classifying speech text, which greatly improves the accuracy and generalization ability of classifying item categories.
  • Figure 1 is a structural block diagram of a model involved in an item classification method based on deep learning in an embodiment of the present invention.
  • Figure 2 is a schematic diagram of the steps of an item classification method based on deep learning in an embodiment of the present invention.
  • Figure 3 is a schematic diagram of the steps of acquiring real-time voice data and acquiring historical text data in an embodiment of the present invention.
  • Figure 4 is a schematic diagram of the steps of translating the real-time voice data into voice text data and extracting text features of the voice text data in an embodiment of the invention.
  • Figure 5 is a schematic diagram of the steps of extracting text features of the historical text data in an embodiment of the invention.
  • Figure 6 is a schematic structural diagram of an object classification device based on deep learning in an embodiment of the present invention.
  • FIG. 1 it is a structural block diagram of the model involved in an item classification method based on deep learning provided by the present invention.
  • FIG. 2 it is a schematic diagram of the steps of the item classification method based on deep learning, which includes:
  • S1 Obtain real-time voice data containing item information and obtain historical text data.
  • S2 Transcribe the real-time voice data into voice text data, and extract text features of the voice text data.
  • S4 Jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features.
  • the method provided by the present invention can be used by an intelligent electronic device to implement functions such as real-time interaction or message push with the user based on the user's real-time voice input.
  • a smart refrigerator is taken as an example, and the method is explained in combination with a pre-trained deep learning model.
  • the smart refrigerator classifies the text content corresponding to the user's voice, thereby judging the category of items involved in the voice, and pushing relevant classification information based on the item classification results.
  • the classification of food materials in a smart refrigerator is taken as an example.
  • the method provided by the present invention can also be applied to other items that need to be stored in the refrigerator, such as medicines, Cosmetics, etc. are classified.
  • step S1 it specifically includes:
  • the real-time voice data transmitted from the client terminal is obtained.
  • the real-time voice mentioned here refers to the inquiry or instructional statements currently spoken by the user to the intelligent electronic device or to the client terminal device that is communicatively connected to the intelligent electronic device.
  • the current voice is a sentence containing relevant information such as the category of items stored in the smart refrigerator.
  • the user can ask questions such as "What vegetables are in the refrigerator today", "What are the beef ingredients in the refrigerator today", or the user can issue a reminder such as " Types of beverages left in the refrigerator” and other commands.
  • the processor of the smart refrigerator determines the category of relevant items through the method provided by the present invention, and then performs real-time voice interaction or pushes relevant information with the user.
  • obtaining historical text data includes:
  • the historical ingredient review text data described here refers to the transcribed text of user comments on the ingredients in the past, such as "The chili I put in today is very spicy” and "The yogurt of a certain brand I put in yesterday was very good.” "Drink", etc. Furthermore, it may also include text data of relevant food reviews directly input by the user.
  • the historical food review text usually contains item information that the user is interested in. Selecting it as the historical text data can effectively supplement information such as item categories.
  • the user's real-time voice can be collected through voice collection devices such as pickups and microphone arrays installed in the smart refrigerator.
  • voice collection devices such as pickups and microphone arrays installed in the smart refrigerator.
  • the smart refrigerator emits a voice.
  • the transmitted real-time voice of the user can also be obtained through a client terminal connected to the smart refrigerator based on a wireless communication protocol.
  • the client terminal is an electronic device with information sending function, such as a mobile phone, tablet computer, smart speaker, smart bracelet or Bluetooth.
  • the user directly sends voice to the customer terminal, and the customer terminal collects the voice and transmits it to the smart refrigerator through wireless communication methods such as wifi or Bluetooth.
  • users When users have interaction needs, they can send real-time voice through any convenient channel, which can significantly improve user convenience.
  • one or more of the above real-time voice acquisition methods may also be used, or the real-time voice may be acquired through other channels based on existing technology, and the present invention does not impose specific limitations on this.
  • the data stored in the internal memory of the smart refrigerator can be read.
  • historical food material review text to obtain the historical food material review text data.
  • the historical food material review text data can also be obtained by reading the historical food material review text stored in the external storage device configured in the smart refrigerator.
  • the external storage device is a mobile storage device such as a U disk, SD card, etc., by setting the external storage The device can further expand the storage space of smart refrigerators.
  • the historical food review text data stored in a client terminal such as a mobile phone or a tablet computer or an application software server can also be obtained.
  • the realization of multi-channel historical text acquisition channels can greatly increase the data volume of historical text information, thereby improving the accuracy of subsequent speech recognition.
  • one or more of the above methods for obtaining historical food review text data may also be used, or the historical food review text data may be obtained through other channels based on existing technology. There are no specific restrictions on this.
  • the smart refrigerator is configured with an external cache, and at least part of the historical food material review text data is stored in the external cache. As the use time increases, the historical food material review text data increases. By adding Part of the data is stored in the external cache, which can save the internal storage space of the smart refrigerator, and when performing neural network calculations, directly reading the historical food review text data stored in the external cache can improve algorithm efficiency.
  • the Redis component is used as the external cache.
  • the Redis component is currently a distributed cache system that uses a relatively widely used key/value storage structure. It can be used as a database, cache and message queue. acting.
  • Other external caches such as Memcached may also be used in other embodiments of the present invention, and the present invention places no specific limitations on this.
  • steps S11 and S12 real-time voice data containing item information and historical ingredient review text data can be flexibly obtained through multiple channels, which not only improves the user experience, but also ensures the amount of data and effectively improves the user experience. Algorithmic efficiency.
  • step S1 also includes the step of preprocessing the data, which includes:
  • S13 Preprocess the real-time voice data, including: performing frame processing and windowing processing on the real-time voice data.
  • S14 Preprocess the historical text data, including cleaning, annotating, word segmenting, and removing stop words on the speech text data.
  • step S13 the speech is segmented according to the specified length (time period or number of samples), Structured into a programmable data structure, the frame processing of the speech is completed to obtain the speech signal data. Then, the speech signal data is multiplied by a window function, so that the originally non-periodic speech signal exhibits some characteristics of the periodic function, completing the windowing process. Furthermore, pre-emphasis processing can be performed before the frame processing to emphasize the high-frequency part of the speech to eliminate the influence of lip radiation during the voicing process, thereby compensating for the high-frequency part of the speech signal that is suppressed by the articulation system, and can Highlight the high frequency resonance peaks.
  • steps such as filtering audio noise points and enhancing vocal processing can be performed to complete the enhancement of the real-time voice data, extract the characteristic parameters of the real-time voice, and make the real-time voice data Meet the input requirements of subsequent neural network models.
  • step S14 irrelevant data and duplicate data in the historical food material review text data set are deleted, and abnormal value and missing value data are processed, and information irrelevant to classification is initially screened out, and the historical food material review text data is cleaned. deal with. Then, the historical food review text data is annotated with category labels using methods based on rule statistics, and word segmentation methods based on string matching, word segmentation methods based on understanding, word segmentation methods based on statistics, and word segmentation methods based on rules, etc. The historical food review text data is subjected to word segmentation processing. After that, stop words are removed and the preprocessing of the historical food review text data is completed, so that the historical food review text data meets the input requirements of the subsequent neural network model.
  • step S13 and step S14 the specific algorithm used to preprocess the real-time voice data and the historical food review text data can refer to the current technology in the field, and will not be described again here.
  • step S2 it specifically includes the following steps:
  • S23 Input the first speech text vector into a speech recognition convolutional neural network for encoding to obtain a second speech text vector.
  • step S21 extracting the real-time voice data features specifically includes:
  • MFCC Mel-scale Frequency Cepstral Coefficients
  • step S21 may include:
  • the preprocessed real-time speech data is subjected to fast Fourier transform to obtain the energy spectrum of each frame of real-time speech data signal, and the energy spectrum is passed through a set of Mel-scale triangular filter banks to smooth the spectrum and eliminate
  • the role of harmonics highlights the formants of real-time speech, and then the MFCC coefficient characteristics are obtained through further logarithmic operations and discrete cosine transforms.
  • characteristic parameters such as the Perceptual Linear Predictive (PLP) or Linear Predictive Coding (LPC) characteristics of the real-time speech data can also be obtained through different algorithm steps.
  • PLP Perceptual Linear Predictive
  • LPC Linear Predictive Coding
  • step S22 the text content of the real-time speech data is transcribed through the pre-trained speech recognition deep neural network to obtain the first speech text vector.
  • speech recognition is completed directly through a deep neural network model.
  • the deep neural network model avoids the assumption that acoustic features need to obey independent and identical distribution, and is different from the Gaussian mixture model.
  • the network inputs in the hybrid model are different.
  • the deep neural network model is obtained by splicing and overlapping several adjacent frames, so that it can better utilize context information, obtain more speech feature information, and have higher speech recognition accuracy.
  • step S21 the algorithm steps involved in step S21 can be combined into the deep neural network model to make the overall model structure more balanced.
  • the first speech text vector After obtaining the first speech text vector, it is encoded through a speech recognition convolutional neural network. Since the convolutional neural network has translation invariance in time and space, the acoustic features of the speech recognition are modeled based on CNN. , can eliminate the diversity of the speech signal, complete its encoding work, and the second speech text vector finally obtained contains high-level feature semantic information of the real-time speech data.
  • the real-time speech data can also be transcribed into the speech text data by constructing other structural neural network models or using models such as Gaussian mixture models, as long as the real-time speech data can be Just transcribe it into the voice text data.
  • step S2 the text transcription and feature extraction of the real-time voice data are completed through step S2.
  • step S3 it specifically includes:
  • S32 Input the food material review word vector into a two-way long and short memory network model to obtain an food material review context feature vector containing contextual feature information based on the historical food material review text data.
  • step S31 in order to convert the text data into a vectorized form that can be recognized and processed by a computer, the historical food material review text data can be converted into the food material review word vector through the Word2Vec algorithm, or other methods such as Glove can also be used to convert the historical food material review text data into the food material review word vector. Algorithms and other existing algorithms in the field can be converted to obtain the word vectors, and the present invention does not impose specific restrictions on this.
  • the Bi-directional Long Short-Term Memory (BiLSTM) is composed of the forward Long Short-Term Memory (LSTM) and the backward long short memory network.
  • the LSTM model It can better obtain the long-distance dependencies of text semantics, and based on it, the BiLSTM model can better obtain the bidirectional semantics of text.
  • Input multiple food review word vectors into the BiLSTM model respectively.
  • the hidden layer state representing effective information output at each time step is obtained, and the described word vectors with contextual context information are output. Food review context feature vector.
  • a common recurrent network model in the field such as a Gated Recurrent Unit (GRU) network can also be used to extract contextual feature information, and the present invention does not impose specific limitations on this.
  • GRU Gated Recurrent Unit
  • steps may also be added to step S3:
  • S33 Input the second speech text vector into the speech recognition bidirectional long and short memory network model, and obtain a speech text context feature vector containing context feature information based on the speech text data.
  • steps S2 and S3 the feature extraction of the voice text data and the historical ingredient review text data are respectively completed, different semantic feature information is obtained, and useful text information is extracted, which improves the accuracy of item classification. accuracy, avoid the loss or filtering of useful information, and improve the performance of the model.
  • step S3 there are also steps:
  • S3a Based on the attention mechanism model, enhance the text features of the speech text data and the historical ingredient review text data.
  • step S3a includes:
  • Obtain a voice text attention feature vector that includes the weight information of the voice text data itself and the weight information between the voice text data and the historical ingredient review text data;
  • Obtain the food review text attention feature vector including the weight information of the historical food review text data itself and the weight information between the historical food review text number context feature vector and the voice text data.
  • the attention mechanism can guide the neural network to focus on more critical information and suppress other non-critical information. Therefore, by introducing the attention mechanism, the local key features or weight information of the output text data can be obtained, thereby further reducing model training. Irregular error alignment phenomenon of time series.
  • the input second voice text vector and the food review context feature vector are given their own weight information through the self-attention mechanism model, thereby obtaining the text semantic features of the voice text data and the historical food review text data. internal weight information. And further assign the input second voice text vector and the food review context feature vector to their mutual correlation weight information through the mutual attention mechanism model, thereby obtaining the voice text data and the historical food review text. Association weight information between data.
  • the finally obtained speech text attention feature vector and food review text attention feature vector enhance the importance of different parts of text semantic feature information, further optimizing the interpretability of the model.
  • the text feature enhancement of the speech text data and the historical ingredient review text data can also be completed based only on the self-attention mechanism model, or through other algorithm models.
  • steps S2, S3, and S3a may also include:
  • Obtain the configuration data stored in the external cache perform deep neural network calculations on the voice text data and the historical ingredient review text data based on the configuration data, perform text transcription and extract the voice text data and the historical ingredient Text features of review text data.
  • the calculation efficiency of the algorithm is improved by configuring an external cache, and effectively solves the problems of time response and space calculation complexity caused by the large amount of historical food review text data.
  • the order of the layers of the deep neural network can be adjusted or some layers can be omitted as needed, as long as the text classification of the voice text data and the historical food review text data can be completed.
  • the invention places no specific limitations on this.
  • step S4 it specifically includes:
  • the voice text attention feature vector and the food review text attention feature vector are jointly mapped to a unified multi-modal vector space for joint representation to obtain the joint feature vector, and the multi-modal joint feature vector is obtained It integrates optimal representation capabilities such as contextual information of text semantics, feature useful information, high-level features, and the different importance of useful features. It has rich semantic feature information and can obtain excellent text and speech representation capabilities.
  • step S4 may also be:
  • the speech text attention feature vector and the food review text attention feature vector are fused to obtain a fusion feature vector.
  • Multi-modal joint feature representation and multi-modal fusion are intended to combine the real-time voice data and the historical food review text to better extract and represent the feature information of both.
  • step S5 it specifically includes:
  • the attention feature vector After the attention feature vector is combined through the fully connected layer, it is output to the Softmax function, and the scores of the text semantics of the speech text data and the historical food review text data and their normalized score results are calculated to obtain classification result information.
  • the method provided by the present invention sequentially completes the recognition and classification tasks of the acquired voice data through the above steps, and by obtaining historical food material review text data, the historical food material review text data is used as pre-training and prediction As part of the data set of the model, the text semantic feature information is more comprehensively obtained.
  • the historical ingredient review text data is used as supplementary data to make up for the lack of semantic information in the voice data text. problem, effectively improving the accuracy of text classification, thereby improving the accuracy of classifying items.
  • the accuracy of real-time speech recognition is improved; by building a neural network model that combines context information mechanisms, self-attention mechanisms, and mutual attention mechanisms, it can more fully Extract text semantic feature information.
  • the overall model structure has excellent deep learning representation capabilities, has high accuracy in classifying speech text, and greatly improves the accuracy and generalization ability of classifying item categories.
  • step S6 it specifically includes:
  • step S6 in this embodiment, after the classification result information is obtained through the previous steps and the item category information is determined, it can be converted into voice, and the item category information can be broadcast through the sound playback device built in the smart refrigerator. This allows direct voice interaction with the user, or the item category information can be converted into text and displayed directly through the display device configured in the smart refrigerator. Moreover, the voice communication of item category information can also be transmitted to the client terminal for output.
  • the client terminal is an electronic device with an information receiving function, such as transmitting voice to mobile phones, smart speakers, Bluetooth headsets and other devices for broadcast, or classifying results.
  • the information text is transmitted to client terminals such as mobile phones and tablets or application software installed on the client terminal through text messages, emails, etc. for users to review.
  • a multi-channel and multi-type classification result information output method is realized.
  • the user is not limited to only obtaining relevant information near the smart refrigerator.
  • the multi-channel and multi-type real-time voice acquisition method provided by the present invention, the user can directly obtain relevant information remotely. Interacting with the smart refrigerator is extremely convenient and greatly improves the user experience.
  • only the above-mentioned One or more of the classification result information output methods, or the classification result information can also be output through other channels based on the existing technology, and the present invention does not impose specific restrictions on this.
  • the present invention provides an item classification method based on deep learning, which obtains real-time voice data containing item information through multiple channels. After the real-time voice data is transcribed into text, it is combined with historical ingredient review text data through The deep neural network model fully extracts text semantic features, obtains item category information and outputs it through multiple channels. It significantly improves the accuracy of speech recognition and item category judgment, while making the interaction method more convenient and diverse, greatly improving the user experience.
  • the present invention also provides an item classification device 7 based on deep learning, which includes:
  • Data acquisition module 71 used to acquire real-time voice data and acquire historical text data
  • Transcription module 72 used to transcribe the real-time voice data into voice text data
  • Feature extraction module 73 used to extract text features of the voice text data and extract text features of the historical text data
  • the joint representation module 74 is used to jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features;
  • the result calculation module 75 is used to combine the joint features through the fully connected layer and output it to the classifier to calculate the score to obtain the classification result information, and determine to obtain the item category information;
  • the output module 76 is used to output the item category information.
  • an electrical device which includes:
  • Memory used to store executable instructions
  • the processor is configured to implement the above deep learning-based item classification method when running executable instructions stored in the memory.
  • the present invention also provides a refrigerator, which includes:
  • Memory used to store executable instructions
  • the processor is configured to implement the above deep learning-based item classification method when running executable instructions stored in the memory.
  • the present invention also provides a computer-readable storage medium, which stores executable Instructions, characterized in that when the executable instructions are executed by the processor, the above-mentioned deep learning-based item classification method is implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides an item classification method and apparatus based on deep learning. The method comprises the steps of: obtaining real-time speech data containing item information, and obtaining historical text data; transcribing the real-time speech data into speech text data, and extracting text features of speech text data; extracting text features of the historical text data; jointly representing the real-time speech data text features and the historical text data text features to obtain joint features; combining the joint features via a fully connected layer, outputting same to a classifier, calculating a score, and obtaining classification result information, and determining item category information; and outputting item category information.

Description

基于深度学习的物品分类方法和装置Item classification method and device based on deep learning 技术领域Technical field
本发明涉及计算机技术领域,具体地涉及一种基于深度学习的物品分类方法和装置。The present invention relates to the field of computer technology, and in particular to an item classification method and device based on deep learning.
背景技术Background technique
随着语音识别技术的成熟应用落地,目前针对食材内容在冰箱场景应用普遍存在如下问题或现象:食材内容分类准确率偏低,没有结合及提取食材评论中的重要性信息,导致食材推送体验不好甚至推送内容欠佳。因此,如何利用深度学习构建基于智能语音食材分类模型已成为冰箱体验提升的关键技术及解决方法。并且,智能冰箱交互离不开语音、文本、图像等多源异构数据,故如何最大化利用及其融合最有用的多模态数据特征信息,从而优化智能语音食材分类准确率进而提升冰箱使用的体验效果,目前业界尚未提出较为有效的解决方案。With the mature application of speech recognition technology, the following problems or phenomena are currently common in the application of food content in refrigerator scenarios: the classification accuracy of food content is low, and the important information in food reviews is not combined and extracted, resulting in poor food push experience. Good or even poor content. Therefore, how to use deep learning to build an intelligent voice-based ingredient classification model has become a key technology and solution for improving the refrigerator experience. Moreover, smart refrigerator interaction is inseparable from multi-source heterogeneous data such as voice, text, and images. Therefore, how to maximize the use and integration of the most useful multi-modal data feature information, thereby optimizing the accuracy of smart voice food classification and improving the use of refrigerators. experience effect, the industry has not yet proposed a more effective solution.
发明内容Contents of the invention
本发明的目的在于提供一种基于深度学习的物品分类方法和装置。The purpose of the present invention is to provide an item classification method and device based on deep learning.
本发明提供一种种基于深度学习的物品分类方法,包括步骤:The present invention provides an item classification method based on deep learning, including the steps:
获取包含物品信息的实时语音数据,获取历史文本数据;Obtain real-time voice data containing item information and obtain historical text data;
转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征;Transcribe the real-time voice data into voice text data, and extract text features of the voice text data;
提取所述历史文本数据的文本特征;Extract text features of the historical text data;
将所述实时语音数据文本特征和所述历史文本数据文本特征联合表示得到联合特征;jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features;
将所述联合特征经全连接层组合后,输出至分类器计算得分得到分类结果信息,并判断得到物品类别信息;After the joint features are combined through the fully connected layer, they are output to the classifier to calculate the score to obtain the classification result information, and determine the item category information;
输出所述物品类别信息。Output the item category information.
作为本发明的进一步改进,所述获取历史文本数据,具体包括:As a further improvement of the present invention, the acquisition of historical text data specifically includes:
获取历史食材评论文本数据作为所述历史文本数据。Obtain historical food ingredient review text data as the historical text data.
作为本发明的进一步改进,所述转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征,具体包括:As a further improvement of the present invention, the transcribing the real-time speech data into speech text data, and extracting the text features of the speech text data specifically include:
提取所述实时语音数据特征,得到语音特征; Extract the real-time voice data features to obtain voice features;
将所述语音特征输入语音识别深度神经网络模型转写得到第一语音文本向量;Enter the speech feature into a speech recognition deep neural network model and transcribe it to obtain a first speech text vector;
将所述第一语音文本向量输入语音识别卷积神经网络进行编码,得到第二语音文本向量。The first speech text vector is input into a speech recognition convolutional neural network for encoding to obtain a second speech text vector.
作为本发明的进一步改进,所述提取所述实时语音数据特征,具体包括:As a further improvement of the present invention, the extraction of real-time voice data features specifically includes:
提取所述实时语音数据特征,获取其梅尔频率倒谱系数特征。Extract the characteristics of the real-time speech data and obtain its Mel frequency cepstrum coefficient characteristics.
作为本发明的进一步改进,提取所述历史文本数据的文本特征,具体包括:As a further improvement of the present invention, extracting text features of the historical text data specifically includes:
将所述历史食材评论文本数据转化为食材评论词向量;Convert the historical food material review text data into food material review word vectors;
将所述食材评论词向量输入双向长短记忆网络模型,获取包含基于所述历史食材评论文本数据上下文特征信息的食材评论上下文特征向量。The food material review word vector is input into a two-way long and short memory network model to obtain an food material review context feature vector containing contextual feature information based on the historical food material review text data.
作为本发明的进一步改进,还包括步骤:As a further improvement of the present invention, it also includes the steps:
基于注意力机制模型,增强所述语音文本数据和所述历史食材评论文本数据的文本特征。Based on the attention mechanism model, the text features of the speech text data and the historical ingredient review text data are enhanced.
作为本发明的进一步改进,所述基于注意力机制模型,增强所述语音文本数据和历史食材评论文本数据的文本特征,具体包括:As a further improvement of the present invention, based on the attention mechanism model, the text features of the speech text data and historical ingredient review text data are enhanced, specifically including:
分别将所述第二语音文本向量和所述食材评论上下文特征向量输入自注意力机制模型和互注意力机制机制模型;Input the second speech text vector and the food review context feature vector into the self-attention mechanism model and the mutual attention mechanism model respectively;
获取包含所述语音文本数据自身权重信息以及所述语音文本数据与所述历史食材评论文本数据之间权重信息的语音文本注意力特征向量;Obtain a voice text attention feature vector that includes the weight information of the voice text data itself and the weight information between the voice text data and the historical ingredient review text data;
获取包含所述历史食材评论文本数据自身权重信息以及所述历史食材评论文本数上下文特征向量与所述语音文本数据之间权重信息的食材评论文本注意力特征向量。Obtain the food review text attention feature vector including the weight information of the historical food review text data itself and the weight information between the historical food review text number context feature vector and the voice text data.
作为本发明的进一步改进,所述将所述实时语音数据文本特征和所述历史文本数据文本特征联合表示得到联合特征向量,具体包括:As a further improvement of the present invention, the joint representation of the text features of the real-time speech data and the text features of the historical text data to obtain a joint feature vector specifically includes:
将所述语音文本注意力特征向量和所述食材评论文本注意力特征向量共同映射到一个统一多模态向量空间进行联合表示得到所述联合特征向量。The voice text attention feature vector and the food review text attention feature vector are jointly mapped to a unified multi-modal vector space for joint representation to obtain the joint feature vector.
作为本发明的进一步改进,所述将所述文本特征经全连接层组合后,输出至分类器计算得分得到分类结果信息,具体包括:As a further improvement of the present invention, after combining the text features through a fully connected layer, the text features are output to a classifier to calculate scores to obtain classification result information, which specifically includes:
将所述联合特征向量经全连接层组合后,输出至Softmax函数,计算所述语音文本数据和所述历史食材评论文本数据文本语义的得分及其归一化得分结果,得到分类结果信息。After the joint feature vector is combined through the fully connected layer, it is output to the Softmax function, and the scores of the textual semantics of the speech text data and the historical food review text data and their normalized score results are calculated to obtain classification result information.
作为本发明的进一步改进,所述获取包含物品信息的实时语音数据,具体包括: As a further improvement of the present invention, obtaining real-time voice data containing item information specifically includes:
获取语音采集装置所采集的所述实时语音数据,和/或Obtain the real-time voice data collected by the voice collection device, and/or
获取自客户终端传输的所述实时语音数据。The real-time voice data transmitted from the client terminal is obtained.
作为本发明的进一步改进,所述获取历史食材评论文本数据作为所述历史文本数据,具体包括:As a further improvement of the present invention, the acquisition of historical ingredient review text data as the historical text data specifically includes:
获取内部存储的历史食材评论文本作为历史食材评论文本数据,和/或Obtain the internally stored historical ingredient review text as historical ingredient review text data, and/or
获取外部存储的历史食材评论文本作为历史食材评论文本数据,和/或Obtain the externally stored historical ingredient review text as historical ingredient review text data, and/or
获取客户终端传输的历史食材评论文本作为历史食材评论文本数据。Obtain the historical ingredient review text transmitted by the client terminal as historical ingredient review text data.
作为本发明的进一步改进,还包括步骤:As a further improvement of the present invention, it also includes the steps:
对所述实时语音数据进行预处理,包括:对所述实时语音数据进行分帧处理和加窗处理,Preprocessing the real-time voice data includes: framing and windowing the real-time voice data,
对所述历史文本数据进行预处理,包括:对所述语音文本数据进行清洗处理、标注、分词、去停用词。Preprocessing the historical text data includes: cleaning, annotating, word segmenting, and removing stop words on the speech text data.
作为本发明的进一步改进,所述输出所述物品类别信息包括:As a further improvement of the present invention, the outputting the item category information includes:
将所述物品类别信息转换为语音进行输出,和/或Convert the item category information into speech for output, and/or
将所述物品类别信息转换为语音传输至客户终端输出,和/或Convert the item category information into voice and transmit it to the client terminal for output, and/or
将所述物品类别信息转换为文本进行输出,和/或Convert the item category information into text for output, and/or
将所述物品类别信息转换为文本传输至客户终端输出。Convert the item category information into text and transmit it to the client terminal for output.
作为本发明的进一步改进,所述转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征;提取所述历史文本数据的文本特征,还包括:As a further improvement of the present invention, the step of transcribing the real-time voice data into voice text data, extracting text features of the voice text data, and extracting text features of the historical text data also includes:
获取存储于外部缓存的配置数据,将所述所述实时语音数据和所述历史食材评论文本数据基于所述配置数据执行深度神经网络计算,进行文本转写和提取文本特征。Obtain the configuration data stored in the external cache, perform deep neural network calculations on the real-time voice data and the historical food review text data based on the configuration data, perform text transcription and extract text features.
本发明还提供一种基于深度学习的物品分类装置,包括:The present invention also provides an item classification device based on deep learning, including:
数据获取模块,用于获取实时语音数据和获取历史文本数据;Data acquisition module, used to acquire real-time voice data and historical text data;
转写模块,用于转写所述实时语音数据为语音文本数据;A transliteration module, used to transcribe the real-time voice data into voice text data;
特征提取模块,用于提取所述语音文本数据文本特征和提取所述历史文本数据的文本特征;A feature extraction module, used to extract text features of the speech text data and extract text features of the historical text data;
联合表示模块,用于将所述实时语音数据文本特征和所述历史文本数据文本特征联合表示得到联合特征;A joint representation module, used to jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features;
结果计算模块,用于将所述联合特征经全连接层组合后,输出至分类器计算得分得到分类结果信息,并判断得到物品类别信息; The result calculation module is used to combine the joint features through the fully connected layer and output it to the classifier to calculate the score to obtain the classification result information, and to determine the item category information;
输出模块,用于输出所述物品类别信息。An output module is used to output the item category information.
本发明的有益效果是:本发明所提供的方法完成了对所获取的语音数据的识别与分类任务,并且通过获取历史食材评论文本数据,将历史食材评论文本数据作为预训练和预测模型的数据集的一部分,更全面地获取了文本语义特征信息,通过综合运用语音文本数据和历史食材评论文本数据,将历史食材评论文本数据作为补充数据,弥补了语音数据文本语义信息较少的问题,有效提高了文本分类准确度,从而提高了对物品进行分类的准确率。并且,通过构建融合了深度神经网络和卷积神经网络的网络模型提高了实时语音识别的精度;通过构建融合了上下文信息机制、自注意力机制和互注意力机制的神经网络模型,更充分地提取文本语义特征信息。整体模型结构具有优秀的深度学习表征能力,语音识别精度高,对语音文本分类的准确率高,大幅提升了对物品类别进行分类的准确率和泛化能力。The beneficial effects of the present invention are: the method provided by the present invention completes the task of identifying and classifying the acquired voice data, and by obtaining historical food material review text data, the historical food material review text data is used as data for pre-training and prediction models As part of the set, the text semantic feature information is more comprehensively obtained. By comprehensively using voice text data and historical ingredient review text data, the historical ingredient review text data is used as supplementary data to make up for the problem of less semantic information in the voice data text, effectively Improved text classification accuracy, thereby improving the accuracy of classifying items. Moreover, by building a network model that combines deep neural networks and convolutional neural networks, the accuracy of real-time speech recognition is improved; by building a neural network model that combines context information mechanisms, self-attention mechanisms, and mutual attention mechanisms, it can more fully Extract text semantic feature information. The overall model structure has excellent deep learning representation capabilities, high speech recognition accuracy, and high accuracy in classifying speech text, which greatly improves the accuracy and generalization ability of classifying item categories.
附图说明Description of the drawings
图1是本发明一实施方式中的基于深度学习的物品分类方法所涉及模型的结构框图。Figure 1 is a structural block diagram of a model involved in an item classification method based on deep learning in an embodiment of the present invention.
图2是本发明一实施方式中的基于深度学习的物品分类方法步骤示意图。Figure 2 is a schematic diagram of the steps of an item classification method based on deep learning in an embodiment of the present invention.
图3本发明一实施方式中获取实时语音数据,获取历史文本数据步骤示意图。Figure 3 is a schematic diagram of the steps of acquiring real-time voice data and acquiring historical text data in an embodiment of the present invention.
图4是发明一实施方式中转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征步骤示意图。Figure 4 is a schematic diagram of the steps of translating the real-time voice data into voice text data and extracting text features of the voice text data in an embodiment of the invention.
图5是发明一实施方式中提取所述历史文本数据的文本特征步骤示意图。Figure 5 is a schematic diagram of the steps of extracting text features of the historical text data in an embodiment of the invention.
图6是本发明一实施方式中的基于深度学习的物品分类装置结构示意图。Figure 6 is a schematic structural diagram of an object classification device based on deep learning in an embodiment of the present invention.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施方式及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施方式仅是本申请一部分实施方式,而不是全部的实施方式。基于本申请中的实施方式,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施方式,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions of the present application will be clearly and completely described below in conjunction with the specific implementation modes of the present application and the corresponding drawings. Obviously, the described embodiments are only some of the embodiments of the present application, but not all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.
下面详细描述本发明的实施方式,实施方式的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通 过参考附图描述的实施方式是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements with the same or similar functions. Pass below The embodiments described with reference to the drawings are exemplary and are only used to explain the present invention and are not to be construed as limitations of the present invention.
如图1所示,为本发明所提供的一种基于深度学习的物品分类方法所涉及模型的结构框图,如图2所示,为基于深度学习的物品分类方法步骤示意图,其包括:As shown in Figure 1, it is a structural block diagram of the model involved in an item classification method based on deep learning provided by the present invention. As shown in Figure 2, it is a schematic diagram of the steps of the item classification method based on deep learning, which includes:
S1:获取包含物品信息的实时语音数据,获取历史文本数据。S1: Obtain real-time voice data containing item information and obtain historical text data.
S2:转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征。S2: Transcribe the real-time voice data into voice text data, and extract text features of the voice text data.
S3:提取所述历史文本数据的文本特征。S3: Extract text features of the historical text data.
S4:将所述实时语音数据文本特征和所述历史文本数据文本特征联合表示得到联合特征。S4: Jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features.
S5:将所述联合特征经全连接层组合后,输出至分类器计算得分得到分类结果信息。S5: After combining the joint features through the fully connected layer, output them to the classifier to calculate scores and obtain classification result information.
S6:输出所述物品类别信息。S6: Output the item category information.
本发明提供的方法可供智能电子设备基于用户的实时语音输入,来实现与用户之间的实时交互或消息推送等功能。示例性的,在本实施方式中,以智能冰箱为例,并结合预先训练好的深度学习模型对本方法进行说明。基于用户的语音输入,智能冰箱对用户语音所对应的文本内容进行分类,从而对语音所涉及的物品类别进行判断,并根据物品分类结果推送相关分类信息。进一步的,在本实施方式中,以对智能冰箱内的食材分类为例进行说明,在其他实施方式中,本发明所提供的方法也可应用于针对其他需要存放于冰箱内的物品如药品、化妆品等进行分类。The method provided by the present invention can be used by an intelligent electronic device to implement functions such as real-time interaction or message push with the user based on the user's real-time voice input. Illustratively, in this embodiment, a smart refrigerator is taken as an example, and the method is explained in combination with a pre-trained deep learning model. Based on the user's voice input, the smart refrigerator classifies the text content corresponding to the user's voice, thereby judging the category of items involved in the voice, and pushing relevant classification information based on the item classification results. Further, in this embodiment, the classification of food materials in a smart refrigerator is taken as an example. In other embodiments, the method provided by the present invention can also be applied to other items that need to be stored in the refrigerator, such as medicines, Cosmetics, etc. are classified.
如图3所示,在步骤S1中,其具体包括:As shown in Figure 3, in step S1, it specifically includes:
S11:获取语音采集装置所采集的所述实时语音数据,和/或S11: Obtain the real-time voice data collected by the voice collection device, and/or
获取自客户终端传输的所述实时语音数据。The real-time voice data transmitted from the client terminal is obtained.
S12:获取内部存储的历史食材评论文本作为历史食材评论文本数据,和/或S12: Obtain the internally stored historical ingredient review text as historical ingredient review text data, and/or
获取外部存储的历史食材评论文本作为历史食材评论文本数据,和/或Obtain the externally stored historical ingredient review text as historical ingredient review text data, and/or
获取客户终端传输的历史食材评论文本作为历史食材评论文本数据。Obtain the historical ingredient review text transmitted by the client terminal as historical ingredient review text data.
这里所述的实时语音指的是用户当前对智能电子设备或对与智能电子设备通信连接的客户终端设备等说出的询问性或指令性语句等。如在本实施方式中,所述实 时语音为包含存放于智能冰箱内物品的类别等相关信息的语句,用户可提出诸如“今天冰箱里有啥蔬菜”、“今天冰箱里牛肉食材有哪些”等问题,或用户可发出诸如“提醒冰箱里剩余饮料的种类”等命令指令。基于上述信息,智能冰箱的处理器通过本发明所提供的方法对相关物品类别进行判断后,与用户进行实时语音交互或推送相关信息。The real-time voice mentioned here refers to the inquiry or instructional statements currently spoken by the user to the intelligent electronic device or to the client terminal device that is communicatively connected to the intelligent electronic device. As in this embodiment, the implementation The current voice is a sentence containing relevant information such as the category of items stored in the smart refrigerator. The user can ask questions such as "What vegetables are in the refrigerator today", "What are the beef ingredients in the refrigerator today", or the user can issue a reminder such as " Types of beverages left in the refrigerator" and other commands. Based on the above information, the processor of the smart refrigerator determines the category of relevant items through the method provided by the present invention, and then performs real-time voice interaction or pushes relevant information with the user.
具体的,在本实施方式中,获取历史文本数据包括:Specifically, in this implementation, obtaining historical text data includes:
获取历史食材评论文本数据作为所述历史文本数据。Obtain historical food ingredient review text data as the historical text data.
这里所述的历史食材评论文本数据指的是以往使用过程中用户对食材进行的评论所转写的文本,如“今天放进去的辣椒很辣”“昨天放入的某种品牌的酸奶很好喝”等,进一步的,其还可包括用户直接自行输入的相关食材评论文本数据等。所述历史食材评论文本通常会包含用户感兴趣的物品信息,选择其作为所述历史文本数据,能够有效对物品类别等信息做出有效补充。The historical ingredient review text data described here refers to the transcribed text of user comments on the ingredients in the past, such as "The chili I put in today is very spicy" and "The yogurt of a certain brand I put in yesterday was very good." "Drink", etc. Furthermore, it may also include text data of relevant food reviews directly input by the user. The historical food review text usually contains item information that the user is interested in. Selecting it as the historical text data can effectively supplement information such as item categories.
在本发明的其他实施方式中,也可获取诸如以往用户提问或发出指令后,相关问题和指令所转写成的文本、或以往使用过程中用户依据放入的物品发出的说明性语音所转写的文本等其他历史文本数据,具体在此不再赘述。In other embodiments of the present invention, it is also possible to obtain texts such as the transcribed text of relevant questions and instructions after the user asked questions or issued instructions in the past, or the explanatory voice that the user issued based on the items he put in during the previous use. The text and other historical text data will not be described in detail here.
如步骤S11所述,在本实施方式中,可通过设置于智能冰箱内的拾音器、麦克风阵列等语音采集装置采集用户实时语音,在使用过程中,当用户需要与智能冰箱进行交互时,直接对智能冰箱发出语音即可。并且,也可通过与智能冰箱基于无线通信协议连接的客户终端获取传输而来的用户实时语音,客户终端为具有信息发送功能的电子设备,如手机、平板电脑、智能音响、智能手环或蓝牙耳机等智能电子设备,在使用过程中,用户直接对客户终端发出语音,客户终端采集语音后通过wifi或蓝牙等无线通信方式传输至智能冰箱。从而实现多渠道的实时语音获取方式,并不局限于必须面向智能冰箱发出语音。当用户有交互需求时,通过任意便捷渠道发出实时语音即可,从而能够显著提高用户的使用便捷度。在本发明的其他实施方式中,也可采用上述实时语音获取方法中一种或任意多种,或者也可基于现有技术通过其他渠道获取所述实时语音,本发明对此不作具体限制。As described in step S11, in this embodiment, the user's real-time voice can be collected through voice collection devices such as pickups and microphone arrays installed in the smart refrigerator. During use, when the user needs to interact with the smart refrigerator, the user can directly interact with the smart refrigerator. The smart refrigerator emits a voice. In addition, the transmitted real-time voice of the user can also be obtained through a client terminal connected to the smart refrigerator based on a wireless communication protocol. The client terminal is an electronic device with information sending function, such as a mobile phone, tablet computer, smart speaker, smart bracelet or Bluetooth. During the use of smart electronic devices such as headphones, the user directly sends voice to the customer terminal, and the customer terminal collects the voice and transmits it to the smart refrigerator through wireless communication methods such as wifi or Bluetooth. This enables a multi-channel real-time voice acquisition method, which is not limited to sending voice to smart refrigerators. When users have interaction needs, they can send real-time voice through any convenient channel, which can significantly improve user convenience. In other embodiments of the present invention, one or more of the above real-time voice acquisition methods may also be used, or the real-time voice may be acquired through other channels based on existing technology, and the present invention does not impose specific limitations on this.
如步骤S12所述,在本实施方式中,可通过读取智能冰箱的内部存储器所存储 的历史食材评论文本来获取所述历史食材评论文本数据。并且,也可通过读取智能冰箱配置的外部存储装置所存储的历史食材评论文本来获取所述历史食材评论文本数据,外部存储装置为诸如U盘、SD卡等移动存储设备,通过设置外部存储装置可进一步拓展智能冰箱的存储空间。并且,也可通过获取存储在诸如手机、平板电脑等客户终端或应用软件服务器端等处的所述历史食材评论文本数据。实现多渠道的历史文本获取渠道,能够大幅提高历史文本信息的数据量,从而提高后续语音识别的准确度。在本发明的其他实施方式中,也可采用上述历史食材评论文本数据获取方法中的一种或任意多种,或者也可基于现有技术通过其他渠道获取所述历史食材评论文本数据,本发明对此不作具体限制。As described in step S12, in this embodiment, the data stored in the internal memory of the smart refrigerator can be read. historical food material review text to obtain the historical food material review text data. Moreover, the historical food material review text data can also be obtained by reading the historical food material review text stored in the external storage device configured in the smart refrigerator. The external storage device is a mobile storage device such as a U disk, SD card, etc., by setting the external storage The device can further expand the storage space of smart refrigerators. Moreover, the historical food review text data stored in a client terminal such as a mobile phone or a tablet computer or an application software server can also be obtained. The realization of multi-channel historical text acquisition channels can greatly increase the data volume of historical text information, thereby improving the accuracy of subsequent speech recognition. In other embodiments of the present invention, one or more of the above methods for obtaining historical food review text data may also be used, or the historical food review text data may be obtained through other channels based on existing technology. There are no specific restrictions on this.
进一步的,在本实施方式中,智能冰箱配置有外部缓存,至少有部分所述历史食材评论文本数据被储存在所述外部缓存中,随着使用时间增加,历史食材评论文本数据增多,通过将部分数据存储在外部缓存中,能够节省智能冰箱内部存储空间,并且在进行神经网络计算时,直接读取存储于外部缓存中的所述历史食材评论文本数据,能够提高算法效率。Further, in this embodiment, the smart refrigerator is configured with an external cache, and at least part of the historical food material review text data is stored in the external cache. As the use time increases, the historical food material review text data increases. By adding Part of the data is stored in the external cache, which can save the internal storage space of the smart refrigerator, and when performing neural network calculations, directly reading the historical food review text data stored in the external cache can improve algorithm efficiency.
具体的,在本实施方式中,采用Redis组件作为所述外部缓存,Redis组件为当前一种使用较为广泛的key/value存储结构的分布式缓存系统,其可用作数据库,高速缓存和消息队列代理。在本发明的其他实施方式中也可采用诸如Memcached等其他外部缓存,本发明对此不作具体限制。Specifically, in this embodiment, the Redis component is used as the external cache. The Redis component is currently a distributed cache system that uses a relatively widely used key/value storage structure. It can be used as a database, cache and message queue. acting. Other external caches such as Memcached may also be used in other embodiments of the present invention, and the present invention places no specific limitations on this.
综上所述,在步骤S11和步骤S12中,能够通过多渠道灵活获取包含物品信息的实时语音数据和历史食材评论文本数据,在提升了用户体验的同时,保证了数据量,并有效提升了算法效率。To sum up, in steps S11 and S12, real-time voice data containing item information and historical ingredient review text data can be flexibly obtained through multiple channels, which not only improves the user experience, but also ensures the amount of data and effectively improves the user experience. Algorithmic efficiency.
进一步的,步骤S1还包括对数据进行预处理的步骤,其包括:Further, step S1 also includes the step of preprocessing the data, which includes:
S13:对所述实时语音数据进行预处理,包括:对所述实时语音数据进行分帧处理和加窗处理。S13: Preprocess the real-time voice data, including: performing frame processing and windowing processing on the real-time voice data.
S14:对所述历史文本数据进行预处理,包括:对所述语音文本数据进行清洗处理、标注、分词、去停用词。S14: Preprocess the historical text data, including cleaning, annotating, word segmenting, and removing stop words on the speech text data.
具体的,在步骤S13中,将语音根据指定的长度(时间段或者采样数)进行分段, 结构化为可编程的数据结构,完成对语音的分帧处理得到语音信号数据。接着,将语音信号数据与一个窗函数相乘,使原本没有周期性的语音信号呈现出周期函数的部分特征,完成加窗处理。进一步的,还可在分帧处理之前进行预加重处理,对语音的高频部分进行加重,以消除发声过程中口唇辐射的影响,从而补偿语音信号受到发音系统所压抑的高频部分,并能突显高频的共振峰。并且,在加窗处理之后还可进行过滤音频噪音点处理和增强人声处理等步骤,从而完成对所述实时语音数据的加强,提取得到所述实时语音的特征参数,使所述实时语音数据符合后续神经网络模型的输入要求。Specifically, in step S13, the speech is segmented according to the specified length (time period or number of samples), Structured into a programmable data structure, the frame processing of the speech is completed to obtain the speech signal data. Then, the speech signal data is multiplied by a window function, so that the originally non-periodic speech signal exhibits some characteristics of the periodic function, completing the windowing process. Furthermore, pre-emphasis processing can be performed before the frame processing to emphasize the high-frequency part of the speech to eliminate the influence of lip radiation during the voicing process, thereby compensating for the high-frequency part of the speech signal that is suppressed by the articulation system, and can Highlight the high frequency resonance peaks. In addition, after the windowing process, steps such as filtering audio noise points and enhancing vocal processing can be performed to complete the enhancement of the real-time voice data, extract the characteristic parameters of the real-time voice, and make the real-time voice data Meet the input requirements of subsequent neural network models.
具体的,在步骤S14中,删除历史食材评论文本数据集中的无关数据、重复数据以及处理异常值和缺失值数据等,初步筛选掉与分类无关的信息,对所述历史食材评论文本数据进行清洗处理。接着,基于规则统计的方法等对所述历史食材评论文本数据进行类别标签标注,以及基于字符串匹配的分词方法、基于理解的分词方法、基于统计的分词方法和基于规则的分词方法等对所述历史食材评论文本数据进行分词处理。之后,去除停用词,完成对所述历史食材评论文本数据的预处理,从而使所述历史食材评论文本数据符合后续神经网络模型的输入要求。Specifically, in step S14, irrelevant data and duplicate data in the historical food material review text data set are deleted, and abnormal value and missing value data are processed, and information irrelevant to classification is initially screened out, and the historical food material review text data is cleaned. deal with. Then, the historical food review text data is annotated with category labels using methods based on rule statistics, and word segmentation methods based on string matching, word segmentation methods based on understanding, word segmentation methods based on statistics, and word segmentation methods based on rules, etc. The historical food review text data is subjected to word segmentation processing. After that, stop words are removed and the preprocessing of the historical food review text data is completed, so that the historical food review text data meets the input requirements of the subsequent neural network model.
在步骤S13和步骤S14中,对所述实时语音数据和所述历史食材评论文本数据预处理所采用的的具体算法可参考当前本领域现有技术,具体在此不再赘述。In step S13 and step S14, the specific algorithm used to preprocess the real-time voice data and the historical food review text data can refer to the current technology in the field, and will not be described again here.
如图4所示,在步骤S2中,其具体包括步骤:As shown in Figure 4, in step S2, it specifically includes the following steps:
S21:提取所述实时语音数据特征,得到语音特征。S21: Extract the real-time voice data features to obtain voice features.
S22:将所述语音特征输入语音识别深度神经网络模型转写得到第一语音文本向量。S22: Enter the speech feature into the speech recognition deep neural network model and transcribe it to obtain the first speech text vector.
S23:将所述第一语音文本向量输入语音识别卷积神经网络进行编码,得到第二语音文本向量。S23: Input the first speech text vector into a speech recognition convolutional neural network for encoding to obtain a second speech text vector.
在步骤S21中,提取所述实时语音数据特征具体包括:In step S21, extracting the real-time voice data features specifically includes:
提取所述实时语音数据特征,获取其梅尔频率倒谱系数特征(Mel-scale Frequency Cepstral Coefficients,简称MFCC)。MFCC是一种语音信号中具有辨识性的成分,是在Mel标度频率域提取出来的倒谱参数,其中,Mel标度描述了人耳频 率的非线性特性,MFCC的参数考虑到了人耳对不同频率的感受程度,特别适用于语音辨别和语者辨识。Extract the characteristics of the real-time speech data and obtain its Mel-scale Frequency Cepstral Coefficients (MFCC for short). MFCC is a discernible component in the speech signal. It is a cepstrum parameter extracted in the Mel scale frequency domain. The Mel scale describes the frequency of the human ear. The nonlinear characteristics of frequency, the parameters of MFCC take into account the human ear's sensitivity to different frequencies, and are especially suitable for speech recognition and speaker identification.
示例性的,步骤S21可包括:For example, step S21 may include:
将预处理后的所述实时语音数据经过快速傅里叶变换后得到各帧实时语音数据信号的能量谱,并将能量谱通过一组Mel尺度的三角形滤波器组来对频谱进行平滑化,消除谐波的作用,突显实时语音的共振峰,之后在进一步通过对数运算和离散余弦变换后得到MFCC系数特征。The preprocessed real-time speech data is subjected to fast Fourier transform to obtain the energy spectrum of each frame of real-time speech data signal, and the energy spectrum is passed through a set of Mel-scale triangular filter banks to smooth the spectrum and eliminate The role of harmonics highlights the formants of real-time speech, and then the MFCC coefficient characteristics are obtained through further logarithmic operations and discrete cosine transforms.
在本发明的其他实施方式中,也可通过不同算法步骤获取所述实时语音数据的感知线性预测特征(Perceptual Linear Predictive,简称PLP)或线性预测系数特征(Linear Predictive Coding,简称LPC)等特征参数来取代MFCC特征,具体可基于实际模型参数和本方法实际应用的领域而进行具体选择,本发明对此不做具体限制。In other embodiments of the present invention, characteristic parameters such as the Perceptual Linear Predictive (PLP) or Linear Predictive Coding (LPC) characteristics of the real-time speech data can also be obtained through different algorithm steps. To replace the MFCC features, specific selection can be made based on the actual model parameters and the field of practical application of this method, and the present invention does not impose specific restrictions on this.
上述步骤中所涉及的具体的算法步骤可参考当前本领域现有技术,具体在此不再赘述。For the specific algorithm steps involved in the above steps, reference can be made to the current state of the art in the field, and details will not be described again here.
在步骤S22中,通过预先训练的所述语音识别深度神经网络完成对所述实时语音数据的文本内容转写,得到所述第一语音文本向量。In step S22, the text content of the real-time speech data is transcribed through the pre-trained speech recognition deep neural network to obtain the first speech text vector.
在本实施方式中,直接通过深度神经网络模型来完成语音识别,相比于现有技术中常用的高斯混合模型等模型,深度神经网络模型避免了声学特征需要服从独立同分布的假设,与高斯混合模型中的网络输入不同,深度神经网络模型由相邻的若干帧拼接重叠得到,从而能够更好地利用上下文的信息,获取更多语音特征信息,具有更高的语音识别精度。In this implementation, speech recognition is completed directly through a deep neural network model. Compared with models such as the Gaussian mixture model commonly used in the existing technology, the deep neural network model avoids the assumption that acoustic features need to obey independent and identical distribution, and is different from the Gaussian mixture model. The network inputs in the hybrid model are different. The deep neural network model is obtained by splicing and overlapping several adjacent frames, so that it can better utilize context information, obtain more speech feature information, and have higher speech recognition accuracy.
进一步的,在本实施方式中,步骤S21所涉及算法步骤可以结合在所述深度神经网络模型中,以使整体模型结构更加均衡。Furthermore, in this embodiment, the algorithm steps involved in step S21 can be combined into the deep neural network model to make the overall model structure more balanced.
在得到所述第一语音文本向量后,通过语音识别卷积神经网络对其进行编码,由于卷积神经网络在时间和空间上具有平移不变性,所以基于CNN对语音识别的声学特征进行建模,能够消除语音信号的多样性,完成对其的编码工作,最终得到的所述第二语音文本向量包含实时语音数据的高层特征语义信息。 After obtaining the first speech text vector, it is encoded through a speech recognition convolutional neural network. Since the convolutional neural network has translation invariance in time and space, the acoustic features of the speech recognition are modeled based on CNN. , can eliminate the diversity of the speech signal, complete its encoding work, and the second speech text vector finally obtained contains high-level feature semantic information of the real-time speech data.
在本发明的其他实施方式中,也可通过构建其他结构神经网络模型或者通过高斯混合模型等模型等来将所述实时语音数据转写为所述语音文本数据,只要能够将所述实时语音数据转写为所述语音文本数据即可。In other embodiments of the present invention, the real-time speech data can also be transcribed into the speech text data by constructing other structural neural network models or using models such as Gaussian mixture models, as long as the real-time speech data can be Just transcribe it into the voice text data.
综上所述,通过步骤S2完成了对所述实时语音数据的文本转写及特征提取。To sum up, the text transcription and feature extraction of the real-time voice data are completed through step S2.
如图5所示,在步骤S3中,其具体包括:As shown in Figure 5, in step S3, it specifically includes:
S31:将所述历史食材评论文本数据转化为食材评论词向量。S31: Convert the historical food material review text data into food material review word vectors.
S32:将所述食材评论词向量输入双向长短记忆网络模型,获取包含基于所述历史食材评论文本数据上下文特征信息的食材评论上下文特征向量。S32: Input the food material review word vector into a two-way long and short memory network model to obtain an food material review context feature vector containing contextual feature information based on the historical food material review text data.
在步骤S31中,为了将文本数据转化为计算机能够识别和处理的向量化形式,可通过Word2Vec算法,将所述历史食材评论文本数据转化为所述食材评论词向量,或者也可通过其他诸如Glove算法等本领域现有算法转化得到所述词向量,本发明对此不做具体限制。In step S31, in order to convert the text data into a vectorized form that can be recognized and processed by a computer, the historical food material review text data can be converted into the food material review word vector through the Word2Vec algorithm, or other methods such as Glove can also be used to convert the historical food material review text data into the food material review word vector. Algorithms and other existing algorithms in the field can be converted to obtain the word vectors, and the present invention does not impose specific restrictions on this.
在步骤S32中,双向长短记忆网络(Bi-directional Long Short-Term Memory,简写BiLSTM)由前向长短记忆网络(Long Short-Term Memory,简写LSTM)和后向长短记忆网络组合而成,LSTM模型能够更好地获取文本语义长距离的依赖关系,而在其基础上,BiLSTM模型能更好地获取文本双向语义。将多个所述食材评论词向量分别输入BiLSTM模型中,经过前向LSTM和后向LSTM后,得到每个时间步输出的表示有效信息的隐藏层状态,输出带有语境上下文信息的所述食材评论上下文特征向量。In step S32, the Bi-directional Long Short-Term Memory (BiLSTM) is composed of the forward Long Short-Term Memory (LSTM) and the backward long short memory network. The LSTM model It can better obtain the long-distance dependencies of text semantics, and based on it, the BiLSTM model can better obtain the bidirectional semantics of text. Input multiple food review word vectors into the BiLSTM model respectively. After forward LSTM and backward LSTM, the hidden layer state representing effective information output at each time step is obtained, and the described word vectors with contextual context information are output. Food review context feature vector.
在本发明的其他实施方式中,也可采用诸如门控循环单元(Gated Recurrent Unit,简写GRU)网络等本领域常见的循环网络模型来提取上下文特征信息,本发明对此不作具体限制。In other embodiments of the present invention, a common recurrent network model in the field such as a Gated Recurrent Unit (GRU) network can also be used to extract contextual feature information, and the present invention does not impose specific limitations on this.
在本发明的另一些实施方式中,也可在步骤S3中增加步骤:In other embodiments of the present invention, steps may also be added to step S3:
S33:将所述第二语音文本向量输入语音识别双向长短记忆网络模型,获取包含基于所述语音文本数据上下文特征信息的语音文本上下文特征向量。S33: Input the second speech text vector into the speech recognition bidirectional long and short memory network model, and obtain a speech text context feature vector containing context feature information based on the speech text data.
从而进一步增加所述语音文本数据的上下文特征信息,但基于整体模型结构考虑,在本实施方式中,不增加语音识别双向长短记忆网络模型,从而使得整体模型 结构更加对称和均衡,本领域技术人员可依据实际模型结构对模型层数进行具体调整,本发明对此不做具体限制。This further increases the contextual feature information of the speech text data. However, based on the overall model structure, in this implementation, the speech recognition bidirectional long and short memory network model is not added, so that the overall model The structure is more symmetrical and balanced. Those skilled in the art can make specific adjustments to the number of model layers based on the actual model structure. The present invention does not impose specific restrictions on this.
从而,通过步骤S2和S3分别完成了对所述语音文本数据和所述历史食材评论文本数据的特征提取,分别得到了不同的语义特征信息并进而提取了有用的文本信息,提升了物品分类的准确性,避免有用信息的丢失或过滤,提升了模型的性能。Therefore, through steps S2 and S3, the feature extraction of the voice text data and the historical ingredient review text data are respectively completed, different semantic feature information is obtained, and useful text information is extracted, which improves the accuracy of item classification. accuracy, avoid the loss or filtering of useful information, and improve the performance of the model.
进一步的,在本发明一些实施方式中,在步骤S3之后,还包括步骤:Further, in some embodiments of the present invention, after step S3, there are also steps:
S3a:基于注意力机制模型,增强所述语音文本数据和所述历史食材评论文本数据的文本特征。S3a: Based on the attention mechanism model, enhance the text features of the speech text data and the historical ingredient review text data.
具体的,步骤S3a包括:Specifically, step S3a includes:
分别将所述第二语音文本向量和所述食材评论上下文特征向量输入自注意力机制模型和互注意力机制机制模型;Input the second speech text vector and the food review context feature vector into the self-attention mechanism model and the mutual attention mechanism model respectively;
获取包含所述语音文本数据自身权重信息以及所述语音文本数据与所述历史食材评论文本数据之间权重信息的语音文本注意力特征向量;Obtain a voice text attention feature vector that includes the weight information of the voice text data itself and the weight information between the voice text data and the historical ingredient review text data;
获取包含所述历史食材评论文本数据自身权重信息以及所述历史食材评论文本数上下文特征向量与所述语音文本数据之间权重信息的食材评论文本注意力特征向量。Obtain the food review text attention feature vector including the weight information of the historical food review text data itself and the weight information between the historical food review text number context feature vector and the voice text data.
注意力机制可以引导神经网络去关注更为关键的信息而抑制其他非关键的信息,因此,通过引入注意力机制,能够得到所述输出文本数据的局部关键特征或权重信息,从而进一步减少模型训练时序列的不规则误差对齐现象。The attention mechanism can guide the neural network to focus on more critical information and suppress other non-critical information. Therefore, by introducing the attention mechanism, the local key features or weight information of the output text data can be obtained, thereby further reducing model training. Irregular error alignment phenomenon of time series.
这里,通过自注意力机制模型将输入的所述第二语音文本向量和所述食材评论上下文特征向量赋予其自身权重信息,从而获得所述语音文本数据和所述历史食材评论文本数据文本语义特征的内部权重信息。并进一步通过互注意力机制模型将输入的所述第二语音文本向量和所述食材评论上下文特征向量赋予其相互之间的关联权重信息,从而获得所述语音文本数据和所述历史食材评论文本数据之间的关联权重信息。最终得到的所述语音文本注意力特征向量和所述食材评论文本注意力特征向量,从而增强文本语义特征信息不同部分的重要性,使得模型的可解释性进一步优化。 Here, the input second voice text vector and the food review context feature vector are given their own weight information through the self-attention mechanism model, thereby obtaining the text semantic features of the voice text data and the historical food review text data. internal weight information. And further assign the input second voice text vector and the food review context feature vector to their mutual correlation weight information through the mutual attention mechanism model, thereby obtaining the voice text data and the historical food review text. Association weight information between data. The finally obtained speech text attention feature vector and food review text attention feature vector enhance the importance of different parts of text semantic feature information, further optimizing the interpretability of the model.
在本发明的其他实施方式中,也可仅基于自注意力机制模型,或通过其他算法模型完成对所述语音文本数据和所述历史食材评论文本数据的文本特征增强。In other embodiments of the present invention, the text feature enhancement of the speech text data and the historical ingredient review text data can also be completed based only on the self-attention mechanism model, or through other algorithm models.
进一步的,在本发明的一些实施方式中,步骤S2、S3、S3a还可包括:Further, in some embodiments of the present invention, steps S2, S3, and S3a may also include:
获取存储于外部缓存的配置数据,将所述语音文本数据和所述历史食材评论文本数据基于所述配置数据执行深度神经网络计算,进行文本转写和提取所述语音文本数据和所述历史食材评论文本数据的文本特征。Obtain the configuration data stored in the external cache, perform deep neural network calculations on the voice text data and the historical ingredient review text data based on the configuration data, perform text transcription and extract the voice text data and the historical ingredient Text features of review text data.
这里,通过配置外部缓存提高了算法计算效率,有效解决了所述历史食材评论文本数据量较大带来的时间响应和空间计算复杂度等问题。Here, the calculation efficiency of the algorithm is improved by configuring an external cache, and effectively solves the problems of time response and space calculation complexity caused by the large amount of historical food review text data.
在本发明的其他实施方式中,可以根据需要调整深度神经网络各层的排列顺序或省略部分层,只要能够完成对所述语音文本数据和所述历史食材评论文本数据的文本分类即可,本发明对此不作具体限制。In other embodiments of the present invention, the order of the layers of the deep neural network can be adjusted or some layers can be omitted as needed, as long as the text classification of the voice text data and the historical food review text data can be completed. The invention places no specific limitations on this.
在步骤S4中,其具体包括:In step S4, it specifically includes:
将所述语音文本注意力特征向量和所述食材评论文本注意力特征向量共同映射到一个统一多模态向量空间进行联合表示得到所述联合特征向量,多模态联合的所述联合特征向量融合了文本语义的上下文信息、特征有用信息、高层特征、有用特征的不同重要性等最优表征能力,具有丰富的语义特征信息,从而能够获得到优秀的文本、语音表征能力。The voice text attention feature vector and the food review text attention feature vector are jointly mapped to a unified multi-modal vector space for joint representation to obtain the joint feature vector, and the multi-modal joint feature vector is obtained It integrates optimal representation capabilities such as contextual information of text semantics, feature useful information, high-level features, and the different importance of useful features. It has rich semantic feature information and can obtain excellent text and speech representation capabilities.
需要说明的是,在目前的神经网络模型中,多模态的联合特征表示和多模态融合之间已经没有明确的界限,因此,在本发明的一些实施方式中,步骤S4也可为:将所述语音文本注意力特征向量和所述食材评论文本注意力特征向量融合表示得到融合特征向量。多模态联合特征表示以及多模态融合均是为了将所述实时语音数据和所述历史食材评论文本组合,更好地提取和表示两者的特征信息。It should be noted that in the current neural network model, there is no clear boundary between multi-modal joint feature representation and multi-modal fusion. Therefore, in some embodiments of the present invention, step S4 may also be: The speech text attention feature vector and the food review text attention feature vector are fused to obtain a fusion feature vector. Multi-modal joint feature representation and multi-modal fusion are intended to combine the real-time voice data and the historical food review text to better extract and represent the feature information of both.
在步骤S5中,其具体包括:In step S5, it specifically includes:
将所述注意力特征向量经全连接层组合后,输出至Softmax函数,计算所述语音文本数据和所述历史食材评论文本数据文本语义的得分及其归一化得分结果,得到分类结果信息。After the attention feature vector is combined through the fully connected layer, it is output to the Softmax function, and the scores of the text semantics of the speech text data and the historical food review text data and their normalized score results are calculated to obtain classification result information.
在本发明的其他实施方式中,也可根据模型结构选择其他激活函数,本发明对 此不做具体限制。In other embodiments of the present invention, other activation functions can also be selected according to the model structure. There are no specific restrictions on this.
综上所述,本发明所提供的方法依次通过上述步骤,完成了对所获取的语音数据的识别与分类任务,并且通过获取历史食材评论文本数据,将历史食材评论文本数据作为预训练和预测模型的数据集的一部分,更全面地获取了文本语义特征信息,通过综合运用语音文本数据和历史食材评论文本数据,将历史食材评论文本数据作为补充数据,弥补了语音数据文本语义信息较少的问题,有效提高了文本分类准确度,从而提高了对物品进行分类的准确率。并且,通过构建融合了深度神经网络和卷积神经网络的网络模型提高了实时语音识别的精度;通过构建融合了上下文信息机制、自注意力机制和互注意力机制的神经网络模型,更充分地提取文本语义特征信息。整体模型结构具有优秀的深度学习表征能力,对语音文本分类的准确率高,大幅提升了对物品类别进行分类的准确率和泛化能力。In summary, the method provided by the present invention sequentially completes the recognition and classification tasks of the acquired voice data through the above steps, and by obtaining historical food material review text data, the historical food material review text data is used as pre-training and prediction As part of the data set of the model, the text semantic feature information is more comprehensively obtained. By comprehensively using speech text data and historical ingredient review text data, the historical ingredient review text data is used as supplementary data to make up for the lack of semantic information in the voice data text. problem, effectively improving the accuracy of text classification, thereby improving the accuracy of classifying items. Moreover, by building a network model that combines deep neural networks and convolutional neural networks, the accuracy of real-time speech recognition is improved; by building a neural network model that combines context information mechanisms, self-attention mechanisms, and mutual attention mechanisms, it can more fully Extract text semantic feature information. The overall model structure has excellent deep learning representation capabilities, has high accuracy in classifying speech text, and greatly improves the accuracy and generalization ability of classifying item categories.
在步骤S6中,其具体包括:In step S6, it specifically includes:
将所述物品类别信息转换为语音进行输出,和/或Convert the item category information into speech for output, and/or
将所述物品类别信息转换为语音传输至客户终端输出,和/或Convert the item category information into voice and transmit it to the client terminal for output, and/or
将所述物品类别信息转换为文本进行输出,和/或Convert the item category information into text for output, and/or
将所述物品类别信息转换为文本传输至客户终端输出。Convert the item category information into text and transmit it to the client terminal for output.
如步骤S6所述,在本实施方式中,在通过前述步骤获得分类结果信息并判断得到物品类别信息后,可将其转换为语音,通过智能冰箱内置的声音播放设备播报所述物品类别信息,从而直接与用户进行语音交互,或者也可将所述物品类别信息转换为文本,直接通过智能冰箱配置的显示装置显示。并且,也可将物品类别信息语音通信传输至客户终端输出,这里,客户终端为具有信息接收功能的电子设备,如将语音传输至手机、智能音响、蓝牙耳机等设备进行播报,或将分类结果信息文本通过短信、邮件等方式通讯传输至诸如手机、平板电脑等客户终端或客户终端安装的应用软件,供用户查阅。从而实现多渠道多种类的分类结果信息输出方式,用户并不局限于只能在智能冰箱附近处获得相关信息,配合本发明所提供的多渠道多种类实时语音获取方式,使得用户能够直接在远程与智能冰箱进行交互,具有极高的便捷性,大幅提高了用户使用体验。在本发明的其他实施方式中,也可仅采用上述 分类结果信息输出方式中的一种或几种,或者也可基于现有技术通过其他渠道输出分类结果信息,本发明对此不作具体限制。As described in step S6, in this embodiment, after the classification result information is obtained through the previous steps and the item category information is determined, it can be converted into voice, and the item category information can be broadcast through the sound playback device built in the smart refrigerator. This allows direct voice interaction with the user, or the item category information can be converted into text and displayed directly through the display device configured in the smart refrigerator. Moreover, the voice communication of item category information can also be transmitted to the client terminal for output. Here, the client terminal is an electronic device with an information receiving function, such as transmitting voice to mobile phones, smart speakers, Bluetooth headsets and other devices for broadcast, or classifying results. The information text is transmitted to client terminals such as mobile phones and tablets or application software installed on the client terminal through text messages, emails, etc. for users to review. Thus, a multi-channel and multi-type classification result information output method is realized. The user is not limited to only obtaining relevant information near the smart refrigerator. With the multi-channel and multi-type real-time voice acquisition method provided by the present invention, the user can directly obtain relevant information remotely. Interacting with the smart refrigerator is extremely convenient and greatly improves the user experience. In other embodiments of the present invention, only the above-mentioned One or more of the classification result information output methods, or the classification result information can also be output through other channels based on the existing technology, and the present invention does not impose specific restrictions on this.
综上所述,本发明提供的一种基于深度学习的物品分类方法,其通过多渠道获取包含物品信息的实时语音数据,在将实时语音数据进行文本转写后,结合历史食材评论文本数据通过深度神经网络模型充分提取了文本语义特征,获得物品类别信息后通过多渠道进行输出,显著改善语音识别精度和物品类别判断准确率的同时,使得交互方式更加便捷多元,大幅提高用户体验。To sum up, the present invention provides an item classification method based on deep learning, which obtains real-time voice data containing item information through multiple channels. After the real-time voice data is transcribed into text, it is combined with historical ingredient review text data through The deep neural network model fully extracts text semantic features, obtains item category information and outputs it through multiple channels. It significantly improves the accuracy of speech recognition and item category judgment, while making the interaction method more convenient and diverse, greatly improving the user experience.
如图6所示,基于同一发明构思,本发明还提供一种基于深度学习的物品分类装置7,其包括:As shown in Figure 6, based on the same inventive concept, the present invention also provides an item classification device 7 based on deep learning, which includes:
数据获取模块71,用于获取实时语音数据和获取历史文本数据;Data acquisition module 71, used to acquire real-time voice data and acquire historical text data;
转写模块72,用于转写所述实时语音数据为语音文本数据;Transcription module 72, used to transcribe the real-time voice data into voice text data;
特征提取模块73,用于提取所述语音文本数据文本特征和提取所述历史文本数据的文本特征;Feature extraction module 73, used to extract text features of the voice text data and extract text features of the historical text data;
联合表示模块74,用于将所述实时语音数据文本特征和所述历史文本数据文本特征联合表示得到联合特征;The joint representation module 74 is used to jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features;
结果计算模块75,用于将所述联合特征经全连接层组合后,输出至分类器计算得分得到分类结果信息,并判断得到物品类别信息;The result calculation module 75 is used to combine the joint features through the fully connected layer and output it to the classifier to calculate the score to obtain the classification result information, and determine to obtain the item category information;
输出模块76,用于输出所述物品类别信息。The output module 76 is used to output the item category information.
基于同一发明构思,本发明还提供一种电器设备,其包括:Based on the same inventive concept, the present invention also provides an electrical device, which includes:
存储器,用于存储可执行指令;Memory, used to store executable instructions;
处理器,用于运行所述存储器存储的可执行指令时,实现上述的基于深度学习的物品分类方法。The processor is configured to implement the above deep learning-based item classification method when running executable instructions stored in the memory.
基于同一发明构思,本发明还提供一种冰箱,其包括:Based on the same inventive concept, the present invention also provides a refrigerator, which includes:
存储器,用于存储可执行指令;Memory, used to store executable instructions;
处理器,用于运行所述存储器存储的可执行指令时,实现上述的基于深度学习的物品分类方法。The processor is configured to implement the above deep learning-based item classification method when running executable instructions stored in the memory.
基于同一发明构思,本发明还提供一种计算机可读存储介质,其存储有可执行 指令,其特征在于,所述可执行指令被处理器执行时实现上述的基于深度学习的物品分类方法。Based on the same inventive concept, the present invention also provides a computer-readable storage medium, which stores executable Instructions, characterized in that when the executable instructions are executed by the processor, the above-mentioned deep learning-based item classification method is implemented.
应当理解,虽然本说明书按照实施方式加以描述,但并非每个实施方式仅包含一个独立的技术方案,说明书的这种叙述方式仅仅是为清楚起见,本领域技术人员应当将说明书作为一个整体,各实施方式中的技术方案也可以经适当组合,形成本领域技术人员可以理解的其他实施方式。It should be understood that although this specification is described in terms of implementations, not each implementation only contains an independent technical solution. This description of the specification is only for the sake of clarity. Persons skilled in the art should take the specification as a whole and understand each individual solution. The technical solutions in the embodiments can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.
上文所列出的一系列的详细说明仅仅是针对本发明的可行性实施方式的具体说明,并非用以限制本发明的保护范围,凡未脱离本发明技艺精神所作的等效实施方式或变更均应包含在本发明的保护范围之内。 The series of detailed descriptions listed above are only specific descriptions of feasible implementations of the present invention and are not intended to limit the scope of protection of the present invention. Any equivalent implementations or changes that do not depart from the technical spirit of the present invention are All should be included in the protection scope of the present invention.

Claims (15)

  1. 一种基于深度学习的物品分类方法,其特征在于,包括步骤:An item classification method based on deep learning, which is characterized by including the steps:
    获取包含物品信息的实时语音数据,获取历史文本数据;Obtain real-time voice data containing item information and obtain historical text data;
    转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征;Transcribe the real-time voice data into voice text data, and extract text features of the voice text data;
    提取所述历史文本数据的文本特征;Extract text features of the historical text data;
    将所述实时语音数据文本特征和所述历史文本数据文本特征联合表示得到联合特征;jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features;
    将所述联合特征经全连接层组合后,输出至分类器计算得分得到分类结果信息,并判断得到物品类别信息;After the joint features are combined through the fully connected layer, they are output to the classifier to calculate the score to obtain the classification result information, and determine the item category information;
    输出所述物品类别信息。Output the item category information.
  2. 根据权利要求1所述的基于深度学习的物品分类方法,其特征在于,所述获取历史文本数据,具体包括:The item classification method based on deep learning according to claim 1, characterized in that said obtaining historical text data specifically includes:
    获取历史食材评论文本数据作为所述历史文本数据。Obtain historical food ingredient review text data as the historical text data.
  3. 根据权利要求1所述的基于深度学习的物品分类方法,其特征在于,所述转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征,具体包括:The item classification method based on deep learning according to claim 1, wherein the transcribing the real-time speech data is speech text data, and extracting the text features of the speech text data specifically includes:
    提取所述实时语音数据特征,得到语音特征;Extract the real-time voice data features to obtain voice features;
    将所述语音特征输入语音识别深度神经网络模型转写得到第一语音文本向量;Enter the speech feature into a speech recognition deep neural network model and transcribe it to obtain a first speech text vector;
    将所述第一语音文本向量输入语音识别卷积神经网络进行编码,得到第二语音文本向量。The first speech text vector is input into a speech recognition convolutional neural network for encoding to obtain a second speech text vector.
  4. 根据权利要求3所述的基于深度学习的物品分类方法,其特征在于,所述提取所述实时语音数据特征,具体包括:The item classification method based on deep learning according to claim 3, wherein the extracting the real-time voice data features specifically includes:
    提取所述实时语音数据特征,获取其梅尔频率倒谱系数特征。Extract the characteristics of the real-time speech data and obtain its Mel frequency cepstrum coefficient characteristics.
  5. 根据权利要求3所述的基于深度学习的物品分类方法,其特征在于,提取所述历史文本数据的文本特征,具体包括:The item classification method based on deep learning according to claim 3, characterized in that extracting text features of the historical text data specifically includes:
    将所述历史食材评论文本数据转化为食材评论词向量;Convert the historical food material review text data into food material review word vectors;
    将所述食材评论词向量输入双向长短记忆网络模型,获取包含基于所述历史食 材评论文本数据上下文特征信息的食材评论上下文特征向量。Input the food review word vector into the two-way long and short memory network model to obtain information based on the historical food The ingredient review context feature vector is the ingredient review text data context feature information.
  6. 根据权利要求5所述的基于深度学习的物品分类方法,其特征在于,还包括步骤:The deep learning-based item classification method according to claim 5, further comprising the steps of:
    基于注意力机制模型,增强所述语音文本数据和所述历史食材评论文本数据的文本特征。Based on the attention mechanism model, the text features of the speech text data and the historical ingredient review text data are enhanced.
  7. 根据权利要求6所述的基于深度学习的物品分类方法,其特征在于,所述基于注意力机制模型,增强所述语音文本数据和历史食材评论文本数据的文本特征,具体包括:The item classification method based on deep learning according to claim 6, characterized in that the attention mechanism model is used to enhance the text features of the voice text data and historical ingredient review text data, specifically including:
    分别将所述第二语音文本向量和所述食材评论上下文特征向量输入自注意力机制模型和互注意力机制机制模型;Input the second speech text vector and the food review context feature vector into the self-attention mechanism model and the mutual attention mechanism model respectively;
    获取包含所述语音文本数据自身权重信息以及所述语音文本数据与所述历史食材评论文本数据之间权重信息的语音文本注意力特征向量;Obtain a voice text attention feature vector that includes the weight information of the voice text data itself and the weight information between the voice text data and the historical ingredient review text data;
    获取包含所述历史食材评论文本数据自身权重信息以及所述历史食材评论文本数上下文特征向量与所述语音文本数据之间权重信息的食材评论文本注意力特征向量。Obtain the food review text attention feature vector including the weight information of the historical food review text data itself and the weight information between the historical food review text number context feature vector and the voice text data.
  8. 根据权利要求7所述的基于深度学习的物品分类方法,其特征在于,所述将所述实时语音数据文本特征和所述历史文本数据文本特征联合表示得到联合特征向量,具体包括:The item classification method based on deep learning according to claim 7, wherein the joint representation of the text features of the real-time voice data and the text features of the historical text data to obtain a joint feature vector specifically includes:
    将所述语音文本注意力特征向量和所述食材评论文本注意力特征向量共同映射到一个统一多模态向量空间进行联合表示得到所述联合特征向量。The voice text attention feature vector and the food review text attention feature vector are jointly mapped to a unified multi-modal vector space for joint representation to obtain the joint feature vector.
  9. 根据权利要求7所述的基于深度学习的物品分类方法,其特征在于,所述将所述文本特征经全连接层组合后,输出至分类器计算得分得到分类结果信息,具体包括:The item classification method based on deep learning according to claim 7, characterized in that, after combining the text features through a fully connected layer, the text features are output to a classifier to calculate scores to obtain classification result information, specifically including:
    将所述联合特征向量经全连接层组合后,输出至Softmax函数,计算所述语音文本数据和所述历史食材评论文本数据文本语义的得分及其归一化得分结果,得到分类结果信息。After the joint feature vector is combined through the fully connected layer, it is output to the Softmax function, and the scores of the textual semantics of the speech text data and the historical food review text data and their normalized score results are calculated to obtain classification result information.
  10. 根据权利要求1所述的基于深度学习的物品分类方法,其特征在于,所述获 取包含物品信息的实时语音数据,具体包括:The item classification method based on deep learning according to claim 1, characterized in that the obtained Obtain real-time voice data containing item information, including:
    获取语音采集装置所采集的所述实时语音数据,和/或Obtain the real-time voice data collected by the voice collection device, and/or
    获取自客户终端传输的所述实时语音数据。The real-time voice data transmitted from the client terminal is obtained.
  11. 根据权利要求2所述的基于深度学习的物品分类方法,其特征在于,所述获取历史食材评论文本数据作为所述历史文本数据,具体包括:The item classification method based on deep learning according to claim 2, characterized in that said obtaining historical ingredient review text data as the historical text data specifically includes:
    获取内部存储的历史食材评论文本作为历史食材评论文本数据,和/或Obtain the internally stored historical ingredient review text as historical ingredient review text data, and/or
    获取外部存储的历史食材评论文本作为历史食材评论文本数据,和/或Obtain the externally stored historical ingredient review text as historical ingredient review text data, and/or
    获取客户终端传输的历史食材评论文本作为历史食材评论文本数据。Obtain the historical ingredient review text transmitted by the client terminal as historical ingredient review text data.
  12. 根据权利要求1所述的基于深度学习的物品分类方法,其特征在于,还包括步骤:The deep learning-based item classification method according to claim 1, further comprising the steps of:
    对所述实时语音数据进行预处理,包括:对所述实时语音数据进行分帧处理和加窗处理,Preprocessing the real-time voice data includes: framing and windowing the real-time voice data,
    对所述历史文本数据进行预处理,包括:对所述语音文本数据进行清洗处理、标注、分词、去停用词。Preprocessing the historical text data includes: cleaning, annotating, word segmenting, and removing stop words on the speech text data.
  13. 根据权利要求1所述的基于深度学习的物品分类方法,其特征在于,所述输出所述物品类别信息包括:The item classification method based on deep learning according to claim 1, wherein the outputting the item category information includes:
    将所述物品类别信息转换为语音进行输出,和/或Convert the item category information into speech for output, and/or
    将所述物品类别信息转换为语音传输至客户终端输出,和/或Convert the item category information into voice and transmit it to the client terminal for output, and/or
    将所述物品类别信息转换为文本进行输出,和/或Convert the item category information into text for output, and/or
    将所述物品类别信息转换为文本传输至客户终端输出。Convert the item category information into text and transmit it to the client terminal for output.
  14. 根据权利要求1所述的基于深度学习的物品分类方法,其特征在于,所述转写所述实时语音数据为语音文本数据,提取所述语音文本数据文本特征;提取所述历史文本数据的文本特征,还包括:The object classification method based on deep learning according to claim 1, characterized in that the transcribing the real-time speech data is speech text data, extracting text features of the speech text data; extracting the text of the historical text data Features, also include:
    获取存储于外部缓存的配置数据,将所述所述实时语音数据和所述历史食材评论文本数据基于所述配置数据执行深度神经网络计算,进行文本转写和提取文本特征。Obtain the configuration data stored in the external cache, perform deep neural network calculations on the real-time voice data and the historical food review text data based on the configuration data, perform text transcription and extract text features.
  15. 一种基于深度学习的物品分类装置,其特征在于,包括: An item classification device based on deep learning, which is characterized by including:
    数据获取模块,用于获取实时语音数据和获取历史文本数据;Data acquisition module, used to acquire real-time voice data and historical text data;
    转写模块,用于转写所述实时语音数据为语音文本数据;A transliteration module, used to transcribe the real-time voice data into voice text data;
    特征提取模块,用于提取所述语音文本数据文本特征和提取所述历史文本数据的文本特征;A feature extraction module, used to extract text features of the speech text data and extract text features of the historical text data;
    联合表示模块,用于将所述实时语音数据文本特征和所述历史文本数据文本特征联合表示得到联合特征;A joint representation module, used to jointly represent the text features of the real-time speech data and the text features of the historical text data to obtain joint features;
    结果计算模块,用于将所述联合特征经全连接层组合后,输出至分类器计算得分得到分类结果信息,并判断得到物品类别信息;The result calculation module is used to combine the joint features through the fully connected layer and output it to the classifier to calculate the score to obtain the classification result information, and to determine the item category information;
    输出模块,用于输出所述物品类别信息。 An output module is used to output the item category information.
PCT/CN2023/095081 2022-05-20 2023-05-18 Item classification method and apparatus based on deep learning WO2023222089A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210554861.9A CN114944156A (en) 2022-05-20 2022-05-20 Article classification method, device and equipment based on deep learning and storage medium
CN202210554861.9 2022-05-20

Publications (1)

Publication Number Publication Date
WO2023222089A1 true WO2023222089A1 (en) 2023-11-23

Family

ID=82908762

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/095081 WO2023222089A1 (en) 2022-05-20 2023-05-18 Item classification method and apparatus based on deep learning

Country Status (2)

Country Link
CN (1) CN114944156A (en)
WO (1) WO2023222089A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118209866A (en) * 2024-03-20 2024-06-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Lithium battery state of charge estimation system and method with rapid migration capability

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114944156A (en) * 2022-05-20 2022-08-26 青岛海尔电冰箱有限公司 Article classification method, device and equipment based on deep learning and storage medium
CN115098765A (en) * 2022-05-20 2022-09-23 青岛海尔电冰箱有限公司 Information pushing method, device and equipment based on deep learning and storage medium
CN116186258A (en) * 2022-12-31 2023-05-30 青岛海尔电冰箱有限公司 Text classification method, equipment and storage medium based on multi-mode knowledge graph
CN116108176A (en) * 2022-12-31 2023-05-12 青岛海尔电冰箱有限公司 Text classification method, equipment and storage medium based on multi-modal deep learning
CN116431805A (en) * 2023-03-15 2023-07-14 青岛海尔电冰箱有限公司 Text classification method and refrigeration equipment system
CN117475199A (en) * 2023-10-16 2024-01-30 深圳市泰洲科技有限公司 Intelligent classification method for customs declaration commodity

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170293687A1 (en) * 2016-04-12 2017-10-12 Abbyy Infopoisk Llc Evaluating text classifier parameters based on semantic features
CN107993134A (en) * 2018-01-23 2018-05-04 北京知行信科技有限公司 A kind of smart shopper exchange method and system based on user interest
US20190034823A1 (en) * 2017-07-27 2019-01-31 Getgo, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization
CN113111954A (en) * 2021-04-20 2021-07-13 网易(杭州)网络有限公司 User category judgment method and device, storage medium and server
CN113887410A (en) * 2021-09-30 2022-01-04 杭州电子科技大学 Deep learning-based multi-category food material identification system and method
CN114121018A (en) * 2021-12-06 2022-03-01 中国科学技术大学 Voice document classification method, system, device and storage medium
CN114944156A (en) * 2022-05-20 2022-08-26 青岛海尔电冰箱有限公司 Article classification method, device and equipment based on deep learning and storage medium
CN115062143A (en) * 2022-05-20 2022-09-16 青岛海尔电冰箱有限公司 Voice recognition and classification method, device, equipment, refrigerator and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170293687A1 (en) * 2016-04-12 2017-10-12 Abbyy Infopoisk Llc Evaluating text classifier parameters based on semantic features
US20190034823A1 (en) * 2017-07-27 2019-01-31 Getgo, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization
CN107993134A (en) * 2018-01-23 2018-05-04 北京知行信科技有限公司 A kind of smart shopper exchange method and system based on user interest
CN113111954A (en) * 2021-04-20 2021-07-13 网易(杭州)网络有限公司 User category judgment method and device, storage medium and server
CN113887410A (en) * 2021-09-30 2022-01-04 杭州电子科技大学 Deep learning-based multi-category food material identification system and method
CN114121018A (en) * 2021-12-06 2022-03-01 中国科学技术大学 Voice document classification method, system, device and storage medium
CN114944156A (en) * 2022-05-20 2022-08-26 青岛海尔电冰箱有限公司 Article classification method, device and equipment based on deep learning and storage medium
CN115062143A (en) * 2022-05-20 2022-09-16 青岛海尔电冰箱有限公司 Voice recognition and classification method, device, equipment, refrigerator and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118209866A (en) * 2024-03-20 2024-06-18 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Lithium battery state of charge estimation system and method with rapid migration capability

Also Published As

Publication number Publication date
CN114944156A (en) 2022-08-26

Similar Documents

Publication Publication Date Title
WO2023222089A1 (en) Item classification method and apparatus based on deep learning
WO2023222088A1 (en) Voice recognition and classification method and apparatus
WO2023222090A1 (en) Information pushing method and apparatus based on deep learning
CN113408385B (en) Audio and video multi-mode emotion classification method and system
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN111968679B (en) Emotion recognition method and device, electronic equipment and storage medium
WO2020253509A1 (en) Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium
WO2024140434A1 (en) Text classification method based on multi-modal knowledge graph, and device and storage medium
WO2024140430A1 (en) Text classification method based on multimodal deep learning, device, and storage medium
CN112233680A (en) Speaker role identification method and device, electronic equipment and storage medium
CN114566189B (en) Speech emotion recognition method and system based on three-dimensional depth feature fusion
WO2024193596A1 (en) Natural language understanding method and refrigerator
CN115798459B (en) Audio processing method and device, storage medium and electronic equipment
WO2024140432A1 (en) Ingredient recommendation method based on knowledge graph, and device and storage medium
CN117077787A (en) Text generation method and device, refrigerator and storage medium
CN113129867A (en) Training method of voice recognition model, voice recognition method, device and equipment
CN112581937A (en) Method and device for acquiring voice instruction
WO2024114303A1 (en) Phoneme recognition method and apparatus, electronic device and storage medium
Ozseven Evaluation of the effect of frame size on speech emotion recognition
US11991511B2 (en) Contextual awareness in dynamic device groups
US11277304B1 (en) Wireless data protocol
Sartiukova et al. Remote Voice Control of Computer Based on Convolutional Neural Network
KR20210085182A (en) System, server and method for determining user utterance intention
CN118486305B (en) Event triggering processing method based on voice recognition
CN118428343B (en) Full-media interactive intelligent customer service interaction method and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23807039

Country of ref document: EP

Kind code of ref document: A1