WO2023093074A1 - 语音数据处理方法、装置及电子设备、存储介质 - Google Patents
语音数据处理方法、装置及电子设备、存储介质 Download PDFInfo
- Publication number
- WO2023093074A1 WO2023093074A1 PCT/CN2022/105326 CN2022105326W WO2023093074A1 WO 2023093074 A1 WO2023093074 A1 WO 2023093074A1 CN 2022105326 W CN2022105326 W CN 2022105326W WO 2023093074 A1 WO2023093074 A1 WO 2023093074A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text information
- model
- domain
- classification model
- category
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 26
- 238000003860 storage Methods 0.000 title claims abstract description 15
- 238000013145 classification model Methods 0.000 claims abstract description 111
- 238000000034 method Methods 0.000 claims abstract description 45
- 238000012549 training Methods 0.000 claims description 82
- 239000013598 vector Substances 0.000 claims description 65
- 238000013527 convolutional neural network Methods 0.000 claims description 25
- 230000015654 memory Effects 0.000 claims description 24
- 238000012545 processing Methods 0.000 claims description 22
- 230000004927 fusion Effects 0.000 claims description 6
- 238000009826 distribution Methods 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 4
- 230000002779 inactivation Effects 0.000 claims description 3
- 238000004891 communication Methods 0.000 claims description 2
- 238000012216 screening Methods 0.000 claims 2
- 230000003993 interaction Effects 0.000 abstract description 8
- 230000008569 process Effects 0.000 description 19
- 238000005406 washing Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 12
- 238000004590 computer program Methods 0.000 description 11
- 230000000694 effects Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000010408 sweeping Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 2
- 230000009849 deactivation Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000005291 magnetic effect Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000005294 ferromagnetic effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Definitions
- the present disclosure relates to voice interaction technology, in particular to a voice data processing method, device, electronic equipment, and storage medium.
- smart home appliances are usually controlled through voice interaction.
- voice data in the voice interaction will be sent to a variety of smart home appliances.
- command recognition models in different fields recognize whether the voice data is suitable for smart home appliances in this field. .
- This voice data recognition and processing method will cause excessive load pressure on the server, thereby increasing the risk of server downtime.
- the present disclosure provides a voice data processing method, device, electronic equipment, and storage medium, which are used to solve the problem that in the field of smart home appliances, voice interaction control of home appliances may cause excessive load pressure on servers and increase the risk of server downtime.
- the present disclosure provides a voice data processing method, including:
- An execution instruction is generated after identifying the text information based on the target instruction recognition model, and the execution instruction is used to instruct the target household electrical appliance to execute the operation indicated by the voice data;
- an electronic device including: a processor, and a memory communicatively connected to the processor;
- the memory stores computer-executable instructions
- the processor executes the computer-executed instructions stored in the memory, so as to realize the voice data processing method as described in the first aspect.
- a computer-readable storage medium wherein computer-executable instructions are stored in the computer-readable storage medium, and when the instructions are executed, the computer executes the voice data processing method as described in the first aspect.
- a computer program product including a computer program, and when the computer program is executed by a processor, the speech data processing method according to the first aspect is implemented.
- the method provided by the present disclosure converts the voice data into text information after the voice data is acquired, and then recognizes the field category of the target home appliance indicated by the text information, and determines the target instruction recognition model according to the field category of the target home appliance.
- An execution instruction is generated after identifying the text information based on the target instruction recognition model.
- An execution instruction is sent to the target household electrical appliance, so that the target household electrical appliance executes the operation indicated by the voice data. Therefore, when the voice data is obtained, only the domain classification model and an instruction recognition model are called to process the voice data, and it is not necessary to call all the instruction recognition models to process the voice data, which reduces the load pressure on the server, and further Reduced risk of server downtime.
- FIG. 1 is a schematic diagram of an application scenario of a speech data processing method provided by the present disclosure.
- Fig. 2 is a schematic flowchart of a voice data processing method provided by an embodiment of the present disclosure.
- Fig. 3 is a schematic flowchart of a voice data processing method provided by another embodiment of the present disclosure.
- Fig. 4 is a partial schematic diagram of the process of creating a domain classification model provided by an embodiment of the present disclosure.
- Fig. 5 is a schematic diagram of a speech data processing device provided by an embodiment of the present disclosure.
- Fig. 6 is a schematic diagram of an electronic device provided by an embodiment of the present disclosure.
- smart home appliances are usually controlled through voice interaction.
- voice interaction will be received by smart home appliances in different fields.
- Smart home appliances in each field have their own independent instruction recognition models on the server.
- the server will send the voice data to the command recognition model of the smart home appliances in each field at the same time through the tree flow, and the command recognition models in different fields will recognize whether the voice data is suitable for the smart home appliances in the field .
- the simultaneous operation of multiple instruction recognition models will put pressure on the server, resulting in excessive load pressure on the server, thereby increasing the risk of server downtime.
- the present disclosure provides a voice data processing method, device, electronic equipment, and storage medium.
- the voice data is converted into text information, and then the category of the home appliance field indicated by the text information is identified.
- the field category of the home appliance indicated by the text information determines a target instruction recognition model matching the field category of the target home appliance, and then generates an execution instruction after recognizing the text information based on the target instruction recognition model.
- An execution instruction is sent to the target household electrical appliance, so that the target household electrical appliance executes the operation indicated by the voice data.
- the voice data processing method provided by the present disclosure is applied to electronic equipment, such as a computer used in a smart home system, a server used in a smart home system, and the like.
- Figure 1 is a schematic diagram of the application of the voice data processing method provided by the present disclosure.
- the electronic device converts the voice data into text information after acquiring externally input voice data, and then inputs the text information into the domain classification model to determine the The field category of the target home appliance, and then determine the corresponding target instruction recognition model according to the field category of the target home appliance.
- an execution instruction is output to the target household electrical appliance.
- Embodiment 1 of the present disclosure provides a voice data processing method, including:
- the voice data is defined by the user, for example, what the user says “turn on the washing machine” is a piece of voice data.
- the electronic device After receiving the voice data, the electronic device converts the voice data into text information that can be processed, for example, the converted text information of the voice data "turn on the washing machine” is "turn on the washing machine".
- the voice data may be a short sentence or a long sentence.
- the text information may be a sentence with a small number of words or a sentence with a large number of words.
- the voice data may contain one operation, or may contain multiple operations, for example, voice data "Start washing clothes after turning on the washing machine".
- the domain classification model is used to determine the domain category to which the target household electrical appliance indicated by the text information belongs after identifying the text information. For example, the text information is "start washing clothes after turning on the washing machine", and after inputting the text information into the domain classification model, the obtained domain category of the target household appliance is "washing machine”.
- the domain classification model is used to predict and output the domain category of the target home appliance according to the text information.
- the domain classification model will calculate the probability that the text information corresponds to all domain categories in the domain classification model, and then from The domain category corresponding to the highest probability selected from the plurality of probabilities is the domain category of the target household electrical appliance corresponding to the text information.
- the domain categories in the domain classification model are, for example, washing machines, televisions, and sweeping robots. After the text information "turn on the washing machine and start washing clothes" is processed by the domain classification model, the probability of belonging to washing machines may be 90%, and the probability of belonging to televisions may be 90%. The probability may be 20%, and the probability of belonging to a sweeping robot may be 60%, then the predicted text message "start washing clothes after turning on the washing machine" corresponds to the domain category of the target home appliance equipment is the washing machine.
- the domain classification model Before inputting the text information into the domain classification model, the domain classification model needs to be created.
- an initial domain classification model needs to be obtained first, and then the initial domain classification model is trained to obtain the domain classification model.
- a training set of the initial domain classification model is first obtained, the training set includes a plurality of pieces of text information, and each piece of text information has a standard original domain category.
- the initial domain classification model is trained according to the training set of the initial domain classification model.
- each piece of text information When the initial domain classification model outputs the predicted domain category of each piece of text information in the training set of the initial domain classification model, each piece of text information output When the output loss between the predicted domain category and the original domain category of each piece of text information is within the preset loss, the training of the initial domain classification model is completed to obtain the domain classification model.
- the initial domain classification model is a fusion model of the pre-training model ALBERT and the convolutional neural network model, after fusion, the output of the pre-training model ALBERT is the input of the convolutional neural network model.
- the convolutional neural network model TextCNN is set in the initial domain classification model.
- the convolutional neural network model TextCNN has a strong ability to extract text shallow features, and the effect is very good in short text classification.
- the pre-training model ALBERT is introduced into the initial domain classification model.
- the pre-training model ALBERT has learned a large amount of corpus such as encyclopedia questions and answers, daily conversations, and news reports during pre-training.
- the feature information in it can enhance the ability to understand the non-control corpus very well.
- the pre-training model ALBERT has a simple structure and fast reasoning speed, so it can well adapt to the deployment environment after being deployed online.
- Each domain category has an independent command recognition model on the electronic device. After the domain category of the target household electrical device is determined in step S220, the target command recognition model corresponding to the domain category of the target household electrical appliance can be determined.
- the target instruction recognition model can determine the information of the target household electrical device indicated by the text information and the operation to be performed by the target household electrical device. After identifying the information of the target household electrical appliance and the operation to be performed by the target household electrical appliance, the information is packaged as an execution instruction.
- the target household electrical appliance After receiving the execution instruction, the target household electrical appliance executes the operation indicated in the execution instruction, thereby completing the process of controlling the target household electrical appliance through voice interaction.
- the method provided in this embodiment converts the voice data into text information after the voice data is acquired, and then recognizes the field category of the target home appliance indicated by the text information, and determines the target according to the field category of the target home appliance.
- An instruction recognition model and then generates an execution instruction after recognizing the text information based on the target instruction recognition model.
- An execution instruction is sent to the target household electrical appliance, so that the target household electrical appliance executes the operation indicated by the voice data. Therefore, when the voice data is obtained, only the domain classification model and an instruction recognition model are called to process the voice data, and it is not necessary to call all the instruction recognition models to process the voice data, which reduces the load pressure on the server, and further Reduced risk of server downtime.
- Embodiment 2 of the present disclosure also provides a voice data processing method, and further describes the process of creating a domain classification model on the basis of Embodiment 1.
- the method provided in this embodiment includes:
- the convolutional neural network model TextCNN has a strong ability to extract shallow text features, and has a good effect in short text classification.
- the pre-training model ALBERT has learned a large number of feature information in corpora such as encyclopedia questions and answers, daily conversations, and news reports during pre-training, which can well enhance the ability to understand non-control corpus.
- the pre-training model ALBERT has a simple structure and fast reasoning speed, so it can well adapt to the deployment environment after being deployed online.
- the original field category marked on the text information may be defined by the research and development personnel.
- the amount of text information in the training set should be as large as possible to improve the training effect of the initial domain classification model.
- the first text information is any text information in the training set of the initial domain classification model, and the preset domain category is, for example, "washing machine", “television”, “sweeping robot” and so on.
- the first text information is actually input as a numerical value.
- the first text information needs to be The text information is supplemented or deleted to the text information of the preset numerical length, and then input to the pre-training model ALBERT in the initial domain classification model.
- the word in the first text information is deleted; when the numerical length corresponding to the first text information is less than the preset numerical length, the word The words in the first text information are supplemented.
- the first text information is first input into the pre-training model ALBERT, and the pre-training model ALBERT outputs the first feature vector of each word in the first text information . Then input the first feature vector of each character in the first text information into the convolutional neural network model in sequence.
- the convolutional neural network model includes N different convolutional layers ( Figure 4 takes 6 different convolutional layers as an example), N is an integer greater than 3, each Convolutional layers have different receptive fields.
- first feature vector of each word in the first text information according to the order of the first convolutional layer to the Nth convolutional layer, input the first feature vector of each word in the first text information in sequence to the first convolutional layer to the Nth convolutional layer among the N different convolutional layers.
- the first convolutional layer to the Nth convolutional layer can simultaneously use the first text information as a processing sample to extract the first feature vector of each word in the first text information, that is to say, the first text information
- the information is processed by each convolutional layer.
- the first feature vector of each word in the first text information is sequentially extracted based on the size of the receptive field of each convolutional layer, Obtain feature vectors of N dimensions of the first text information.
- the preset domain category set includes, for example, domain categories such as "washing machine”, “television”, and "sweeping robot".
- the size of the receptive field of the third convolutional layer Conv-5 is 5, and the step size is 1. Then the first feature vectors of 5 characters are extracted in sequence according to the order of the words each time, and then the third dimension of the first text information is output after processing. eigenvectors of .
- the size of the receptive field of the fourth convolutional layer Conv-6 is 6, and the step size is 1. Then the first feature vectors of 6 characters are extracted in sequence according to the order of the characters for processing, and then the fourth dimension of the first text information is output. eigenvectors of .
- the size of the receptive field of the fifth convolutional layer Conv-7 is 7, and the step size is 1.
- the first feature vectors of 7 characters are extracted in sequence according to the order of the characters for processing, and then the fifth dimension of the first text information is output.
- the size of the receptive field of the sixth convolutional layer Conv-8 is 8, and the step size is 1.
- the first feature vectors of 8 words are extracted in sequence according to the order of the words for processing, and then the sixth dimension of the first text information is output. eigenvectors of .
- the spliced feature vectors are input to the fully connected layer (Dense) of the convolutional neural network model.
- the fully connected layer of the convolutional neural network model extracts global semantic information from the concatenated one-dimensional feature vector, and outputs the probability value that the first text information belongs to each domain category in the preset domain category set.
- the first text information can also be The feature vectors of the N dimensions of the text information are all subjected to maximum pooling processing to obtain the denoised feature vectors of the N dimensions.
- the random deactivation layer Dropout in the convolutional layer is activated by the Gelu activation function to increase the domain classification model Non-linear features, and prevent the initial domain classification model from overfitting during the training process (overfitting: over-learning text information and domain category identification problematic text information).
- the first feature vector of each word in the first text information is also normalized based on the layer normal in the convolution layer, so that the output feature vector of the first text information is It is output in the form of a normal distribution to improve the stability of the feature vector output by the convolutional layer.
- the probability value of the first text information belonging to each domain category in the preset domain category set is then calculated by the Softmax of the convolutional neural network, and determined from the probability value of the first text information belonging to each domain category in the preset domain category set
- the domain category corresponding to the highest probability value is the domain category corresponding to the first text information.
- S350 execute the step in a loop of inputting the first text information in the training set of the initial domain classification model into the initial domain classification model until the predicted domain category of each piece of text information in the training set is obtained.
- steps S330 to S340 are performed on each piece of text information in the training set until the predicted domain category of each piece of text information in the training set is obtained.
- S360 Calculate an output loss between the predicted domain category of each piece of text information output by the initial domain classification model and the original domain category of each piece of text information.
- step S370 the initial domain model is continuously optimized according to the comparison result between the output loss of each piece of text information and the preset loss, so that the output loss of each piece of text information output by the initial domain model is less than The preset loss.
- the initial domain classification model can predict the domain category to which the text information belongs under the required accuracy, and the training of the initial domain classification model ends at this time , to get the domain classification model.
- the more text information in the training set of the initial domain classification model the better the training effect of the initial domain classification model.
- step S380 is the content described in step S210 to step S250.
- step S210 to step S250 For details, please refer to the relevant description in step S210 to step S250, which will not be repeated here.
- the voice data processing method provided in this embodiment further describes the creation process of the domain classification model on the basis of the first embodiment.
- the pre-training model ALBERT When classifying short texts in the anti-defense domain, use the pre-training model ALBERT to enhance the understanding ability of the final training classification model in this field to non-control corpus, and the pre-training model ALBERT has a simple structure and fast reasoning speed, so it is online After deployment, it can adapt well to the deployment environment.
- the convolutional neural network model TextCNN more convolutional layers and larger convolutional kernels are used to increase the capacity of the convolutional layer to better preserve the feature information between the text information and the context, so that the text information analysis is more accurate.
- the random inactivation layer Dropout is also used to prevent the over-fitting problem that may occur during the learning of the initial domain classification model, and the layer normal is used to ensure that the output feature vector distribution is stable and improve the domain classification model. performance.
- the domain classification model created in the method provided by this embodiment has a higher recognition rate for the domain category of text information, and the domain classification model itself is more stable, so it can be well applied to the text information of voice data.
- the analysis of the corresponding field category improves the accuracy and speed of predicting the field category of text information.
- Embodiment 3 of the present disclosure also provides a voice data processing device 10, including:
- the acquiring module 11 is configured to acquire voice data and convert the voice data into text information.
- the processing module 12 is used to input the text information into the domain classification model to obtain the domain category of the target household electrical appliance;
- the model determining module 13 is configured to determine a target instruction recognition model matching the domain category of the target household electrical appliance from among multiple instruction recognition models according to the domain category of the target household electrical appliance.
- the instruction generation module 14 is configured to generate an execution instruction after identifying the text information based on the target instruction recognition model, and the execution instruction is used to instruct the target household electrical appliance to execute the operation indicated by the voice data.
- the communication module 15 is configured to send the execution instruction to the target household electrical appliance.
- the voice data processing device 10 also includes a model creation module 16, the model creation module 16 is used to obtain an initial field classification model, the initial field classification model is a fusion model of a pre-training model ALBERT and a convolutional neural network model, the pre-training model
- the output of ALBERT is the input of the convolutional neural network model; obtain the training set of the initial domain classification model, the training set includes multiple pieces of text information, each text information is marked with the original domain category; according to the training of the initial domain classification model Set the initial domain classification model to train to obtain the predicted domain category of each text information; calculate the output loss between the predicted domain category of each text information output by the initial domain classification model and the original domain category of each text information ; When the output loss corresponding to each piece of text information is less than the preset loss, it is determined that the training of the initial domain classification model is completed, and the domain classification model is obtained.
- the model creation module 16 is specifically configured to input the first text information in the training set of the initial domain classification model into the initial domain classification model to obtain the probability that the first text information belongs to each preset domain category in the preset domain category set value; filter out that the first text information belongs to the preset domain category set, and the preset domain category corresponding to the maximum probability value among the probability values of each preset domain category in the preset domain category set is the predicted domain category of the first text information; loop execution steps of the The first text information in the training set of the initial domain classification model is input to the initial domain classification model until the predicted domain category of each piece of text information in the training set is obtained.
- the convolutional neural network model includes N different convolutional layers, and N is an integer greater than 3.
- the model creation module 16 is specifically used to input the first text information in the training set to the pre-training model ALBERT to obtain the first A first feature vector of each word in the text information; the first feature vector of each word in the first text information is sequentially input to the first convolution layer to the Nth in the N different convolution layers
- the convolutional layer extracts the first feature vectors of each character in the first text information sequentially based on the size of the receptive fields of the first convolutional layer to the Nth convolutional layer respectively, and obtains N pieces of the first text information.
- the feature vectors of N dimensions of the first text information are spliced into one-dimensional feature vectors, and the spliced feature vectors are input to the fully connected layer of the convolutional neural network model to obtain the first
- the text information belongs to the probability value of each domain category in the preset domain category set.
- each convolutional layer After each convolutional layer receives the first feature vector of each word in the first text information, it activates the random deactivation layer, and based on the layer normal, the feature vector of the output first text information is normally distributed form output.
- the model creation module 16 is specifically used to supplement or delete the first text information in the training set of the initial field classification model to the first text information of a preset numerical length and then input the pre-trained model ALBERT to obtain the first text information The first feature vector for each word.
- the model creation module 16 is also configured to perform maximum pooling processing on all feature vectors of N dimensions of the first text information.
- Embodiment 4 of the present disclosure further provides an electronic device 20 , which includes a processor 21 and a memory 22 communicatively connected to the processor.
- the memory 22 executes computer-executed instructions
- the processor 21 executes the computer-executed instructions stored in the memory 22 to implement the speech data processing method described in any one of the above embodiments.
- the present disclosure also provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the instructions are executed, the computer-executable instructions are used to implement any of the above embodiments when executed by a processor
- the voice data processing method provided.
- the present disclosure also provides a computer program product, including a computer program.
- a computer program product including a computer program.
- the computer program is executed by a processor, the speech data processing method as described in any one of the above embodiments is implemented.
- the above-mentioned computer-readable storage medium may be a read-only memory (Read Only Memory, ROM), a programmable read-only memory (Programmable Read-Only Memory, PROM), an erasable programmable read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), Magnetic Random Access Memory (Ferromagnetic Random Access Memory, FRAM), Flash Memory (Flash Memory) , Magnetic surface memory, CD, or CD-ROM (Compact Disc Read-Only Memory, CD-ROM) and other storage. It may also be various electronic devices including one or any combination of the above-mentioned memories, such as mobile phones, computers, tablet devices, personal digital assistants, and the like.
- These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
- the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种语音数据处理方法、装置(10)及电子设备(20)、存储介质,属于语音交互技术领域。方法包括获取语音数据,并将语音数据转换为文本信息(S210);将文本信息输入至领域分类模型,得到目标家电设备的领域类别(S220);根据目标家电设备的领域类别,从多个指令识别模型中确定与目标家电设备的领域类别匹配的目标指令识别模型(S230);基于目标指令识别模型识别文本信息后生成执行指令,执行指令用于指示目标家电设备执行语音数据指示的操作(S240);向目标家电设备发送执行指令(S250)。
Description
本公开要求于2021年11月24日提交中国专利局、申请号为202111404186.3、申请名称为“语音数据处理方法、装置及电子设备、存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
本公开涉及语音交互技术,尤其涉及一种语音数据处理方法、装置及电子设备、存储介质。
在智能家电领域,通常是通过语音交互对智能家电设备进行控制。在一个家庭场景中,可能存在多种智能家电设备,语音交互中的语音数据会给到多种智能家电设备,这个时候,不同领域的指令识别模型识别语音数据是否适配本领域的智能家电设备。
这种语音数据的识别和处理方法会造成服务器负载压力过大,从而导致服务器宕机风险增大。
发明内容
本公开提供一种语音数据处理方法、装置及电子设备、存储介质,用以解决智能家电领域中语音交互控制家电时可能会造成服务器负载压力过大、增加服务器宕机风险的问题。
一方面,本公开提供一种语音数据处理方法,包括:
获取语音数据,并将所述语音数据转换为文本信息;
将所述文本信息输入至领域分类模型,得到目标家电设备的领域类别;
根据所述目标家电设备的领域类别,从多个指令识别模型中确定与所述目标家电设备的领域类别匹配的目标指令识别模型;
基于所述目标指令识别模型识别所述文本信息后生成执行指令,所述执行指令用于指示所述目标家电设备执行所述语音数据指示的操作;
向所述目标家电设备发送所述执行指令。
另一方面,提供一种电子设备,包括:处理器,以及与所述处理器通信连接的存储器;
所述存储器存储计算机执行指令;
所述处理器执行所述存储器存储的计算机执行指令,以实现如第一方面所述的语音数据处理方法。
另一方面,提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当所述指令被执行时,使得计算机执行如第一方面所述的语音数据处理方法。
另一方面,提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如第一方面所述的语音数据处理方法。
本公开提供的在方法获取到语音数据后,将语音数据转换为文本信息,再识别出该文本信息指示的目标家电设备的领域类别,根据该目标家电设备的领域类别确定出目标指令识别模型,再基于所述目标指令识别模型识别所述文本信息后生成执行指令。向该目标家电设备发送执行指令,使得该目标家电设备执行该语音数据指示的操作。由此,在获取到语音数据时,只调用领域分类模型和一个指令识别模型来对语音数据进行处理,不需要调用所有的指令识别模型都对语音数据进行处理,减少了服务器的负载压力,进而减少了服务器宕机的风险。
图1为本公开提供的语音数据处理方法的一种应用场景示意图。
图2为本公开的一个实施例提供的语音数据处理方法的流程示意图。
图3为本公开的另一个实施例提供的语音数据处理方法的流程示意图。
图4为本公开的一个实施例提供的领域分类模型的创建过程中的部分示意图。
图5为本公开的一个实施例提供的语音数据处理装置的示意图。
图6为本公开的一个实施例提供的电子设备的示意图。
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
在智能家电领域,通常是通过语音交互对智能家电设备进行控制。在一个家庭场景中,可能存在属于不同领域的智能家电设备,语音交互中的语音数据会被不同领域的智能家电设备接收,每个领域的智能家电设备在服务器上都有自己独立的指令识别模型,这个时候服务器就会通过树形流的方式将语音数据同时下发到每个领域的智能家电设备具有的指令识别模型,不同领域的指令识别模型识别语音数据是否适配本领域的智能家电设备。这种语音数据的识别和处理方法中,多个指令识别模型的同时运行会对服务器造成压力,导致服务器负载压力过大,从而导致服务器宕机风险增大。
基于此,本公开提供一种语音数据处理方法、装置及电子设备、存储介质,在获取到语音数据后,将语音数据转换为文本信息,再识别出该文本信息指示的家电设备领域类别,根据该文本信息指示的家电设备领域类别确定出与该目标家电设备的领域类别匹配的目标指令识别模型,再基于该目标指令识别模型识别该文本信息后生成执行指令。向该目标家电设备发送执行指令,使得该目标家电设备执行该语音数据指示的操作。由此,在获取到语音数据时,只调用领域分类模型和一个指令识别模型来对语音数据进行处理,不需要调用所有的指令识别模型都对语音数据进行处理,减少了服务器的负载压力,进而减少了服务器宕机的风险。
本公开提供的语音数据处理方法应用于电子设备,该电子设备例如智能家居系统所用 的计算机、智能家居系统所用的服务器等。图1为本公开提供的语音数据处理方法的应用示意图,图中,该电子设备在获取外部输入的语音数据后,将语音数据转换为文本信息,再将文本信息输入至领域分类模型,确定出目标家电设备的领域类别,再根据目标家电设备的领域类别确定对应的目标指令识别模型。基于该目标指令识别模型处理该文本信息后输出执行指令至目标家电设备。
请参考图2,本公开实施例一提供一种语音数据处理方法,包括:
S210,获取语音数据,并将该语音数据转换为文本信息。
该语音数据由用户定义,例如用户说的“开启洗衣机”就是一个语音数据。该电子设备在接收到语音数据后,将该语音数据转换为可以处理的文本信息,例如语音数据“开启洗衣机”转换后的文本信息为“开启洗衣机”。
该语音数据可以是短句,也可以是长句,对应的,该文本信息可以为字数较少的语句,也可以是字数较多的语句。该语音数据可以是包含一个操作,也可以是包含多个操作,例如语音数据“开启洗衣机后开始洗衣服”。
S220,将该文本信息输入至领域分类模型,得到目标家电设备的领域类别。
该领域分类模型用于识别该文本信息后确定出该文本信息所指示的目标家电设备所属的领域类别。例如该文本信息为“开启洗衣机后开始洗衣服”,将该文本信息输入至该领域分类模型后,得到的该目标家电设备的领域类别为“洗衣机”。
该领域分类模型用于根据该文本信息预测并输出目标家电设备的领域类别,在进行预测时,该领域分类模型会计算该文本信息对应于该领域分类模型中的所有领域类别的概率,再从多个概率里面筛选出最大的概率对应的领域类别为该文本信息对应的目标家电设备的领域类别。该领域分类模型中的领域类别例如为洗衣机、电视机、扫地机器人等,文本信息“开启洗衣机后开始洗衣服”被该领域分类模型处理后,属于洗衣机的概率可能是90%,属于电视机的概率可能是20%,属于扫地机器人的概率可能是60%,则预测出来的文本信息“开启洗衣机后开始洗衣服”对应的目标家电设备的领域类别是洗衣机。
在将该文本信息输入至该领域分类模型之前,还需要创建该领域分类模型。在创建该领域分类模型时,需要先获取初始领域分类模型,再对该初始领域分类模型进行训练,得到该领域分类模型。在对该初始领域分类模型进行训练之前,先获取该初始领域分类模型的训练集,该训练集包括多条文本信息,每条文本信息都标准有原始领域类别。再根据该初始领域分类模型的训练集对该初始领域分类模型进行训练,当该初始领域分类模型输出该初始领域分类模型的训练集中每条文本信息的预测领域类别时,输出的每条文本信息的预测领域类别与每条文本信息的原始领域类别之间的输出损失均在预设损失内时,完成对该初始领域分类模型的训练,得到该领域分类模型。
可选的,该初始领域分类模型为预训练模型ALBERT和卷积神经网络模型的融合模型,在融合后,该预训练模型ALBERT的输出为该卷积神经网络模型的输入。
因为语音数据有两个特点,一是指令的文本较短,而是指令的指示范围不确定,用户可能说出各种各样的句子。所以,在创建该领域分类模型的时候,要求领域分类模型既要具有对短文本局部和全局的特征能够充分学习,又要具有良好的文本泛化能力,以便更充分得理解语音数据转换后的文本信息。基于此原因,在初始领域分类模型中设置了卷积神经网络模型TextCNN,卷积神经网络模型TextCNN对文本浅层特征的抽取能力很强,在 短文本分类时的效果很好。为了增加该领域分类模型的文本泛化能力,在该初始领域分类模型中引入该预训练模型ALBERT,该预训练模型ALBERT在预训练时已经学习到了大量的百科问答、日常对话、新闻报道等语料中的特征信息,可以很好的增强对非控制类语料的理解能力。且该预训练模型ALBERT结构简单、推理速度快,因此在线上部署后可以很好的适应部署环境。
S230,根据该目标家电设备的领域类别,从多个指令识别模型中确定与该目标家电设备的领域类别匹配的目标指令识别模型。
每个领域类别在该电子设备上都具有独立的指令识别模型,在步骤S220确定出该目标家电设备的领域类别后,就可以确定出与该目标家电设备的领域类别对应的目标指令识别模型。
S240,基于该目标指令识别模型识别该文本信息后生成执行指令,该执行指令用于指示目标家电设备执行该语音数据指示的操作。
该目标指令识别模型在识别该文本信息后可以确定出该文本信息所指示的目标家电设备的信息和该目标家电设备要执行的操作。在识别出该目标家电设备的信息和该目标家电设备要执行的操作后,将这些信息打包为执行指令。
S250,向该目标家电设备发送该执行指令。
该目标家电设备接收到该执行指令后,执行该执行指令中指示的操作,由此完成语音交互控制目标家电设备的过程。
综上,本实施例提供的在方法获取到语音数据后,将语音数据转换为文本信息,再识别出该文本信息指示的目标家电设备的领域类别,根据该目标家电设备的领域类别确定出目标指令识别模型,再基于该目标指令识别模型识别该文本信息后生成执行指令。向该目标家电设备发送执行指令,使得该目标家电设备执行该语音数据指示的操作。由此,在获取到语音数据时,只调用领域分类模型和一个指令识别模型来对语音数据进行处理,不需要调用所有的指令识别模型都对语音数据进行处理,减少了服务器的负载压力,进而减少了服务器宕机的风险。
请参见图3,本公开实施例二还提供一种语音数据处理方法,在实施例一的基础上对领域分类模型的创建过程进行进一步得描述。本实施例提供的方法包括:
S310,获取初始领域分类模型,该初始领域分类模型为预训练模型ALBERT和卷积神经网络模型的融合模型,该预训练模型ALBERT的输出为该卷积神经网络模型的输入。
如步骤S220所描述的,卷积神经网络模型TextCNN对文本浅层特征的抽取能力很强,在短文本分类时的效果很好。该预训练模型ALBERT在预训练时已经学习到了大量的百科问答、日常对话、新闻报道等语料中的特征信息,可以很好的增强对非控制类语料的理解能力。且该预训练模型ALBERT结构简单、推理速度快,因此在线上部署后可以很好的适应部署环境。
S320,获取该初始领域分类模型的训练集,该训练集包括多条文本信息,每条文本信息标注有原始领域类别。
文本信息上标注的原始领域类别可以是由研发人员定义。该训练集中的文本信息数量应该尽可能得多,以提高该初始领域分类模型的训练效果。
S330,将该初始领域分类模型的训练集中的第一文本信息输入至该初始领域分类模型, 得到该第一文本信息属于预设领域类别集中每个预设领域类别的概率值。
该第一文本信息是该初始领域分类模型的训练集中的任意一个文本信息,该预设领域类别例如为“洗衣机”、“电视机”、“扫地机器人”等。将该第一文本信息输入至该初始领域分类模型时,该第一文本信息其实是作为数值输入的,为保持每条文本信息对应的数值长度相同,以提高模型训练效果,需要将该第一文本信息补充或删除至预设数值长度的文本信息后输入至该初始领域分类模型中的预训练模型ALBERT。当该第一文本信息对应的数值长度大于该预设数值长度时,对该第一文本信息中的字进行删除,当该第一文本信息对应的数值长度小于该预设数值长度时,对该第一文本信息中的字进行补充。
将该第一文本信息输入至该初始领域分类模型后,该第一文本信息先输入至该预训练模型ALBERT,由该预训练模型ALBERT输出该第一文本信息中每个字的第一特征向量。再将该第一文本信息中每个字的第一特征向量依序输入至该卷积神经网络模型中。
如图4所示,在本实施例中,该卷积神经网络模型包括N个不同的卷积层(图4以6个不同的卷积层为例),N为大于3的整数,每个卷积层具有不同的感受野。将该第一文本信息中每个字的第一特征向量时,按照第一卷积层至第N卷积层的顺序,将该第一文本信息中每个字的第一特征向量依序输入至该N个不同的卷积层中的第一卷积层至第N卷积层。该第一卷积层至该第N卷积层可以同时以该第一文本信息为处理样本,对该第一文本信息中每个字的第一特征向量进行提取,也就是说该第一文本信息被每个卷积层都处理了一遍。在进行该第一文本信息中每个字的第一特征向量的提取时,基于每个卷积层的感受野的大小对该第一文本信息中每个字的第一特征向量进行依次提取,得到该第一文本信息的N个维度的特征向量。
再将该第一文本信息的N个维度的特征向量拼接为一个维度的特征向量,并将拼接得到的特征向量输入至该卷积神经网络模型的全连接层,得到该第一文本信息属于预设领域类别集中每个领域类别的概率值。该预设领域类别集例如包含“洗衣机”、“电视机”、“扫地机器人”等领域类别。
如图4所示,一共有6个不同的卷积层,分别为Conv-3、Conv-4、Conv-5、Conv-6、Conv-7、Conv-8。第一卷积层Conv-3的感受野大小为3,步长为1,则每次按照字的排序依次提取3个字的第一特征向量进行处理后输出该第一文本信息第一个维度的特征向量。第二卷积层Conv-4的感受野大小为4,步长为1,则每次按照字的排序依次提取4个字的第一特征向量进行处理后输出该第一文本信息第二个维度的特征向量。第三卷积层Conv-5的感受野大小为5,步长为1,则每次按照字的排序依次提取5个字的第一特征向量进行处理后输出该第一文本信息第三个维度的特征向量。第四卷积层Conv-6的感受野大小为6,步长为1,则每次按照字的排序依次提取6个字的第一特征向量进行处理后输出该第一文本信息第四个维度的特征向量。第五卷积层Conv-7的感受野大小为7,步长为1,则每次按照字的排序依次提取7个字的第一特征向量进行处理后输出该第一文本信息第五个维度的特征向量。第六卷积层Conv-8的感受野大小为8,步长为1,则每次按照字的排序依次提取8个字的第一特征向量进行处理后输出该第一文本信息第六个维度的特征向量。
将该第一文本信息的六个维度的特征向量拼接为一个维度的特征向量后,将拼接得到的特征向量输入至该卷积神经网络模型的全连接层(Dense)。该卷积神经网络模型的全连接层对该拼接得到的一个维度的特征向量进行全局语义信息提取,输出该第一文本信息属 于预设领域类别集中每个领域类别的概率值。
可选的,在得到该第一文本信息的N个维度的特征向量之后,也就是将该第一文本信息的六个维度的特征向量拼接为一个维度的特征向量之前,还可以对该第一文本信息的N个维度的特征向量均进行最大池化处理,以得到去噪的该N个维度的特征向量。
可选的,每个卷积层在接收到该第一文本信息中每个字的第一特征向量后,通过Gelu激活函数激活卷积层中的随机失活层Dropout,以增加该领域分类模型的非线性特征,并且防止该初始领域分类模型在训练过程中出现过拟合的问题(过拟合:过度学习文本信息与领域类别标识有问题的文本信息)。除此之外,还基于卷积层中的层法线Layer normal对该第一文本信息中每个字的第一特征向量进行归一化操作,使得输出的该第一文本信息的特征向量是以正态分布的形式输出的,以提高卷积层输出的特征向量的稳定性。
S340,筛选出该第一文本信息属于预设领域类别集中每个预设领域类别的概率值中最大的概率值对应的预设领域类别为该第一文本信息的预测领域类别。
该第一文本信息属于预设领域类别集中每个领域类别的概率值再经过该卷积神经网络的Softmax计算,从该第一文本信息属于预设领域类别集中每个领域类别的概率值中确定出最大的概率值对应的领域类别为该第一文本信息对应的领域类别。
S350,循环执行步骤该将该初始领域分类模型的训练集中的第一文本信息输入至该初始领域分类模型,直到得到该训练集中每条文本信息的预测领域类别。
即,遍历该训练集中所有的文本信息,对训练集中的每条文本信息执行步骤S330至步骤S340,直到得到该训练集中每条文本信息的预测领域类别。
S360,计算该初始领域分类模型输出的每条文本信息的预测领域类别与每条文本信息的原始领域类别之间的输出损失。
初始领域分类模型的训练过程中,需要将该第一文本信息的预测领域类别与该第一文本信息标注的原始领域类别进行对比,确定出该初始领域分类模型在预测该第一文本信息所属领域类别时的输出损失,以输出损失更新模型的方式对该初始领域分类模型进行迭代优化。迭代优化的过程如步骤S370,即根据每条文本信息的输出损失与预设损失之间的比较结果不断得优化该初始领域模型,使得该初始领域模型输出的每条文本信息的输出损失均小于该预设损失。
S370,在每条文本信息对应的该输出损失均小于预设损失时,确定对该初始领域分类模型的训练结束,得到该领域分类模型。
在每条文本信息对应的该输出损失均小于预设损失时,代表该初始领域分类模型能够在要求的准确度之下预测出文本信息所属的领域类别,此时结束该初始领域分类模型的训练,得到该领域分类模型。该初始领域分类模型的训练集中文本信息的数量越多,该初始领域分类模型的训练效果越好。
S380,获取语音数据,对该语音数据进行处理后向该目标家电设备发送执行指令。
步骤S380所描述的内容即步骤S210至步骤S250所描述的内容,具体可以参考步骤S210至步骤S250中的相关描述,此处不再赘述。
本实施例提供的语音数据处理方法在实施例一的基础上进一步描述了该领域分类模型的创建过程。在对开防域的短文本分类时,使用预训练模型ALBERT增强最终训练得到的该领域分类模型对非控制类语料的理解能力,且该预训练模型ALBERT结构简单、推理 速度快,因此在线上部署后可以很好的适应部署环境。在卷积神经网络模型TextCNN中使用更多的卷积层和更大的卷积核,以增大卷积层的容量,以更好得保存文本信息与上下文之间的特征信息,使得文本信息的分析更加准确。除此之外,还使用了随机失活层Dropout来防止该初始领域分类模型学习时可能会出现的过拟合问题,使用层法线Layer normal保证输出的特征向量分布稳定,提高该领域分类模型的性能。综上,本实施例提供的方法中创建的该领域分类模型对文本信息的领域类别的识别率更高,该领域分类模型本身也更稳定,因此可以很好得应用在对语音数据的文本信息对应的领域类别的分析上,以提高预测文本信息的领域类别的准确率和速度。
请参见图5,本公开实施例三还提供一种语音数据处理装置10,包括:
获取模块11,用于获取语音数据,并将该语音数据转换为文本信息。
处理模块12,用于将该文本信息输入至领域分类模型,得到目标家电设备的领域类别;
模型确定模块13,用于根据该目标家电设备的领域类别,从多个指令识别模型中确定与该目标家电设备的领域类别匹配的目标指令识别模型。
指令生成模块14,用于基于该目标指令识别模型识别该文本信息后生成执行指令,该执行指令用于指示该目标家电设备执行该语音数据指示的操作。
通信模块15,用于向该目标家电设备发送该执行指令。
该语音数据处理装置10还包括模型创建模块16,该模型创建模块16用于获取初始领域分类模型,该初始领域分类模型为预训练模型ALBERT和卷积神经网络模型的融合模型,该预训练模型ALBERT的输出为该卷积神经网络模型的输入;获取该初始领域分类模型的训练集,该训练集包括多条文本信息,每条文本信息标注有原始领域类别;根据该初始领域分类模型的训练集对该初始领域分类模型进行训练,得到每条文本信息的预测领域类别;计算该初始领域分类模型输出的每条文本信息的预测领域类别与每条文本信息的原始领域类别之间的输出损失;在每条文本信息对应的该输出损失均小于预设损失时,确定对该初始领域分类模型的训练结束,得到该领域分类模型。
该模型创建模块16具体用于将该初始领域分类模型的训练集中的第一文本信息输入至该初始领域分类模型,得到该第一文本信息属于预设领域类别集中每个预设领域类别的概率值;筛选出该第一文本信息属于预设领域类别集中每个预设领域类别的概率值中最大的概率值对应的预设领域类别为该第一文本信息的预测领域类别;循环执行步骤该将该初始领域分类模型的训练集中的第一文本信息输入至该初始领域分类模型,直到得到该训练集中每条文本信息的预测领域类别。
该卷积神经网络模型包括N个不同的卷积层,N为大于3的整数,该模型创建模块16具体用于将该训练集中的第一文本信息输入至该预训练模型ALBERT,得到该第一文本信息中每个字的第一特征向量;将该第一文本信息中每个字的第一特征向量依序输入至该N个不同的卷积层中的第一卷积层至第N卷积层,分别基于第一卷积层至第N卷积层的感受野的大小对该第一文本信息中每个字的第一特征向量进行依次提取,得到该第一文本信息的N个维度的特征向量;将该第一文本信息的N个维度的特征向量拼接为一个维度的特征向量,并将拼接得到的特征向量输入至该卷积神经网络模型的全连接层,得到该第一文本信息属于预设领域类别集中每个领域类别的概率值。
每个卷积层接收到该第一文本信息中每个字的第一特征向量后,激活随机失活层,并 基于层法线将输出的该第一文本信息的特征向量以正态分布的形式输出。
该模型创建模块16具体用于将该初始领域分类模型的训练集中的第一文本信息补充或删除至预设数值长度的第一文本信息后输入该预训练模型ALBERT,得到该第一文本信息中每个字的第一特征向量。
该模型创建模块16还用于对该第一文本信息的N个维度的特征向量均进行最大池化处理。
请参见图6,本公开实施例四还提供一种电子设备20,该电子设备包括处理器21,以及与该处理器通信连接的存储器22。该存储器22计算机执行指令,该处理器21执行该存储器22存储的计算机执行指令,以实现如以上任一项实施例描述的语音数据处理方法。
本公开还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机执行指令,当该指令被执行时,使得计算机执行指令被处理器执行时用于实现如上任一项实施例提供的该语音数据处理方法。
本公开还提供一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现如以上任一项实施例描述的该语音数据处理方法。
需要说明的是,上述计算机可读存储介质可以是只读存储器(Read Only Memory,ROM)、可编程只读存储器(Programmable Read-Only Memory,PROM)、可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、磁性随机存取存储器(Ferromagnetic Random Access Memory,FRAM)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(Compact Disc Read-Only Memory,CD-ROM)等存储器。也可以是包括上述存储器之一或任意组合的各种电子设备,如移动电话、计算机、平板设备、个人数字助理等。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台智能家电设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本公开各个实施例所描述的方法。
本公开是参照根据本公开实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现 在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上仅为本公开的优选实施例,并非因此限制本公开的专利范围,凡是利用本公开说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本公开的专利保护范围内。
Claims (16)
- 一种语音数据处理方法,包括:获取语音数据,并将所述语音数据转换为文本信息;将所述文本信息输入至领域分类模型,得到目标家电设备的领域类别;根据所述目标家电设备的领域类别,从多个指令识别模型中确定与所述目标家电设备的领域类别匹配的目标指令识别模型;基于所述目标指令识别模型识别所述文本信息后生成执行指令,所述执行指令用于指示所述目标家电设备执行所述语音数据指示的操作;向所述目标家电设备发送所述执行指令。
- 根据权利要求1所述的方法,其中,还包括:获取初始领域分类模型,所述初始领域分类模型为预训练模型ALBERT和卷积神经网络模型的融合模型,所述预训练模型ALBERT的输出为所述卷积神经网络模型的输入;获取所述初始领域分类模型的训练集,所述训练集包括多条文本信息,每条文本信息标注有原始领域类别;根据所述初始领域分类模型的训练集对所述初始领域分类模型进行训练,得到每条文本信息的预测领域类别;计算所述初始领域分类模型输出的每条文本信息的预测领域类别与每条文本信息的原始领域类别之间的输出损失;在每条文本信息对应的所述输出损失均小于预设损失时,确定对所述初始领域分类模型的训练结束,得到所述领域分类模型。
- 根据权利要求2所述的方法,其中,所述根据所述初始领域分类模型的训练集对所述初始领域分类模型进行训练,得到每条文本信息的预测领域类别包括:将所述初始领域分类模型的训练集中的第一文本信息输入至所述初始领域分类模型,得到所述第一文本信息属于预设领域类别集中每个预设领域类别的概率值;筛选出所述第一文本信息属于预设领域类别集中每个预设领域类别的概率值中最大的概率值对应的预设领域类别为所述第一文本信息的预测领域类别;循环执行步骤所述将所述初始领域分类模型的训练集中的第一文本信息输入至所述初始领域分类模型,直到得到所述训练集中每条文本信息的预测领域类别。
- 根据权利要求3所述的方法,其中,所述卷积神经网络模型包括N个不同的卷积层,N为大于3的整数,所述将所述初始领域分类模型的训练集中的第一文本信息输入至所述初始领域分类模型,得到所述第一文本信息属于预设领域类别集中每个预设领域类别的概率值包括:将所述训练集中的第一文本信息输入至所述预训练模型ALBERT,得到所述第一文本信息中每个字的第一特征向量;将所述第一文本信息中每个字的第一特征向量依序输入至所述N个不同的卷积层中的第一卷积层至第N卷积层,分别基于第一卷积层至第N卷积层的感受野的大小对所述第一文本信息中每个字的第一特征向量进行依次提取,得到所述第一文本信息的N个维度的特征向量;将所述第一文本信息的N个维度的特征向量拼接为一个维度的特征向量,并将拼接得到的特征向量输入至所述卷积神经网络模型的全连接层,得到所述第一文本信息属于预设领域类别集中每个领域类别的概率值。
- 根据权利要求4所述的方法,其中,每个卷积层接收到所述第一文本信息中每 个字的第一特征向量后,激活随机失活层,并基于层法线将输出的所述第一文本信息的特征向量以正态分布的形式输出。
- 根据权利要求4所述的方法,其中,所述将所述训练集中的第一文本信息输入至所述预训练模型ALBERT,得到所述第一文本信息中每个字的第一特征向量包括:将所述训练集中的第一文本信息补充或删除至预设数值长度后输入所述预训练模型ALBERT,得到所述第一文本信息中每个字的第一特征向量。
- 根据权利要求4所述的方法,其中,所述将所述第一文本信息中每个字的第一特征向量均输入至所述N个不同的卷积层中的第一卷积层至第N卷积层,分别基于第一卷积层至第N卷积层的感受野的大小对所述第一文本信息中每个字的第一特征向量进行依次提取,得到所述第一文本信息的N个维度的特征向量之后,还包括:对所述第一文本信息的N个维度的特征向量均进行最大池化处理。
- 一种语音数据处理装置,包括:获取模块,用于获取语音数据,并将所述语音数据转换为文本信息;处理模块,用于将所述文本信息输入至领域分类模型,得到目标家电设备的领域类别;模型确定模块,用于根据所述目标家电设备的领域类别,从多个指令识别模型中确定与所述目标家电设备的领域类别匹配的目标指令识别模型;指令生成模块,用于基于所述目标指令识别模型识别所述文本信息后生成执行指令,所述执行指令用于指示所述目标家电设备执行所述语音数据指示的操作;通信模块,用于向所述目标家电设备发送所述执行指令。
- 根据权利要求8所述的装置,其中,还包括:模型创建模块,用于获取初始领域分类模型,所述初始领域分类模型为预训练模型ALBERT和卷积神经网络模型的融合模型,所述预训练模型ALBERT的输出为所述卷积神经网络模型的输入;所述模型创建模块还用于获取所述初始领域分类模型的训练集,所述训练集包括多条文本信息,每条文本信息标注有原始领域类别;所述模型创建模块还用于根据所述初始领域分类模型的训练集对所述初始领域分类模型进行训练,得到每条文本信息的预测领域类别;所述模型创建模块还用于计算所述初始领域分类模型输出的每条文本信息的预测领域类别与每条文本信息的原始领域类别之间的输出损失;所述模型创建模块还用于在每条文本信息对应的所述输出损失均小于预设损失时,确定对所述初始领域分类模型的训练结束,得到所述领域分类模型。
- 根据权利要求9所述的装置,其特征在于,所述模型创建模块具体用于:将所述初始领域分类模型的训练集中的第一文本信息输入至所述初始领域分类模型,得到所述第一文本信息属于预设领域类别集中每个预设领域类别的概率值;筛选出所述第一文本信息属于预设领域类别集中每个预设领域类别的概率值中最大的概率值对应的预设领域类别为所述第一文本信息的预测领域类别;循环执行步骤所述将所述初始领域分类模型的训练集中的第一文本信息输入至所述初始领域分类模型,直到得到所述训练集中每条文本信息的预测领域类别。
- 根据权利要求10所述的装置,其特征在于,所述卷积神经网络模型包括N个不同的卷积层,N为大于3的整数,所述模型创建模块具体用于:将所述训练集中的第一文本信息输入至所述预训练模型ALBERT,得到所述第一文本信息中每个字的第一特征向量;将所述第一文本信息中每个字的第一特征向量依序输入至所述N个不同的卷积层 中的第一卷积层至第N卷积层,分别基于第一卷积层至第N卷积层的感受野的大小对所述第一文本信息中每个字的第一特征向量进行依次提取,得到所述第一文本信息的N个维度的特征向量;
- 根据权利要求11所述的装置,其中,每个卷积层接收到所述第一文本信息中每个字的第一特征向量后,激活随机失活层,并基于层法线将输出的所述第一文本信息的特征向量以正态分布的形式输出。
- 根据权利要求11所述的装置,其中,所述模型创建模块具体用于:将所述训练集中的第一文本信息补充或删除至预设数值长度后输入所述预训练模型ALBERT,得到所述第一文本信息中每个字的第一特征向量。
- 根据权利要求11所述的装置,其中,所述模型创建模块还用于:对所述第一文本信息的N个维度的特征向量均进行最大池化处理。
- 一种电子设备,包括:处理器,以及与所述处理器通信连接的存储器;所述存储器存储计算机执行指令;所述处理器执行所述存储器存储的计算机执行指令,以实现如权利要求1至7中任一项所述的语音数据处理方法。
- 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当所述指令被执行时,使得计算机执行如权利要求1-7中任一项所述的语音数据处理方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111404186.3A CN114141248A (zh) | 2021-11-24 | 2021-11-24 | 语音数据处理方法、装置及电子设备、存储介质 |
CN202111404186.3 | 2021-11-24 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023093074A1 true WO2023093074A1 (zh) | 2023-06-01 |
Family
ID=80391258
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/105326 WO2023093074A1 (zh) | 2021-11-24 | 2022-07-13 | 语音数据处理方法、装置及电子设备、存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114141248A (zh) |
WO (1) | WO2023093074A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114141248A (zh) * | 2021-11-24 | 2022-03-04 | 青岛海尔科技有限公司 | 语音数据处理方法、装置及电子设备、存储介质 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150058018A1 (en) * | 2013-08-23 | 2015-02-26 | Nuance Communications, Inc. | Multiple pass automatic speech recognition methods and apparatus |
CN109347708A (zh) * | 2018-10-15 | 2019-02-15 | 珠海格力电器股份有限公司 | 一种语音识别方法、装置、家电设备、云服务器及介质 |
CN110648664A (zh) * | 2019-10-11 | 2020-01-03 | 广东美的白色家电技术创新中心有限公司 | 家电控制方法、装置和具有存储功能的装置 |
CN111933135A (zh) * | 2020-07-31 | 2020-11-13 | 深圳Tcl新技术有限公司 | 终端控制方法、装置、智能终端及计算机可读存储介质 |
US20210050017A1 (en) * | 2019-08-13 | 2021-02-18 | Samsung Electronics Co., Ltd. | System and method for modifying speech recognition result |
CN113011533A (zh) * | 2021-04-30 | 2021-06-22 | 平安科技(深圳)有限公司 | 文本分类方法、装置、计算机设备和存储介质 |
CN113051374A (zh) * | 2021-06-02 | 2021-06-29 | 北京沃丰时代数据科技有限公司 | 一种文本匹配优化方法及装置 |
CN114141248A (zh) * | 2021-11-24 | 2022-03-04 | 青岛海尔科技有限公司 | 语音数据处理方法、装置及电子设备、存储介质 |
-
2021
- 2021-11-24 CN CN202111404186.3A patent/CN114141248A/zh active Pending
-
2022
- 2022-07-13 WO PCT/CN2022/105326 patent/WO2023093074A1/zh unknown
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150058018A1 (en) * | 2013-08-23 | 2015-02-26 | Nuance Communications, Inc. | Multiple pass automatic speech recognition methods and apparatus |
CN109347708A (zh) * | 2018-10-15 | 2019-02-15 | 珠海格力电器股份有限公司 | 一种语音识别方法、装置、家电设备、云服务器及介质 |
US20210050017A1 (en) * | 2019-08-13 | 2021-02-18 | Samsung Electronics Co., Ltd. | System and method for modifying speech recognition result |
CN110648664A (zh) * | 2019-10-11 | 2020-01-03 | 广东美的白色家电技术创新中心有限公司 | 家电控制方法、装置和具有存储功能的装置 |
CN111933135A (zh) * | 2020-07-31 | 2020-11-13 | 深圳Tcl新技术有限公司 | 终端控制方法、装置、智能终端及计算机可读存储介质 |
CN113011533A (zh) * | 2021-04-30 | 2021-06-22 | 平安科技(深圳)有限公司 | 文本分类方法、装置、计算机设备和存储介质 |
CN113051374A (zh) * | 2021-06-02 | 2021-06-29 | 北京沃丰时代数据科技有限公司 | 一种文本匹配优化方法及装置 |
CN114141248A (zh) * | 2021-11-24 | 2022-03-04 | 青岛海尔科技有限公司 | 语音数据处理方法、装置及电子设备、存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN114141248A (zh) | 2022-03-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110309283B (zh) | 一种智能问答的答案确定方法及装置 | |
JP6980119B2 (ja) | 音声認識方法、並びにその装置、デバイス、記憶媒体及びプログラム | |
WO2022142006A1 (zh) | 基于语义识别的话术推荐方法、装置、设备及存储介质 | |
CN111428010B (zh) | 人机智能问答的方法和装置 | |
US9672467B2 (en) | Systems and methods for creating and implementing an artificially intelligent agent or system | |
CN109299476B (zh) | 问答方法、装置、电子设备及存储介质 | |
US11423884B2 (en) | Device with convolutional neural network for acquiring multiple intent words, and method thereof | |
CN112069811A (zh) | 多任务交互增强的电子文本事件抽取方法 | |
CN110019742B (zh) | 用于处理信息的方法和装置 | |
CN109523014B (zh) | 基于生成式对抗网络模型的新闻评论自动生成方法及系统 | |
US20230206928A1 (en) | Audio processing method and apparatus | |
CN112380853B (zh) | 业务场景交互方法、装置、终端设备及存储介质 | |
CN109948160B (zh) | 短文本分类方法及装置 | |
CN110795913A (zh) | 一种文本编码方法、装置、存储介质及终端 | |
CN113435182B (zh) | 自然语言处理中分类标注的冲突检测方法、装置和设备 | |
CN114120978A (zh) | 情绪识别模型训练、语音交互方法、装置、设备及介质 | |
WO2023093074A1 (zh) | 语音数据处理方法、装置及电子设备、存储介质 | |
CN113821623A (zh) | 模型训练方法、装置、设备与存储介质 | |
JP2021081713A (ja) | 音声信号を処理するための方法、装置、機器、および媒体 | |
CN115687934A (zh) | 意图识别方法、装置、计算机设备及存储介质 | |
US11705110B2 (en) | Electronic device and controlling the electronic device | |
CN113571052A (zh) | 一种噪声提取及指令识别方法和电子设备 | |
KR102684936B1 (ko) | 전자 장치 및 이의 제어 방법 | |
CN116701636A (zh) | 一种数据分类方法、装置、设备及存储介质 | |
CN113220852B (zh) | 人机对话方法、装置、设备和存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22897170 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |