CN111710337B

CN111710337B - Voice data processing method and device, computer readable medium and electronic equipment

Info

Publication number: CN111710337B
Application number: CN202010549158.XA
Authority: CN
Inventors: 元涛; 兰泽华; 林昱
Original assignee: Ringslink Xiamen Network Communication Technologies Co ltd
Current assignee: Ringslink Xiamen Network Communication Technologies Co ltd
Priority date: 2020-06-16
Filing date: 2020-06-16
Publication date: 2023-07-07
Anticipated expiration: 2040-06-16
Also published as: CN111710337A

Abstract

The embodiment of the application provides a voice data processing method, a voice data processing device, a computer readable medium and electronic equipment. The processing method of the voice data comprises the following steps: acquiring voice input information in real time; carrying out framing treatment on the voice input information to obtain a voice frame corresponding to the voice input information; performing phoneme recognition on the voice frame by adopting a pre-trained acoustic model so as to recognize phonemes contained in the voice frame; aiming at the result of each phoneme recognition, carrying out keyword recognition on the currently recognized phonemes to determine keywords contained in the voice input information; if the number of times of continuously identifying the same keywords is greater than or equal to the preset number, determining the keywords as target keywords, and performing corresponding actions according to the target keywords. The technical scheme of the embodiment of the application can improve the efficiency of voice recognition, and further ensure the response speed of voice control equipment.

Description

Voice data processing method and device, computer readable medium and electronic equipment

Technical Field

The present invention relates to the field of speech recognition technology, and in particular, to a method and apparatus for processing speech data, a computer readable medium, and an electronic device.

Background

With the development of speech recognition technology, the application of the speech recognition technology is also becoming wider and wider. Such as voice control of automotive devices, smart toys or smart homes, etc. In the current technical scheme, the voice control device obtains voice information input by a user and performs voice recognition on the voice information so as to achieve the purpose of controlling the motion of the device. Therefore, how to improve the recognition efficiency of the voice and further ensure the response speed of the voice control device becomes a technical problem to be solved.

Disclosure of Invention

The embodiment of the application provides a processing method and device of voice data, a computer readable medium and electronic equipment, so that the recognition efficiency of voice can be improved at least to a certain extent, and the response speed of voice control equipment is further ensured.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned in part by the practice of the application.

According to an aspect of the embodiments of the present application, there is provided a method for processing voice data, including:

acquiring voice input information in real time;

carrying out framing treatment on the voice input information to obtain a voice frame corresponding to the voice input information;

Performing phoneme recognition on the voice frame by adopting a pre-trained acoustic model so as to recognize phonemes contained in the voice frame;

aiming at the result of each phoneme recognition, carrying out keyword recognition on the currently recognized phonemes to determine keywords contained in the voice input information;

if the number of times of continuously identifying the same keywords is greater than or equal to the preset number, determining the keywords as target keywords, and performing corresponding actions according to the target keywords.

According to an aspect of the embodiments of the present application, there is provided a processing apparatus for voice data, including:

the acquisition module is used for acquiring voice input information in real time;

the framing module is used for framing the voice input information to obtain a voice frame corresponding to the voice input information;

the first recognition module is used for carrying out phoneme recognition on the voice frame by adopting a pre-trained acoustic model so as to recognize phonemes contained in the voice frame;

the second recognition module is used for recognizing keywords of the currently recognized phonemes according to the result of each phoneme recognition so as to determine the keywords contained in the voice input information;

And the processing module is used for determining the keyword as a target keyword if the number of times of continuously identifying the same keyword is greater than or equal to the preset number, so as to perform corresponding actions according to the target keyword.

Based on the foregoing, in some embodiments of the present application, the processing module is configured to: determining the voice frame with the keyword recognized for the first time as a starting frame; and if no other keywords are identified in the voice frames in the preset number after the initial frame, determining the keywords as target keywords.

Based on the foregoing, in some embodiments of the present application, the first identification module is configured to: extracting the characteristics of the voice frame to obtain the voice characteristics corresponding to the voice frame; inputting the voice features into an acoustic model so that the acoustic model outputs phonemes contained in the voice frame.

Based on the foregoing, in some embodiments of the present application, the second identification module is configured to: acquiring a first weight of a keyword and a second weight of a near-voice word close to the pronunciation of the keyword; inputting the currently recognized phonemes, the first weights and the second weights into a language model so that the language model outputs keywords corresponding to the currently recognized phonemes.

Based on the foregoing, in some embodiments of the present application, the second identifying module includes:

the vocabulary recognition unit is used for respectively inputting the currently recognized phonemes into a plurality of language models so that the plurality of language models respectively output pending vocabularies corresponding to the currently recognized phonemes;

and the keyword determining unit is used for determining keywords corresponding to the currently recognized phonemes according to the undetermined vocabulary.

Based on the foregoing, in some embodiments of the present application, the processing module is further configured to: if a modification request for the keywords is received, displaying a keyword editing interface; and storing the modification information according to the modification information of the keywords received by the keyword editing interface.

According to an aspect of the embodiments of the present application, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements a method of processing speech data as described in the above embodiments.

According to an aspect of an embodiment of the present application, there is provided an electronic device including: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method of processing speech data as described in the above embodiments.

In the technical solutions provided in some embodiments of the present application, voice input information is obtained in real time, and frame segmentation is performed on the voice input information to obtain a voice frame corresponding to the voice input information, a pre-trained acoustic model is used to perform phoneme recognition on the voice frame to identify phonemes included in the voice frame, and keyword recognition is performed on currently identified phonemes according to a result of each phoneme recognition to determine keywords included in the voice input information, where if the number of times of continuously identifying the same keywords is greater than or equal to a predetermined number, the keywords are determined to be target keywords, so that corresponding actions are performed according to the target keywords. Therefore, voice input information is obtained in real time, voice recognition is carried out on the voice input information, voice recognition can be carried out when a user carries out voice control operation, the same keywords are continuously recognized, the keywords are determined to be target keywords, voice recognition efficiency is improved, and response speed of voice control equipment is further guaranteed.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

FIG. 1 shows a schematic diagram of an exemplary system architecture to which the technical solutions of embodiments of the present application may be applied;

FIG. 2 illustrates a flow diagram of a method of processing voice data according to one embodiment of the present application;

FIG. 3 is a flow chart illustrating step S250 in the method of processing speech data of FIG. 2 according to one embodiment of the present application;

fig. 4 is a flowchart illustrating step S230 in the processing method of voice data of fig. 2 according to an embodiment of the present application;

FIG. 5 is a flow chart illustrating step S240 in the method of processing speech data of FIG. 2 according to one embodiment of the present application;

fig. 6 is a flowchart illustrating step S240 in the method for processing voice data of fig. 2 according to another embodiment of the present application;

FIG. 7 is a flow chart of editing keywords further included in a method of processing voice data according to one embodiment of the present application;

FIG. 8 illustrates a block diagram of a processing device for voice data according to one embodiment of the present application;

fig. 9 shows a schematic diagram of a computer system suitable for use in implementing the electronic device of the embodiments of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present application. One skilled in the relevant art will recognize, however, that the aspects of the application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Fig. 1 shows a schematic diagram of an exemplary system architecture to which the technical solutions of the embodiments of the present application may be applied.

As shown in fig. 1, the system architecture may include a terminal device (such as one or more of the smart phone 101, tablet 102, and portable computer 103 shown in fig. 1, but of course, a desktop or embedded device, etc.), a network 104, and a server 105. The network 104 is the medium used to provide communication links between the terminal devices and the server 105. The network 104 may include various connection types, such as wired communication links, wireless communication links, and the like.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 105 may be a server cluster formed by a plurality of servers.

A user may interact with the server 105 via the network 104 using terminal devices to receive or transmit information or the like. For example, the user performs voice input by using the terminal device 101 (may also be the terminal device 102 or 103), the terminal device 101 may acquire voice input information in real time, and perform frame segmentation processing on the voice input information to obtain a voice frame corresponding to the voice input information; and carrying out phoneme recognition on the voice frame by adopting a pre-trained acoustic model to recognize phonemes contained in the voice frame, carrying out keyword recognition on the currently recognized phonemes according to the result of each phoneme recognition to determine keywords contained in voice input information, and if the number of times of continuously recognizing the same keywords is greater than or equal to the preset number, determining the keywords as target keywords so as to carry out corresponding actions according to the target keywords.

It should be noted that, the processing method of voice data provided in the embodiments of the present application is generally executed by a terminal device, and accordingly, an input device of voice data is generally disposed in the terminal device. However, in other embodiments of the present application, the server 105 may also have a similar function to the terminal device, so as to execute the scheme of the processing method of voice data provided in the embodiments of the present application.

The implementation details of the technical solutions of the embodiments of the present application are described in detail below:

fig. 2 shows a flow diagram of a method of processing voice data according to an embodiment of the present application. Referring to fig. 2, the processing method of voice data at least includes steps S210 to S240, and is described in detail as follows:

in step S210, voice input information is acquired in real time.

In one embodiment of the present application, the user may make voice input through a voice input means (e.g., microphone, etc.) configured with the terminal device. When the terminal equipment detects that the user speaks, the voice input information of the user can be acquired in real time.

In an example, the terminal device may acquire the voice input information of the user at intervals of a predetermined time period, which may be configured in advance by a person skilled in the art, for example, the predetermined time period may be 0.5S, 1S, 2S, or the like, which is merely an exemplary example and is not limited thereto.

In step S220, the frame segmentation process is performed on the voice input information, so as to obtain a voice frame corresponding to the voice input information.

In this embodiment, when the terminal device acquires the voice input information of the user in real time, the acquired voice input information may be subjected to framing processing in real time to divide the voice input information into at least one voice frame.

It should be understood that, in order to ensure accuracy of recognition, the time length of the voice input information is positively correlated with the number of voice frames corresponding to the voice input information, that is, the longer the time length of the voice input information is, the more the number of voice frames corresponding to the voice input information is, thereby ensuring accuracy of a subsequent voice recognition result.

In step S230, a pre-trained acoustic model is used to perform phoneme recognition on the speech frame, so as to recognize phonemes included in the speech frame.

Wherein, the phonemes may be the smallest phonetic units that are divided according to the natural properties of the speech.

In one embodiment of the present application, a pre-trained acoustic model is used to perform phoneme recognition on the divided speech frames to identify phonemes contained in each speech frame. It should be appreciated that a speech frame may include one phoneme or more than one phoneme of any number, and a speech frame may include no phoneme (e.g., the speech frame is located in a speaking gap of a user, etc.).

In step S240, for each phoneme recognition result, keyword recognition is performed on the currently recognized phonemes to determine keywords included in the voice input information.

The keywords may be specific words used to control the device action, for example, the keywords may be "open door", "turn off light", etc. According to each keyword contained in the voice input information, the device can be controlled to perform actions corresponding to the keyword, such as controlling a door to open or extinguishing lights and the like.

In one embodiment of the present application, after each phoneme recognition on the speech frame is finished, a pre-trained language model may be used to perform keyword recognition on the currently recognized phoneme, so as to recognize a keyword corresponding to the currently recognized phoneme. Specifically, the language model may analyze the nature of a phoneme according to a phoneme contained in a speech frame extracted by the acoustic model, so as to identify a vocabulary corresponding to the currently identified phoneme, and then compare the vocabulary with a preset keyword list, where the keyword list includes at least one preset keyword. If the vocabulary recognized by the language model is the same as the keywords contained in the preset keyword list, determining the vocabulary corresponding to the currently recognized phonemes as the keywords.

It should be noted that, after each phoneme recognition is performed for a speech frame, a keyword recognition is performed once according to the phoneme recognized by the speech frame and the phoneme recognized before the speech frame (i.e., the phoneme recognized currently), that is, one speech frame corresponds to a phoneme recognition, and a phoneme recognition corresponds to a keyword recognition. Therefore, the purpose of carrying out 'real-time' keyword recognition on the phonemes is achieved, and the recognition efficiency of the keywords is improved.

In step S250, if the number of times of continuously recognizing the same keyword is greater than or equal to the predetermined number, determining the keyword as the target keyword, so as to perform a corresponding action according to the target keyword.

In this embodiment, if the same keyword is identified in the keyword identification results of multiple consecutive times, and the number of times of the consecutive keyword identification is greater than or equal to the predetermined number, it may be indicated that the keyword is a specific vocabulary that the user wants to perform the corresponding action with a high probability, so that the keyword may be determined as the target keyword, so that the control device performs the corresponding action according to the target keyword.

Wherein the predetermined number may be preset, and the predetermined number may be 10, 15, 25, or the like, which is only an exemplary example and is not particularly limited in this application. The person skilled in the art may set a specific predetermined number according to implementation requirements, for example, to ensure accuracy of recognition of the target keyword, a predetermined number of larger values may be set, and so on.

In the embodiment shown in fig. 2, by acquiring the voice input information of the user in real time and performing frame processing on the voice input information in real time to perform voice recognition, when the user does not end voice input, voice recognition can be performed on the voice input information.

And if the number of times of continuously recognizing the same keywords is greater than or equal to the preset number, determining the keywords as target keywords, namely if the keywords recognized by a certain keyword recognition can last for a certain time, representing that the keywords are the specific words of the corresponding actions to be performed by the user with high probability, and further ensuring the accuracy of the voice recognition result. Therefore, before the user finishes voice input, the device can be controlled to perform corresponding actions according to the target keywords, and the response speed of the voice control device is further ensured.

Based on the embodiment shown in fig. 2, fig. 3 shows a flow chart of step S250 in the processing method of voice data of fig. 2 according to an embodiment of the present application. Referring to fig. 3, step S250 includes at least steps S310 to S320, and is described in detail as follows:

in step S310, the speech frame in which the keyword is first recognized is determined as a start frame.

In this embodiment, since one speech frame corresponds to one phoneme recognition, which corresponds to one keyword recognition, each speech frame corresponds to one keyword recognition result. When a keyword is recognized from a voice frame, if the keyword is not recognized in a voice frame within a predetermined range before the voice frame, the voice frame in which the keyword is recognized may be determined as a start frame. For example, the keyword "door open" is recognized from the 12 th frame of speech, and the predetermined number is 10 frames, and if no keyword is recognized from the 2 nd frame of speech to the 11 th frame of speech, the 12 th frame of speech may be determined as the start frame.

In step S320, if no other keywords are recognized in the speech frames within the predetermined number after the start frame, the keywords are determined to be target keywords.

In this embodiment, if no other keywords are recognized in the speech frames in the predetermined number after the start frame, that is, the number of times of continuously recognizing the same keywords is greater than or equal to the predetermined number, it indicates that the keywords recognized in the start frame are highly probable as specific words of the corresponding actions desired by the user, so that the keywords can be determined as target keywords.

In one example, if other keywords are identified within a predetermined number of speech frames following the start frame, one keyword may be randomly selected from the plurality of keywords as the target keyword. It should be understood that the number of the plurality mentioned herein may be two or any number above, and those skilled in the art may be configured according to actual implementation needs, which is not particularly limited in this application.

It should be noted that the other keyword not recognized may be the same keyword as the keyword recognized. For example, the keyword "door open" is recognized in the 12 th frame voice frame, and the keyword "door open" is also recognized in the following 13 th, 14 th, 15 th, … th and 19 th frames voice frames, and the other voice frames do not recognize the keyword, so that the keyword "door open" is determined as the target keyword, which indicates that the user wants to perform the door open operation.

In one embodiment of the present application, if a plurality of target keywords are determined from the voice input information, the device may be controlled to perform corresponding actions according to the respective target keywords. For example, when the target keywords are determined to be "light-off" and "door-closed", the actions of light-off and door-closing are correspondingly performed according to the sequence of the identified target keywords.

In an example, the importance level of the keywords may be set in advance by those skilled in the art to divide the keywords into different importance levels, for example, importance levels of "open door", "close door" are greater than importance levels of "turn-off light", "heat water temperature", etc. Thus, when a plurality of target keywords are specified from the voice input information, the actions corresponding to the target keywords can be executed in order of greater importance levels according to the importance levels corresponding to the respective target keywords.

Specifically, a person skilled in the art may add an importance degree identification to each keyword in the keyword list in advance, for example, "1" means very important, "2" means important, and "3" means general, and so on. It should be noted that the important identifier may be any form of identifier information, and the important identifier may include, but is not limited to, a numeric identifier, a letter identifier, a graphic identifier, or the like, which is merely an exemplary example, and is not limited in this application.

In the embodiment shown in fig. 3, the target keyword is determined by determining a voice frame in which the keyword is first recognized as a start frame and determining whether other keywords are recognized in a predetermined number of voice frames after the start frame. It should be understood that when the user performs voice control, the situation that a plurality of keywords are contained in a sentence is low in probability, so that the keyword recognition of the application is ensured to be consistent with the actual situation, and the accuracy of the keyword recognition result is ensured.

Based on the embodiment shown in fig. 2, fig. 4 shows a flowchart of step S230 in the processing method of voice data of fig. 2 according to an embodiment of the present application. Referring to fig. 4, step S230 includes at least steps S410 to S420, and is described in detail as follows:

in step S410, feature extraction is performed on the speech frame, so as to obtain a speech feature corresponding to the speech frame.

In this embodiment, a pre-configured speech feature extraction module may be used to perform feature extraction on a speech frame, so as to obtain a speech feature corresponding to the speech frame. The speech feature extraction module may be any existing speech feature extraction module, for example, a speech feature extraction module adopting linear prediction analysis, perceptual linear analysis, mel-frequency cepstrum coefficient analysis, and the like, which is not particularly limited in this application.

In step S420, the speech features are input into an acoustic model such that the acoustic model outputs phonemes contained in the speech frame.

The acoustic model may be a model for identifying phonemes corresponding to the speech features according to the speech features.

In this embodiment, the speech features corresponding to the speech frames are used as inputs of the acoustic model, so that the acoustic model outputs phonemes corresponding to the speech features. It should be appreciated that the phonemes may be the smallest phonetic units divided according to the natural properties of the language, e.g., 48 phonemes in the english international phonetic symbol. Dividing the speech input information into a plurality of phonemes may convert an unlimited class into a limited class, e.g. assuming three phonemes constitute the pronunciation of a word, then 48 x 48 total possible and the like are identified in the limited category, so that the voice identification difficulty can be reduced.

In one embodiment of the application, an acoustic model can be built based on LVCSR (large vocabulary continuous speech recognition technology), the number of layers of a neural network in the acoustic model can be correspondingly reduced for an embedded platform, the recognition accuracy of the acoustic model is ensured, meanwhile, the storage space occupied by the acoustic model is reduced, and the application of the embedded platform is facilitated.

Based on the embodiment shown in fig. 2, fig. 5 shows a flowchart of step S240 in the processing method of voice data of fig. 2 according to an embodiment of the present application. Referring to fig. 5, step S240 includes at least steps S510 to S520, and is described in detail as follows:

in step S510, a first weight of a keyword and a second weight of a near-word that is close to the keyword pronunciation are obtained.

The first weight and the second weight may be values for respectively representing occurrence probabilities of the keyword and the near-voice word, and it should be understood that the higher the occurrence probability is, the greater the corresponding weight is, so the first weight should be greater than the second weight, for example, the first weight is 0.7, the second weight is 0.3, and so on.

In this embodiment, the person skilled in the art may configure the weight of each vocabulary, i.e. the probability of occurrence, through an input device (e.g. an input keyboard, a touch-sensitive touch screen, etc.) configured by the terminal device, and store the configured weight into a storage location of the terminal device for subsequent acquisition.

By setting different weights for different words, the occurrence probability of each word can be clarified, and words with higher occurrence probability can be selected as recognition results according to the recognized pronunciation.

In step S520, the currently recognized phoneme, the first weight and the second weight are input into a language model, so that the language model outputs a keyword corresponding to the currently recognized phoneme.

In this embodiment, the currently recognized phoneme, the first weight and the second weight are used as inputs of a language model, so that the language model recognizes the vocabulary corresponding to the phoneme according to the currently recognized phoneme, the first weight and the second weight.

In the embodiment shown in fig. 5, by presetting the weights of the keywords and the weights of the near-voice words, when the language model is identified, the word with a higher weight, i.e. a higher occurrence probability, can be selected as the identification result of the language model according to the weights of the keywords and the near-voice words. Therefore, the condition that the near-voice word similar to the keyword pronunciation is taken as the recognition result can be prevented, and the accuracy of the voice recognition result is ensured.

In one embodiment of the present application, when training the language model, a predetermined vocabulary having a larger difference from the pronunciation of the keyword may be added to train the language model, so as to increase the vocabulary that can be recognized by the language model, thereby preventing false triggering caused by that the vocabulary that can be recognized by the language model is smaller, so that the result of speech recognition is the same as the keyword. For example, the language model can only recognize one word "open door", and the language model can only output the recognition result of "open door" no matter what kind of phoneme combination the acoustic model outputs, so that the accuracy of the recognition result of the language model can be improved by training the language model by adding a predetermined word having a large difference from the pronunciation of the keyword.

Based on the embodiment shown in fig. 2, fig. 6 shows a flowchart of step S240 in the processing method of voice data of fig. 2 according to another embodiment of the present application. Referring to fig. 6, step S240 includes at least steps S610 to S620, and is described in detail as follows:

in step S610, the currently recognized phonemes are input into a plurality of language models, respectively, so that the plurality of language models output the pending vocabulary corresponding to the currently recognized phonemes, respectively.

In one embodiment of the present application, a plurality of language models may be trained in advance, and when performing speech recognition, currently recognized phonemes are respectively input into the plurality of language models, so that each language model outputs a pending vocabulary corresponding to the phonemes according to the phonemes, respectively. It should be understood that the pending vocabulary output by the plurality of language models may be the same or different.

In one embodiment of the present application, different sets of training speech frames may be used to train multiple language models to reduce the correlation between the multiple language models, thereby ensuring the accuracy of recognition according to the multiple language models. In other embodiments, a set of classification models may be trained by other classification algorithms such as SVM (Support Vector Machine ) to replace one of the multiple language models, so as to reduce the correlation between the two language models, and those skilled in the art may configure the classification models according to the actual implementation needs, which is not limited in this application.

In step S620, a keyword corresponding to the currently recognized phoneme is determined according to the undetermined vocabulary.

In one embodiment of the application, comparing the vocabulary to be determined output by the language models with each keyword in the keyword list, and if only one keyword matched with one of the vocabulary to be determined exists, determining the keyword as a final recognition result; if a plurality of matched keywords exist, the keywords with higher importance degree can be selected as the final recognition result,

in the embodiment shown in fig. 6, the recognition results of the plurality of language models are compared by pre-training the plurality of language models for recognition, so that the accuracy of the recognition results output by the language models can be ensured, and false triggering is prevented.

Fig. 7 shows a schematic flow chart of editing keywords further included in the method for processing voice data according to an embodiment of the present application, based on the embodiment shown in fig. 2. Referring to fig. 7, the editing keyword includes at least steps S710 to S720, and is described in detail as follows:

in step S710, if a modification request for a keyword is received, a keyword editing interface is displayed.

Wherein, the modification request for the keyword may be information for requesting editing of the keyword. In an example, the user may generate a request for modification of a keyword by clicking a particular area on the interface (e.g., an "edit keyword" key, etc.) or a physical key configured on the terminal device (e.g., an "edit" key on a physical keyboard, etc.).

The keyword editing interface may be an interface for editing keywords, and the keyword editing interface may include keyword editing options, such as an add keyword option and a delete keyword option. Specifically, the keyword editing interface may display an existing keyword list, and the user may perform corresponding adding and deleting operations by selecting a corresponding keyword editing option, so as to complete editing of the keyword list.

In step S720, the modification information is stored according to the modification information of the keyword received by the keyword editing interface.

In this embodiment, according to the modification information of the keyword obtained by the user through the keyword editing interface, the modification information is stored and synchronized into the existing keyword list, so as to complete the update of the keyword list, so that the language model can be compared in the subsequent speech recognition.

In the embodiment shown in fig. 7, by setting the keyword editing option, the user can add or delete keywords according to actual needs, and the application range of the voice data recognition method is improved.

The following describes an embodiment of an apparatus of the present application, which may be used to perform the processing method of voice data in the foregoing embodiment of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method for processing voice data described in the present application.

Fig. 8 shows a block diagram of a processing device for speech data according to an embodiment of the present application.

Referring to fig. 8, a processing apparatus for voice data according to an embodiment of the present application includes:

an acquiring module 810, configured to acquire voice input information in real time;

a framing module 820, configured to perform framing processing on the voice input information, so as to obtain a voice frame corresponding to the voice input information;

a first recognition module 830, configured to perform phoneme recognition on the speech frame by using a pre-trained acoustic model, so as to recognize phonemes included in the speech frame;

a second recognition module 840, configured to recognize keywords of the currently recognized phonemes for each phoneme recognition result, so as to determine keywords included in the voice input information;

And the processing module 850 is configured to determine that the keyword is a target keyword if the number of times of continuously identifying the same keyword is greater than or equal to a predetermined number, so as to perform a corresponding action according to the target keyword.

Based on the foregoing, in some embodiments of the present application, the processing module 850 is configured to: determining the voice frame with the keyword recognized for the first time as a starting frame; and if no other keywords are identified in the voice frames in the preset number after the initial frame, determining the keywords as target keywords.

Based on the foregoing, in some embodiments of the present application, the first recognition module 830 is configured to: extracting the characteristics of the voice frame to obtain the voice characteristics corresponding to the voice frame; inputting the voice features into an acoustic model so that the acoustic model outputs phonemes contained in the voice frame.

Based on the foregoing, in some embodiments of the present application, the second identifying module 840 is configured to: acquiring a first weight of a keyword and a second weight of a near-voice word close to the pronunciation of the keyword; inputting the currently recognized phonemes, the first weights and the second weights into a language model so that the language model outputs keywords corresponding to the currently recognized phonemes.

Based on the foregoing, in some embodiments of the present application, the second identifying module 840 includes:

Based on the foregoing, in some embodiments of the present application, the processing module 850 is further configured to: if a modification request for the keywords is received, displaying a keyword editing interface; and storing the modification information according to the modification information of the keywords received by the keyword editing interface.

It should be noted that, the computer system of the electronic device shown in fig. 9 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 9, the computer system includes a central processing unit (Central Processing Unit, CPU) 901 which can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 902 or a program loaded from a storage section 908 into a random access Memory (Random Access Memory, RAM) 903, for example, execute the method described in the above embodiment. In the RAM 903, various programs and data required for system operation are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other through a bus 904. An Input/Output (I/O) interface 905 is also connected to bus 904.

The following components are connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output section 907 including a speaker and the like, such as a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and the like; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. When the computer program is executed by a Central Processing Unit (CPU) 901, various functions defined in the system of the present application are performed.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with a computer-readable computer program embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. A computer program embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the methods described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, in accordance with embodiments of the present application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for processing voice data, comprising:

acquiring voice input information in real time;

aiming at the result of each phoneme recognition, carrying out keyword recognition on the currently recognized phonemes to determine keywords contained in the voice input information, wherein the keywords are specific words used for controlling the actions of equipment; the method specifically comprises the following steps: acquiring a first weight of a keyword and a second weight of a near-voice word similar to the pronunciation of the keyword, wherein the first weight and the second weight are numerical values used for respectively representing the occurrence probability of the keyword and the near-voice word, and the higher the occurrence probability is, the larger the corresponding weight is, and the first weight is larger than the second weight; inputting the currently recognized phonemes, the first weights and the second weights into a language model so that the language model outputs keywords corresponding to the currently recognized phonemes; when the language model is identified by presetting the weight of the key words and the weight of the near-voice words, selecting words with larger weights, namely higher occurrence probability, as the identification result of the language model according to the weights of the key words and the near-voice words; after each phoneme recognition is performed on a voice frame, performing a keyword recognition according to the phonemes recognized by the voice frame and the phonemes recognized before the voice frame, i.e. one voice frame corresponds to a phoneme recognition, and one phoneme recognition corresponds to a keyword recognition;

Determining the voice frame with the key words identified for the first time as a starting frame: that is, when a keyword is recognized from a voice frame, if the keyword is not recognized in a voice frame within a predetermined range before the voice frame, determining the voice frame in which the keyword is recognized as a start frame; and judging whether other keywords are recognized in the voice frames in the preset number after the initial frame, if not, judging that the other keywords are not recognized in the voice frames in the preset number after the initial frame: the method comprises the steps that the number of times of continuously recognizing the same keywords is larger than or equal to a preset number, the probability that the keywords recognized by the initial frame are specific words of corresponding actions to be performed by a user is indicated, the keywords are determined to be target keywords, and corresponding actions are performed according to the target keywords; if other keywords are identified in the voice frames in a preset number after the initial frame, randomly selecting one keyword from the keywords as a target keyword; if a plurality of target keywords are determined from the voice input information, corresponding actions are respectively carried out according to each target keyword control device, specifically: corresponding actions are carried out according to the sequence of the identified target keywords or importance degrees of the keywords are preset, importance degree identifiers are added to each keyword in a keyword list in advance, the keywords are divided into different importance degrees, and the actions corresponding to the target keywords are executed according to the importance degrees corresponding to the target keywords from big to small according to the importance degrees corresponding to the target keywords.

2. The processing method according to claim 1, wherein performing phoneme recognition on the speech frame using a pre-trained acoustic model to recognize phonemes contained in the speech frame comprises:

extracting the characteristics of the voice frame to obtain the voice characteristics corresponding to the voice frame;

inputting the voice features into an acoustic model so that the acoustic model outputs phonemes contained in the voice frame.

3. The processing method according to claim 1, wherein, for each phoneme recognition result, keyword recognition is performed on the currently recognized phonemes to determine keywords contained in the voice input information, comprising:

respectively inputting the currently recognized phonemes into a plurality of language models, so that the language models respectively output undetermined vocabularies corresponding to the currently recognized phonemes;

and determining keywords corresponding to the currently recognized phonemes according to the undetermined vocabulary.

4. A method of treatment according to claim 3, further comprising:

if a modification request for the keywords is received, displaying a keyword editing interface;

And storing the modification information according to the modification information of the keywords received by the keyword editing interface.

5. A processing apparatus for voice data, comprising:

the second recognition module is used for recognizing keywords of the currently recognized phonemes according to the result of each phoneme recognition so as to determine keywords contained in the voice input information, wherein the keywords are specific vocabularies used for controlling the actions of the equipment; the method specifically comprises the following steps: acquiring a first weight of a keyword and a second weight of a near-voice word similar to the pronunciation of the keyword, wherein the first weight and the second weight are numerical values used for respectively representing the occurrence probability of the keyword and the near-voice word, and the higher the occurrence probability is, the larger the corresponding weight is, and the first weight is larger than the second weight; inputting the currently recognized phonemes, the first weights and the second weights into a language model so that the language model outputs keywords corresponding to the currently recognized phonemes; when the language model is identified by presetting the weight of the key words and the weight of the near-voice words, selecting words with larger weights, namely higher occurrence probability, as the identification result of the language model according to the weights of the key words and the near-voice words; after each phoneme recognition is performed on a voice frame, performing a keyword recognition according to the phonemes recognized by the voice frame and the phonemes recognized before the voice frame, i.e. one voice frame corresponds to a phoneme recognition, and one phoneme recognition corresponds to a keyword recognition;

The processing module is used for determining the voice frame with the keyword recognized for the first time as a starting frame: that is, when a keyword is recognized from a voice frame, if the keyword is not recognized in a voice frame within a predetermined range before the voice frame, determining the voice frame in which the keyword is recognized as a start frame; and judging whether other keywords are recognized in the voice frames in the preset number after the initial frame, if not, judging that the other keywords are not recognized in the voice frames in the preset number after the initial frame: the method comprises the steps that the number of times of continuously recognizing the same keywords is larger than or equal to a preset number, the probability that the keywords recognized by the initial frame are specific words of corresponding actions to be performed by a user is indicated, the keywords are determined to be target keywords, and corresponding actions are performed according to the target keywords; if other keywords are identified in the voice frames in a preset number after the initial frame, randomly selecting one keyword from the keywords as a target keyword; if a plurality of target keywords are determined from the voice input information, corresponding actions are respectively carried out according to each target keyword control device, specifically: corresponding actions are carried out according to the sequence of the identified target keywords or importance degrees of the keywords are preset, importance degree identifiers are added to each keyword in a keyword list in advance, the keywords are divided into different importance degrees, and the actions corresponding to the target keywords are executed according to the importance degrees corresponding to the target keywords from big to small according to the importance degrees corresponding to the target keywords.

6. The apparatus of claim 5, wherein the second identification module comprises:

7. A computer readable medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of processing speech data according to any one of claims 1 to 4.

8. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of processing speech data as claimed in any one of claims 1 to 4.