CN111710337A

CN111710337A - Voice data processing method and device, computer readable medium and electronic equipment

Info

Publication number: CN111710337A
Application number: CN202010549158.XA
Authority: CN
Inventors: 元涛; 兰泽华; 林昱
Original assignee: Ringslink Xiamen Network Communication Technologies Co ltd
Current assignee: Ringslink Xiamen Network Communication Technologies Co ltd
Priority date: 2020-06-16
Filing date: 2020-06-16
Publication date: 2020-09-25
Anticipated expiration: 2040-06-16
Also published as: CN111710337B

Abstract

The embodiment of the application provides a method and a device for processing voice data, a computer readable medium and electronic equipment. The processing method of the voice data comprises the following steps: acquiring voice input information in real time; performing framing processing on the voice input information to obtain a voice frame corresponding to the voice input information; adopting a pre-trained acoustic model to perform phoneme recognition on the voice frame so as to recognize phonemes contained in the voice frame; performing keyword recognition on the currently recognized phoneme according to the result of each phoneme recognition to determine a keyword contained in the voice input information; and if the times of continuously identifying the same keywords are more than or equal to the preset number, determining the keywords as target keywords so as to perform corresponding actions according to the target keywords. According to the technical scheme, the efficiency of voice recognition can be improved, and the response speed of the voice control equipment is further ensured.

Description

Voice data processing method and device, computer readable medium and electronic equipment

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for processing speech data, a computer-readable medium, and an electronic device.

Background

With the development of speech recognition technology, its application is also more and more extensive. Such as voice control of automotive equipment, smart toys or smart homes, etc. In the current technical scheme, the voice control device achieves the purpose of controlling the device to act by acquiring voice information input by a user and carrying out voice recognition on the voice information. Therefore, how to improve the recognition efficiency of the voice and further ensure the response speed of the voice control equipment becomes a technical problem to be solved urgently.

Disclosure of Invention

Embodiments of the present application provide a method and an apparatus for processing voice data, a computer-readable medium, and an electronic device, so that the recognition efficiency of voice can be improved at least to a certain extent, and the response speed of a voice control device is further ensured.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided a method for processing voice data, the method including:

acquiring voice input information in real time;

performing framing processing on the voice input information to obtain a voice frame corresponding to the voice input information;

adopting a pre-trained acoustic model to perform phoneme recognition on the voice frame so as to recognize phonemes contained in the voice frame;

performing keyword recognition on the currently recognized phoneme according to the result of each phoneme recognition to determine a keyword contained in the voice input information;

and if the times of continuously identifying the same keywords are more than or equal to the preset number, determining the keywords as target keywords so as to perform corresponding actions according to the target keywords.

According to an aspect of an embodiment of the present application, there is provided a processing apparatus for voice data, the processing apparatus including:

the acquisition module is used for acquiring voice input information in real time;

the framing module is used for framing the voice input information to obtain a voice frame corresponding to the voice input information;

the first recognition module is used for recognizing phonemes in the voice frame by adopting a pre-trained acoustic model so as to recognize phonemes contained in the voice frame;

the second recognition module is used for carrying out keyword recognition on the currently recognized phonemes aiming at the result of each phoneme recognition so as to determine the keywords contained in the voice input information;

and the processing module is used for determining the keywords as the target keywords if the times of continuously identifying the same keywords are greater than or equal to the preset number, so as to perform corresponding actions according to the target keywords.

Based on the foregoing, in some embodiments of the present application, the processing module is configured to: determining the voice frame with the keyword recognized for the first time as a starting frame; and if no other keywords are identified in the voice frames within the preset number after the initial frame, determining the keywords as target keywords.

Based on the foregoing, in some embodiments of the present application, the first identification module is configured to: extracting the characteristics of the voice frame to obtain the voice characteristics corresponding to the voice frame; inputting the speech features into an acoustic model so that the acoustic model outputs phonemes contained in the speech frame.

Based on the foregoing, in some embodiments of the present application, the second identifying module is configured to: acquiring a first weight of a keyword and a second weight of a near-sound word with a similar pronunciation to the keyword; and inputting the currently recognized phoneme, the first weight and the second weight into a language model so that the language model outputs a keyword corresponding to the currently recognized phoneme.

Based on the foregoing solution, in some embodiments of the present application, the second identification module includes:

the word recognition unit is used for respectively inputting the currently recognized phonemes into a plurality of language models so as to enable the plurality of language models to respectively output pending words corresponding to the currently recognized phonemes;

and the keyword determining unit is used for determining the keywords corresponding to the currently recognized phonemes according to the words to be determined.

Based on the foregoing, in some embodiments of the present application, the processing module is further configured to: if a modification request for the keyword is received, displaying a keyword editing interface; and storing the modification information of the keywords according to the modification information of the keywords received by the keyword editing interface.

According to an aspect of embodiments of the present application, there is provided a computer-readable medium on which a computer program is stored, the computer program, when executed by a processor, implementing a method of processing speech data as described in the above embodiments.

According to an aspect of an embodiment of the present application, there is provided an electronic device including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the processing method of voice data as described in the above embodiments.

In the technical solutions provided in some embodiments of the present application, a speech frame corresponding to speech input information is obtained by acquiring the speech input information in real time and performing framing processing on the speech input information, a pre-trained acoustic model is used to perform phoneme recognition on the speech frame to recognize phonemes included in the speech frame, and for a result of each phoneme recognition, keyword recognition is performed on a currently recognized phoneme to determine a keyword included in the speech input information, and if the number of times of continuously recognizing the same keyword is greater than or equal to a predetermined number, the keyword is determined as a target keyword, so as to perform a corresponding action according to the target keyword. Therefore, the voice input information is acquired in real time and is subjected to voice recognition, the voice recognition can be carried out when a user carries out voice control operation, and the keyword is determined as the target keyword when the same keyword is continuously recognized, so that the voice recognition efficiency is improved, and the response speed of the voice control equipment is further ensured.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 shows a schematic diagram of an exemplary system architecture to which aspects of embodiments of the present application may be applied;

FIG. 2 shows a flow diagram of a method of processing voice data according to an embodiment of the present application;

FIG. 3 shows a flowchart of step S250 of the method of processing speech data of FIG. 2 according to one embodiment of the present application;

FIG. 4 shows a flowchart of step S230 of the method for processing speech data of FIG. 2 according to one embodiment of the present application;

FIG. 5 shows a flowchart of step S240 of the method of processing speech data of FIG. 2 according to one embodiment of the present application;

fig. 6 shows a flowchart of step S240 in the processing method of voice data of fig. 2 according to another embodiment of the present application;

FIG. 7 is a flow diagram illustrating editing keywords further included in a method of processing voice data according to an embodiment of the present application;

FIG. 8 shows a block diagram of a processing device of speech data according to an embodiment of the present application;

FIG. 9 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Fig. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of the embodiments of the present application can be applied.

As shown in fig. 1, the system architecture may include a terminal device (e.g., one or more of the smart phone 101, the tablet computer 102, and the portable computer 103 shown in fig. 1, and may also be a desktop computer or an embedded device, etc.), a network 104, and a server 105. The network 104 serves as a medium for providing communication links between terminal devices and the server 105. Network 104 may include various connection types, such as wired communication links, wireless communication links, and so forth.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

A user may use a terminal device to interact with the server 105 over the network 104 to receive or transmit information or the like. For example, a user uses the terminal device 101 (or the terminal device 102 or 103) to perform voice input, and the terminal device 101 may obtain voice input information in real time and perform framing processing on the voice input information to obtain a voice frame corresponding to the voice input information; and performing phoneme recognition on the voice frame by adopting a pre-trained acoustic model to recognize phonemes contained in the voice frame, performing keyword recognition on the currently recognized phonemes aiming at the result of each phoneme recognition to determine keywords contained in the voice input information, and determining the keywords as target keywords if the times of continuously recognizing the same keywords are more than or equal to a preset number, so as to perform corresponding actions according to the target keywords.

It should be noted that the processing method of the voice data provided in the embodiment of the present application is generally executed by the terminal device, and accordingly, the input device of the voice data is generally disposed in the terminal device. However, in other embodiments of the present application, the server 105 may also have a similar function as the terminal device, so as to execute the scheme of the processing method of voice data provided in the embodiments of the present application.

The implementation details of the technical solution of the embodiment of the present application are set forth in detail below:

fig. 2 shows a flow diagram of a method of processing speech data according to an embodiment of the application. Referring to fig. 2, the processing method of the voice data at least includes steps S210 to S240, and the following details are described as follows:

in step S210, voice input information is acquired in real time.

In one embodiment of the present application, a user may perform voice input through a voice input device (e.g., a microphone, etc.) configured with the terminal device. When the terminal device detects that the user speaks, the voice input information of the user can be acquired in real time.

In an example, the terminal device may obtain the voice input information of the user every predetermined time period, where the predetermined time period may be configured by a person skilled in the art in advance, for example, the predetermined time period may be 0.5S, 1S, or 2S, and the like, which are merely exemplary and not limited herein.

In step S220, the voice input information is subjected to framing processing to obtain a voice frame corresponding to the voice input information.

In this embodiment, when the terminal device acquires the voice input information of the user in real time, the acquired voice input information may be subjected to framing processing in real time to divide the voice input information into at least one voice frame.

It should be understood that, in order to ensure the accuracy of the recognition, the time length of the voice input information is positively correlated with the number of the corresponding voice frames, that is, the longer the time length of the voice input information is, the more the number of the corresponding voice frames is, thereby ensuring the accuracy of the subsequent voice recognition result.

In step S230, a pre-trained acoustic model is used to perform phoneme recognition on the speech frame to identify phonemes contained in the speech frame.

Wherein the phoneme may be a minimum voice unit divided according to natural attributes of the voice.

In an embodiment of the present application, a pre-trained acoustic model is used to perform phoneme recognition on the divided speech frames, so as to recognize phonemes contained in each speech frame. It should be understood that a speech frame may include one phoneme or more than one phoneme in any number, or a speech frame may not include phonemes (e.g., the speech frame is in a speech gap of a user, etc.).

In step S240, for each phoneme recognition result, performing keyword recognition on the currently recognized phoneme to determine a keyword included in the speech input information.

The keyword may be a specific word for controlling the device action, for example, the keyword may be "open door", "turn off light", or the like. According to each keyword contained in the voice input information, the device can be controlled to perform an action corresponding to the keyword, such as controlling a door to open or turn off light.

In an embodiment of the present application, after completing the phoneme recognition on the voice frame each time, the pre-trained language model may be used to perform keyword recognition on the currently recognized phonemes to recognize the keywords corresponding to the currently recognized phonemes. Specifically, the language model may analyze the properties of the phoneme according to the phoneme included in the speech frame extracted by the acoustic model to identify a vocabulary corresponding to the currently identified phoneme, and then compare the vocabulary with a preset keyword list, where the keyword list includes at least one preset keyword. And if the vocabulary recognized by the language model is the same as the keywords contained in the preset keyword list, determining that the vocabulary corresponding to the currently recognized phoneme is the keyword.

It should be noted that, after each speech frame is subjected to phoneme recognition, a keyword recognition is performed according to the phonemes recognized by the speech frame and the phonemes recognized before the speech frame (i.e., the phonemes currently recognized), that is, one speech frame corresponds to one phoneme recognition, and one phoneme recognition corresponds to one keyword recognition. Therefore, the aim of performing real-time keyword recognition on the phonemes is fulfilled, and the recognition efficiency of the keywords is improved.

In step S250, if the number of times of continuously identifying the same keyword is greater than or equal to the predetermined number, the keyword is determined as a target keyword, and a corresponding action is performed according to the target keyword.

In this embodiment, if the same keyword is recognized in the result of the consecutive keyword recognition for a plurality of times, and the number of times of the consecutive keyword recognition is greater than or equal to the predetermined number, it may be indicated that the keyword has a high probability of being a specific word intended for a corresponding action by the user, so that the keyword may be determined as a target keyword, and the control device may perform the corresponding action according to the target keyword.

The predetermined number may be preset, and the predetermined number may be 10, 15, 25, etc., which are merely exemplary and are not limited in this application. The skilled person can set a specific predetermined number according to implementation needs, for example, a predetermined number with a larger value may be set in order to ensure the accuracy of the recognition of the target keyword, and so on.

In the embodiment shown in fig. 2, by acquiring the voice input information of the user in real time and performing the framing processing on the voice input information in real time to perform the voice recognition, the voice input information can be subjected to the voice recognition when the user does not finish the voice input, and compared with performing the voice recognition on the acquired voice input information after the user finishes the voice input, the method for processing the voice data provided by the present application can improve the recognition efficiency of the voice input information.

If the number of times of continuously recognizing the same keywords is larger than or equal to the preset number, the keywords are determined to be the target keywords, namely if the keywords recognized by certain keyword recognition can last for a certain time, the keywords show that the large probability is the specific words of the corresponding actions desired by the user, and the accuracy of the voice recognition result is further ensured. Therefore, before the user finishes voice input, the equipment can be controlled to perform corresponding actions according to the target keyword, and the response speed of the voice control equipment is further ensured.

Based on the embodiment shown in fig. 2, fig. 3 shows a flowchart of step S250 in the processing method of voice data of fig. 2 according to an embodiment of the present application. Referring to fig. 3, step S250 at least includes steps S310 to S320, which are described in detail as follows:

in step S310, the speech frame in which the keyword is recognized for the first time is determined as a start frame.

In this embodiment, each speech frame corresponds to a keyword recognition result, since one speech frame corresponds to one phoneme recognition and one phoneme recognition corresponds to one keyword recognition. When a keyword is recognized from a speech frame, if the keyword is not recognized in a speech frame within a predetermined range before the speech frame, the speech frame in which the keyword is recognized may be determined as a start frame. For example, if the keyword "open door" is recognized from the 12 th speech frame and the predetermined number is 10 frames, if the keyword is not recognized from the 2 nd speech frame to the 11 th speech frame, the 12 th speech frame may be determined as the starting frame.

In step S320, if no other keyword is recognized in the predetermined number of voice frames after the start frame, the keyword is determined as a target keyword.

In this embodiment, if no other keyword is recognized in the predetermined number of speech frames after the start frame, that is, the number of times of continuously recognizing the same keyword is greater than or equal to the predetermined number, it indicates that the keyword recognized by the start frame has a high probability of being a specific word of a corresponding action intended by the user, so that the keyword can be determined as the target keyword.

In one example, if other keywords are recognized within a predetermined number of speech frames after the start frame, one keyword may be randomly selected from a plurality of keywords as the target keyword. It should be understood that the plurality mentioned herein may be two or more than two, and those skilled in the art can configure the present invention according to the actual implementation needs, and the present application is not limited to this.

It should be noted that the non-recognition of other keywords may be recognition of the same keyword as the keyword. For example, the keyword "open door" is recognized in the speech frame of frame 12, the keyword "open door" is recognized in the speech frames of the next 13 th, 14 th, 15 th, …, and 19 th frames, and the keyword "open door" is not recognized in the remaining speech frames, so that the keyword "open door" is determined as the target keyword, indicating that the user intends to open the door.

In an embodiment of the present application, if a plurality of target keywords are determined from the voice input information, the device may be controlled to perform corresponding actions according to the respective target keywords. For example, if the target keywords "light off" and "door closed" are determined, the actions of turning off the light and closing the door are performed according to the sequence of the identified target keywords.

In one example, the importance of the keyword may be set in advance by those skilled in the art to divide the keyword into different importance levels, for example, the importance level of "open door", "close door" is greater than the importance level of "turn off light", "heat water temperature", and so on. Thus, when a plurality of target keywords are specified from the voice input information, the actions corresponding to the target keywords can be executed in descending order of importance according to the importance degree corresponding to each target keyword.

Specifically, the person skilled in the art may add an importance level identifier to each keyword in the keyword list in advance, for example, "1" indicates very important, "2" indicates important, and "3" indicates general, and so on. It should be noted that the important mark may be any form of identification information, and the important mark may include, but is not limited to, a numeric mark, a letter mark, a graphic mark, or the like, which is merely an exemplary example, and the present application is not limited to this.

In the embodiment shown in fig. 3, the target keyword is determined by determining a speech frame in which the keyword is recognized for the first time as a start frame and determining whether other keywords are recognized within a predetermined number of speech frames after the start frame. It should be understood that, when the user performs voice control, a situation that a plurality of keywords are contained in one sentence is less likely to occur, so that the keyword recognition of the application can be ensured to be consistent with the actual situation, and the accuracy of the keyword recognition result is ensured.

Based on the embodiment shown in fig. 2, fig. 4 shows a flowchart of step S230 in the processing method of voice data of fig. 2 according to an embodiment of the present application. Referring to fig. 4, step S230 at least includes steps S410 to S420, which are described in detail as follows:

in step S410, feature extraction is performed on the speech frame to obtain speech features corresponding to the speech frame.

In this embodiment, a pre-configured speech feature extraction module may be adopted to perform feature extraction on a speech frame to obtain a speech feature corresponding to the speech frame. The speech feature extraction module may be any existing speech feature extraction module, for example, a speech feature extraction module that adopts linear prediction analysis, perceptual linear analysis, mel-frequency cepstrum coefficient analysis, and the like, which is not particularly limited in this application.

In step S420, the speech features are input into an acoustic model, so that the acoustic model outputs phonemes included in the speech frame.

The acoustic model may be a model for recognizing a phoneme corresponding to the speech feature according to the speech feature.

In this embodiment, a speech feature corresponding to a speech frame is used as an input of an acoustic model, so that the acoustic model outputs a phoneme corresponding to the speech feature. It should be understood that a phoneme may be the smallest phonetic unit divided according to the natural properties of the language, for example, there are 48 phonemes in the english international phonetic alphabet. For example, if three phonemes constitute a pronunciation of one word, there are 48 × 48 possibilities, and the like, and recognition is performed in the limited categories, thereby reducing difficulty in speech recognition.

In an embodiment of the application, an acoustic model can be established based on the LVCSR (large vocabulary continuous speech recognition technology), and for an embedded platform, the number of layers of a neural network in the acoustic model can be correspondingly reduced, so that the storage space occupied by the acoustic model is reduced while the recognition accuracy of the acoustic model is ensured, and the application of the embedded platform is facilitated.

Based on the embodiment shown in fig. 2, fig. 5 shows a flowchart of step S240 in the processing method of voice data of fig. 2 according to an embodiment of the present application. Referring to fig. 5, step S240 at least includes steps S510 to S520, which are described in detail as follows:

in step S510, a first weight of a keyword and a second weight of a word with a similar pronunciation to the keyword are obtained.

The first weight and the second weight may be numerical values respectively representing occurrence probabilities of the keywords and the nearing words, and it should be understood that the higher the occurrence probability is, the greater the corresponding weight is, so the first weight should be greater than the second weight, for example, the first weight is 0.7, the second weight is 0.3, and so on.

In this embodiment, a person skilled in the art may configure the weight, i.e., the occurrence probability, of each vocabulary through an input device (e.g., an input keyboard, a touch-sensitive touch screen, etc.) configured by the terminal device, and store the configured weight in a storage location of the terminal device for subsequent retrieval.

It should be noted that, by setting different weights for different vocabularies, the occurrence probability of each vocabulary can be clarified, and according to the recognized pronunciation, a vocabulary with a higher occurrence probability can be selected as the recognition result.

In step S520, the currently recognized phoneme, the first weight and the second weight are input into a language model, so that the language model outputs a keyword corresponding to the currently recognized phoneme.

In this embodiment, the currently recognized phoneme, the first weight and the second weight are used as the input of the language model, so that the language model recognizes the vocabulary corresponding to the phoneme according to the currently recognized phoneme, the first weight and the second weight.

In the embodiment shown in fig. 5, by presetting the weight of the keyword and the weight of the near-sound word, when the language model is identified, the word with a higher weight, that is, a higher occurrence probability, can be selected as the identification result of the language model according to the weights of the keyword and the near-sound word. Therefore, the situation that the similar sound words with the similar pronunciation of the key words are used as the recognition results can be prevented, and the accuracy of the voice recognition results is ensured.

In an embodiment of the present application, when training the language model, a predetermined vocabulary with a larger pronunciation difference from the keyword may be added to train the language model to increase the amount of vocabularies recognized by the language model, so as to prevent false triggering caused by the fact that the result of speech recognition is the same as the keyword because the amount of vocabularies recognized by the language model is small. For example, the language model can only recognize one word "open door", and the language model can only output the recognition result of "open door" no matter what phoneme combination is output by the acoustic model, so that the accuracy of the recognition result of the language model can be improved by adding a predetermined word which is greatly different from the pronunciation of the keyword to train the language model.

Based on the embodiment shown in fig. 2, fig. 6 shows a flowchart of step S240 in the processing method of voice data of fig. 2 according to another embodiment of the present application. Referring to fig. 6, step S240 at least includes steps S610 to S620, which are described in detail as follows:

in step S610, the currently recognized phonemes are respectively input into a plurality of language models, so that the language models respectively output pending words corresponding to the currently recognized phonemes.

In an embodiment of the present application, a plurality of language models may be trained in advance, and when performing speech recognition, a currently recognized phoneme is input into each of the plurality of language models, so that each language model outputs an undetermined word corresponding to the phoneme according to the phoneme. It should be understood that the words to be determined output by the language models may be the same or different.

In an embodiment of the present application, different sets of training speech frames may be used to train multiple language models, so as to reduce the correlation between the multiple language models, thereby ensuring the accuracy of recognition according to the multiple language models. In other embodiments, a set of classification models may also be trained by using other classification algorithms such as an SVM (Support Vector Machine) to replace one of the plurality of language models, so as to reduce the correlation between the two language models.

In step S620, a keyword corresponding to the currently recognized phoneme is determined according to the word to be determined.

In one embodiment of the application, the words to be determined output by the plurality of language models are compared with each keyword in the keyword list, and if only one keyword matched with one of the words to be determined exists, the keyword is determined as a final recognition result; if a plurality of matched keywords exist, the keywords with higher importance degree can be selected as the final recognition result,

in the embodiment shown in fig. 6, the plurality of language models are pre-trained for recognition, and the recognition results of the plurality of language models are compared, so that the accuracy of the recognition result output by the language model can be ensured, and the occurrence of false triggering can be prevented.

Based on the embodiment shown in fig. 2, fig. 7 is a schematic flowchart illustrating editing keywords further included in a processing method of voice data according to an embodiment of the present application. Referring to fig. 7, the editing keyword at least includes steps S710 to S720, which are described in detail as follows:

in step S710, if a modification request for the keyword is received, a keyword editing interface is displayed.

Wherein the modification request for the keyword may be information for requesting editing of the keyword. In an example, the user may generate a modification request for the keyword by clicking on a particular region on the interface (e.g., an "edit keyword" key, etc.) or a physical key configured on the terminal device (e.g., an "edit" key on a physical keyboard, etc.).

The keyword editing interface may be an interface for editing keywords, and the keyword editing interface may include keyword editing options, such as add keyword options and delete keyword options. Specifically, the keyword editing interface may display an existing keyword list, and the user may perform corresponding add/delete operations by selecting a corresponding keyword editing option, so as to complete editing of the keyword list.

In step S720, the modification information of the keyword received by the keyword editing interface is stored.

In the embodiment, according to the modification information of the keyword, obtained by the user, obtained by the keyword editing interface, the modification information is stored and synchronized to the existing keyword list, so that the keyword list is updated, and the language model is compared with the existing keyword list in the subsequent voice recognition.

In the embodiment shown in fig. 7, by setting the keyword editing option, the user can add or delete keywords according to actual needs, and the application range of the voice data recognition method is improved.

The following describes embodiments of the apparatus of the present application, which may be used to perform the processing method of voice data in the above embodiments of the present application. For details that are not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method for processing voice data described above in the present application.

Fig. 8 shows a block diagram of a processing device of speech data according to an embodiment of the application.

Referring to fig. 8, a speech data processing apparatus according to an embodiment of the present application includes:

an obtaining module 810, configured to obtain voice input information in real time;

a framing module 820, configured to perform framing processing on the voice input information to obtain a voice frame corresponding to the voice input information;

a first recognition module 830, configured to perform phoneme recognition on the speech frame by using a pre-trained acoustic model to recognize phonemes included in the speech frame;

a second recognition module 840, configured to perform keyword recognition on a currently recognized phoneme according to a result of each phoneme recognition, so as to determine a keyword included in the speech input information;

and the processing module 850 is configured to determine the keyword as a target keyword if the number of times of continuously identifying the same keyword is greater than or equal to a predetermined number, so as to perform a corresponding action according to the target keyword.

Based on the foregoing, in some embodiments of the present application, the processing module 850 is configured to: determining the voice frame with the keyword recognized for the first time as a starting frame; and if no other keywords are identified in the voice frames within the preset number after the initial frame, determining the keywords as target keywords.

Based on the foregoing, in some embodiments of the present application, the first identifying module 830 is configured to: extracting the characteristics of the voice frame to obtain the voice characteristics corresponding to the voice frame; inputting the speech features into an acoustic model so that the acoustic model outputs phonemes contained in the speech frame.

Based on the foregoing, in some embodiments of the present application, the second identification module 840 is configured to: acquiring a first weight of a keyword and a second weight of a near-sound word with a similar pronunciation to the keyword; and inputting the currently recognized phoneme, the first weight and the second weight into a language model so that the language model outputs a keyword corresponding to the currently recognized phoneme.

Based on the foregoing solutions, in some embodiments of the present application, the second identification module 840 includes:

Based on the foregoing, in some embodiments of the present application, the processing module 850 is further configured to: if a modification request for the keyword is received, displaying a keyword editing interface; and storing the modification information of the keywords according to the modification information of the keywords received by the keyword editing interface.

It should be noted that the computer system of the electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 9, the computer system includes a Central Processing Unit (CPU) 901, which can perform various appropriate actions and processes, such as performing the method described in the above embodiments, according to a program stored in a Read-Only Memory (ROM) 902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for system operation are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An Input/Output (I/O) interface 905 is also connected to bus 904.

The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 908 including a hard disk and the like; and a communication section 909 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The computer program executes various functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 901.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with a computer program embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for processing voice data, comprising:

acquiring voice input information in real time;

2. The processing method of claim 1, wherein determining the keyword as a target keyword if the same keyword is continuously identified more than a predetermined number of times comprises:

determining the voice frame with the keyword recognized for the first time as a starting frame;

and if no other keywords are identified in the voice frames within the preset number after the initial frame, determining the keywords as target keywords.

3. The processing method of claim 1, wherein performing phoneme recognition on the speech frame by using a pre-trained acoustic model to recognize phonemes contained in the speech frame comprises:

extracting the characteristics of the voice frame to obtain the voice characteristics corresponding to the voice frame;

inputting the speech features into an acoustic model so that the acoustic model outputs phonemes contained in the speech frame.

4. The processing method of claim 1, wherein performing keyword recognition on a currently recognized phoneme to determine a keyword included in the speech input information according to a result of each phoneme recognition comprises:

acquiring a first weight of a keyword and a second weight of a near-sound word with a similar pronunciation to the keyword;

and inputting the currently recognized phoneme, the first weight and the second weight into a language model so that the language model outputs a keyword corresponding to the currently recognized phoneme.

5. The processing method of claim 1, wherein performing keyword recognition on a currently recognized phoneme to determine a keyword included in the speech input information according to a result of each phoneme recognition comprises:

respectively inputting the currently recognized phonemes into a plurality of language models so that the plurality of language models respectively output undetermined words corresponding to the currently recognized phonemes;

and determining the keywords corresponding to the currently recognized phonemes according to the words to be determined.

6. The processing method according to claim 5, characterized in that it further comprises:

if a modification request for the keyword is received, displaying a keyword editing interface;

and storing the modification information of the keywords according to the modification information of the keywords received by the keyword editing interface.

7. An apparatus for processing voice data, comprising:

8. The apparatus of claim 7, wherein the second identification module comprises:

9. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of processing speech data according to any one of claims 1 to 6.

10. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method of processing voice data according to any one of claims 1 to 6.