CN115713929A

CN115713929A - Voice data processing method and device and electronic equipment

Info

Publication number: CN115713929A
Application number: CN202211184669.1A
Authority: CN
Inventors: 王炳乾; 刘童
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2023-02-24

Abstract

The embodiment of the application provides a voice data processing method and device, electronic equipment and a computer readable storage medium, and relates to the technical field of voice recognition. The method comprises the following steps: performing voice recognition on a first voice based on a voice recognition model by receiving the first voice, and determining recognition information corresponding to the first voice; and executing processing operation corresponding to the first voice according to the identification information. The voice recognition model is obtained by training based on the target sample voice, and the target sample voice is obtained by performing voice detection on the initial sample voice, so that the voice quality of the obtained target sample voice is higher, the voice recognition precision of the voice recognition model based on the target sample voice training is higher, and the accuracy of voice recognition is improved.

Description

Voice data processing method and device and electronic equipment

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for processing speech data, an electronic device, and a computer-readable storage medium.

Background

At present, the artificial intelligence technology has been widely applied to various application scenarios of speech processing, for example, the artificial intelligence technology can be applied to application scenarios such as speech wakeup, speech recognition, speech synthesis, and the like. However, at present, the recognition accuracy of the speech recognition model is still low, and situations such as recognition errors or recognition failures are easy to occur, which brings inconvenience to users.

Disclosure of Invention

The present application aims to solve at least one of the above technical drawbacks, and in particular, the technical drawback that the accuracy of speech recognition is low due to the low speech recognition precision of the speech recognition model.

According to an aspect of the present application, there is provided a voice data processing method, including: receiving a first voice;

performing voice recognition on the first voice based on a voice recognition model, and determining recognition information corresponding to the first voice; the voice recognition model is obtained by training based on target sample voice in a sample voice database, and the target sample voice is obtained by performing voice detection on initial sample voice; wherein the voice detection comprises: at least one of speech rate detection, keyword frequency detection and whiteout detection;

and executing the processing operation corresponding to the first voice according to the identification information.

Optionally, before performing speech recognition on the first speech, the method further includes:

receiving the initial sample speech;

and performing the voice detection on the initial sample voice, and screening the initial sample voice meeting the preset sample requirement as target sample voice according to the detection result of the voice detection.

Optionally, the performing the voice detection on the initial sample voice includes:

dividing the initial sample voice into a plurality of voice frames, and determining effective voice frames in the voice frames;

screening the speech segments of which the frame number is greater than the preset number of the continuous effective speech frames as effective speech segments;

performing the speech detection on the initial sample speech including the valid speech segment.

Optionally, the determining a valid speech frame in the speech frames includes:

extracting acoustic features of the voice frame;

and under the condition that the acoustic characteristics of the voice frame are determined to accord with the preset characteristic conditions, determining the voice frame as the effective voice frame.

Optionally, in a case that the voice detection includes leave detection, the performing the voice detection on the initial sample voice includes:

determining the interval duration between the adjacent effective speech segments;

it is determined whether the interval duration is within a first standard duration range.

Optionally, in a case that the speech detection includes speech rate detection, the performing the speech detection on the initial sample speech includes:

and comparing the speech segment duration of the effective speech segment with a second standard duration range, and determining whether the speech segment duration is within the second standard duration range.

Optionally, in a case that the voice detection includes keyword frequency detection, the performing the voice detection on the initial sample voice includes:

and performing voice recognition on the effective speech segments, and determining whether the effective speech segments contain target keywords and whether the number of the effective speech segments containing the target keywords is greater than a preset threshold value.

Optionally, the screening, according to the detection result of the voice detection, the initial sample voice meeting the preset sample requirement as a target sample voice includes:

under the condition that the voice detection comprises single detection, determining the initial sample voice of which the detection result of the single detection meets the requirement of a preset sample as target sample voice;

and under the condition that the voice detection comprises a plurality of detections, determining the initial sample voice with at least a preset number of detection results meeting preset sample requirements in the plurality of detections as target sample voice.

Optionally, before the receiving the initial sample speech, the method further includes:

receiving a second voice;

sending a prompt voice and/or displaying first prompt information under the condition that the second voice contains a target awakening word;

the prompt voice and the first prompt information indicate that the initial sample voice starts to be collected; wherein the first prompt message includes at least one of the following:

a target keyword;

collecting times of the target keywords;

the interval duration between target keywords.

Optionally, the method further includes:

displaying second prompt information under the condition that the initial sample voice does not meet the preset sample condition;

the second prompt indicates to re-collect the initial sample speech.

According to another aspect of the present application, there is provided a voice data processing apparatus including:

the receiving module is used for receiving first voice;

the recognition module is used for carrying out voice recognition on the first voice based on a voice recognition model and determining recognition information corresponding to the first voice; the voice recognition model is obtained by training based on target sample voice in a sample voice database, and the target sample voice is obtained by performing voice detection on initial sample voice; wherein the voice detection comprises: at least one of speech rate detection, keyword frequency detection and whiteout detection;

and the execution module is used for executing the processing operation corresponding to the first voice according to the identification information.

According to another aspect of the present application, there is provided an electronic device including:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the method of processing speech data according to any one of the first aspect of the present application is performed.

For example, in a third aspect of the present application, there is provided a computing device comprising: the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the voice data processing method as shown in the first aspect of the application.

According to yet another aspect of the present application, there is provided a computer-readable storage medium, which when executed by a processor implements the speech data processing method of any one of the first aspects of the present application.

For example, in a fourth aspect of the embodiments of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and the program, when executed by a processor, implements the voice data processing method shown in the first aspect of the present application.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method provided in the various alternative implementations of the first aspect described above.

The beneficial effect that technical scheme that this application provided brought is:

in the embodiment of the application, in the process of voice recognition, receiving first voice, performing voice recognition on the first voice based on a voice recognition model, and executing processing operation corresponding to the first voice; the target sample voice adopted in the voice recognition model training process is obtained through voice detection, and the voice detection comprises at least one of voice speed detection, keyword frequency detection and blank space detection; therefore, the target sample voice with higher voice quality is obtained through voice detection, and the voice recognition precision of the voice recognition model is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic system architecture diagram of a voice data processing method according to an embodiment of the present application;

fig. 2 is a flowchart illustrating a voice data processing method according to an embodiment of the present application;

fig. 3 is a schematic view of an application scenario of a voice data processing method according to an embodiment of the present application;

fig. 4 is a second schematic view of an application scenario of a voice data processing method according to an embodiment of the present application;

fig. 5 is a third schematic view of an application scenario of a speech data processing method according to an embodiment of the present application;

fig. 6 is a fourth schematic view of an application scenario of a voice data processing method according to an embodiment of the present application;

fig. 7 is a fifth schematic view of an application scenario of a voice data processing method according to an embodiment of the present application;

fig. 8 is a sixth schematic view illustrating an application scenario of a voice data processing method according to an embodiment of the present application;

fig. 9 is a second flowchart illustrating a voice data processing method according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a speech data processing apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device for processing voice data according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below in conjunction with the drawings in the present application. It should be understood that the embodiments set forth below in connection with the drawings are exemplary descriptions for explaining technical solutions of the embodiments of the present application, and do not limit the technical solutions of the embodiments of the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification in connection with embodiments of the present application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, as embodied in the art. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g., "a and/or B" can be implemented as "a", or as "B", or as "a and B".

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

At least part of the contents of the voice data processing method provided by the embodiment of the application relate to the fields of machine learning and the like in the field of artificial intelligence, and also relate to various fields of Cloud technology, such as Cloud computing in Cloud technology, cloud service and related data computing processing in the field of big data.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML for short) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide method steps as shown in the following embodiments or figures, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In steps where no necessary causal relationship exists logically, the order of execution of these steps is not limited to the order of execution provided by the embodiments of the present application.

Fig. 1 is a system architecture diagram of a voice data processing method according to an embodiment of the present application. The system may include a server 101 and a terminal cluster, wherein the server 101 may be regarded as a background server for performing speech recognition.

The terminal cluster may include: terminal 102, terminal 103, terminal 104, \8230;, in which a client supporting image display is installed in the terminal. There may be a communication connection between the terminals, for example, a communication connection between terminal 102 and terminal 103, and a communication connection between terminal 103 and terminal 104.

Meanwhile, the server 101 may provide a service for the terminal cluster through a communication connection function, and any terminal in the terminal cluster may have a communication connection with the server 101, for example, a communication connection exists between the terminal 102 and the server 101, and a communication connection exists between the terminal 103 and the server 101, where the communication connection is not limited to a connection manner, and may be directly or indirectly connected through a wired communication manner, may also be directly or indirectly connected through a wireless communication manner, and may also be through other manners.

The communicatively coupled network may be a wide area network or a local area network, or a combination thereof. The application is not limited thereto.

The voice data processing method in the embodiment of the present application may be executed on a server side or a terminal side, and an execution subject is not limited in the embodiment of the present application. The method provided by the embodiment of the present application may be executed by a computer device, which includes but is not limited to a terminal (also includes the above-mentioned user terminal) or a server (also includes the above-mentioned server 101). The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Certainly, the method provided in the embodiment of the present application is not limited to be used in the application scenario shown in fig. 1, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described in the following method embodiments, and will not be described in detail herein.

The embodiment of the present application provides a possible implementation manner, and the scheme may be executed by any electronic device, and optionally, any electronic device may be a server device having a voice data processing capability, or may also be a device or a chip integrated on these devices. As shown in fig. 2, which is a schematic flow chart of a voice data processing method according to an embodiment of the present application, the method includes the following steps:

step S201: a first voice is received.

Optionally, the embodiment of the present application may be applied to the technical field of speech recognition, and specifically, to an application scenario for recognizing speech information.

The voice information, i.e. the first voice in the embodiment of the present application, performs subsequent recognition processing on the first voice by receiving the first voice.

In the embodiment of the application, the first voice can be the voice of the user; for example, in an actual voice recognition scenario, a user may control to turn on a home appliance through a voice instruction (i.e., a first voice of the embodiment of the present application), may perform voice interaction with a robot through an interactive voice (i.e., a first voice of the embodiment of the present application), and the like. In addition, the first voice may also be a voice of the device, such as a voice of an intelligent appliance, a voice of an electronic device, and so on. For example, in some scenarios, the household appliance may also be controlled to be turned on by the voice of the electronic device (i.e., the first voice of the embodiment of the present application).

Step S202: and performing voice recognition on the first voice based on a voice recognition model, and determining recognition information corresponding to the first voice.

The voice recognition model is obtained by training based on target sample voice in a sample voice database, and the target sample voice is obtained by performing voice detection on initial sample voice; wherein the voice detection comprises: and detecting at least one of the speech rate, the keyword frequency and the blank.

Specifically, the speech recognition model is obtained by training based on target sample speech; the target sample voice is obtained through voice detection such as voice speed detection, keyword frequency detection, blank detection and the like; that is to say, the target sample speech of the present application is subjected to speech detection, and the speech detection result is speech meeting the requirements of the preset sample.

The speech rate detection means detecting a speech rate of the speech data. For example, in an actual implementation process, the speech rate detection may detect a duration of the speech data, e.g., detect whether the duration of the speech data is within a preset duration range, and if the duration of the speech data is longer, indicate that the speech rate is slower, and if the duration of the speech data is shorter, indicate that the speech rate is faster.

The keyword frequency detection means detecting the number of keywords contained in the voice data within a preset time length. For example, if the keyword is "increase volume", the number of occurrences of the keyword "increase volume" contained in the voice data may be detected.

The whiteout detection is to detect a blank duration of an interval between adjacent speech segments in the speech data, for example, a time without speech data formed by pausing a speech uttering process after the user utters a speech of a certain speech segment (for example, a keyword). As an example, the user may pause for 2 seconds after uttering the keyword "increase volume" and then utter the keyword "increase volume" again, for example; wherein, the 2 seconds of pause is the blank duration, namely the blank.

In the embodiment of the application, the voice detection result of the target sample voice meets the preset sample requirement; the preset sample requirement may include that the speech rate is within a preset speech rate range, that is, the duration of the voice data is within a preset duration range; the method can also comprise that the frequency of the keywords is within a preset frequency range, namely the number of the keywords contained in the voice data is within a preset number range; the method can further include that the margin is in a preset margin duration range, that is, the blank duration of the interval between adjacent speech segments in the speech data is in the preset margin duration range.

Step S203: and executing processing operation corresponding to the first voice according to the identification information.

Specifically, after performing speech recognition on the first speech to obtain recognition information corresponding to the first speech, the processing operation corresponding to the first speech may be executed according to the recognition information.

For example, in an actual scene, if the identification information corresponding to the first voice is obtained as "turn on the air conditioner", then the processing operation corresponding to the identification information may be correspondingly executed, that is, the air conditioner is controlled to be turned on.

In the embodiment of the application, in the process of voice recognition, receiving first voice, performing voice recognition on the first voice based on a voice recognition model, and executing processing operation corresponding to the first voice; the target sample voice adopted in the voice recognition model training process is obtained through voice detection, and the voice detection comprises at least one of voice speed detection, keyword frequency detection and blank detection; therefore, target sample voice with high voice quality is obtained through voice detection, and the voice recognition accuracy of the voice recognition model is improved.

In an embodiment of the present application, before performing speech recognition on the first speech, a step of screening a target sample speech may be further included, so as to construct a sample speech database through the target sample speech, where the specific steps are as follows:

receiving the initial sample speech;

and carrying out voice detection on the initial sample voice, and screening the initial sample voice meeting the preset sample requirement as target sample voice according to the detection result of the voice detection.

Specifically, the initial sample voice is the sample voice to be screened, and the initial sample voice may be all the received sample voices. For example, the initial sample voice may be voice data obtained by recording predetermined keywords at different collection distances for persons of different genders, different age groups, and different sound characteristics.

In an actual implementation scenario, before receiving the initial sample voice, the method may further include the step of waking up the voice collecting device:

specifically, a second voice may be received;

a target keyword;

collecting times of the target keywords;

the interval duration between target keywords.

The second voice can be a wake-up voice, the wake-up voice can include a wake-up word, and when the second voice includes the target wake-up word, the second voice can send a prompt voice and display the first prompt message to the user.

As an example, in the scenario of collecting the initial sample voice, as shown in fig. 3 and fig. 4, the user may wake up the voice collecting device through voice interaction first, for example, during the voice interaction, the wake-up voice may be "hello, small a, start data collection". Through voice recognition and semantic understanding of the awakening voice, under the condition that the awakening voice contains a target awakening word 'hello, small A', the awakening voice is synthesized through technologies such as natural language generation, voice synthesis and the like, the prompt voice 'hello, please record audio according to screen prompt contents', and the like.

In addition, first prompt information can be displayed through a prompt screen, and the first prompt information can include the target keywords, the collection times of the target keywords, the interval duration between the target keywords and the like. For example, as shown in fig. 5, the prompt screen may display "please speak five times at normal speech rate" light up the screen "with a pause of 2 seconds between each time. Wherein the target keyword is 'on screen'; the collection times of the target keywords are five times; the interval duration between the target keywords is 2 seconds.

In addition, in the above scenario, the initial sample voice may be collected through a collection device, such as a microphone device, a mobile phone, a tablet computer, or a microphone array. In addition, the number of collected persons may be set, for example, 700-800 persons; the collected ages of the users can be distributed from 18 to 65 years of age; the acquisition distance may be set to, for example, 1 meter, 3 meters, 5 meters, etc.

After the initial sample voice is obtained, the method in this embodiment of the present application may further include a step of screening an effective speech segment in the initial sample voice, which specifically includes:

performing the speech detection on the initial sample speech including the valid utterance segment.

Specifically, the initial sample speech may be segmented into a plurality of speech frames, and then valid speech frames of the speech frames may be determined.

A valid speech frame may be understood as a speech frame containing a target object sound, for example, in a real scene, a valid speech frame may be a speech frame containing a real user sound, i.e. a speech frame containing a human sound.

In one embodiment, the specific step of determining a valid speech frame of the speech frames may include:

extracting acoustic features of the voice frame;

and determining the voice frame as the effective voice frame under the condition that the acoustic characteristics of the voice frame are determined to meet the preset characteristic conditions.

In an actual implementation process, in the embodiment of the present application, a Voice Activity Detection (VAD) technique may be used to determine a valid Voice frame.

The probability that a speech frame belongs to a valid speech frame is usually calculated by using DNN as a classifier in VAD techniques, which can be regarded as a two-classification problem. In the embodiment of the present application, VAD uses ResNet34 as a feature extractor to extract the acoustic features of the speech frame, the number of convolution channels may be {32,64,128,256}, and the size of convolution kernel is 3. In the embodiment of the application, the acoustic feature may be an 80-dimensional log Mel Fbank feature, the frame length is 25ms, and the frame shift is 10ms.

And then, performing pooling once every S frames through a pooling layer, and finally performing post-processing through two BilSTM layers, two full-connection layers and a sigmoid activation function to calculate the probability that the voice frame is a valid voice frame.

After the effective speech frame is determined, effective speech segments can be screened, and the effective speech segments are multi-frame continuous effective speech frames, namely one effective speech.

In the practical implementation process, a plurality of voice frames can be continuously detected, and when the plurality of continuous voice frames are effective voice frames, the continuous effective voice frames are effective speech segments; that is, the first valid speech frame is the "beginning" of the valid speech segment, and the last valid speech frame is the "end" of the valid speech segment.

In this embodiment, after determining the valid utterance segment, the speech detection may be performed on the initial sample speech including the valid utterance segment.

Specifically, in a case where the voice detection includes keyword frequency detection, the performing the voice detection on the initial sample voice includes:

and carrying out voice recognition on the effective language segments, and determining whether the effective language segments contain target keywords and whether the number of the effective language segments containing the target keywords is greater than a preset threshold value.

Optionally, the keywords in the valid speech segments may be detected by a speech recognition technique, and it is determined whether the valid speech segments include the target keyword, and the number of valid speech segments including the target keyword is determined. For example, 3 valid utterance segments are included in the initial sample speech, and it can be determined whether each valid utterance segment includes the target keyword.

In addition, in a case where the voice detection includes speech rate detection, the performing the voice detection on the initial sample voice may include:

Specifically, the speech segment duration of each valid speech segment may be detected by VAD technique, and in combination with the schematic diagram of speech rate detection shown in fig. 6, the durations shown by t1 and t2 shown in fig. 6 are the speech segment durations of the corresponding valid speech segment, respectively.

Then, it is determined whether the speech passage duration is within the second standard duration range, where a standard duration range of the valid speech passage (i.e. a second standard duration range) may be preset, for example, the standard duration range of the valid speech passage is 3 seconds to 4 seconds.

Further, in a case where the voice detection includes a white space detection, the performing the voice detection on the initial sample voice may include:

With reference to the schematic diagram of the whitespace detection shown in fig. 7, the embodiment of the present application may detect an interval duration between two adjacent valid speech segments through a VAD technique, where the interval duration is the whitespace, that is, the duration shown by t3, t4, and t5 shown in fig. 7. Then, it is determined whether the interval duration is within a first standard duration range, wherein the standard duration range of the blank (i.e. the first standard duration range) may be set in advance, for example, the standard duration range of the blank may be 1 second to 2 seconds.

In an embodiment of the present application, the screening, according to a detection result of the voice detection, the initial sample voice meeting a preset sample requirement as a target sample voice includes:

Specifically, when the voice detection includes single detection, for example, the voice detection includes only speech rate detection, or includes only whiteout detection, or the like, in which case the initial sample voice may be determined as the target sample voice when the detection result of the single detection satisfies a preset sample requirement.

When the voice detection includes a plurality of detections, for example, the voice detection includes speech rate detection, keyword frequency detection, and leave-blank detection, and in this case, the initial sample voice may be determined as the target sample voice under the condition that at least a preset number of detection results satisfy a preset sample requirement. For example, when the voice detection includes the three detections, the initial sample voice may be determined as the target sample voice when the detection results of at least two detections satisfy the preset sample requirement.

The preset sample requirement may include that the speech rate is within a preset speech rate range, that is, the duration of the voice data is within a preset duration range; the method can also comprise that the frequency of the keywords is within a preset frequency range, namely the number of the keywords contained in the voice data is within a preset number range; the method may further include that the time length of the blank is within a preset time length range, that is, the blank time length of the interval between adjacent speech segments in the speech data is within a preset time length range.

In an embodiment of the present application, when the initial sample speech does not satisfy the preset sample condition, the embodiment of the present application may further include the following processing steps:

the second prompt indicates to re-collect the initial sample speech.

Specifically, in an actual scene, when the initial sample voice does not meet the preset sample condition, the user may be prompted to re-collect the initial sample voice through the prompt screen.

The following describes a complete process for acquiring an initial sample voice and determining a target sample voice in this embodiment with reference to fig. 8 and 9:

as shown in fig. 8, it is a schematic diagram of a system architecture for voice acquisition, and the system architecture includes a voice interaction module, a large screen prompt module, a quality detection module, and a microphone module.

The voice interaction module is used for interacting with a user, prompting the user to perform voice acquisition, voice re-acquisition and the like; the large screen word-lifting module is used for displaying target keywords, the speed of speech, the left white and the like in the collected speech; the quality detection module is used for carrying out voice detection on the collected initial sample voice; the microphone module is used for collecting initial sample voice.

As shown in fig. 9, after the collection is started, a prompt message such as "please speak five times at normal speed" lights up the screen, and pause for 2 seconds between each time "is displayed on the prompt screen, and a prompt voice" please record audio according to the prompt content of the screen "may be sent out accordingly to prompt the user to perform voice collection.

After receiving initial sample voice of a user, determining an effective speech segment through voice activity detection; then, carrying out voice recognition on each effective speech segment, and determining whether the effective speech segment contains a target keyword or not; if the target keywords are not contained, the user can be prompted to speak the correct target keywords; if the target keyword is contained, the speech speed detection can be carried out on the effective speech section, and whether the speech speed meets the requirement or not is determined; if the speed of speech does not meet the requirement, prompting the user to pronounce according to the prompt speed of speech; if the speech speed meets the requirement, detecting the blank between the effective speech sections, and detecting whether the blank meets the blank requirement; if the blank space does not meet the requirement, the user can be prompted to pay attention to the pronunciation pause; if the margin meets the requirement, whether the number of the effective speech segments containing the target keywords meets the requirement or not, namely whether the recording times meet the requirement or not can be detected; if the requirement of the recording times is not met, prompting the user to continue recording the current target keywords; if the requirement of the recording times is met, the recording can be stored, and then the collection of other target keywords can be switched.

In the embodiment of the application, in the process of voice recognition, first voice is received, voice recognition is carried out on the first voice on the basis of a voice recognition model, and processing operation corresponding to the first voice is executed; the target sample voice adopted in the voice recognition model training process is obtained through voice detection, and the voice detection comprises at least one of voice speed detection, keyword frequency detection and blank detection; therefore, the target sample voice with higher voice quality is obtained through voice detection, and the voice recognition precision of the voice recognition model is improved.

An embodiment of the present application provides a voice data processing apparatus, and as shown in fig. 10, the voice data processing apparatus 100 may include: a receiving module 1001, an identifying module 1002, and an executing module 1003, wherein,

a receiving module 1001, configured to receive a first voice;

a recognition module 1002, configured to perform speech recognition on the first speech based on a speech recognition model, and determine recognition information corresponding to the first speech; the voice recognition model is obtained by training based on target sample voice in a sample voice database, and the target sample voice is obtained by performing voice detection on initial sample voice; wherein the voice detection comprises: at least one of speech rate detection, keyword frequency detection and whiteout detection;

an executing module 1003, configured to execute, according to the identification information, a processing operation corresponding to the first voice.

In one embodiment of the present application, the method further comprises: a filtering module for filtering the first speech before performing the speech recognition on the first speech,

receiving the initial sample speech;

In an embodiment of the present application, the screening module is specifically configured to segment the initial sample speech into a plurality of speech frames, and determine a valid speech frame in the speech frames;

In an embodiment of the present application, the screening module is specifically configured to extract acoustic features of the speech frame;

In an embodiment of the present application, in a case that the voice detection includes whiteout detection, the screening module is specifically configured to determine an interval duration between adjacent effective speech segments;

In an embodiment of the application, in a case that the voice detection includes speech rate detection, the screening module is specifically configured to compare a speech segment duration of the effective speech segment with a second standard duration range, and determine whether the speech segment duration is within the second standard duration range.

In an embodiment of the application, under the condition that the voice detection includes keyword frequency detection, the screening module is specifically configured to perform voice recognition on the valid speech segments, determine whether the valid speech segments include target keywords, and determine whether the number of the valid speech segments including the target keywords is greater than a preset threshold.

In an embodiment of the application, the screening module is specifically configured to determine, as the target sample speech, the initial sample speech of which the detection result of the single detection meets a preset sample requirement, when the speech detection includes the single detection;

In one embodiment of the present application, the apparatus further comprises: a first hinting module to, prior to said receiving the initial sample speech,

receiving a second voice;

a target keyword;

collecting times of the target keywords;

the interval duration between target keywords.

In one embodiment of the present application, the apparatus further comprises: the second prompt module is used for displaying second prompt information under the condition that the initial sample voice does not meet the preset sample condition;

the second prompt indicates to re-collect the initial sample speech.

The apparatus of the embodiment of the present application may execute the method provided by the embodiment of the present application, and the implementation principle is similar, the actions executed by the modules in the apparatus of the embodiments of the present application correspond to the steps in the method of the embodiments of the present application, and for the detailed functional description of the modules of the apparatus, reference may be specifically made to the description in the corresponding method shown in the foregoing, and details are not repeated here.

In the embodiment of the application, in the process of voice recognition, receiving first voice, performing voice recognition on the first voice based on a voice recognition model, and executing processing operation corresponding to the first voice; the target sample voice adopted in the voice recognition model training process is obtained through voice detection, and the voice detection comprises at least one of voice speed detection, keyword frequency detection and blank detection; therefore, the target sample voice with higher voice quality is obtained through voice detection, and the voice recognition precision of the voice recognition model is improved.

An embodiment of the present application provides an electronic device, including: a memory and a processor; at least one program stored in the memory for execution by the processor, which when executed by the processor, implements: in the embodiment of the application, in the process of voice recognition, first voice is received, voice recognition is carried out on the first voice on the basis of a voice recognition model, and processing operation corresponding to the first voice is executed; the target sample voice adopted in the voice recognition model training process is obtained through voice detection, and the voice detection comprises at least one of voice speed detection, keyword frequency detection and blank detection; therefore, the target sample voice with higher voice quality is obtained through voice detection, and the voice recognition precision of the voice recognition model is improved.

In an alternative embodiment, an electronic device is provided, as shown in fig. 11, the electronic device 4000 shown in fig. 11 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 11, but this is not intended to represent only one bus or type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application program codes (computer programs) for executing the present scheme, and is controlled by the processor 4001 to execute. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.

Among them, electronic devices include but are not limited to: mobile phones, notebook computers, multimedia players, desktop computers, and the like.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.

In the embodiment of the application, in the process of voice recognition, receiving first voice, performing voice recognition on the first voice based on a voice recognition model, and executing processing operation corresponding to the first voice; the target sample voice adopted in the voice recognition model training process is obtained through voice detection, and the voice detection comprises at least one of voice speed detection, keyword frequency detection and blank space detection; therefore, target sample voice with high voice quality is obtained through voice detection, and the voice recognition accuracy of the voice recognition model is improved.

The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than illustrated or otherwise described herein.

It should be understood that, although each operation step is indicated by an arrow in the flowchart of the embodiment of the present application, the implementation order of the steps is not limited to the order indicated by the arrow. In some implementation scenarios of the embodiments of the present application, the implementation steps in the flowcharts may be performed in other sequences as desired, unless explicitly stated otherwise herein. In addition, some or all of the steps in each flowchart may include multiple sub-steps or multiple stages based on an actual implementation scenario. Some or all of these sub-steps or stages may be performed at the same time, or each of these sub-steps or stages may be performed at different times. Under the scenario that the execution time is different, the execution sequence of the sub-steps or phases may be flexibly configured according to the requirement, which is not limited in the embodiment of the present application.

The foregoing is only an optional implementation manner of a part of implementation scenarios in this application, and it should be noted that, for those skilled in the art, other similar implementation means based on the technical idea of this application are also within the protection scope of the embodiments of this application without departing from the technical idea of this application.

Claims

1. A method for processing voice data, comprising:

receiving a first voice;

and executing processing operation corresponding to the first voice according to the identification information.

2. The method of processing speech data according to claim 1, wherein prior to said performing speech recognition on said first speech, said method further comprises:

receiving the initial sample speech;

3. The method of claim 2, wherein the performing the voice detection on the initial sample voice comprises:

4. The method of claim 3, wherein the determining a valid one of the speech frames comprises:

extracting acoustic features of the voice frame;

5. The method according to claim 3, wherein in the case where the speech detection includes a white space detection, the performing the speech detection on the initial sample speech includes:

6. The method according to claim 3, wherein in the case where the speech detection includes speech rate detection, the performing the speech detection on the initial sample speech includes:

7. The method according to claim 3, wherein in the case that the speech detection includes keyword frequency detection, the performing the speech detection on the initial sample speech includes:

8. The method according to claim 2, wherein the screening the initial sample speech satisfying a predetermined sample requirement as the target sample speech according to the detection result of the speech detection comprises:

9. The method of processing speech data according to claim 2, wherein prior to said receiving the initial sample speech, the method further comprises:

receiving a second voice;

a target keyword;

collecting times of the target keywords;

the interval duration between target keywords.

10. The speech data processing method of claim 2, wherein the method further comprises:

under the condition that the initial sample voice does not meet the preset sample condition, displaying second prompt information;

the second prompt indicates to re-collect the initial sample speech.

11. A speech data processing apparatus, comprising:

the receiving module is used for receiving first voice;

12. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: -performing a speech data processing method according to any of claims 1 to 10.

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the speech data processing method of any one of claims 1 to 10.