CN113205800A

CN113205800A - Audio recognition method and device, computer equipment and storage medium

Info

Publication number: CN113205800A
Application number: CN202110436379.0A
Authority: CN
Inventors: 赵晴
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-08-03
Anticipated expiration: 2041-04-22
Also published as: CN113205800B

Abstract

The application relates to an audio recognition method, an audio recognition device, a computer device and a storage medium. The method comprises the following steps: receiving audio stream information, wherein the audio stream information comprises: an audio stream sampling rate; acquiring input control parameters of an acoustic model and decoding parameters of a decoder according to the audio stream sampling rate; receiving an audio stream segment; inputting the audio stream fragments into an acoustic model according to the input control parameters to obtain a score list; and inputting the score list into a decoder according to the decoding parameters to obtain the identification result of the audio stream segment. In the embodiment of the application, the input control parameters of the acoustic model and the decoding parameters of the decoder are obtained according to the sampling rate of the received audio stream, the identification result is obtained according to the acoustic model and the decoder, a plurality of sets of systems do not need to be equipped according to the sampling rate, and the cost can be reduced.

Description

Audio recognition method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of data processing, and in particular, to an audio recognition method and apparatus, a computer device, and a storage medium.

Background

With the continuous progress of the technology, the application of the voice interaction technology is more and more extensive, such as intelligent outbound robot, intelligent customer service quality inspection and the like.

In the voice interaction process, in order to improve user experience, the voice input of the user needs to be processed in time so as to reduce response delay. For example, in an intelligent outbound scene, a product is required to be capable of accurately and quickly recognizing the speech into text information through a speech recognition server, and then making a corresponding reply after the user intention is obtained according to natural language processing, so as to complete a round of conversation.

However, different application scenarios often have different requirements for voice streams, such as 8k audio streams generated by outgoing calls, 16k audio streams generated by conferences, and different audio streams generated by different application scenarios and application devices. To support different services, voice services require maintenance of multiple sets of similar systems, which can result in significant resource consumption and labor maintenance costs.

Disclosure of Invention

To solve the above technical problem or at least partially solve the above technical problem, the present application provides an audio recognition method, apparatus, computer device and storage medium.

In a first aspect, the present application provides an audio recognition method, including:

receiving audio stream information, wherein the audio stream information comprises: an audio stream sampling rate;

acquiring input control parameters of an acoustic model and decoding parameters of a decoder according to the audio stream sampling rate;

receiving an audio stream segment;

inputting the audio stream fragments into an acoustic model according to the input control parameters to obtain a score list;

and inputting the score list into a decoder according to the decoding parameters to obtain the identification result of the audio stream segment.

In an embodiment of the present application, the inputting of the control parameter includes: presetting a threshold and a preset data volume;

the inputting the audio stream segment into the acoustic model according to the input control parameter to obtain a score list, comprising:

acquiring all audio stream fragments in a memory;

judging whether the total data volume of all the audio stream fragments is larger than a preset threshold value or not;

if the total data volume of all the audio stream fragments is larger than a preset threshold value, sequentially acquiring a plurality of audio stream fragments from back to front according to the sequence of the timestamps, taking the sum of the plurality of audio stream fragments as the audio stream to be identified, and enabling the total data volume of the audio stream to be identified to be equal to the preset data volume, wherein the audio stream fragment corresponding to the last timestamp is the current audio stream fragment;

acquiring a first score list corresponding to the audio stream to be identified according to the acoustic model;

and screening out a second score list corresponding to the current audio stream fragment from the first score list.

In this embodiment of the present application, the inputting the score list into a decoder according to the decoding parameters to obtain the recognition result includes:

inputting the decoding parameters corresponding to the second score list and the last timestamp into the decoder to obtain the identification result of the current audio stream segment;

after obtaining the identification result of the current audio stream segment, the method further includes:

and generating and storing the decoding parameters corresponding to the current time stamp.

In this embodiment of the application, before receiving the audio stream information, the method further includes:

receiving a long connection application;

establishing long connection with a user side according to the long connection application;

receiving verification information sent by the user side through the long connection;

and performing identity authentication on the user side according to the authentication information, and if the authentication is passed, allowing the audio stream information to be received.

In the embodiment of the application, the audio stream information received for the first time also comprises single transmission data volume;

the obtaining of the input control parameters of the acoustic model and the decoding parameters of the decoder according to the sampling rate of the audio stream includes:

obtaining an initialized decoding parameter according to the sampling rate of the audio stream and the single transmission data volume;

and acquiring a preset threshold value and a preset data volume according to the sampling rate of the audio stream and the single transmission data volume.

In this embodiment of the application, after receiving the audio stream segment and before inputting the audio stream segment into the acoustic model, the method further includes:

and storing the audio stream fragments to a memory, wherein the stored audio stream fragments carry time stamps.

In this embodiment of the present application, if a total data volume of all audio preset data volume segments is less than or equal to a preset threshold, the method further includes:

all audio stream segments in the memory are used as audio streams to be identified;

acquiring a third score list corresponding to the audio stream to be identified according to the acoustic model;

inputting the third score list and the initialized decoding parameters into a decoder to obtain an identification result of the audio stream to be identified, and taking the identification result of the audio stream to be identified as an identification result of the current audio stream segment;

the decoding parameters are initialized.

In an embodiment of the present application, the method further includes:

judging whether the received recognition results of all the audio stream fragments are sent to the end;

if the transmission is finished, judging whether an audio stream transmission finishing mark transmitted by the user side is received;

and if the audio stream transmission ending mark is received, disconnecting the long connection with the user side.

In this embodiment of the application, before acquiring all audio stream segments in the memory, the method further includes:

judging whether the received audio stream segment is a mute segment;

if the audio stream is not a mute segment, judging whether the audio stream segment is complete;

and if the audio stream segment is complete, storing the audio stream segment into the memory.

In a second aspect, there is provided an audio recognition apparatus, the apparatus comprising:

a receiving unit, configured to receive audio stream information, where the audio stream information includes: an audio stream sampling rate;

the processor is used for acquiring input control parameters of an acoustic model and decoding parameters of a decoder according to the sampling rate of the audio stream;

the receiving unit is further configured to receive an audio stream segment;

the model unit is used for inputting the audio stream fragments into an acoustic model to obtain a score list;

and the decoding unit is used for inputting the score list into a decoder according to the decoding parameters and acquiring the identification result of the audio stream segment.

In a third aspect, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the above methods when executing the computer program.

In a fourth aspect, a computer-readable storage medium is provided, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any of the above.

The audio recognition method, the device, the computer equipment and the storage medium comprise the following steps: receiving audio stream information, wherein the audio stream information comprises: an audio stream sampling rate; acquiring input control parameters of an acoustic model and decoding parameters of a decoder according to the audio stream sampling rate; receiving an audio stream segment; inputting the audio stream fragments into an acoustic model according to the input control parameters to obtain a score list; and inputting the score list into a decoder according to the decoding parameters to obtain the identification result of the audio stream segment. In the embodiment of the application, the input control parameters of the acoustic model and the decoding parameters of the decoder are obtained according to the sampling rate of the received audio stream, the identification result is obtained according to the acoustic model and the decoder, a plurality of sets of systems do not need to be equipped according to the sampling rate, and the cost can be reduced.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a diagram illustrating an application environment of an audio recognition method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an audio recognition method according to an embodiment of the invention;

FIG. 3 is a flowchart illustrating an audio recognition method according to an embodiment of the invention;

FIG. 4 is a flowchart illustrating an audio recognition method according to an embodiment of the invention;

FIG. 5 is a flowchart illustrating an audio recognition method according to an embodiment of the invention;

FIG. 6 is a block diagram of an audio recognition apparatus according to an embodiment of the present invention;

fig. 7 is a diagram showing an internal structure of a computer device in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

FIG. 1 is a diagram of an exemplary audio recognition system. Referring to fig. 1, the audio recognition method is applied to an audio recognition system. The audio recognition system includes a client terminal 110 and a server 120. The user terminal 110 and the server 120 are connected through a network. The client 110 may be a desktop client or a mobile client, and the mobile client may be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in FIG. 2, an audio recognition method is provided. The embodiment is mainly illustrated by applying the method to the server 120 in fig. 1. Referring to fig. 2, the audio recognition method includes:

step 210, receiving audio stream information, where the audio stream information includes: audio stream sampling rate.

Prior to step 210, i.e. prior to said receiving audio stream information, the method further comprises:

receiving a long connection application;

If the verification is not passed, the audio stream information is not allowed to be received.

And step 220, acquiring input control parameters of the acoustic model and decoding parameters of a decoder according to the audio stream sampling rate.

In the embodiment of the application, the sampling rates of the audio streams are different, but the acoustic models are the same. After the acoustic model is initialized and established, the acoustic model can be trained according to the audio streams with different sampling rates, so that the acoustic model can perform audio identification on the audio streams with different sampling rates.

Input control parameters for the acoustic model, including: inputting control parameters including: a preset threshold and a preset data volume. The decoding parameters of the decoder will be explained in detail below.

In the embodiment of the present application, the acoustic model refers to a model obtained by training through a training set, where the training set may be audio and/or a label.

In the prior art, the model structure commonly used is CNN + transform + CTC. The models of the structure are more, and the training method used by the acoustic model in the embodiment of the application can be as follows:

acquiring a first linear frequency spectrum corresponding to audio to be trained with different sampling rates, wherein the abscissa of the first linear frequency spectrum is a frequency spectrum sequence number, the ordinate is a frequency domain number, and the value of a coordinate point determined by the abscissa and the ordinate is an original amplitude value corresponding to the audio to be trained; determining a maximum sampling rate of the different sampling rates and other sampling rates than the maximum sampling rate; determining the maximum frequency domain serial number of the first linear frequency spectrum corresponding to the other sampling rates as a first frequency domain serial number; determining the maximum frequency domain serial number of the first linear frequency spectrum corresponding to the maximum sampling rate as a second frequency domain serial number; setting amplitude values corresponding to each frequency domain serial number which is larger than the first frequency domain serial number and smaller than or equal to the second frequency domain serial number to be zero in the first linear frequency spectrum corresponding to the other sampling rates to obtain second linear frequency spectrums corresponding to the other sampling rates; determining a first voice characteristic of the audio to be trained with the maximum sampling rate according to a first Mel spectrum characteristic of a first linear spectrum corresponding to the maximum sampling rate; determining a second voice characteristic of the audio to be trained at the other sampling rate according to a second Mel spectrum characteristic of a second linear spectrum corresponding to the other sampling rate; training a machine learning model using the first speech feature and the second speech feature.

In the embodiment of the present application, the acoustic model may also be in other structures, and the training method of the model may also be in other methods, and the fact that the acoustic model structure and the training method adopted do not affect the method of the present application is not described herein again.

Step 230, an audio stream segment is received.

In the embodiment of the present application, the client continuously sends the audio stream segment, and thus the server also continuously receives the audio stream segment.

The first received audio stream information also comprises single transmission data volume;

acquiring input control parameters of an acoustic model and decoding parameters of a decoder according to the sampling rate of the audio stream, wherein the method comprises the following steps:

The client transmits the audio stream segments according to the single transmission data amount, i.e. the data amount of each audio stream segment is equal to the single transmission data amount.

In the basic application, the predetermined threshold is determined according to the sampling rate of the audio stream, such as 16K, 8K, etc., and the amount of data in a single transmission

The sampling rate, also referred to as the sampling speed or sampling frequency, refers to the number of samples per second taken from a continuous signal and made up into a discrete signal, in hertz (Hz). The higher the sampling rate, the greater the amount of data collected per unit time. That is, in the embodiment of the present application, the higher the sampling rate is, the larger the data volume collected in the unit time of the user side is, and the larger the data volume that needs to be identified is.

The acquired data volume is different from the single-time transmission data volume, although the data volume acquired in unit time of the user side is large, the data transmission speed from the user side to the server is limited in various ways, so that the sampling rate is high, and the single-time data transmission volume is large; meanwhile, the single data transmission amount is also related to the single transmission interval of the system, for example, the single transmission time of the system may be 0.1 second, or may be 1 second.

The sampling rate of the audio stream is determined by the client, and the single data transmission amount is determined by both the client and the server, which may be preset in the system, or may be commonly agreed when the client and the server perform communication handshake, or may be determined in other ways, and will not be described herein again.

Step 240, inputting the audio stream segment into an acoustic model according to the input control parameter, and obtaining a score list.

In this embodiment of the application, after receiving the audio stream segment, before inputting the audio stream segment into the acoustic model, that is, after step 230, before step 240, the method further includes:

and storing the audio stream fragments to a memory, wherein the stored audio stream fragments carry time stamps. In the embodiment of the present application, the time stamp may be any one of a transmission time stamp, a reception time stamp, or a storage time stamp, and the time stamp type of each audio stream segment needs to be consistent.

The time stamp is a mark indicating a chronological order, and may be a number, for example, 1, 2, 3, and 4, and taking the transmission time stamp as an example, the time stamp of the first transmitted audio stream segment is 1, and the time stamp of the next transmitted audio stream segment is 2.

Step 250, inputting the score list into a decoder according to the decoding parameters, and acquiring the recognition result of the audio stream segment.

In the embodiment of the application, the input control parameters of the acoustic model and the decoding parameters of the decoder are obtained according to the sampling rate of the received audio stream, the identification result is obtained according to the acoustic model and the decoder, a plurality of sets of systems do not need to be equipped according to the sampling rate, and the cost can be reduced. In this embodiment of the present application, in step 240, according to the input control parameter, the inputting the audio stream segment into the acoustic model, and obtaining a score list includes:

acquiring all audio stream fragments in a memory;

In the embodiment of the application, when the recognition result of the current audio stream segment is obtained, the sum of the multiple audio stream segments is input into the acoustic model, and the recognition accuracy and precision can be improved due to the fact that the input data size is large (the duration is long).

the decoding parameters are initialized.

In an embodiment of the present application, the time stamp is 1, 2, 3, 4, 5 … …, the data size of each audio stream segment is 1K, and the data size of each audio stream segment may also be converted to the duration, for example, in the above embodiment, the duration of each audio stream segment may also be set to be 0.2 seconds.

In the embodiment of the present application, the preset threshold may be set as the data amount, and then may be set as the duration of the phase transition associated with the data amount, so that the preset threshold may also be set as 1 second.

After initialization, the server receives the audio stream segments sent by the client, each of which is 0.2 second.

In the embodiment of the application, after receiving and storing the audio stream segment with the timestamp of 1 (audio stream segment 1 for short), the server determines that the total duration of the audio stream segment in the memory is less than the preset threshold value for 1 second during identification, so that the audio stream segment with the timestamp of 1 is used as the audio stream to be identified. Inputting the audio stream segment with the time stamp of 1 into an acoustic model, acquiring a score list, inputting the score list and the initialized decoding parameters into a decoder to obtain an identification result of the audio stream to be identified, taking the identification result of the audio stream to be identified as the identification result of the audio stream segment with the time stamp of 1, and initializing the decoding parameters.

In the embodiment of the present application, if the total data amount of all the preset audio data amount segments in the memory is less than or equal to the preset threshold, only the audio stream segment with the timestamp of 1 flows into the memory, and then the decoding parameters are initialized after the recognition result is obtained.

In the embodiment of the application, after the server receives and stores the audio stream segment with the timestamp of 2 (audio stream segment 2 for short), the memory has the audio stream segment 1 and the audio stream segment 2, and during identification, the total duration of the audio stream segment in the memory is judged to be 0.4 second and is smaller than the preset threshold value 1 second, so that the sum of the audio stream segment 1 and the audio stream segment 2 is taken as the audio stream to be identified. And inputting the sum of the audio stream segment 1 and the audio stream segment 2 into an acoustic model to obtain a score list, inputting the score list and the initialized decoding parameters into a decoder to obtain an identification result of the audio stream to be identified, taking the identification result of the audio stream to be identified as the identification result of the audio stream segment with the time stamp of 2, and initializing the decoding parameters.

In this embodiment of the application, the server obtains the recognition results of the audio stream segments with the timestamps of 3, 4, and 5, and the obtaining of the recognition result of the audio stream segment 2 is similar as above, which is not described herein again.

In the embodiment of the application, after receiving and storing an audio stream segment with a time stamp of 6 (referred to as audio stream segment 6 for short), a server in the embodiment of the application stores audio stream segments 1 to 6 in a memory, and during identification, determines that the total duration of the audio stream segments in the memory is 1.2 seconds, and is greater than a preset threshold value for 1 second, and at this time, the preset data amount is 1 second, then sequentially obtains a plurality of audio stream segments from back to front according to the sequence of the time stamps, that is, obtains the sum of the audio stream segments of 6, 5, 4, 3, and 2 as an audio stream to be identified, and obtains a first score list corresponding to the audio stream to be identified according to an acoustic model; and screening out a second score list corresponding to the audio stream fragment 6 from the first score list. Inputting a second score list corresponding to the audio stream segment 6 and a decoding parameter corresponding to the timestamp 5 into the decoder to obtain an identification result of the current audio stream segment; and generating and storing the decoding parameters corresponding to the time stamp 6.

The decoding parameter corresponding to the time stamp 5 is initialized after the recognition result of the audio stream segment 5 is acquired, and therefore the decoding parameter at this time is the initialized decoding parameter.

When the recognition result of the audio stream segment 7 is obtained, the decoding parameter of the time stamp 6 is used, and the decoding parameter of the time stamp 6 is not the initialization parameter.

In the above embodiment of the present application, when the recognition result of the audio stream segment with the time stamp of 6 is obtained, the sum of the audio stream segments 6, 5, 4, 3, and 2 is input to the acoustic model, and since the input data size is large (the duration is long), the accuracy and precision of recognition can be improved.

In the above embodiment of the present application, the preset data amount is 1 second, and the data amount may completely include a plurality of audio stream segments, and if the preset data amount is 0.9 second, since each audio segment is 0.2 second, it is not possible to divide the audio stream segments evenly, the preset data amount is unreasonable to set.

In the above embodiments of the present application, the preset data amount may also be set to 1.2 seconds or 0.8 seconds, and specific factors affecting the setting may refer to the above embodiments, which are not described herein again.

In an embodiment of the present application, the method further includes:

judging whether the received audio stream segment is a mute segment;

In the embodiment of the application, the silent segment refers to the audio without effective recognizable voice, the silent segment and the incomplete audio stream segment are filtered, only the non-silent complete audio stream segment is stored, and the recognition efficiency of the subsequent audio stream segment can be improved.

Fig. 3 is a flowchart illustrating an audio recognition method according to an embodiment of the present application, and as shown in fig. 3, the method is applied to a client and a server, and includes:

step 310, sending a long connection application by a user side;

step 320, the server establishes a long connection with the user side;

step 330, the server receives the verification information sent by the user terminal through long connection;

in step 340, the server authenticates the user terminal, and if the authentication is passed, the process goes to step 350, and if the authentication is not passed, the process goes to step 394.

Step 350, the user end sends the audio stream information, and after the audio stream is sent, the end mark is sent.

In step 360, the server obtains the input control parameters of the acoustic model and the decoding parameters of the decoder according to the sampling rate of the audio stream.

Step 370, the server receives the audio stream segment sent by the user end;

step 380, the server inputs the audio stream fragments into an acoustic model to obtain a score list.

In step 390, the server inputs the score list into the decoder to obtain the recognition result of the audio stream segment.

In step 391, the server determines whether the transmission of the identification result is finished, and if so, goes to step 392.

Step 392, determine whether an end flag has been received, and if so, go to step 393.

Step 393, the server disconnects the long connection with the client.

In step 394, the server feeds back the verification failure information, and the process is ended.

In the embodiment of the application, the input control parameters of the acoustic model and the decoding parameters of the decoder are obtained according to the sampling rate of the received audio stream, the identification result is obtained according to the acoustic model and the decoder, a plurality of sets of systems do not need to be equipped according to the sampling rate, and the cost can be reduced.

Fig. 4 is a flowchart illustrating an audio recognition method according to an embodiment of the present application, where as shown in fig. 4, the method includes:

step 410, receiving audio stream information for the first time;

step 420, obtaining an initialized decoding parameter according to the sampling rate of the audio stream and the single transmission data volume;

step 430, obtaining a preset threshold and/or a preset data volume according to the audio stream sampling rate and the single transmission data volume.

Step 440, receiving the audio stream segment, determining whether the audio stream is a mute segment, if not (no), going to step 450, and if it is a mute segment (yes), going to step 460.

Step 450, judging whether the system is in a working state, if so, turning to step 480, and if not, turning to step 470.

Step 460, discard the audio stream segment, go to step 440.

Step 470, activate the system, go to step 480.

Step 480, determine whether the audio stream segment is complete, if so, go to step 490, and if not, go to step 460.

Step 490, store the audio stream segment in memory.

Step 491, audio recognition is performed.

Fig. 5 is a flowchart illustrating an audio recognition method according to an embodiment of the present application, where as shown in fig. 5, the method includes:

at step 510, an audio stream segment is received and stored.

Step 520, determining whether the duration of the audio stream in the memory is greater than a preset threshold, if so, going to step 530, and if not, going to step 570.

Step 530, sequentially acquiring a plurality of audio stream fragments from back to front according to the sequence of the timestamps, taking the sum of the plurality of audio stream fragments as the audio stream to be identified, and enabling the total data volume of the audio stream to be identified to be equal to the preset data volume;

step 540, acquiring a first score list corresponding to the audio stream to be identified according to the acoustic model;

step 550, screening out a second score list corresponding to the current audio stream segment from the first score list;

step 560, update the decoding parameters, go to step 520.

Step 570, using all the audio stream segments in the memory as the audio streams to be identified;

step 580, according to the acoustic model, obtaining a third score list corresponding to the audio stream to be recognized;

step 590, inputting the third score list and the initialized decoding parameters into a decoder to obtain the recognition result of the audio stream to be recognized, and using the recognition result of the audio stream to be recognized as the recognition result of the current audio stream segment;

in step 591, decoding parameters are initialized, and the process goes to step 520.

Fig. 2 to 5 are schematic flow charts of an audio recognition method according to an embodiment. It should be understood that, although the steps in the flowcharts of fig. 2 to 5 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-5 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

Corresponding to the above audio recognition method, the present application further provides an audio recognition apparatus, as shown in fig. 6, the apparatus includes:

a receiving unit 610, configured to receive audio stream information, where the audio stream information includes: an audio stream sampling rate;

a processor 620, configured to obtain an input control parameter of an acoustic model and a decoding parameter of a decoder according to the audio stream sampling rate;

the receiving unit 610 is further configured to receive an audio stream segment;

a model unit 630, configured to input the audio stream segment into an acoustic model, and obtain a score list;

and the decoding unit 640 is configured to input the score list into a decoder according to the decoding parameters, and obtain an identification result of the audio stream segment.

In an embodiment of the present application, the inputting of the control parameter includes: presetting a threshold and a preset data volume; the model unit 630 is further configured to:

acquiring all audio stream fragments in a memory;

In this embodiment of the application, the decoding unit 640 is further configured to:

In an embodiment of the present application, the apparatus further includes a long connection establishing unit, configured to:

receiving a long connection application;

in an embodiment of the present application, the processor is further configured to:

In an embodiment of the present application, the apparatus further includes a storage unit, configured to: and storing the audio stream fragments to a memory, wherein the stored audio stream fragments carry time stamps.

In an embodiment of the present application, the decoder is further configured to:

if the total data volume of all audio preset data volume segments is less than or equal to a preset threshold value, taking all audio stream segments in the memory as audio streams to be identified;

the decoding parameters are initialized.

In this embodiment of the present application, the long connection establishing unit is further configured to:

In an embodiment of the present application, the storage unit is further configured to: before all audio stream segments in a memory are acquired, judging whether the received audio stream segments are silent segments or not;

FIG. 7 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the server 120 in fig. 1. As shown in fig. 7, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the audio recognition method. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform the audio recognition method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: an audio stream sampling rate; acquiring input control parameters of an acoustic model and decoding parameters of a decoder according to the audio stream sampling rate; receiving an audio stream segment; inputting the audio stream fragments into an acoustic model according to the input control parameters to obtain a score list; and inputting the score list into a decoder according to the decoding parameters to obtain the identification result of the audio stream segment.

In an embodiment, the processor executes the computer program to implement the steps of the above method, which are not described herein again.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: an audio stream sampling rate; acquiring input control parameters of an acoustic model and decoding parameters of a decoder according to the audio stream sampling rate; receiving an audio stream segment; inputting the audio stream fragments into an acoustic model according to the input control parameters to obtain a score list; and inputting the score list into a decoder according to the decoding parameters to obtain the identification result of the audio stream segment. In an embodiment, the computer program, when executed by the processor, further implements the steps of the above method, which are not described herein again.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of: an audio stream sampling rate; acquiring input control parameters of an acoustic model and decoding parameters of a decoder according to the audio stream sampling rate; receiving an audio stream segment; inputting the audio stream fragments into an acoustic model according to the input control parameters to obtain a score list; and inputting the score list into a decoder according to the decoding parameters to obtain the identification result of the audio stream segment. In an embodiment, the computer program product or the computer program when executed further implements the steps of the above method, which are not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for audio recognition, the method comprising:

receiving an audio stream segment;

2. The method of claim 1, wherein the inputting control parameters comprises: a preset threshold value and a preset amount of data,

acquiring all audio stream fragments in a memory;

3. The method of claim 2, wherein inputting the score list into a decoder according to the decoding parameters to obtain the recognition result comprises:

4. The method of claim 1, wherein prior to receiving the audio stream information, the method further comprises:

receiving a long connection application;

5. The method of claim 1, wherein the first received audio stream information further includes a single transmission data amount,

6. The method of claim 1, wherein after receiving the audio stream segment and before inputting the audio stream segment into an acoustic model, the method further comprises:

7. The method according to claim 1, wherein if the total data volume of all audio preset data volume segments is less than or equal to a preset threshold, the method further comprises:

the decoding parameters are initialized.

8. The method of claim 1, further comprising:

9. The method of claim 2, wherein prior to retrieving all audio stream segments in memory, the method further comprises:

judging whether the received audio stream segment is a mute segment;

10. An audio recognition apparatus, characterized in that the apparatus comprises:

the receiving unit is further configured to receive an audio stream segment;

11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 9 are implemented when the computer program is executed by the processor.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 9.