CN112863526B

CN112863526B - Speech processing method based on automatic selection of speech decoding playing format

Info

Publication number: CN112863526B
Application number: CN202110454832.0A
Authority: CN
Inventors: 王霞; 陈永慈; 时东各
Original assignee: Beijing Jinganjia New Technology Co ltd
Current assignee: Beijing Jinganjia New Technology Co ltd
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2021-07-16
Anticipated expiration: 2041-04-26
Also published as: CN112863526A

Abstract

The invention relates to a voice processing method based on automatic selection of voice decoding playing format, which comprises the following steps: acquiring a voice to be played, wherein the voice to be played is stored to a cloud terminal in a frame format; calling the voice to be played from the cloud end, and searching a file header and a voice frame of the voice to be played; reading the information in the file header and the voice frame, and acquiring a playing format for playing the voice to be played according to the information in the file header and the voice frame; and playing the voice to be played in a playing format. The playing format is acquired according to the information in the file header and the voice frame in the voice after the voice is decoded, and the voice is played according to the corresponding playing format.

Description

Speech processing method based on automatic selection of speech decoding playing format

Technical Field

The invention relates to the technical field of data processing, in particular to a voice processing method based on automatic selection of a voice decoding playing format.

Background

In a mobile network such as a 4G network or a 5G network, a complete voice conversation generally includes two or more formats of voice, and the existing voice processing method is as follows: and cutting a section of complete voice conversation into a plurality of sections of voice files with different formats, and then sequentially playing the voice files in sections according to the sequence, wherein each section of voice file is played according to the format of the voice file when played.

However, when a voice is played, since the voice is divided into a plurality of voice files with different formats, the playing formats need to be switched during the sequential playing of the voice files, and therefore, the fluency of the voice playing is affected during the switching of the playing formats.

Disclosure of Invention

Therefore, the invention provides a voice processing method based on automatic selection of voice decoding playing format, which can solve the problem of unsmooth playing caused by format switching in the voice playing process.

In order to achieve the above object, the present invention provides a speech processing method based on automatic selection of speech decoding playing format, which includes:

acquiring a voice to be played, wherein the voice to be played is stored to a cloud terminal in a frame format;

calling the voice to be played from the cloud end, and searching a file header and a voice frame of the voice to be played;

reading the information in the file header and the voice frame, and acquiring a playing format for playing the voice to be played according to the information in the file header and the voice frame;

playing the voice to be played in a playing format;

acquiring a voice to be played, wherein the step of storing the voice to be played to a cloud end in a frame format comprises the following steps:

acquiring the byte length l of the voice to be played;

a first byte length l1, a second byte length l2 and a third byte length l3 are arranged in the processor, and the data processing speed of the cloud is selected according to the actual length of the voice to be played;

if the byte length l of the voice to be played is less than or equal to the first byte length l1, selecting a first data processing rate;

if the second byte length l2 is larger than or equal to the byte length l of the voice to be played and larger than the first byte length l1, selecting a second data processing speed;

if the third byte length l3 is greater than or equal to the byte length l of the voice to be played > the second byte length l2, selecting a third data processing rate;

if the byte length l of the speech to be played > the third byte length l3, a fourth data processing rate is selected.

Further, the reading the information in the file header and the voice frame, and obtaining the playing format for playing the voice to be played according to the information in the file header and the voice frame includes:

acquiring n sections of voice formats contained in the voice to be played according to the information in the file header, cutting off the voice to be played into first sections n1, and determining the byte length z of each section of voice format;

the byte length of the first segment n1 is z1, the byte length of the second segment is z2, the byte length of the third field is z3, the byte length of the nth segment is zn, and the lengths of z1-zn can be equal or can be different; comparing the byte lengths of z1-zn, selecting zi with the longest byte length and zj with the next longest byte length, and determining the fluency fi of playing the voice in the voice format of zi and the fluency fj of playing the voice in the voice format of zj under the current network;

if fi is greater than fj, playing the voice to be played by adopting a zi voice format;

if fi is less than fj, playing the voice to be played by adopting a voice format of zj;

and if fi = fj, playing the voice to be played by adopting a zi voice format or a zj voice format.

Further, determining the fluency f of playing the voice in any voice format under the current network comprises:

determining the current network condition, setting a first network condition to indicate that the network condition is good, the transmission rate is high, and the assignment is 1, if the current network condition is general and the transmission rate is medium, the current network condition belongs to a second network condition and the assignment is 2, if the current network condition is poor and the transmission rate is low, the current network condition belongs to a third network condition and the assignment is 3, wherein the expression of the fluency f is f = t × d0i, wherein t represents the assignment of the network condition, and d0i represents the byte standard increase length corresponding to the ith voice format.

Further, before the voice to be played is called from the cloud and the file header and the voice frame of the voice to be played are searched, the cloud responds to a calling instruction at a standard response speed v 0;

when the cloud end responds to the calling instruction with the standard response speed v0, the response speed is corrected based on the actual network condition, a first correction coefficient k1, a second correction coefficient k2 and a third correction coefficient k3 are arranged in the central control unit, and k1> k2> k3 is more than or equal to 1: if the network condition of the network side belongs to the first network condition, adjusting the response speed by adopting a first correction coefficient, and adjusting the response speed of the cloud end to be v 10' = v0 × k 1;

if the network condition of the network side belongs to a second network condition, adjusting the response speed by adopting a second correction coefficient, wherein the response speed of the cloud end is adjusted to be v 20' = v0 × k 2;

if the network condition of the network side belongs to a third network condition, the response speed is adjusted by adopting a third correction coefficient, and the response speed of the cloud end is adjusted to be v 30' = v0 × k 3.

Further, if the byte length ordering in the n-segment bytes is z1> z2> … > zn, the first correction coefficient k1= z1/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn);

the second correction coefficient k2= z2/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn);

the third correction coefficient k3= z3/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn).

Further, if the first playing format is used for playing, the byte length of the original voice file is increased by Δ L1, if the second playing format is used for playing, the byte length of the original voice file is increased by Δ L2, if the third playing format is used for playing, the byte length of the original voice file is increased by Δ L3, if the fourth playing format is used for playing, the byte length of the original voice file is increased by Δ L4, if the fifth playing format is used for playing, the byte length of the original voice file is increased by Δ L5, if the sixth playing format is used for playing, the byte length of the original voice file is increased by Δ L6, if the seventh playing format is used for playing, the byte length of the original voice file is increased by Δ L7, if the eighth playing format is used for playing, the byte length of the original voice file is increased by Δ L8;

and correcting the data processing rate according to the byte length increment.

Further, the correcting the data processing rate according to the byte length increment includes:

if the byte length increase amount is Δ L1, correcting the first data processing rate m1 to m 1' = m1 × Δ L1/L;

if the byte length increase amount is Δ L1, correcting the second data processing rate m2 to m 2' = m2 × Δ L1/L;

if the byte length increase amount is Δ L1, the third data processing rate m3 is corrected to m 3' = m3 × Δ L1/L;

if the byte length increase amount is Δ L1, the fourth data processing rate m4 is corrected to m 4' = m4 × Δ L1/L;

wherein L = (Δ L1+ Δ L2+ Δ L3 … + Δ L8)/8.

Further, the current network is a local area network or the internet, the first playing format is an amr format, the second playing format is a vol format, the third playing format is an evs format, the fourth playing format is a pcm format, the fifth playing format is a ghr format, the sixth playing format is a gfr format, the seventh playing format is a ehr format, and the eighth playing format is a efr format.

Further, the voice frame comprises a first part, a second part and a third part, wherein the first part comprises 2 bytes, the 2 bytes comprise 16 bits in total, the high 4 bits represent the format of the voice frame, the rest 12 bits represent the length of the voice frame, the value is the sum of the lengths of the 3 parts of the voice frame, and the host byte order; the second part is a relative timestamp of 4 bytes and a network byte order and is used for calculating the time difference between two voice frames; the third part is the actual speech frame.

Further, the file header has 6 bytes, wherein the first two bytes represent the length of the file header, the default is 6, the length is the host byte order, the last four bytes are the voice start time, and the unit of the host byte order is seconds.

Compared with the prior art, the voice playing method and the voice playing device have the advantages that after the voice is decoded, the playing format is obtained according to the information in the file header and the voice frame in the voice, and the voice is played according to the corresponding playing format.

Particularly, when the voice is stored, the data processing speed of the cloud is selected according to the byte length of the voice to be played, so that the voice to be played can be stored quickly, and the processing speed of the voice is improved.

Particularly, the voice formats contained in the voice to be played are counted, the byte lengths corresponding to the voice formats with different formats are determined, the voice format with the longest byte length is played, the speed of voice playing is increased, the voice playing format is reselected according to the state of a network in the playing process, the speed of voice processing and the fluency of voice playing are increased, and the voice processing efficiency is improved.

Particularly, the fluency of voice playing is calculated by judging the network condition and the corresponding voice format to be transmitted, so that the fluency of transmitting voice data in different network states is improved, the voice transmission speed is improved, and the user experience is optimized.

Particularly, the data processing rate of the cloud is corrected according to the increment of the byte length, different pre-playing formats are adopted, the increment of the byte length is different, and the data processing rates are corrected based on the difference, so that the processing speed of actual voice data is further improved, and the playing efficiency is improved.

In particular, the file header in the embodiment of the invention is convenient for quickly searching voice information, improves the voice processing speed and improves the convenience of playing format selection.

Drawings

Fig. 1 is a schematic flowchart of a voice processing method based on automatically selecting a voice decoding playing format according to an embodiment of the present invention;

FIG. 2 is a block diagram of a speech processing system for automatically selecting a speech decoding playback format according to an embodiment of the present invention.

Detailed Description

In order that the objects and advantages of the invention will be more clearly understood, the invention is further described below with reference to examples; it should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and do not limit the scope of the present invention.

Referring to fig. 1, a voice processing method based on automatically selecting a voice decoding playing format according to an embodiment of the present invention includes:

s100, acquiring a voice to be played, wherein the voice to be played is stored to a cloud end in a frame format;

step S200, calling the voice to be played from the cloud, and searching a file header and a voice frame of the voice to be played;

step S300, reading the information in the file header and the voice frame, and acquiring a playing format for playing the voice to be played according to the information in the file header and the voice frame;

and S400, playing the voice to be played in a playing format.

Specifically, according to the voice processing method based on automatic selection of the voice decoding playing format provided by the embodiment of the present invention, after the voice decoding, the playing format is obtained according to the file header in the voice and the information in the voice frame, and the voice is played according to the corresponding playing format.

Specifically, acquiring a voice to be played, wherein the storing of the voice to be played to a cloud end in a frame format includes:

acquiring the byte length l of the voice to be played;

Specifically, in the voice processing method based on automatic selection of the voice decoding playing format in the embodiment of the present invention, when the voice is stored, the data processing speed of the cloud is selected according to the byte length of the voice to be played, which is convenient for fast storage of the voice to be played, so as to improve the processing speed of the voice.

Specifically, the reading the information in the file header and the voice frame, and obtaining the playing format for playing the voice to be played according to the information in the file header and the voice frame includes:

if fi = fj, the voice to be played can be played in the zi voice format or the zj voice format.

Specifically, the voice formats contained in the voice to be played are counted, the byte lengths corresponding to the voice formats with different formats are determined, the voice format with the longest byte length is played, the speed of voice playing is increased, the voice playing format is reselected according to the state of a network in the playing process, the speed of voice processing and the fluency of voice playing are increased, and the voice processing efficiency is improved.

Specifically, determining the fluency f of playing the voice in an arbitrary voice format under the current network includes:

Specifically, the embodiment of the invention calculates the fluency of voice playing by judging the network condition and the corresponding voice format to be transmitted, so as to improve the fluency of transmitting voice data in different network states, improve the voice transmission speed and optimize the user experience.

Specifically, before the voice to be played is called from the cloud and a file header and a voice frame of the voice to be played are searched, the cloud responds to a calling instruction at a standard response speed v 0;

Specifically, the embodiment of the invention adjusts the response speed of the cloud end through the network condition of the network side, so that the response speed of the cloud end on the network side is matched with the network condition, if the network condition is good, the response speed is improved, the voice transmission is transmitted under the condition of good network condition, the stability of voice transmission is ensured, and the voice playing effect is ensured.

Specifically, if the byte length ordering in the n-segment bytes is z1> z2> … > zn, the first correction coefficient k1= z1/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn);

Specifically, the correction coefficient in the embodiment of the present invention is positively correlated with the length of n bytes in the voice, so that the correction of the response speed is more accurate, the influence rate is increased, and the high efficiency of voice transmission is ensured.

Specifically, the current network is a local area network or the internet.

Specifically, the voice file in the embodiment of the present invention may be processing of voice data from a local area network, or processing of voice data of the internet, and processing is performed on various voice data, so as to increase the processing speed of the voice data in the network, so that the voice processing method in the embodiment of the present invention is adapted to more use scenarios, and complexity and compatibility of voice processing are increased.

Specifically, if the first playing format is used for playing, the byte length of the original voice file is increased by Δ L1, if the second playing format is used for playing, the byte length of the original voice file is increased by Δ L2, if the third playing format is used for playing, the byte length of the original voice file is increased by Δ L3, if the fourth playing format is used for playing, the byte length of the original voice file is increased by Δ L4, if the fifth playing format is used for playing, the byte length of the original voice file is increased by Δ L5, if the sixth playing format is used for playing, the byte length of the original voice file is increased by Δ L6, if the seventh playing format is used for playing, the byte length of the original voice file is increased by Δ L7, if the eighth playing format is used for playing, the byte length of the original voice file is increased by Δ L8;

correcting the data processing rate according to the byte length increment;

wherein L = (Δ L1+ Δ L2+ Δ L3 … + Δ L8)/8.

Specifically, the embodiment of the invention corrects the data processing rate of the cloud according to the increment of the byte length, adopts different pre-playing formats and different length increments of the byte, corrects each data processing rate based on the difference, further improves the processing speed of actual voice data and improves the playing efficiency.

Specifically, the first playing format is an amr format, the second playing format is a vol format, the third playing format is an evs format, the fourth playing format is a pcm format, the fifth playing format is a ghr format, the sixth playing format is a gfr format, the seventh playing format is a ehr format, and the eighth playing format is a efr format.

Specifically, if the first playing format is an amr format, the suffix of the voice file is amr, and the voice file is adaptive multi-rate coding and decoding; if the suffix of the voice file is vol, the voice file is represented as 4G Volt high-definition voice, if the suffix of the voice file is. EVS, the voice file is represented as 5G EVS voice, if the suffix of the voice file is PCM, the voice file is represented as pulse code modulation ALAW PCM, if the suffix of the voice file is ghr, the voice file is represented as GSM half-rate, if the suffix of the voice file is. gfr, the voice file is represented as GSM full-rate, if the suffix of the voice file is ehr, the voice file is represented as enhanced GSM half-rate, and if the suffix of the voice file is. efr, the voice file is represented as enhanced GSM full-rate.

Specifically, the voice frame comprises a first part, a second part and a third part, wherein the first part comprises 2 bytes, the 2 bytes comprise 16 bits in total, the high 4 bits represent the format of the voice frame, the rest 12 bits represent the length of the voice frame, the value is the sum of the lengths of the 3 parts of the voice frame, and the host byte order; the second part is a relative timestamp of 4 bytes and a network byte order and is used for calculating the time difference between two voice frames; the third part is the actual speech frame.

Specifically, for example: the voice sampling rate is 8000hz, 50 frames per second, the time stamp difference value between two adjacent voice frames is 8000 divided by 50 and is 160.

Specifically, the header has 6 bytes, where the first two bytes represent the header length, the default is 6, and the header is host endianness, and the last four bytes are voice start time, and the host endianness is in seconds, and the format is shown in the following table:

specifically, the file header in the embodiment of the invention is convenient for quickly searching voice information, improves the voice processing speed and improves the convenience of playing format selection.

As shown in fig. 2, a scenario in which the voice processing method based on automatically selecting the voice decoding playing format according to the embodiment of the present invention is applied is a voice processing system based on automatically selecting the voice decoding playing format, the system includes a cloud and a terminal side, the cloud is used to store a voice to be played, the terminal side sends a playing request to the cloud, the cloud responds to the request, the cloud returns the voice information to the terminal after responding, and the terminal selects the playing format of the returned voice information according to the content in the voice information after reading the voice information in the cloud.

According to the embodiment of the invention, the playing format of the voice information is selected according to the voice information content of the cloud, and the voice information is information in the file header and the voice frame, so that the selection of the voice playing format is more accurate, the selected playing format is adopted for playing, the actual format of the voice does not need to be determined, the continuity of voice playing is greatly improved, and the playing experience of a user is improved.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention; various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A speech processing method based on automatic selection of speech decoding playing format is characterized by comprising the following steps: acquiring a voice to be played, wherein the voice to be played is stored to a cloud terminal in a frame format;

playing the voice to be played in a playing format;

acquiring the byte length l of the voice to be played;

selecting a fourth data processing rate if the byte length l of the voice to be played is greater than the third byte length l3;

the reading the information in the file header and the voice frame, and obtaining the playing format for playing the voice to be played according to the information in the file header and the voice frame comprises:

acquiring n sections of voice formats contained in the voice to be played according to the information in the file header, wherein each section is a first section n1, a … and an nth section, and determining the byte length z of each section of voice format;

2. The speech processing method based on automatically selecting a playback format for speech decoding according to claim 1,

determining the fluency f of playing the voice in any voice format under the current network comprises:

determining the current network condition, setting a first network condition to indicate that the network condition is good, the transmission rate is high, and the value is 1, if the current network condition is general, and the transmission rate is medium, the current network condition belongs to a second network condition and the value is 2, if the current network condition is poor, and the transmission rate is low, the current network condition belongs to a third network condition and the value is 3, wherein the expression of the fluency f is f = t × d0i, wherein t represents the value of the network condition, and d0i represents the byte length increment corresponding to the ith voice format.

3. The speech processing method based on automatically selecting the speech decoding playing format according to claim 2, wherein before retrieving the speech to be played from the cloud and searching for the file header and the speech frames of the speech to be played, the cloud responds to the retrieval command with a standard response speed v 0;

4. The method of claim 3, wherein if the byte length ordering in n bytes is z1> z2> … > zn, the first modification factor k1= z1/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn);

5. The speech processing method based on automatically selecting a playback format for speech decoding according to claim 1,

if the first playing format is used for playing, the byte length increment of the original voice file is delta L1, if the second playing format is used for playing, the byte length increment of the original voice file is delta L2, if the third playing format is used for playing, the byte length increment of the original voice file is delta L3, if the fourth playing format is used for playing, the byte length increment of the original voice file is delta L4, if the fifth playing format is used for playing, the byte length increment of the original voice file is delta L5, if the sixth playing format is used for playing, the byte length increment of the original voice file is delta L6, if the seventh playing format is used for playing, the byte length increment of the original voice file is delta L7, if the eighth playing format is used for playing, the byte length of the original voice file is increased by Δ L8;

and correcting the data processing rate according to the byte length increment.

6. The method of claim 5, wherein the modifying the data processing rate according to the byte length increment comprises:

wherein L = (Δ L1+ Δ L2+ Δ L3 … + Δ L8)/8.

7. The speech processing method according to claim 5, wherein the current network is a local area network or the Internet, the first playback format is an amr format, the second playback format is a vol format, the third playback format is an evs format, the fourth playback format is a pcm format, the fifth playback format is a ghr format, the sixth playback format is a gfr format, the seventh playback format is a ehr format, and the eighth playback format is a efr format.

8. The speech processing method based on automatic selection of speech decoding playing format according to claim 1, wherein the speech frames comprise a first part, a second part and a third part, the first part is 2 bytes, the 2 bytes are 16 bits in total, wherein the high 4 bits represent the format of the speech frame, the remaining 12 bits represent the length of the speech frame, the value is the sum of the lengths of 3 parts of the speech frame, and the host byte order; the second part is a relative timestamp of 4 bytes and a network byte order and is used for calculating the time difference between two voice frames; the third part is the actual speech frame.

9. The speech processing method according to any one of claims 1-7, wherein the header has a total of 6 bytes, and the first two bytes represent the header length, and the default is 6, which is the host byte order, and the last four bytes are the speech start time, which is the host byte order, and the unit is second.