CN112863526B - Speech processing method based on automatic selection of speech decoding playing format - Google Patents

Speech processing method based on automatic selection of speech decoding playing format Download PDF

Info

Publication number
CN112863526B
CN112863526B CN202110454832.0A CN202110454832A CN112863526B CN 112863526 B CN112863526 B CN 112863526B CN 202110454832 A CN202110454832 A CN 202110454832A CN 112863526 B CN112863526 B CN 112863526B
Authority
CN
China
Prior art keywords
voice
format
playing
byte length
played
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110454832.0A
Other languages
Chinese (zh)
Other versions
CN112863526A (en
Inventor
王霞
陈永慈
时东各
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jinganjia New Technology Co ltd
Original Assignee
Beijing Jinganjia New Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jinganjia New Technology Co ltd filed Critical Beijing Jinganjia New Technology Co ltd
Priority to CN202110454832.0A priority Critical patent/CN112863526B/en
Publication of CN112863526A publication Critical patent/CN112863526A/en
Application granted granted Critical
Publication of CN112863526B publication Critical patent/CN112863526B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/167Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/002Dynamic bit allocation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/24Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Abstract

The invention relates to a voice processing method based on automatic selection of voice decoding playing format, which comprises the following steps: acquiring a voice to be played, wherein the voice to be played is stored to a cloud terminal in a frame format; calling the voice to be played from the cloud end, and searching a file header and a voice frame of the voice to be played; reading the information in the file header and the voice frame, and acquiring a playing format for playing the voice to be played according to the information in the file header and the voice frame; and playing the voice to be played in a playing format. The playing format is acquired according to the information in the file header and the voice frame in the voice after the voice is decoded, and the voice is played according to the corresponding playing format.

Description

Speech processing method based on automatic selection of speech decoding playing format
Technical Field
The invention relates to the technical field of data processing, in particular to a voice processing method based on automatic selection of a voice decoding playing format.
Background
In a mobile network such as a 4G network or a 5G network, a complete voice conversation generally includes two or more formats of voice, and the existing voice processing method is as follows: and cutting a section of complete voice conversation into a plurality of sections of voice files with different formats, and then sequentially playing the voice files in sections according to the sequence, wherein each section of voice file is played according to the format of the voice file when played.
However, when a voice is played, since the voice is divided into a plurality of voice files with different formats, the playing formats need to be switched during the sequential playing of the voice files, and therefore, the fluency of the voice playing is affected during the switching of the playing formats.
Disclosure of Invention
Therefore, the invention provides a voice processing method based on automatic selection of voice decoding playing format, which can solve the problem of unsmooth playing caused by format switching in the voice playing process.
In order to achieve the above object, the present invention provides a speech processing method based on automatic selection of speech decoding playing format, which includes:
acquiring a voice to be played, wherein the voice to be played is stored to a cloud terminal in a frame format;
calling the voice to be played from the cloud end, and searching a file header and a voice frame of the voice to be played;
reading the information in the file header and the voice frame, and acquiring a playing format for playing the voice to be played according to the information in the file header and the voice frame;
playing the voice to be played in a playing format;
acquiring a voice to be played, wherein the step of storing the voice to be played to a cloud end in a frame format comprises the following steps:
acquiring the byte length l of the voice to be played;
a first byte length l1, a second byte length l2 and a third byte length l3 are arranged in the processor, and the data processing speed of the cloud is selected according to the actual length of the voice to be played;
if the byte length l of the voice to be played is less than or equal to the first byte length l1, selecting a first data processing rate;
if the second byte length l2 is larger than or equal to the byte length l of the voice to be played and larger than the first byte length l1, selecting a second data processing speed;
if the third byte length l3 is greater than or equal to the byte length l of the voice to be played > the second byte length l2, selecting a third data processing rate;
if the byte length l of the speech to be played > the third byte length l3, a fourth data processing rate is selected.
Further, the reading the information in the file header and the voice frame, and obtaining the playing format for playing the voice to be played according to the information in the file header and the voice frame includes:
acquiring n sections of voice formats contained in the voice to be played according to the information in the file header, cutting off the voice to be played into first sections n1, and determining the byte length z of each section of voice format;
the byte length of the first segment n1 is z1, the byte length of the second segment is z2, the byte length of the third field is z3, the byte length of the nth segment is zn, and the lengths of z1-zn can be equal or can be different; comparing the byte lengths of z1-zn, selecting zi with the longest byte length and zj with the next longest byte length, and determining the fluency fi of playing the voice in the voice format of zi and the fluency fj of playing the voice in the voice format of zj under the current network;
if fi is greater than fj, playing the voice to be played by adopting a zi voice format;
if fi is less than fj, playing the voice to be played by adopting a voice format of zj;
and if fi = fj, playing the voice to be played by adopting a zi voice format or a zj voice format.
Further, determining the fluency f of playing the voice in any voice format under the current network comprises:
determining the current network condition, setting a first network condition to indicate that the network condition is good, the transmission rate is high, and the assignment is 1, if the current network condition is general and the transmission rate is medium, the current network condition belongs to a second network condition and the assignment is 2, if the current network condition is poor and the transmission rate is low, the current network condition belongs to a third network condition and the assignment is 3, wherein the expression of the fluency f is f = t × d0i, wherein t represents the assignment of the network condition, and d0i represents the byte standard increase length corresponding to the ith voice format.
Further, before the voice to be played is called from the cloud and the file header and the voice frame of the voice to be played are searched, the cloud responds to a calling instruction at a standard response speed v 0;
when the cloud end responds to the calling instruction with the standard response speed v0, the response speed is corrected based on the actual network condition, a first correction coefficient k1, a second correction coefficient k2 and a third correction coefficient k3 are arranged in the central control unit, and k1> k2> k3 is more than or equal to 1: if the network condition of the network side belongs to the first network condition, adjusting the response speed by adopting a first correction coefficient, and adjusting the response speed of the cloud end to be v 10' = v0 × k 1;
if the network condition of the network side belongs to a second network condition, adjusting the response speed by adopting a second correction coefficient, wherein the response speed of the cloud end is adjusted to be v 20' = v0 × k 2;
if the network condition of the network side belongs to a third network condition, the response speed is adjusted by adopting a third correction coefficient, and the response speed of the cloud end is adjusted to be v 30' = v0 × k 3.
Further, if the byte length ordering in the n-segment bytes is z1> z2> … > zn, the first correction coefficient k1= z1/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn);
the second correction coefficient k2= z2/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn);
the third correction coefficient k3= z3/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn).
Further, if the first playing format is used for playing, the byte length of the original voice file is increased by Δ L1, if the second playing format is used for playing, the byte length of the original voice file is increased by Δ L2, if the third playing format is used for playing, the byte length of the original voice file is increased by Δ L3, if the fourth playing format is used for playing, the byte length of the original voice file is increased by Δ L4, if the fifth playing format is used for playing, the byte length of the original voice file is increased by Δ L5, if the sixth playing format is used for playing, the byte length of the original voice file is increased by Δ L6, if the seventh playing format is used for playing, the byte length of the original voice file is increased by Δ L7, if the eighth playing format is used for playing, the byte length of the original voice file is increased by Δ L8;
and correcting the data processing rate according to the byte length increment.
Further, the correcting the data processing rate according to the byte length increment includes:
if the byte length increase amount is Δ L1, correcting the first data processing rate m1 to m 1' = m1 × Δ L1/L;
if the byte length increase amount is Δ L1, correcting the second data processing rate m2 to m 2' = m2 × Δ L1/L;
if the byte length increase amount is Δ L1, the third data processing rate m3 is corrected to m 3' = m3 × Δ L1/L;
if the byte length increase amount is Δ L1, the fourth data processing rate m4 is corrected to m 4' = m4 × Δ L1/L;
wherein L = (Δ L1+ Δ L2+ Δ L3 … + Δ L8)/8.
Further, the current network is a local area network or the internet, the first playing format is an amr format, the second playing format is a vol format, the third playing format is an evs format, the fourth playing format is a pcm format, the fifth playing format is a ghr format, the sixth playing format is a gfr format, the seventh playing format is a ehr format, and the eighth playing format is a efr format.
Further, the voice frame comprises a first part, a second part and a third part, wherein the first part comprises 2 bytes, the 2 bytes comprise 16 bits in total, the high 4 bits represent the format of the voice frame, the rest 12 bits represent the length of the voice frame, the value is the sum of the lengths of the 3 parts of the voice frame, and the host byte order; the second part is a relative timestamp of 4 bytes and a network byte order and is used for calculating the time difference between two voice frames; the third part is the actual speech frame.
Further, the file header has 6 bytes, wherein the first two bytes represent the length of the file header, the default is 6, the length is the host byte order, the last four bytes are the voice start time, and the unit of the host byte order is seconds.
Compared with the prior art, the voice playing method and the voice playing device have the advantages that after the voice is decoded, the playing format is obtained according to the information in the file header and the voice frame in the voice, and the voice is played according to the corresponding playing format.
Particularly, when the voice is stored, the data processing speed of the cloud is selected according to the byte length of the voice to be played, so that the voice to be played can be stored quickly, and the processing speed of the voice is improved.
Particularly, the voice formats contained in the voice to be played are counted, the byte lengths corresponding to the voice formats with different formats are determined, the voice format with the longest byte length is played, the speed of voice playing is increased, the voice playing format is reselected according to the state of a network in the playing process, the speed of voice processing and the fluency of voice playing are increased, and the voice processing efficiency is improved.
Particularly, the fluency of voice playing is calculated by judging the network condition and the corresponding voice format to be transmitted, so that the fluency of transmitting voice data in different network states is improved, the voice transmission speed is improved, and the user experience is optimized.
Particularly, the data processing rate of the cloud is corrected according to the increment of the byte length, different pre-playing formats are adopted, the increment of the byte length is different, and the data processing rates are corrected based on the difference, so that the processing speed of actual voice data is further improved, and the playing efficiency is improved.
In particular, the file header in the embodiment of the invention is convenient for quickly searching voice information, improves the voice processing speed and improves the convenience of playing format selection.
Drawings
Fig. 1 is a schematic flowchart of a voice processing method based on automatically selecting a voice decoding playing format according to an embodiment of the present invention;
FIG. 2 is a block diagram of a speech processing system for automatically selecting a speech decoding playback format according to an embodiment of the present invention.
Detailed Description
In order that the objects and advantages of the invention will be more clearly understood, the invention is further described below with reference to examples; it should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and do not limit the scope of the present invention.
Referring to fig. 1, a voice processing method based on automatically selecting a voice decoding playing format according to an embodiment of the present invention includes:
s100, acquiring a voice to be played, wherein the voice to be played is stored to a cloud end in a frame format;
step S200, calling the voice to be played from the cloud, and searching a file header and a voice frame of the voice to be played;
step S300, reading the information in the file header and the voice frame, and acquiring a playing format for playing the voice to be played according to the information in the file header and the voice frame;
and S400, playing the voice to be played in a playing format.
Specifically, according to the voice processing method based on automatic selection of the voice decoding playing format provided by the embodiment of the present invention, after the voice decoding, the playing format is obtained according to the file header in the voice and the information in the voice frame, and the voice is played according to the corresponding playing format.
Specifically, acquiring a voice to be played, wherein the storing of the voice to be played to a cloud end in a frame format includes:
acquiring the byte length l of the voice to be played;
a first byte length l1, a second byte length l2 and a third byte length l3 are arranged in the processor, and the data processing speed of the cloud is selected according to the actual length of the voice to be played;
if the byte length l of the voice to be played is less than or equal to the first byte length l1, selecting a first data processing rate;
if the second byte length l2 is larger than or equal to the byte length l of the voice to be played and larger than the first byte length l1, selecting a second data processing speed;
if the third byte length l3 is greater than or equal to the byte length l of the voice to be played > the second byte length l2, selecting a third data processing rate;
if the byte length l of the speech to be played > the third byte length l3, a fourth data processing rate is selected.
Specifically, in the voice processing method based on automatic selection of the voice decoding playing format in the embodiment of the present invention, when the voice is stored, the data processing speed of the cloud is selected according to the byte length of the voice to be played, which is convenient for fast storage of the voice to be played, so as to improve the processing speed of the voice.
Specifically, the reading the information in the file header and the voice frame, and obtaining the playing format for playing the voice to be played according to the information in the file header and the voice frame includes:
acquiring n sections of voice formats contained in the voice to be played according to the information in the file header, cutting off the voice to be played into first sections n1, and determining the byte length z of each section of voice format;
the byte length of the first segment n1 is z1, the byte length of the second segment is z2, the byte length of the third field is z3, the byte length of the nth segment is zn, and the lengths of z1-zn can be equal or can be different; comparing the byte lengths of z1-zn, selecting zi with the longest byte length and zj with the next longest byte length, and determining the fluency fi of playing the voice in the voice format of zi and the fluency fj of playing the voice in the voice format of zj under the current network;
if fi is greater than fj, playing the voice to be played by adopting a zi voice format;
if fi is less than fj, playing the voice to be played by adopting a voice format of zj;
if fi = fj, the voice to be played can be played in the zi voice format or the zj voice format.
Specifically, the voice formats contained in the voice to be played are counted, the byte lengths corresponding to the voice formats with different formats are determined, the voice format with the longest byte length is played, the speed of voice playing is increased, the voice playing format is reselected according to the state of a network in the playing process, the speed of voice processing and the fluency of voice playing are increased, and the voice processing efficiency is improved.
Specifically, determining the fluency f of playing the voice in an arbitrary voice format under the current network includes:
determining the current network condition, setting a first network condition to indicate that the network condition is good, the transmission rate is high, and the assignment is 1, if the current network condition is general and the transmission rate is medium, the current network condition belongs to a second network condition and the assignment is 2, if the current network condition is poor and the transmission rate is low, the current network condition belongs to a third network condition and the assignment is 3, wherein the expression of the fluency f is f = t × d0i, wherein t represents the assignment of the network condition, and d0i represents the byte standard increase length corresponding to the ith voice format.
Specifically, the embodiment of the invention calculates the fluency of voice playing by judging the network condition and the corresponding voice format to be transmitted, so as to improve the fluency of transmitting voice data in different network states, improve the voice transmission speed and optimize the user experience.
Specifically, before the voice to be played is called from the cloud and a file header and a voice frame of the voice to be played are searched, the cloud responds to a calling instruction at a standard response speed v 0;
when the cloud end responds to the calling instruction with the standard response speed v0, the response speed is corrected based on the actual network condition, a first correction coefficient k1, a second correction coefficient k2 and a third correction coefficient k3 are arranged in the central control unit, and k1> k2> k3 is more than or equal to 1: if the network condition of the network side belongs to the first network condition, adjusting the response speed by adopting a first correction coefficient, and adjusting the response speed of the cloud end to be v 10' = v0 × k 1;
if the network condition of the network side belongs to a second network condition, adjusting the response speed by adopting a second correction coefficient, wherein the response speed of the cloud end is adjusted to be v 20' = v0 × k 2;
if the network condition of the network side belongs to a third network condition, the response speed is adjusted by adopting a third correction coefficient, and the response speed of the cloud end is adjusted to be v 30' = v0 × k 3.
Specifically, the embodiment of the invention adjusts the response speed of the cloud end through the network condition of the network side, so that the response speed of the cloud end on the network side is matched with the network condition, if the network condition is good, the response speed is improved, the voice transmission is transmitted under the condition of good network condition, the stability of voice transmission is ensured, and the voice playing effect is ensured.
Specifically, if the byte length ordering in the n-segment bytes is z1> z2> … > zn, the first correction coefficient k1= z1/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn);
the second correction coefficient k2= z2/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn);
the third correction coefficient k3= z3/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn).
Specifically, the correction coefficient in the embodiment of the present invention is positively correlated with the length of n bytes in the voice, so that the correction of the response speed is more accurate, the influence rate is increased, and the high efficiency of voice transmission is ensured.
Specifically, the current network is a local area network or the internet.
Specifically, the voice file in the embodiment of the present invention may be processing of voice data from a local area network, or processing of voice data of the internet, and processing is performed on various voice data, so as to increase the processing speed of the voice data in the network, so that the voice processing method in the embodiment of the present invention is adapted to more use scenarios, and complexity and compatibility of voice processing are increased.
Specifically, if the first playing format is used for playing, the byte length of the original voice file is increased by Δ L1, if the second playing format is used for playing, the byte length of the original voice file is increased by Δ L2, if the third playing format is used for playing, the byte length of the original voice file is increased by Δ L3, if the fourth playing format is used for playing, the byte length of the original voice file is increased by Δ L4, if the fifth playing format is used for playing, the byte length of the original voice file is increased by Δ L5, if the sixth playing format is used for playing, the byte length of the original voice file is increased by Δ L6, if the seventh playing format is used for playing, the byte length of the original voice file is increased by Δ L7, if the eighth playing format is used for playing, the byte length of the original voice file is increased by Δ L8;
correcting the data processing rate according to the byte length increment;
if the byte length increase amount is Δ L1, correcting the first data processing rate m1 to m 1' = m1 × Δ L1/L;
if the byte length increase amount is Δ L1, correcting the second data processing rate m2 to m 2' = m2 × Δ L1/L;
if the byte length increase amount is Δ L1, the third data processing rate m3 is corrected to m 3' = m3 × Δ L1/L;
if the byte length increase amount is Δ L1, the fourth data processing rate m4 is corrected to m 4' = m4 × Δ L1/L;
wherein L = (Δ L1+ Δ L2+ Δ L3 … + Δ L8)/8.
Specifically, the embodiment of the invention corrects the data processing rate of the cloud according to the increment of the byte length, adopts different pre-playing formats and different length increments of the byte, corrects each data processing rate based on the difference, further improves the processing speed of actual voice data and improves the playing efficiency.
Specifically, the first playing format is an amr format, the second playing format is a vol format, the third playing format is an evs format, the fourth playing format is a pcm format, the fifth playing format is a ghr format, the sixth playing format is a gfr format, the seventh playing format is a ehr format, and the eighth playing format is a efr format.
Specifically, if the first playing format is an amr format, the suffix of the voice file is amr, and the voice file is adaptive multi-rate coding and decoding; if the suffix of the voice file is vol, the voice file is represented as 4G Volt high-definition voice, if the suffix of the voice file is. EVS, the voice file is represented as 5G EVS voice, if the suffix of the voice file is PCM, the voice file is represented as pulse code modulation ALAW PCM, if the suffix of the voice file is ghr, the voice file is represented as GSM half-rate, if the suffix of the voice file is. gfr, the voice file is represented as GSM full-rate, if the suffix of the voice file is ehr, the voice file is represented as enhanced GSM half-rate, and if the suffix of the voice file is. efr, the voice file is represented as enhanced GSM full-rate.
Specifically, the voice frame comprises a first part, a second part and a third part, wherein the first part comprises 2 bytes, the 2 bytes comprise 16 bits in total, the high 4 bits represent the format of the voice frame, the rest 12 bits represent the length of the voice frame, the value is the sum of the lengths of the 3 parts of the voice frame, and the host byte order; the second part is a relative timestamp of 4 bytes and a network byte order and is used for calculating the time difference between two voice frames; the third part is the actual speech frame.
Specifically, for example: the voice sampling rate is 8000hz, 50 frames per second, the time stamp difference value between two adjacent voice frames is 8000 divided by 50 and is 160.
Specifically, the header has 6 bytes, where the first two bytes represent the header length, the default is 6, and the header is host endianness, and the last four bytes are voice start time, and the host endianness is in seconds, and the format is shown in the following table:
Figure DEST_PATH_IMAGE001
specifically, the file header in the embodiment of the invention is convenient for quickly searching voice information, improves the voice processing speed and improves the convenience of playing format selection.
As shown in fig. 2, a scenario in which the voice processing method based on automatically selecting the voice decoding playing format according to the embodiment of the present invention is applied is a voice processing system based on automatically selecting the voice decoding playing format, the system includes a cloud and a terminal side, the cloud is used to store a voice to be played, the terminal side sends a playing request to the cloud, the cloud responds to the request, the cloud returns the voice information to the terminal after responding, and the terminal selects the playing format of the returned voice information according to the content in the voice information after reading the voice information in the cloud.
According to the embodiment of the invention, the playing format of the voice information is selected according to the voice information content of the cloud, and the voice information is information in the file header and the voice frame, so that the selection of the voice playing format is more accurate, the selected playing format is adopted for playing, the actual format of the voice does not need to be determined, the continuity of voice playing is greatly improved, and the playing experience of a user is improved.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention; various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A speech processing method based on automatic selection of speech decoding playing format is characterized by comprising the following steps: acquiring a voice to be played, wherein the voice to be played is stored to a cloud terminal in a frame format;
calling the voice to be played from the cloud end, and searching a file header and a voice frame of the voice to be played;
reading the information in the file header and the voice frame, and acquiring a playing format for playing the voice to be played according to the information in the file header and the voice frame;
playing the voice to be played in a playing format;
acquiring a voice to be played, wherein the step of storing the voice to be played to a cloud end in a frame format comprises the following steps:
acquiring the byte length l of the voice to be played;
a first byte length l1, a second byte length l2 and a third byte length l3 are arranged in the processor, and the data processing speed of the cloud is selected according to the actual length of the voice to be played;
if the byte length l of the voice to be played is less than or equal to the first byte length l1, selecting a first data processing rate;
if the second byte length l2 is larger than or equal to the byte length l of the voice to be played and larger than the first byte length l1, selecting a second data processing speed;
if the third byte length l3 is greater than or equal to the byte length l of the voice to be played > the second byte length l2, selecting a third data processing rate;
selecting a fourth data processing rate if the byte length l of the voice to be played is greater than the third byte length l3;
the reading the information in the file header and the voice frame, and obtaining the playing format for playing the voice to be played according to the information in the file header and the voice frame comprises:
acquiring n sections of voice formats contained in the voice to be played according to the information in the file header, wherein each section is a first section n1, a … and an nth section, and determining the byte length z of each section of voice format;
the byte length of the first segment n1 is z1, the byte length of the second segment is z2, the byte length of the third field is z3, the byte length of the nth segment is zn, and the lengths of z1-zn can be equal or can be different; comparing the byte lengths of z1-zn, selecting zi with the longest byte length and zj with the next longest byte length, and determining the fluency fi of playing the voice in the voice format of zi and the fluency fj of playing the voice in the voice format of zj under the current network;
if fi is greater than fj, playing the voice to be played by adopting a zi voice format;
if fi is less than fj, playing the voice to be played by adopting a voice format of zj;
and if fi = fj, playing the voice to be played by adopting a zi voice format or a zj voice format.
2. The speech processing method based on automatically selecting a playback format for speech decoding according to claim 1,
determining the fluency f of playing the voice in any voice format under the current network comprises:
determining the current network condition, setting a first network condition to indicate that the network condition is good, the transmission rate is high, and the value is 1, if the current network condition is general, and the transmission rate is medium, the current network condition belongs to a second network condition and the value is 2, if the current network condition is poor, and the transmission rate is low, the current network condition belongs to a third network condition and the value is 3, wherein the expression of the fluency f is f = t × d0i, wherein t represents the value of the network condition, and d0i represents the byte length increment corresponding to the ith voice format.
3. The speech processing method based on automatically selecting the speech decoding playing format according to claim 2, wherein before retrieving the speech to be played from the cloud and searching for the file header and the speech frames of the speech to be played, the cloud responds to the retrieval command with a standard response speed v 0;
when the cloud end responds to the calling instruction with the standard response speed v0, the response speed is corrected based on the actual network condition, a first correction coefficient k1, a second correction coefficient k2 and a third correction coefficient k3 are arranged in the central control unit, and k1> k2> k3 is more than or equal to 1: if the network condition of the network side belongs to the first network condition, adjusting the response speed by adopting a first correction coefficient, and adjusting the response speed of the cloud end to be v 10' = v0 × k 1;
if the network condition of the network side belongs to a second network condition, adjusting the response speed by adopting a second correction coefficient, wherein the response speed of the cloud end is adjusted to be v 20' = v0 × k 2;
if the network condition of the network side belongs to a third network condition, the response speed is adjusted by adopting a third correction coefficient, and the response speed of the cloud end is adjusted to be v 30' = v0 × k 3.
4. The method of claim 3, wherein if the byte length ordering in n bytes is z1> z2> … > zn, the first modification factor k1= z1/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn);
the second correction coefficient k2= z2/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn);
the third correction coefficient k3= z3/zn + (z 4+ z5+ … zn)/(z 1+ z2+ … zn).
5. The speech processing method based on automatically selecting a playback format for speech decoding according to claim 1,
if the first playing format is used for playing, the byte length increment of the original voice file is delta L1, if the second playing format is used for playing, the byte length increment of the original voice file is delta L2, if the third playing format is used for playing, the byte length increment of the original voice file is delta L3, if the fourth playing format is used for playing, the byte length increment of the original voice file is delta L4, if the fifth playing format is used for playing, the byte length increment of the original voice file is delta L5, if the sixth playing format is used for playing, the byte length increment of the original voice file is delta L6, if the seventh playing format is used for playing, the byte length increment of the original voice file is delta L7, if the eighth playing format is used for playing, the byte length of the original voice file is increased by Δ L8;
and correcting the data processing rate according to the byte length increment.
6. The method of claim 5, wherein the modifying the data processing rate according to the byte length increment comprises:
if the byte length increase amount is Δ L1, correcting the first data processing rate m1 to m 1' = m1 × Δ L1/L;
if the byte length increase amount is Δ L1, correcting the second data processing rate m2 to m 2' = m2 × Δ L1/L;
if the byte length increase amount is Δ L1, the third data processing rate m3 is corrected to m 3' = m3 × Δ L1/L;
if the byte length increase amount is Δ L1, the fourth data processing rate m4 is corrected to m 4' = m4 × Δ L1/L;
wherein L = (Δ L1+ Δ L2+ Δ L3 … + Δ L8)/8.
7. The speech processing method according to claim 5, wherein the current network is a local area network or the Internet, the first playback format is an amr format, the second playback format is a vol format, the third playback format is an evs format, the fourth playback format is a pcm format, the fifth playback format is a ghr format, the sixth playback format is a gfr format, the seventh playback format is a ehr format, and the eighth playback format is a efr format.
8. The speech processing method based on automatic selection of speech decoding playing format according to claim 1, wherein the speech frames comprise a first part, a second part and a third part, the first part is 2 bytes, the 2 bytes are 16 bits in total, wherein the high 4 bits represent the format of the speech frame, the remaining 12 bits represent the length of the speech frame, the value is the sum of the lengths of 3 parts of the speech frame, and the host byte order; the second part is a relative timestamp of 4 bytes and a network byte order and is used for calculating the time difference between two voice frames; the third part is the actual speech frame.
9. The speech processing method according to any one of claims 1-7, wherein the header has a total of 6 bytes, and the first two bytes represent the header length, and the default is 6, which is the host byte order, and the last four bytes are the speech start time, which is the host byte order, and the unit is second.
CN202110454832.0A 2021-04-26 2021-04-26 Speech processing method based on automatic selection of speech decoding playing format Active CN112863526B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110454832.0A CN112863526B (en) 2021-04-26 2021-04-26 Speech processing method based on automatic selection of speech decoding playing format

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110454832.0A CN112863526B (en) 2021-04-26 2021-04-26 Speech processing method based on automatic selection of speech decoding playing format

Publications (2)

Publication Number Publication Date
CN112863526A CN112863526A (en) 2021-05-28
CN112863526B true CN112863526B (en) 2021-07-16

Family

ID=75992925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110454832.0A Active CN112863526B (en) 2021-04-26 2021-04-26 Speech processing method based on automatic selection of speech decoding playing format

Country Status (1)

Country Link
CN (1) CN112863526B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113573369A (en) * 2021-07-26 2021-10-29 申瓯通信设备有限公司 Integrated access equipment based on digital voice
CN113689854B (en) * 2021-08-12 2024-01-23 深圳追一科技有限公司 Voice conversation method, device, computer equipment and storage medium
CN113691595A (en) * 2021-08-12 2021-11-23 深圳追一科技有限公司 Interactive interface generation method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452723A (en) * 2008-10-16 2009-06-10 北京光线传媒有限公司 Media file playing method, playing system and media player
CN102440067A (en) * 2011-09-16 2012-05-02 华为终端有限公司 File read/write method and mobile terminal
CN109040777A (en) * 2018-08-17 2018-12-18 江苏华腾智能科技有限公司 A kind of Internet of Things broadcast audio transmission delay minishing method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190036648A1 (en) * 2014-05-13 2019-01-31 Datomia Research Labs Ou Distributed secure data storage and transmission of streaming media content

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452723A (en) * 2008-10-16 2009-06-10 北京光线传媒有限公司 Media file playing method, playing system and media player
CN102440067A (en) * 2011-09-16 2012-05-02 华为终端有限公司 File read/write method and mobile terminal
CN109040777A (en) * 2018-08-17 2018-12-18 江苏华腾智能科技有限公司 A kind of Internet of Things broadcast audio transmission delay minishing method

Also Published As

Publication number Publication date
CN112863526A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
CN112863526B (en) Speech processing method based on automatic selection of speech decoding playing format
US20020111812A1 (en) Method and apparatus for encoding and decoding pause informantion
KR102419595B1 (en) Playout delay adjustment method and Electronic apparatus thereof
US7764613B2 (en) Communication control method and system
EP1885130A2 (en) Method and apparatus for video telephony in portable terminal
CN106464942B (en) Downloading method and device of streaming media resource and terminal equipment
WO1995031055A1 (en) Method and apparatus for inserting signaling in a communication system
CN111343098B (en) Data interaction method and device, computer equipment and storage medium
CN111768790B (en) Method and device for transmitting voice data
US7912974B2 (en) Transmitting over a network
US8369456B2 (en) Data processing apparatus and method and encoding device
EP2077683A2 (en) Method, system and apparatus for updating phonebook information
EP2629298A2 (en) Method and apparatus for seeking a frame in multimedia contents
CN100589195C (en) Orientation playing method of MP3 files according to bit rate change
CN112511702B (en) Media frame pushing method, server, electronic equipment and storage medium
CN110913421B (en) Method and device for determining voice packet number
CN111381973B (en) Voice data processing method and device and computer readable storage medium
KR100668247B1 (en) Speech transmission system
EP2512052A1 (en) Method and device for determining in-band signalling decoding mode
US20060133344A1 (en) Method for reproducing AMR message in mobile telecommunication terminal
US20220416949A1 (en) Reception terminal and method
CN114710692B (en) Multimedia file processing method and device
KR100874023B1 (en) Method and apparatus for managing input data buffer of MP3 decoder
KR101060490B1 (en) Method and device for calculating average bitrate of a file of variable bitrate, and audio device comprising said device
JP3249012B2 (en) Audio coding device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant