WO2017206256A1

WO2017206256A1 - Method for automatically adjusting speaking speed and terminal

Info

Publication number: WO2017206256A1
Application number: PCT/CN2016/087741
Authority: WO
Inventors: 王晓军
Original assignee: 宇龙计算机通信科技(深圳)有限公司
Priority date: 2016-05-31
Filing date: 2016-06-29
Publication date: 2017-12-07
Also published as: CN105869626A; CN105869626B

Abstract

Disclosed in the present invention is a method for automatically adjusting speaking speed, comprising: acquiring inputted speech information; extracting the speech characteristic information of the speech information; querying from a speech database the playing speed of the speech information corresponding to the speech characteristic information; adjusting, according to the playing speed, the speed at which the speech information is played. It can be seen that the method can determine, according to the speech characteristic information of the speech information inputted in real time, a predetermined playing speed corresponding to the speech characteristic information, and adjust, according to the playing speed, the speaking speed of the inputted speech information, to accommodate the needs of various users; i.e. the present invention realizes the adaptive adjustment of the playing speed according to the content of the speech information, and can be used for a call, programmed playing, etc. with good adaptability. Further disclosed in the present invention is a terminal, which can adaptively adjust the playing speed according to the content of speech information.

Description

Method and terminal for automatically adjusting speech rate

Technical field

The present invention relates to the field of communications technologies, and in particular, to a method and terminal for automatically adjusting speech rate.

Background technique

Due to the different levels of people's hearing, the content of the same speed of speech will make some people feel that the speed of speech is so fast that they can't hear clearly. For others, they will feel that the speed of speech is so slow that they feel that they are wasting time. Therefore, the speech rate of the content to be played in the terminal needs to be set according to the actual needs of the people.

In the prior art, the speech speed adjustment control is added in the user mobile phone client application, so that the user selects the adjustment speech rate, selects the speech rate level, and the mobile phone plays the speech content according to the user-set adjustment speech rate level. However, the above methods also have the following disadvantages: First, although the adjustment of the speech rate is divided into several levels, it requires manual presets, and cannot be dynamically adjusted, that is, the speech rate cannot be adaptively adjusted. Secondly, the speech rate adjustment is limited to the content played by the mobile client software, and the speech rate cannot be adjusted in real time during the call. Finally, it is not possible to adapt to other kinds of languages and adjust the speech rate according to the language of both parties. Therefore, how to adaptively adjust the speech rate is a technical problem that a person skilled in the art needs to solve.

Summary of the invention

An object of the present invention is to provide a method and terminal for automatically adjusting speech rate, which can determine a predetermined playing speed corresponding to the voice feature information according to the voice feature information of the voice information input in real time, and input the sound according to the playing speed. The speech rate of the voice information is adjusted, and the playback speed is adaptively adjusted according to the content of the voice information.

In order to solve the above technical problem, the present invention provides a method for automatically adjusting speech rate, including:

Obtain the input voice information;

Extracting voice feature information of the voice information;

Querying, from the voice database, a play speed of the voice information corresponding to the voice feature information;

Adjusting the speed at which the voice information is played according to the playback speed.

The extracting the voice feature information of the voice information includes:

Identifying language feature information of the voice message; and/or,

Extracting at least one of speech rate information, feature word information, and audio information of the voice information.

The voice information is voice information of the local user, and the method further includes:

Obtaining the physical location information of the local user;

Querying, by the voice database, the playing speed of the voice information corresponding to the voice feature information, including:

A playback speed of the voice information corresponding to the voice feature information and the body information is queried from a voice database.

The playing speed of the voice information corresponding to the voice feature information and the body information is queried from the voice database, and further includes:

The voice relationship information and the physical condition information are used to update the correspondence relationship between the playback speeds in the voice database according to the machine learning algorithm.

The speed at which the voice information is played is adjusted according to the playing speed, including:

The digital signal of the voice information is resampled by interpolation or cropping, and the time scale of the voice information is adjusted to reach the playing speed.

The invention also provides a terminal, comprising:

a voice information acquiring module, configured to obtain input voice information;

a voice feature extraction module, configured to extract voice feature information of the voice information;

a playing speed determining module, configured to query, from the voice database, a playing speed of the voice information corresponding to the voice feature information;

And a play speed adjustment module, configured to adjust a speed of playing the voice information according to the play speed.

The voice feature extraction module includes:

a first speech feature extraction unit, configured to identify language feature information of the voice information; and/or,

The second voice feature extraction unit is configured to extract at least one of the speech rate information, the feature word information, and the audio information of the voice information.

The voice information is voice information of the local user, and the terminal further includes:

The physical information acquisition module is configured to obtain the physical location information of the local user.

The terminal further includes:

The machine learning module is configured to update the correspondence between the playback speeds in the voice database according to the machine learning algorithm by using the voice feature information and the physical information.

The playback speed adjustment module is specifically a module that resamples the digital signal of the voice information by interpolation or clipping, and adjusts a time scale of the voice information to reach the playback speed.

The method for automatically adjusting the speech rate provided by the present invention comprises: acquiring input voice information; extracting voice feature information of the voice information; and querying, from the voice database, the voice information corresponding to the voice feature information Speed; adjusting a speed at which the voice information is played according to the playing speed;

It can be seen that the method can determine a predetermined playing speed corresponding to the voice feature information according to the voice feature information of the voice information input in real time, and adjust the speech rate of the input voice information according to the playing speed to adapt to various users. The requirement is that the playback speed is adaptively adjusted according to the content of the voice information, and the method can be used for occasions such as user call and program play, thereby improving the adaptability of the method. The present invention also provides a terminal, which has the above-mentioned beneficial effects, and details are not described herein again.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is an embodiment of the present invention, and those skilled in the art can obtain other drawings according to the provided drawings without any creative work.

1 is a flowchart of a method for automatically adjusting a speech rate according to an embodiment of the present invention;

2 is a structural block diagram of a terminal according to an embodiment of the present invention;

3 is a structural block diagram of another terminal according to an embodiment of the present invention;

FIG. 4 is a structural block diagram of still another terminal according to an embodiment of the present invention.

detailed description

The core of the present invention is to provide a method and terminal for automatically adjusting the speech rate, and can determine a predetermined playing speed corresponding to the voice feature information according to the voice feature information of the voice information input in real time, and input the sound according to the playing speed. The speech rate of the voice information is adjusted, and the playback speed is adaptively adjusted according to the content of the voice information.

The technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

Please refer to FIG. 1. FIG. 1 is a flowchart of a method for automatically adjusting a speech rate according to an embodiment of the present invention; the execution subject in the embodiment is a terminal, and the terminal may be a mobile phone; the method may include:

S100. Acquire input voice information.

The voice information can be obtained by monitoring the call service and the application capable of implementing the voice play function; that is, the voice information of the local user when making a call or receiving a call, or the peer user playing The voice information when the phone or the phone is answered may also be the voice information played by the application with the voice playing function.

S110. Extract voice feature information of the voice information.

The type of the voice feature information extracted and the number of the types of the voice information can be confirmed according to the actual needs of the user, as long as the voice information of the voice information in the voice information can be adjusted according to the preset standard. can. That is to say, the speech feature information in the voice information can be adjusted according to a preset standard to realize the automatic adjustment of the speech rate. For example, the voice feature information herein may include feature information such as emotion, language, phonetic features, speech rate, and intonation.

S120. Query, from a voice database, a playback speed of the voice information corresponding to the voice feature information.

After confirming the extracted voice feature information, the user may preset a corresponding playback speed corresponding to each voice feature information, or several voice feature information jointly determine a corresponding playback speed; where the voice database may correspond The form of the list will correspond to the above relationship For storage, the above correspondence may be stored in the form of a mapping table. The user can also modify, delete, add, and so on the corresponding relationship saved in the voice database according to the actual situation, so as to ensure that the corresponding playback speed of the voice feature information is the latest, and can meet the actual needs of the user.

Querying the voice database here may further include comparing the extracted voice feature information with a range of the corresponding voice feature information in the voice database, determining which range the value of the extracted voice feature information is located, and further confirming the range corresponding to the range. Set the playback speed. The user can also modify the range of the voice feature information according to actual needs, and can also modify the preset playback speed corresponding to each range to adapt to the user's personalized needs and improve the user experience.

S130. Adjust a speed of playing the voice information according to the playing speed.

The voice information is adjusted according to the obtained playback speed to achieve the playback speed. The method for adjusting the specific voice information is not limited here, as long as the acquired voice information can be adjusted to the corresponding playback speed for playback. A specific speech rate adjustment process is provided below: the digital signal of the speech information is resampled by interpolation or scintillation, and the time scale of the speech information is adjusted to reach the playback speed. That is, the digital signal is resampled by interpolation or scribing, thereby lengthening or shortening the time scale of the speech, and achieving the purpose of changing the speech rate.

For example, in the process of people using mobile phones, calling is a basic business and an important function. However, some people speak faster and some people have poor hearing. In this case, it is more difficult to communicate. In the process of the user using the mobile phone to make a call, the method collects the voice feature information such as emotion, language, and voice features of the two parties according to the obtained input voice information, and compares the information in the voice database with the information in the voice database, thereby performing judgment. If the speech rate is too fast, or there is abnormal feedback at the opposite end, confirm the playback speed corresponding to the speech rate, or the playback speed corresponding to the abnormal feedback, and resample the digital signal by interpolation or clipping to lengthen or shorten the speech time. Scale, the purpose of changing the speed of speech. When the user uses the mobile phone, the speed of the sound played from the earpiece is automatically adjusted according to factors such as the language type and mood change used by the local user or the opposite user when using the mobile phone. To adapt to the needs of various groups of people.

Optionally, the learning and updating of the voice database is performed by using a machine learning algorithm.

Maintaining the voice database in the terminal, the user can store the voice feature information parameters, so that the machine learning algorithm takes the voice feature information parameters as input and learns to implement the voice database. Update. It can be adjusted according to the long-term usage habits of different user groups, rather than being adjusted according to the original setting data of the guide, and has better adaptability.

The specific implementation process of the above example can be as follows:

When the local user, that is, the calling user is eager to express something or is emotional, the words used in the content of the voice information conform to the definition of "impatient" in the database, and then the acquisition speed is reduced according to the "immediacy" corresponding playback speed. Enter the speech rate of the voice message. Achieve soothing purposes, allowing users to use mobile phone calls more efficiently and friendly.

For example, when the calling end user uses English, it is judged that the English is based on the voice feature information, and then the speech rate of the input voice information is adjusted according to the playing speed corresponding to the English. After this adjustment, the called end user, that is, the opposite end user will hear the slowed voice information, and can solve the problem that the user has difficulty in communicating with the non-native language user to a certain extent.

Based on the above technical solution, the method for automatically adjusting the speech rate according to the embodiment of the present invention can determine a predetermined playing speed corresponding to the voice feature information according to the voice feature information of the voice information input in real time, and input the sound according to the playing speed. The speech rate of the voice information is adjusted to adapt to the needs of various users; that is, the playback speed is adaptively adjusted according to the content of the voice information, and the method can be used for occasions such as user call and program play, and the method is improved. Adaptable. Enable different users to adapt the voice playback speed according to their own needs and enhance the user experience.

Based on the above embodiment, the embodiment can adaptively adjust the playback speed of the voice information corresponding to each language type according to the language type of the input voice information; that is, the playback speed can be adaptively adjusted according to the language type. Preferably, the voice feature information for extracting the voice information is specifically:

Identifying language feature information of the voice information.

The language feature information of the voice information may be obtained by identifying the acquired input voice information. The language feature information may include an audio parameter, feature word information, and the voice information is determined according to a preset playback speed corresponding to the language feature information. The speed of playback. Here, the user can set the corresponding playback speed for any language separately; or set the corresponding playback speed for a predetermined number of languages; or divide the language into several categories, and set the corresponding playback speed for each category only, corresponding to The language feature information herein may be category information, or the language to be judged in which category the language belongs to, and finally the corresponding playback speed is determined; the correspondence between the language and the playback speed may be implemented by a corresponding list or a mapping table.

The method for identifying language feature information can synthesize "reference speech" for each language of the user, a Markov model based on segments and syllables, a pitch contour, a formant vector, and an acoustic feature through a user language recognition system and a language text translation system. The dialectical phoneme and prosodic features, and their original phonetic acoustic characteristics are identified. The classification methods used may include HMMs, expert systems, clustering algorithms, secondary classification, and artificial neural networks.

The following embodiments are described in the following specific application scenarios:

When the input voice information exists in the application in the terminal, the acquired voice information is identified, and if the language feature information is determined to be English, the playback speed corresponding to the English preset by the user is determined, and the speech rate of the voice information is adjusted. For the corresponding playback speed. English is only an example.

When the user is in a call, the language of the local user's voice information may be detected, or only the language of the voice information of the peer user may be detected, or the language of the voice information of the local user and the peer user may be detected; A case is illustrated as an example:

At the beginning, the mobile phone is in normal communication state and the main called party is connected. The voice information acquiring module acquires the input voice information; the voice feature extraction module extracts the audio parameters and the keyword sentences of both parties. The play speed determination module parses the extracted audio parameters, queries the voice database and performs language judgment, and determines the preset play speed of the user according to the language. The playback speed adjustment module temporally lengthens or shortens the voice information. The handset plays the processed voice message. Both parties hang up and the call is completed.

In this embodiment, the user can determine the receiving ability for each language according to his actual situation, and set the playing speed reasonably, which can solve the problem that the user has difficulty in communicating with the non-native language user.

Based on any of the above embodiments, the embodiment is mainly used for voice communication between users, and may have a fast speech rate, an emotional excitement, etc., in order to be able to smoothly communicate between users in these situations, according to the user. The voice feature information of the voice information determines the state of the user, and determines the play speed set in the state; that is, the play speed can be adaptively adjusted according to the user's speaking state. Preferably, the voice feature information for extracting the voice information is specifically:

Among them, these need to first determine the user state corresponding to or reacted to each voice feature information, and determine what kind of playback speed should be set in this state. Here can be based only on the speed of speech The information is determined, and the determination may be performed only based on the feature word information, that is, the speech rate information, the feature word information, and the audio information may be arbitrarily combined;

When used alone, it is classified according to each voice feature information, and the corresponding playback speed is set for each case after classification, for example, the speech rate information, and the user usually speaks too fast when the user is in a hurry. When the speech rate information exceeds a certain value, the user can be considered as anxious, and the voice information is set to a predetermined speed of the next speed. Of course, the speech rate can be divided into several speech speed ranges, and each speech speed range can be set. The corresponding playback speed.

In order to improve the accuracy of the speech rate adjustment, it is preferable to use the speech rate information, the feature word information and the audio information in combination, that is, the playback speed is determined according to the information synthesis of the three features. For example, when the user is in a hurry, the speaking speed will generally be too fast, and some specific words will appear (the user can set the habitual words in their own urgency according to their own characteristics), and the sound will be high, if there are three The user or at least both can consider the user to be impatient and set their voice information to a predetermined speed of play.

The speech rate information, the feature word information and the audio information in this embodiment can be used in combination with the language feature information. For example, the corresponding playback speed in the English speech rate range and the corresponding playback speed in the Chinese speech rate range are set.

Based on the above embodiment, the user can adaptively adjust the problem of the speech rate. Enable different users to change the voice playback speed according to their own needs and enhance the user experience.

Based on any of the foregoing embodiments, the embodiment is mainly for determining the state of the local user more accurately, and determining the playing speed of the local user in the state; and adjusting the playing speed according to the local user speaking state. That is, the voice information is the voice information of the local user, and the method may further include:

Obtaining the physical location information of the local user;

Corresponding to query the playing speed of the voice information corresponding to the voice feature information from the voice database, including:

The foregoing embodiment may determine the state of the user according to the speech rate information, the feature word information, and the audio information. In order to more accurately determine whether the local user is in the state, the local end may also be acquired. The physical information of the user, the physical information may include the body temperature, pulse, and the like of the local user. The collection of the vital information can be collected by a smart wearable device such as a smart bracelet that is compatible with the terminal.

For example, when the local user, that is, the calling user is eager to express something or is emotional, the words and phrases used in the voice information content are consistent with the definition of the user in the database, and the information such as the user's pulse is collected from the smart bracelet, then Determining that the user is in an impatient state will reduce the speech rate of the acquired input voice information according to the playback speed corresponding to the emergency. Achieve soothing purposes, allowing users to use mobile phone calls more efficiently and friendly. The specific process can be as follows:

The mobile phone is in normal communication state and the main called party is connected. Collect user's voice information, and collect information such as body temperature and pulse during the user's call through the smart bracelet. The voice database information is queried, and the user's body temperature, pulse change and keyword sentence, that is, the use of feature word information, are used to determine whether the user is emotionally excited. And based on the speech rate information to determine whether adjustment is needed. If the condition of the adjustment is satisfied, the adjustment is made according to the preset value in the voice database to determine the new playback speed. Temporary stretching or shortening of voice information data. The handset plays the processed voice data. The user's emotional change information and feature sentences can be written into the voice database to optimize the subsequent calculation of the emotional judgment.

Based on any of the foregoing embodiments, the embodiment mainly improves the accuracy of the voice database. Therefore, the method further includes:

Among them, the voice database is maintained in the terminal, and the audio information parameters of the user can be stored, so that the guidance has the learning function of the speech rate adjustment. It can be adjusted according to the long-term usage habits of different user groups, rather than being adjusted according to the original setting data of the guide, and has better adaptability. With learning function, it will constantly update the key words used by users, namely feature word information, to optimize the calculation of subsequent judgments on user emotions.

Based on the above technical solution, the method for automatically adjusting the speech rate according to the embodiment of the present invention can determine a predetermined playing speed corresponding to the voice feature information according to the voice feature information of the voice information input in real time, and input the sound according to the playing speed. The speech rate of the voice information is adjusted to adapt to the needs of various users; that is, the playback speed is adaptively adjusted according to the content of the voice information, and the The method can be used for occasions such as user call and program play, and the adaptability of the method is improved. Enable different users to adapt the voice playback speed according to their own needs and enhance the user experience.

The embodiment of the invention provides a method for automatically adjusting the speech rate, and can determine a predetermined playing speed corresponding to the voice feature information according to the voice feature information of the voice information input in real time, and the language of the input voice information according to the playing speed. Speed adjustment.

The terminal provided by the embodiment of the present invention is introduced below, and the terminal described below and the method for automatically adjusting the speech rate described above can refer to each other.

Referring to FIG. 2, FIG. 2 is a structural block diagram of a terminal according to an embodiment of the present invention; the terminal may include:

The voice information acquiring module 100 is configured to acquire the input voice information.

The voice feature extraction module 200 is configured to extract voice feature information of the voice information;

The playing speed determining module 300 is configured to query, from the voice database, a playing speed of the voice information corresponding to the voice feature information;

The play speed adjustment module 400 is configured to adjust the speed of playing the voice information according to the play speed.

Optionally, the voice feature extraction module 200 includes:

Optionally, referring to FIG. 3, the voice information is voice information of the local user, and the terminal further includes:

The physical information acquisition module 500 is configured to acquire the physical location information of the local user.

The playing speed determining module 300 is specifically configured to query, from the voice database, a playing speed of the voice information corresponding to the voice feature information and the physical information.

Optionally, referring to FIG. 4, the terminal further includes:

The machine learning module 600 is configured to update the correspondence between the playback speeds in the voice database according to the machine learning algorithm by using the voice feature information and the physical location information.

Optionally, the play speed adjustment module 400 specifically resamples the digital signal of the voice information by interpolation or clipping, and adjusts a time scale of the voice information to a mode of the play speed. Piece.

The terminal may be specifically a mobile phone based on any of the foregoing embodiments.

The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same similar parts between the various embodiments may be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part.

A person skilled in the art will further appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware, computer software or a combination of both, in order to clearly illustrate the hardware and software. Interchangeability, the composition and steps of the various examples have been generally described in terms of function in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein can be implemented directly in hardware, a software module executed by a processor, or a combination of both. The software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.

The method and terminal for automatically adjusting the speech rate provided by the present invention are described in detail above. The principles and embodiments of the present invention have been described herein with reference to specific examples, and the description of the above embodiments is only to assist in understanding the method of the present invention and its core idea. It should be noted that those skilled in the art can make various modifications and changes to the present invention without departing from the spirit and scope of the invention.

Claims

A method for automatically adjusting speech rate, characterized in that it comprises:

Obtain the input voice information;

Extracting voice feature information of the voice information;

Querying, from the voice database, a play speed of the voice information corresponding to the voice feature information;

Adjusting the speed at which the voice information is played according to the playback speed.
The method for automatically adjusting the speech rate according to claim 1, wherein the extracting the voice feature information of the voice information comprises:

Identifying language feature information of the voice message; and/or,

Extracting at least one of speech rate information, feature word information, and audio information of the voice information.
The method for automatically adjusting the speech rate according to claim 1 or 2, wherein the voice information is voice information of the local user, the method further comprising:

Obtaining the physical location information of the local user;

Querying, by the voice database, the playing speed of the voice information corresponding to the voice feature information, including:

A playback speed of the voice information corresponding to the voice feature information and the body information is queried from a voice database.
The method for automatically adjusting the speech rate according to claim 3, wherein the querying the playback speed of the voice information corresponding to the voice feature information and the body information from the voice database further includes:

The voice relationship information and the physical condition information are used to update the correspondence relationship between the playback speeds in the voice database according to the machine learning algorithm.
The method for automatically adjusting the speech rate according to claim 1, wherein adjusting the speed of playing the voice information according to the playing speed comprises:

The digital signal of the voice information is resampled by interpolation or cropping, and the time scale of the voice information is adjusted to reach the playing speed.
A terminal, comprising:

a voice information acquiring module, configured to obtain input voice information;

a voice feature extraction module, configured to extract voice feature information of the voice information;

a playing speed determining module, configured to query, from the voice database, a playing speed of the voice information corresponding to the voice feature information;

And a play speed adjustment module, configured to adjust a speed of playing the voice information according to the play speed.
The terminal according to claim 6, wherein the voice feature extraction module comprises:

a first speech feature extraction unit, configured to identify language feature information of the voice information; and/or,

The second voice feature extraction unit is configured to extract at least one of the speech rate information, the feature word information, and the audio information of the voice information.
The terminal according to claim 6 or 7, wherein the voice information is voice information of the local user, and the terminal further includes:

The physical information acquisition module is configured to obtain the physical location information of the local user.
The terminal according to claim 8, further comprising:

The machine learning module is configured to update the correspondence between the playback speeds in the voice database according to the machine learning algorithm by using the voice feature information and the physical information.
The terminal according to claim 6, wherein the playback speed adjustment module specifically resamples the digital signal of the voice information by interpolation or cropping, and adjusts a time scale of the voice information to reach the playback speed. Module.