CN108597522B

CN108597522B - Voice processing method and device

Info

Publication number: CN108597522B
Application number: CN201810443395.0A
Authority: CN
Inventors: 王睿宇; 段效晨; 余景逸
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2018-05-10
Filing date: 2018-05-10
Publication date: 2021-10-15
Anticipated expiration: 2038-05-10
Also published as: CN108597522A

Abstract

The embodiment of the invention provides a voice processing method and a voice processing device, wherein the method comprises the following steps: acquiring voice content at a preset voice input inlet; determining a voice processing model arranged at a browser end; converting the voice content into target display content through the voice processing model; and displaying the target display content in a preset display area. According to the embodiment of the invention, the voice processing model is determined at the browser end, and the browser end realizes the conversion of the voice content, so that no pressure is caused to the server in the voice content conversion, and a user can publish the voice comment content in the browser.

Description

Voice processing method and device

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing speech.

Background

With the development of society, people can comment on interesting videos, texts, pictures and the like on the internet.

In the prior art, because the storage space occupied by the audio file is large, if a user makes a voice comment, the voice file needs to be converted into characters through the server, then the characters are stored in the server, and the character comment is displayed in the browser.

However, in the process of studying the above technical solutions, the skilled person finds that the above technical solutions have the following disadvantages: because the server is required to perform voice file conversion once every time a user issues a voice comment, and the number of the voice comments is usually large, great pressure is applied to the server, so that usually on a browser of a client, only an input port for a text comment is provided for the user, a voice comment entry is not set, and the user cannot issue the comment through voice input in the browser.

Disclosure of Invention

The embodiment of the invention provides a voice processing method and a voice processing device, which are used for solving the problem that a user cannot publish comments through voice input due to overlarge server pressure caused by voice comments.

According to a first aspect of the present invention, there is provided a speech processing method applied to a browser, the method comprising:

acquiring voice content at a preset voice input inlet;

determining a voice processing model arranged at a browser end;

converting the voice content into target display content through the voice processing model;

and displaying the target display content in a preset display area.

According to a second aspect of the present invention, there is provided a speech processing apparatus applied to a browser side, the apparatus comprising:

the voice content acquisition module is used for acquiring voice content at a preset voice input inlet;

the voice processing model determining module is used for determining a voice processing model arranged at the browser end;

the target display content conversion module is used for converting the voice content into target display content through the voice processing model;

and the target display content display module is used for displaying the target display content in a preset display area.

The embodiment of the invention has the following advantages: according to the embodiment of the invention, the voice processing model is determined at the browser end, and the browser end realizes the conversion of the voice content, so that no pressure is caused to the server in the voice content conversion, and a user can publish the voice comment content in the browser. Specifically, a voice input interface is preset in the browser, after the voice input interface acquires voice content, a voice processing model arranged at the browser is determined, the voice content is converted into target display content, the target content is displayed in a preset display area of the browser, the server does not need to convert the voice content, and the pressure of the server is reduced.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart of a method for processing speech according to an embodiment of the present invention;

FIG. 2a is a detailed flowchart of a speech processing method according to an embodiment of the present invention;

FIG. 2b is a diagram of a display interface provided by an embodiment of the present invention;

fig. 3 is a block diagram of a speech processing apparatus according to an embodiment of the present invention;

fig. 4 is a detailed block diagram of a speech processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

It should be understood that the specific embodiments described herein are merely illustrative of the invention, but do not limit the invention to only some, but not all embodiments.

Example one

Referring to fig. 1, a flow chart of a method of speech processing is shown.

It can be understood that the embodiment of the present invention may be applied to a browser end, and the browser end may specifically be a client provided with a browser. A browser is an application that displays files on a web server or file system and allows a user to interact with the files. It is used to display text, images and other information within the world wide web or local area network, etc. These words or images can be hyperlinks to other web sites, and the user can browse various information quickly and easily. The client may be a computer, other electronic devices with a GPU, and the like, and the embodiment of the present invention is not limited in this respect.

The method specifically comprises the following steps:

step 101: and acquiring voice content at a preset voice input entrance.

In the embodiment of the invention, a voice input entry can be set in the user interface of the browser in advance by adding a script or a control and the like, and the voice input entry can access recording equipment such as a microphone of a client side where the browser is located. When the user triggers the voice input entrance and inputs voice content through the voice input entrance, the voice content input by the user can be acquired at the voice input entrance.

Step 102: and determining a voice processing model arranged at the browser end.

In the embodiment of the invention, the voice processing model can be arranged at the server side, and the browser calls the preset voice processing model from the server side; in addition, the preset voice processing model can also be arranged at the client side where the browser is located, and the browser calls the preset voice processing model from the client side.

In practice, the speech processing model may be a speech recognition model. Specifically, the speech recognition model may be created by:

firstly, voice of a user reading sample is collected at a client side, and a user voice sample is obtained. The sample for the user to read can be a static sample, such as a Chinese phonetic table, an English alphabet, a numeral table, an easily-confused word table and the like; the samples presented to the user may also be dynamic samples, e.g. containing speech content that the user has been misrecognized, such as syllables of confusing pronunciation, mispronunciations, etc.

Then, the server extracts the characteristics of the collected user voice sample, and creates a voice recognition model according to the extracted characteristics. Of course, if a more optimized speech recognition model is to be obtained, the above steps of creating the speech recognition model may be repeated, and after a plurality of training, the more optimized speech recognition model is selected.

In practical applications, the speech processing model may also be a speech emotion analysis model. Specifically, the speech emotion analysis model can be created by the following method:

firstly, acquiring a large number of voice files as training samples, extracting voice emotion characteristics of the voice files, and forming a voice emotion characteristic vector; the speech emotional characteristics comprise short-time zero crossing rate, short-time energy, fundamental tone frequency, formant, harmonic noise ratio and the like.

And secondly, classifying the voice emotion feature vectors by a voice emotion classifier. The determined emotion categories can include anger, happiness, sadness, surprise, disgust, fear, peace and silence, and the like.

And finally, creating a speech emotion analysis model according to the judgment result. Of course, if a more optimized speech emotion analysis model is to be obtained, the above steps of creating the speech emotion analysis model may be repeated, and after a plurality of training, the more optimized speech emotion analysis model is selected.

It is to be understood that the speech processing model may also be set by a person skilled in the art according to an actual application scenario, and the training method of the speech processing model may also be set by a person skilled in the art according to an actual application scenario, for example, training the speech recognition model based on an artificial intelligence learning system Keras well as the LSTM (Long Short-Term Memory, temporal recurrent neural network), which is not limited in this embodiment of the present invention.

Step 103: and converting the voice content into target display content through the voice processing model.

In the embodiment of the invention, the voice content can be converted into any one of characters, colors, pictures, expressions and the like through the voice processing model, and the target display content which can partially or completely reflect the voice content input by the user can be partially or completely reflected; it can be understood that the storage space occupied by the characters, the colors, the pictures, the expressions and the like is smaller than the storage space occupied by the voice file, so that the storage resource is occupied less.

Step 104: and displaying the target display content in a preset display area.

In this embodiment of the present invention, the preset display area may be a comment area of the browser user interface, for example, the browser interface includes an area for playing a video or an area for displaying news, and the preset display area may be any of an upper area, a lower area, a left area, and a right area around the area for playing a video or the area for displaying news, and target display contents may be displayed in the preset display area one by one.

In summary, in the embodiment of the present invention, the voice processing model is determined at the browser end, and the browser end converts the voice content, so that no pressure is applied to the server in the voice content conversion, and the user can publish the voice comment content in the browser. Specifically, a voice input interface is preset in the browser, after the voice input interface acquires voice content, a voice processing model arranged at the browser is determined, the voice content is converted into target display content, the target content is displayed in a preset display area of the browser, the server does not need to convert the voice content, and the pressure of the server is reduced.

Example two

Referring to fig. 2a, a specific flowchart of a speech processing method is shown, which is applied to a browser, and specifically includes the following steps:

step 201: and acquiring voice content at a preset voice input entrance.

Step 202: obtaining a model configuration file through the model import module; the browser is provided with a model import module.

In the embodiment of the invention, the voice processing model is pre-stored in the server side, the browser is provided with the model import module, and the browser can determine the model configuration file which can import the voice processing module stored in the server side into the client side where the browser is located through the model import module. And when the voice content input by the user through the voice input inlet is acquired, the browser starts the model import module and determines the model configuration file.

Preferably, the model import module is constructed based on a model construction framework kerasJs.

In a specific Application, Keras is an Application Programming Interface (API), which may be written by Python. KerasJs can independently run on a network background to perform a large amount of operations, and a model import module constructed based on a model construction framework KerasJs has the advantages of high operation efficiency and easiness in implementation.

Step 203: and importing the voice processing model of the server side into the browser side according to the model configuration file.

In the embodiment of the invention, after the model configuration file guides the voice processing model of the server side into the client side where the browser is located, the voice content can be analyzed and processed at the client side, and the calculation pressure of the server is reduced.

Step 204: and saving the voice processing model in the browser end.

In the embodiment of the invention, if the user inputs the voice content by using the browser, the voice processing model is introduced into the client of the browser used by the user, and the voice processing model is stored in the browser, so that the voice processing model can be directly called at the client instead of being introduced from the server after the user inputs the voice content again by using the browser, thereby reducing the steps of determining the model configuration file and introducing the voice processing model, and improving the efficiency of voice processing.

Step 205: and determining a voice processing model arranged at the browser end.

It can be understood that, in the embodiment of the present invention, steps 203 to 204 may be executed after step 201, after obtaining the voice content of the user at the browser end, the voice processing model of the server is called, and the voice processing model is determined to the browser end, and then step 205 may determine the voice processing model at the browser end; step 203 to step 204 may also be executed before step 201, and a voice processing model of the server is called in advance, and the voice processing model is determined to the browser end, and after the voice content of the user at the browser end is acquired, the voice processing model may be directly determined at the browser end in step 205.

Step 206: and converting the voice content into target display content through the voice processing model.

As a preferred solution of the embodiment of the present invention, the speech processing model includes: a speech recognition model and/or a speech emotion analysis model; the target display content comprises: textual content, and/or emotional display content.

When the voice processing model is a voice recognition model, determining the voice processing model arranged at the browser end; the step of converting the speech content into the target display content by the speech processing model comprises:

determining a voice recognition model arranged at the browser end; and converting the voice content into text content through the voice recognition model, and determining the text content as target display content.

In the embodiment of the invention, the voice content is converted into the text content through the voice recognition model, and the text content is determined as the target display content to be displayed on the browser user interface. The user using the browser can know the voice content of the specific content issued by the user through the text content.

When the voice processing model is a voice emotion analysis model, determining the voice processing model arranged at the browser end; the step of converting the speech content into the target display content by the speech processing model comprises:

determining a voice emotion analysis model arranged at the browser end; analyzing the emotion type of the voice content through the voice emotion analysis model; obtaining emotion display content corresponding to the emotion types, and determining the emotion display content as target display content.

In the embodiment of the invention, the emotion type of the voice content is analyzed through a voice emotion analysis model, for example, if the voice sent by a user is spoken by angry intonation, namely that the voice is large in one set and small in one set, and the work is not careful, the emotion type of the user can be analyzed to be angry; if the speech uttered by the user is spoken in a happy tone, "feel like a lot, though not too much", the emotion type of the user can be analyzed to be happy, and so on.

In specific application, the corresponding relation between the emotion type and the emotion display content can be predetermined. For example, if the emotion exposure content is a color, it may be determined that the emotion type "anger" corresponds to red, the emotion type "happy" corresponds to green, the emotion type "sad" corresponds to blue, and the like. After the emotion type of the voice content is determined, the emotion display content corresponding to the emotion type can be determined through the corresponding relation between the emotion type and the emotion display content and serves as target display content.

In a specific application, the emotion display content may include: one or more of background color, expression or picture, for example, the emotion display content can be expression representing emotion types of happiness, anger, sadness and the like; or a picture representing an emotional type of happiness, anger, sadness, etc.; combinations of expressions and colors representing emotional types such as happy, angry, and sad, etc., and the embodiment of the present invention is not particularly limited thereto. Through the display of the emotion display content, a user using the browser can know what emotion type of voice content is issued by the user.

Step 207: and displaying the target display content in a preset display area.

In the embodiment of the invention, only the text content converted by the voice recognition model can be displayed in a preset display area, such as a comment area, so that a user can know the specific content of the comment through the text content; in the embodiment of the invention, in a preset display area, such as a comment area, only emotion display contents determined by a voice emotion analysis model, such as only colors, expressions, pictures and the like, can be displayed; the user can know the emotion of the user who issues the comment through the emotion display content, and the emotion display content is not specifically displayed in the embodiment of the invention.

As a preferred scheme of the embodiment of the invention, the voice processing model comprises a voice recognition model and a voice emotion analysis model at the same time.

Taking the target display content of the speech emotion analysis model as a color as an example, the color can be used as a background of the text content, as shown in fig. 2b, a preset display area is a comment area for displaying comments, after a user inputs speech in a speech input area, the speech content is processed by the speech recognition model and the speech emotion analysis model, the text content of the speech content is displayed in the comment area, and meanwhile, the emotion type of the speech content is used as the background color of the text content, so that the user can know the comment content and the emotion type of the user who publishes the comment content clearly in the comment area, and the interestingness and intuition of the user watching the comment area are increased.

Step 208: and sending the target display content to a server side so that the server side stores the target display content.

In the embodiment of the invention, the target display content is sent to the server side, and the server side stores the target display content, so that the target display content can be displayed for a long time in a browser which uses the server side to provide service support.

Preferably, after step 208, the speech processing model may also be deleted from the client.

In the embodiment of the invention, the voice processing model of the client can be deleted after the target content is displayed, so that the occupation of the client resource is avoided.

According to the embodiment of the invention, the voice processing model is determined at the browser end, and the browser end realizes the conversion of the voice content, so that no pressure is caused to the server in the voice content conversion, and a user can publish the voice comment content in the browser. Specifically, a voice input interface is preset in the browser, after the voice input interface acquires voice content, a voice processing model arranged at the browser is determined, the voice content is converted into target display content, the target content is displayed in a preset display area of the browser, the server does not need to convert the voice content, and the pressure of the server is reduced.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

EXAMPLE III

Referring to fig. 3, a block diagram of a speech processing apparatus, which is applied to a browser end, is shown, and specifically may include:

the voice content obtaining module 310 is configured to obtain voice content at a preset voice input entry.

And the voice processing model determining module 320 is configured to determine a voice processing model provided at the browser end.

And a target display content conversion module 330, configured to convert the voice content into a target display content through the voice processing model.

And a target display content displaying module 340, configured to display the target display content in a preset display area.

Preferably, referring to fig. 4, on the basis of fig. 3, the speech processing model comprises: a speech recognition model and/or a speech emotion analysis model; the target display content comprises: text content, and/or emotion display content;

the voice processing model determining module 320 and the target display content converting module 330 include:

the voice recognition model determining submodule is used for determining a voice recognition model arranged at the browser end;

the text content conversion sub-module is used for converting the voice content into text content through the voice recognition model and determining the text content as target display content;

and/or the presence of a gas in the gas,

the voice emotion analysis model determining submodule is used for determining a voice emotion analysis model arranged at the browser end;

the emotion type analysis submodule is used for analyzing the emotion type of the voice content through the voice emotion analysis model;

and the emotion display content acquisition submodule is used for acquiring emotion display content corresponding to the emotion type and determining the emotion display content as target display content.

Preferably, a model import module is arranged in the browser;

the device further comprises:

and the model configuration file determining module 360 is used for obtaining the model configuration file through the model importing module.

And an importing module 370, configured to import the voice processing model of the server side into the browser side according to the model configuration file.

Preferably, the method further comprises the following steps:

and the storage module is used for storing the voice processing model in the browser end.

Preferably, the model import module is constructed based on a model construction framework kerasJs; the device further comprises:

a sending module 350, configured to send the target display content to a server, so that the server stores the target display content.

According to the embodiment of the invention, the voice processing model is determined at the browser end, and the browser end realizes the conversion of the voice content, so that no pressure is caused to the server in the voice content conversion, and a user can publish the voice comment content in the browser. Specifically, a voice input interface is preset in the browser, after the voice content obtaining module 310 obtains the voice content through the voice input interface, the voice processing model determining module 320 calls the preset voice processing model, the target display content converting module 330 converts the voice content into the target display content, and the target display content displaying module 340 displays the target content in the preset display area of the browser.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

In a typical configuration, the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (fransitory media), such as modulated data signals and carrier waves.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable speech processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable speech processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable speech processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable speech processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing describes a speech processing method and a speech processing apparatus in detail, and specific examples are applied herein to explain the principles and embodiments of the present invention, and the descriptions of the foregoing embodiments are only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of speech processing, the method comprising:

acquiring voice content at a preset voice input inlet;

determining a voice processing model arranged at a browser end;

converting the voice content into target display content through the voice processing model; displaying the target display content in a preset display area;

the browser end is provided with a model importing module; the voice processing model is stored in a server side in advance;

before the determining the voice processing model provided on the browser end, the method further comprises:

when voice content input by a user through the voice input inlet is obtained, starting a model import module of the browser end, and obtaining a model configuration file through the model import module; importing a voice processing model of a server side into the browser side according to the model configuration file;

the speech processing model comprises: a speech recognition model, and a speech emotion analysis model; the target display content comprises: text content, and, emotion presentation content;

determining a voice processing model arranged at the browser end; the step of converting the speech content into the target display content by the speech processing model comprises:

determining a voice recognition model arranged at the browser end;

converting the voice content into text content through the voice recognition model, and determining the text content as target display content;

and the combination of (a) and (b),

determining a voice emotion analysis model arranged at the browser end;

analyzing the emotion type of the voice content through the voice emotion analysis model;

obtaining emotion display content corresponding to the emotion types, and determining the emotion display content as target display content.

2. The method of claim 1, further comprising:

and saving the voice processing model in the browser end.

3. The method of claim 1, wherein after the step of converting the speech content into the target display content by the speech processing model, further comprising:

and sending the target display content to a server side so that the server side stores the target display content.

4. A speech processing apparatus, characterized in that the apparatus comprises:

the voice processing model determining module is used for determining a preset voice processing model arranged at the browser end;

the target display content display module is used for displaying the target display content in a preset display area;

the device further comprises:

the model configuration file determining module is used for starting the model importing module of the browser end when the voice content input by the user through the voice input inlet is obtained, and obtaining a model configuration file through the model importing module;

the import module is used for importing the voice processing model of the server end into the browser end according to the model configuration file;

the voice processing model determining module and the target display content converting module comprise:

and the combination of (a) and (b),

5. The apparatus of claim 4, further comprising:

6. The apparatus of claim 4, further comprising:

and the sending module is used for sending the target display content to a server so that the server stores the target display content.