CN111768755A - Information processing method, information processing apparatus, vehicle, and computer storage medium - Google Patents
Information processing method, information processing apparatus, vehicle, and computer storage medium Download PDFInfo
- Publication number
- CN111768755A CN111768755A CN202010589862.8A CN202010589862A CN111768755A CN 111768755 A CN111768755 A CN 111768755A CN 202010589862 A CN202010589862 A CN 202010589862A CN 111768755 A CN111768755 A CN 111768755A
- Authority
- CN
- China
- Prior art keywords
- information
- vehicle
- tts engine
- target
- styles
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 28
- 238000003672 processing method Methods 0.000 title claims abstract description 20
- 238000003860 storage Methods 0.000 title claims description 14
- 238000000034 method Methods 0.000 claims abstract description 30
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 17
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 17
- 230000015654 memory Effects 0.000 claims description 19
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 239000000126 substance Substances 0.000 claims description 2
- 238000012545 processing Methods 0.000 description 10
- 238000013473 artificial intelligence Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000003825 pressing Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001151 other effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
- G10L13/0335—Pitch control
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The application discloses an information processing method, which is applied to a vehicle, wherein the vehicle is provided with a vehicle-mounted application, and the vehicle is provided with a speech synthesis TTS engine, and the information processing method comprises the following steps: the method comprises the steps that a vehicle-mounted application obtains text information to be converted and converts the text information to be converted into a plurality of pieces of information carrying target sound styles; wherein, the sound styles corresponding to different information carrying the target sound style are different; and the vehicle-mounted application sends the information carrying the target voice style to a TTS engine so as to sequentially carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.
Description
Technical Field
The present application relates to the field of audio processing, and in particular, to an information processing method, apparatus, vehicle, and computer storage medium.
Background
Along with the development of intellectuality, also add the on-vehicle application that promotes intelligent degree in the vehicle, including the intelligent scene that the on-vehicle application control goes on the sound production. However, how to make the sound production effect more personalized through the control of the vehicle-mounted application in the vehicle, so that the audio playing scene is richer, is a problem to be solved.
Disclosure of Invention
In order to solve at least one of the above problems in the prior art, embodiments of the present application provide an information processing method, apparatus, device and computer storage medium.
In a first aspect, an embodiment of the present application provides an information processing method applied to a vehicle, where a vehicle-mounted application is installed in the vehicle, and a speech synthesis TTS engine is installed in the vehicle, including:
the method comprises the steps that a vehicle-mounted application obtains text information to be converted and converts the text information to be converted into a plurality of pieces of information carrying target sound styles; wherein, the sound styles corresponding to different information carrying the target sound style are different;
and the vehicle-mounted application sends the information carrying the target voice style to a TTS engine so as to sequentially carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.
In a second aspect, an embodiment of the present application provides an information processing apparatus, including:
the conversion module is used for acquiring text information to be converted and converting the text information to be converted into a plurality of pieces of information carrying target sound styles; wherein, the sound styles corresponding to different information carrying the target sound style are different;
the TTS calling module is used for sending the information carrying the target voice style to a TTS engine so as to sequentially carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.
In a third aspect, an embodiment of the present application provides a vehicle, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method provided by any one of the embodiments of the present application.
In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present application.
One embodiment in the above application has the following advantages or benefits: the text information to be converted can be converted through the vehicle-mounted application to obtain a plurality of pieces of information carrying target voice styles, and then a TTS engine is called to synthesize audio information. Therefore, the vehicle-mounted application has richer audio playing scenes, and a plurality of different personalized sound styles of information can be output one by one based on the same text in the audio playing, so that the personalized requirements are met, and the hearing experience of a user is improved.
Other effects of the above-described alternative will be described below with reference to specific embodiments.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a schematic flow chart diagram of an information processing method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of generating information carrying a target sound style according to the present application;
FIG. 3 is a schematic view of another scenario processing according to the information processing method of the present application;
FIG. 4 is a first diagram illustrating an information processing apparatus according to an embodiment of the present application;
FIG. 5 is a diagram illustrating a second exemplary structure of an information processing apparatus according to an embodiment of the present application;
fig. 6 is a block diagram of a vehicle for implementing the information processing method of the embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The embodiment of the application provides an information processing method, which is applied to a vehicle, wherein the vehicle is provided with a vehicle-mounted application, and a speech synthesis TTS engine is arranged in the vehicle, as shown in fig. 1, the method comprises the following steps:
s101: the method comprises the steps that a vehicle-mounted application obtains text information to be converted and converts the text information to be converted into a plurality of pieces of information carrying target sound styles; wherein, the sound styles corresponding to different information carrying the target sound style are different;
s102: and the vehicle-mounted application sends the information carrying the target voice style to a TTS engine so as to sequentially carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.
In S101, the in-vehicle application may be one of a plurality of applications installed in a vehicle.
The method for acquiring the text information to be converted may include the following steps:
the first mode,
And the vehicle-mounted application acquires the text to be converted from the text information.
That is, the user may input text information at the information input interface of the in-vehicle application; alternatively, a stored text may be selected for the user from locally stored texts as the information input by the user.
The vehicle-mounted application displays the text information input by the user on the display interface, and can display the text information input by the user through the display interface if it is detected that the user clicks a preview key (which may be a virtual key or a physical key).
The second mode,
And the vehicle-mounted application acquires text information input by a user, and takes the text information input by the user as the text to be converted.
The difference from the first method is that in this method, the user is required to input text information, and the text information input by the user is directly converted as a text to be converted.
Besides the two modes, the text information can be obtained from the cloud end, and the text information to be converted can be a piece of text information randomly obtained from the cloud end.
In the above S101, the converting the text information to be converted into a plurality of pieces of information carrying target sound styles includes:
the vehicle-mounted application marks audio-related attributes of the text information to be converted according to a plurality of target sound styles to obtain a plurality of target SSMLs (Speech Synthesis markup language), and the target SSMLs are used as a plurality of pieces of information carrying the target sound styles. That is, the plurality of information carrying the target sound style may be a plurality of target SSMLs.
The determination method of the target sound style may be determined according to user settings, for example, the vehicle-mounted application may provide a selection menu interface for selecting a sound style for a user, and the user obtains a plurality of target sound styles to be obtained this time as the plurality of target sound styles by selecting on the interface. In the embodiment of the present application, each target sound style may be one of multiple preset styles, for example, six preset styles, and a plurality of the preset styles may be selected by the user as multiple target sound styles. The plurality of pieces of information carrying the target sound style may be information and sound style in a many-to-one relationship. Specifically, for example, 9 sound styles are preset, and the converted information may correspond to 3 different target sound styles.
In addition, the vehicle-mounted application can also be provided with a plurality of default sound styles; accordingly, the determining of the plurality of target sound styles may include: if the user does not select a plurality of target sound styles, a plurality of default sound styles can be directly adopted as the plurality of target sound styles (the plurality of sound styles can be related to the type of the vehicle-mounted application); if the user selects a plurality of sound styles required by the processing, the plurality of sound styles selected by the user are used as a plurality of target sound styles; if the user selects one or more sound styles required by the processing, the sound style selected by the user and the default sound style can be used as a plurality of target sound styles.
The default sound styles of the vehicle-mounted application may be preset default sound styles, or may be determined according to a preset mapping relationship for the vehicle-mounted application.
For example, a plurality of default sound styles may be pre-configured in the vehicle-mounted application, for example, a news type may be pre-configured with a serious sound style and other default sound styles, an entertainment type vehicle-mounted application may be pre-configured with a default easy-to-call sound style and other default sound styles, and so on.
The vehicle-mounted application determines a plurality of default sound styles according to a preset mapping relation, and may determine a plurality of default scenes corresponding to the vehicle-mounted application according to the identifier of the vehicle-mounted application and the mapping relation between the preset identifier and the scenes; then, according to the mapping relationship between the scenes and the styles, a plurality of corresponding default sound styles can be determined.
In an implementation mode, information of a plurality of carrying target sound styles aiming at the same text to be converted can be generated according to the sound style selected by a user; among the plurality of pieces of information carrying the target sound style, one may be the default style described above.
In one embodiment, the information processing method includes steps S101 and S102 shown in fig. 1, and on the basis, the information processing method further includes:
the vehicle-mounted application detects a playing instruction and sends a calling request to a TTS engine;
and the vehicle-mounted application determines whether to send the information carrying the target voice styles to the TTS engine based on the information fed back by the TTS engine.
The playing instruction can be triggered by pressing a virtual key for controlling playing in a control key area contained in a display interface of the vehicle-mounted application or by pressing a certain designated physical key in the vehicle under the condition that the user confirms the text information to be converted.
The playing instruction can also be an instruction sent out in a voice message mode under the condition that the user confirms the text information to be converted. After a user sends out voice information, the voice information of the user is collected through a sound collection unit; and performing voice recognition to obtain voice instruction information, and if the voice instruction information represents and determines to play the input text information, understanding the voice instruction information as a playing instruction.
For example, assuming that a birth module (function or option) of a drama exists in the vehicle-mounted application, after clicking to enter the birth module of the drama, the user may display an information input interface in which the user may input text information, for example, the user may create a poem or an article;
after the user finishes inputting the text information, the user can click a 'play type broadcast' button to generate a play instruction;
the vehicle-mounted application can respond to the playing instruction and determine the text information currently input by the user as the text information to be converted. And then, generating information with various different target sound styles according to the text information to be converted, and broadcasting the information with different styles to present different broadcasting effects of the same sentence.
In a specific implementation manner of the foregoing embodiment, the plurality of target sound styles may include: serious, humorous, recreational, etc. After the vehicle-mounted application sends the call request to the TTS engine, the vehicle-mounted application determines whether to send the information carrying the target voice styles to the TTS engine or not based on the information of the call request fed back by the TTS engine, so that the TTS engine can carry out audio synthesis according to the information carrying the target voices and output the synthesized audio information.
Further, the vehicle-mounted application sends a call request to the TTS engine, and may select a corresponding audio output channel based on the vehicle-mounted application; then sending a calling request for calling the audio output channel to a TTS engine; and under the condition that the audio output channel can output audio, the vehicle-mounted application sends the information carrying the target voice styles to a TTS engine.
Furthermore, in the processing of the TTS engine, a plurality of pieces of information carrying the target sound style corresponding to the plurality of pieces of information carrying the target sound style may be added to the output queue of the corresponding audio output channel; and outputting target audio information corresponding to a plurality of target sound styles in the output queue through the audio output channel.
It should be noted that the processing of the TTS engine may include performing speech synthesis in combination with speech models corresponding to a plurality of target voice styles. The speech model can be preset locally for the vehicle, and can also be cloud-end.
For example, a TTS engine of the vehicle may send a plurality of pieces of information carrying a target sound style to the cloud, the cloud selects a corresponding speech model to synthesize the plurality of pieces of information carrying the target sound style, and sequentially feeds back audio obtained by synthesis to the vehicle, and outputs the audio through an audio output channel corresponding to the TTS engine of the vehicle. This kind of mode can adopt when the vehicle can be connected in the high in the clouds, perhaps can adopt when the communication quality of vehicle and high in the clouds is good, perhaps, can adopt when setting up the vehicle and can connect in the clouds and communication quality is greater than preset threshold value for the user.
For another example, the TTS engine of the vehicle may perform speech synthesis on a plurality of pieces of information carrying the target voice style directly according to the local speech model. In this case, the local speech model may be a speech model that can be updated when the cloud can be connected, and/or may also be a speech model preset locally.
In addition, the method further comprises: and when the vehicle-mounted application detects a command of pausing the playing, sending the command of pausing the playing to the TTS engine so as to control the TTS engine to pause audio synthesis.
Wherein, the instruction for pausing the playing may be generated as follows: after the vehicle-mounted application detects the click operation of a pause virtual key in a current playing interface or a control interface displayed by the vehicle-mounted application, generating a play pause instruction; or after the vehicle-mounted application detects the click operation of a specified play pause key (physical key) in the vehicle, generating a play pause instruction; or after the vehicle-mounted application detects the voice command of playing pause, generating the command of playing pause.
In one embodiment, the sending the invocation request to the TTS engine further comprises:
the vehicle-mounted application determines a voice output channel and sends a calling request aiming at the voice output channel to the TTS engine.
In general, a vehicle may be provided with a plurality of sound output channels, which may be used to output a variety of different sounds. For example, a vehicle has 9 channels, of which 5 are used to output sounds of speakers of an in-vehicle computer in the vehicle interior, and the other 4 are used to output sounds of an alarm or the like.
In a specific implementation manner, the vehicle-mounted application may fixedly call one or more channels.
In step S102, the processing of the in-vehicle application invoking TTS may include: the vehicle-mounted application sends a calling request to a TTS engine and receives feedback information of the TTS engine; and if the feedback information of the TTS engine represents that the TTS call is successful, the vehicle-mounted application sends the information carrying the target voice style to the TTS engine.
Finally, the TTS engine can synthesize a plurality of pieces of information carrying the target voice style into a PCM voice stream, so that the PCM voice stream can be output.
In one example, the process of converting the text to be converted into a plurality of pieces of information carrying the target sound style may be as shown in fig. 2, and includes the following steps:
s201: a plurality of SSMLs is created.
The method may include adding information such as version, language, URI (Uniform Resource Identifier), and output voice addition to the created information carrying the target sound style. For example, a plurality of canonical versions of information carrying a target sound style for interpreting the markup of a document may be specified, the language of the root document may be specified, and the URLs of the documents may be specified for defining the markup vocabulary of the plurality of documents carrying information of the target sound style.
S202: and adjusting the selected voice according to each target sound style in the plurality of target sound styles, and selecting service information corresponding to each target sound style.
For example, the speech corresponding to the text to be converted is selected. The selected voice is adjusted according to the plurality of target voice styles, so that the style of the selected voice can be the plurality of target voice styles.
The service may be understood as the audio property related information of the present application, that is, a plurality of target sound styles may ultimately correspond to the audio property related information.
It is to be understood that different genres may also correspond to different services. The service may include at least one of speech rate, intonation, pitch, pause, and the like.
For example, the service information may include at least one of: add or delete interruptions/pauses in speech; specifying paragraphs and sentences in speech; using phonemes to improve pronunciation; using a user-defined dictionary to improve pronunciation; adjusting rhythm; changing the speed of speech; changing the volume; changing the pitch; changing a pitch lifting curve; adding the recorded audio; adding background audio, etc.
S203: a plurality of target SSMLs are generated. A plurality of information carrying the target sound style may be generated for the result based on the aforementioned selection and adjustment. In the embodiment of the present application, the plurality of information carrying the target sound style may include a tag of certain audio related attributes, and the audio related attributes may be understood as service information, such as a speech rate, a tone, a pitch, background music, and the like. The service information has some fixed settings for each style, for example, for the serious style and for the entertainment style, the speech speed, tone, pitch, background music, etc. may be different in at least one of the different service information.
Further, in another embodiment, on the basis of S101 and S102 shown in fig. 1, the scheme provided in this embodiment may further include mixing and playing the background audio information and the synthesized audio.
In one embodiment, the sending, by the in-vehicle application, the plurality of pieces of information carrying the target voice style to a TTS engine includes:
sequentially sending the information carrying the target voice styles to a TTS engine according to a first sequence;
wherein the first sequence is a preset sequence or a random sequence.
For example, the plurality of pieces of information carrying the target sound styles are information carrying sound styles of a1, a2, A3, a4 and a5, and the first order is determined as an order in which a1, a2, A3, a4 and a5 are randomly sorted.
For another example, the plurality of pieces of information carrying the target sound style are information carrying sound styles of serious, lively, sad, happy and cool sound styles, and in order to embody the effect of playing with precision, the sound style with larger contrast can be set as the sound style in the adjacent order. For example, the first order is set as: serious, lively, sad, happy and cool.
An example of the embodiment of the present application, as shown in fig. 3, may include:
the in-vehicle application supports a smart mode. The vehicle application has an AI (Artificial Intelligence) sound skill, and the hardware base of the operation may be a chip (IDCM) of the vehicle application.
The in-vehicle application may have a number of optional functions (or optional modules, such as other skills in the figures and birth of fun modules). Specifically, for example, in the AI voice skill application of IDCM, a drama birth module may be entered according to a user selection.
After entering a page of a birth module of the game essence, a user can control and set a broadcasted case in a manual or voice mode, the setting modes are two, and the case is selected or manually input from a case library; the case library can be a local case library or a cloud case library.
After the setting of the file is completed, the user can manually select and click the 'play type broadcast' button or send a voice command to confirm and click the 'play type broadcast' button.
After the key for controlling the 'show type report' is effective, the vehicle-mounted application requests a relevant case for the birth of the show from the local cloud by pressing a bar.
After the vehicle-mounted application requests the callback case, the birth case with the drama is converted to obtain information carrying a plurality of target voice styles, and the vehicle-mounted application (such as an AI voice skill application) applies for calling a TTS engine.
If the current TTS engine can be called, the TTS engine gives a feedback of successful calling of the vehicle-mounted application, otherwise, the feedback calling fails; after the vehicle-mounted application receives the feedback of successful calling, the vehicle-mounted application sends a plurality of exquisite births with target voice styles to be broadcasted to a TTS engine sentence by sentence, and the TTS synthesizes the files into audio sentence by sentence and broadcasts the audio sentence. Specifically, for example, after the AI voice skill application receives the feedback of successful call, the AI voice skill application copies 7 documents to be broadcasted, each of which carries a target voice style, the target voice styles carried by the documents are different, the documents carrying the target voice styles are successively sent to a TTS engine, and the TTS engine synthesizes the documents sentence by sentence into audio and broadcasts the audio.
In addition, in the section sub-broadcasting process, the vehicle-mounted application can also support pause reciting and continuous reciting. The specific processing has already been described in the foregoing embodiments, and is not described here again.
In the embodiment of the application, the text information to be converted can be converted through the vehicle-mounted application to obtain a plurality of pieces of information carrying different target voice styles, and then the TTS engine is called to synthesize the audio information of the plurality of target voice styles. Therefore, the vehicle-mounted application has a plurality of target sound styles and is combined with a specific audio playing scene, more personalized information carrying the target sound styles can be output in audio playing, and different broadcasting effects of the same sentence and file can be presented.
An embodiment of the present application further provides an information processing apparatus, a main component of which is shown in fig. 4, including:
the conversion module 51 is configured to acquire text information to be converted, and convert the text information to be converted into a plurality of pieces of information carrying a target sound style; wherein, the sound styles corresponding to different information carrying the target sound style are different;
the TTS invoking module 52 is configured to send the multiple pieces of information with the target sound style to a TTS engine, so as to sequentially perform audio synthesis on the multiple pieces of information with the target sound style through the TTS engine and output synthesized audio information.
In one embodiment, the conversion module is further configured to,
acquiring the text to be converted from the text information;
alternatively, the first and second electrodes may be,
and acquiring text information input by a user, and taking the text information input by the user as the text to be converted.
In one embodiment, the conversion module is further configured to:
and marking audio related attributes of the text information to be converted according to a plurality of target sound styles to obtain a plurality of target SSMLs, and taking the plurality of target SSMLs as a plurality of information carrying the target sound styles.
In one embodiment, as shown in fig. 5, the apparatus further comprises:
the call request module is used for detecting a playing instruction and sending a call request to the TTS engine;
and the sending module is used for determining whether to send the information carrying the target voice styles to the TTS engine based on the information fed back by the TTS engine.
In one embodiment, the invocation request module is further operable to,
determining a voice output channel, and sending a call request aiming at the voice output channel to the TTS engine.
In one embodiment, the sending module is further configured to,
and if the information representation fed back by the TTS engine is successfully called, sending the information carrying the target voice style to the TTS engine.
In one embodiment, the TTS calling module is further configured to,
sequentially sending the information carrying the target voice styles to a TTS engine according to a first sequence;
wherein the first sequence is a preset sequence or a random sequence.
As shown in fig. 6, is a block diagram of a vehicle according to an information processing method of an embodiment of the present application. The vehicle is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The vehicle may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 6, the vehicle includes: one or more processors 1201, memory 1202, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the vehicle, including instructions stored in or on the memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple vehicles may be connected, with each device providing some of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 6 illustrates an example of a processor 1201.
The memory 1202, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules (e.g., units shown in fig. 4 and 5) corresponding to the information processing method in the embodiments of the present application. The processor 1201 executes various functional applications of the server and data processing by executing non-transitory software programs, instructions, and modules stored in the memory 1202, that is, implements the information processing method in the above-described method embodiment.
The memory 1202 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the vehicle information processing vehicle, and the like. Further, the memory 1202 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 1202 optionally includes memory remotely located from processor 1201, which may be connected to the vehicle information handling vehicle via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The vehicle of the information processing method may further include: an input device 1203 and an output device 1204. The processor 1201, the memory 1202, the input device 1203, and the output device 1204 may be connected by a bus or other means, and the bus connection is exemplified in fig. 6.
The input device 1203 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the vehicle information handling vehicle, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 1204 may include a display device, auxiliary lighting devices (e.g., LEDs), tactile feedback devices (e.g., vibrating motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (16)
1. An information processing method applied to a vehicle in which an in-vehicle application is installed and in which a speech synthesis TTS engine is installed, characterized by comprising:
the method comprises the steps that a vehicle-mounted application obtains text information to be converted and converts the text information to be converted into a plurality of pieces of information carrying target sound styles; wherein, the sound styles corresponding to different information carrying the target sound style are different;
and the vehicle-mounted application sends the information carrying the target voice style to a TTS engine so as to sequentially carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.
2. The method of claim 1, wherein the obtaining text information to be converted comprises:
the vehicle-mounted application acquires the text to be converted from the text information;
alternatively, the first and second electrodes may be,
and the vehicle-mounted application acquires text information input by a user, and takes the text information input by the user as the text to be converted.
3. The method of claim 1, wherein the converting the text information to be converted into a plurality of pieces of information carrying target sound styles comprises:
and the vehicle-mounted application marks audio-related attributes of the text information to be converted according to a plurality of target sound styles to obtain a plurality of target SSMLs, and the plurality of target SSMLs are used as a plurality of pieces of information carrying the target sound styles.
4. The method of claim 1, wherein the method further comprises:
the vehicle-mounted application detects a playing instruction and sends a calling request to a TTS engine;
and the vehicle-mounted application determines whether to send the information carrying the target voice styles to the TTS engine based on the information fed back by the TTS engine.
5. The method of claim 4, wherein said sending a call request to a TTS engine further comprises:
the vehicle-mounted application determines a voice output channel and sends a calling request aiming at the voice output channel to the TTS engine.
6. The method of claim 4, wherein the determining, by the in-vehicle application, whether to send the plurality of information carrying the target voice style to the TTS engine based on the information fed back by the TTS engine comprises:
and if the information representation fed back by the TTS engine is successfully called, sending the information carrying the target voice style to the TTS engine.
7. The method of claim 1, wherein the in-vehicle application sending the plurality of information carrying the target voice style to a TTS engine comprises:
sequentially sending the information carrying the target voice styles to a TTS engine according to a first sequence;
wherein the first sequence is a preset sequence or a random sequence.
8. An information processing apparatus comprising:
the conversion module is used for acquiring text information to be converted and converting the text information to be converted into a plurality of pieces of information carrying target sound styles; wherein, the sound styles corresponding to different information carrying the target sound style are different;
the TTS calling module is used for sending the information carrying the target voice style to a TTS engine so as to sequentially carry out audio synthesis on the information carrying the target voice style through the TTS engine and output the synthesized audio information.
9. The apparatus of claim 8, wherein the conversion module is further configured to,
acquiring the text to be converted from the text information;
alternatively, the first and second electrodes may be,
and acquiring text information input by a user, and taking the text information input by the user as the text to be converted.
10. The apparatus of claim 8, wherein the conversion module is further configured to:
and marking audio related attributes of the text information to be converted according to a plurality of target sound styles to obtain a plurality of target SSMLs, and taking the plurality of target SSMLs as a plurality of information carrying the target sound styles.
11. The apparatus of claim 8, wherein the apparatus further comprises:
the call request module is used for detecting a playing instruction and sending a call request to the TTS engine;
and the sending module is used for determining whether to send the information carrying the target voice styles to the TTS engine based on the information fed back by the TTS engine.
12. The apparatus of claim 11, wherein the call request module is further configured to,
determining a voice output channel, and sending a call request aiming at the voice output channel to the TTS engine.
13. The apparatus of claim 11, wherein the means for transmitting is further configured to,
and if the information representation fed back by the TTS engine is successfully called, sending the information carrying the target voice style to the TTS engine.
14. The apparatus of claim 8, wherein the TTS invocation module is further to,
sequentially sending the information carrying the target voice styles to a TTS engine according to a first sequence;
wherein the first sequence is a preset sequence or a random sequence.
15. A vehicle, characterized by comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010589862.8A CN111768755A (en) | 2020-06-24 | 2020-06-24 | Information processing method, information processing apparatus, vehicle, and computer storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010589862.8A CN111768755A (en) | 2020-06-24 | 2020-06-24 | Information processing method, information processing apparatus, vehicle, and computer storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111768755A true CN111768755A (en) | 2020-10-13 |
Family
ID=72721898
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010589862.8A Pending CN111768755A (en) | 2020-06-24 | 2020-06-24 | Information processing method, information processing apparatus, vehicle, and computer storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111768755A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112365877A (en) * | 2020-11-27 | 2021-02-12 | 北京百度网讯科技有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
CN115273808A (en) * | 2021-04-14 | 2022-11-01 | 上海博泰悦臻网络技术服务有限公司 | Sound processing method, storage medium and electronic device |
US12008289B2 (en) | 2021-07-07 | 2024-06-11 | Honeywell International Inc. | Methods and systems for transcription playback with variable emphasis |
Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AUPR380401A0 (en) * | 2001-03-19 | 2001-04-12 | Famoice Technology Pty Ltd | Data template structure |
US20020188449A1 (en) * | 2001-06-11 | 2002-12-12 | Nobuo Nukaga | Voice synthesizing method and voice synthesizer performing the same |
CN1938756A (en) * | 2004-03-05 | 2007-03-28 | 莱塞克技术公司 | Prosodic speech text codes and their use in computerized speech systems |
CN102201233A (en) * | 2011-05-20 | 2011-09-28 | 北京捷通华声语音技术有限公司 | Mixed and matched speech synthesis method and system thereof |
CN102222501A (en) * | 2011-06-15 | 2011-10-19 | 中国科学院自动化研究所 | Method for generating duration parameter in speech synthesis |
US20110282668A1 (en) * | 2010-05-14 | 2011-11-17 | General Motors Llc | Speech adaptation in speech synthesis |
CN104200803A (en) * | 2014-09-16 | 2014-12-10 | 北京开元智信通软件有限公司 | Voice broadcasting method, device and system |
CN105304080A (en) * | 2015-09-22 | 2016-02-03 | 科大讯飞股份有限公司 | Speech synthesis device and speech synthesis method |
US20160071509A1 (en) * | 2014-09-05 | 2016-03-10 | General Motors Llc | Text-to-speech processing based on network quality |
CN106782494A (en) * | 2016-09-13 | 2017-05-31 | 乐视控股(北京)有限公司 | Phonetic synthesis processing method and processing device |
WO2018171257A1 (en) * | 2017-03-21 | 2018-09-27 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for speech information processing |
CN108962217A (en) * | 2018-07-28 | 2018-12-07 | 华为技术有限公司 | Phoneme synthesizing method and relevant device |
WO2019005625A1 (en) * | 2017-06-26 | 2019-01-03 | Zya, Inc. | System and method for automatically generating media |
CN109147760A (en) * | 2017-06-28 | 2019-01-04 | 阿里巴巴集团控股有限公司 | Synthesize method, apparatus, system and the equipment of voice |
KR20190104941A (en) * | 2019-08-22 | 2019-09-11 | 엘지전자 주식회사 | Speech synthesis method based on emotion information and apparatus therefor |
KR20190106890A (en) * | 2019-08-28 | 2019-09-18 | 엘지전자 주식회사 | Speech synthesis method based on emotion information and apparatus therefor |
CN110688834A (en) * | 2019-08-22 | 2020-01-14 | 阿里巴巴集团控股有限公司 | Method and equipment for rewriting intelligent manuscript style based on deep learning model |
-
2020
- 2020-06-24 CN CN202010589862.8A patent/CN111768755A/en active Pending
Patent Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AUPR380401A0 (en) * | 2001-03-19 | 2001-04-12 | Famoice Technology Pty Ltd | Data template structure |
US20020188449A1 (en) * | 2001-06-11 | 2002-12-12 | Nobuo Nukaga | Voice synthesizing method and voice synthesizer performing the same |
CN1938756A (en) * | 2004-03-05 | 2007-03-28 | 莱塞克技术公司 | Prosodic speech text codes and their use in computerized speech systems |
US20110282668A1 (en) * | 2010-05-14 | 2011-11-17 | General Motors Llc | Speech adaptation in speech synthesis |
CN102201233A (en) * | 2011-05-20 | 2011-09-28 | 北京捷通华声语音技术有限公司 | Mixed and matched speech synthesis method and system thereof |
CN102222501A (en) * | 2011-06-15 | 2011-10-19 | 中国科学院自动化研究所 | Method for generating duration parameter in speech synthesis |
US20160071509A1 (en) * | 2014-09-05 | 2016-03-10 | General Motors Llc | Text-to-speech processing based on network quality |
CN104200803A (en) * | 2014-09-16 | 2014-12-10 | 北京开元智信通软件有限公司 | Voice broadcasting method, device and system |
CN105304080A (en) * | 2015-09-22 | 2016-02-03 | 科大讯飞股份有限公司 | Speech synthesis device and speech synthesis method |
CN106782494A (en) * | 2016-09-13 | 2017-05-31 | 乐视控股(北京)有限公司 | Phonetic synthesis processing method and processing device |
WO2018171257A1 (en) * | 2017-03-21 | 2018-09-27 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for speech information processing |
WO2019005625A1 (en) * | 2017-06-26 | 2019-01-03 | Zya, Inc. | System and method for automatically generating media |
CN109147760A (en) * | 2017-06-28 | 2019-01-04 | 阿里巴巴集团控股有限公司 | Synthesize method, apparatus, system and the equipment of voice |
CN108962217A (en) * | 2018-07-28 | 2018-12-07 | 华为技术有限公司 | Phoneme synthesizing method and relevant device |
KR20190104941A (en) * | 2019-08-22 | 2019-09-11 | 엘지전자 주식회사 | Speech synthesis method based on emotion information and apparatus therefor |
CN110688834A (en) * | 2019-08-22 | 2020-01-14 | 阿里巴巴集团控股有限公司 | Method and equipment for rewriting intelligent manuscript style based on deep learning model |
KR20190106890A (en) * | 2019-08-28 | 2019-09-18 | 엘지전자 주식회사 | Speech synthesis method based on emotion information and apparatus therefor |
Non-Patent Citations (1)
Title |
---|
任萍萍: "《智能客服机器人》", 成都时代出版社, pages: 99 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112365877A (en) * | 2020-11-27 | 2021-02-12 | 北京百度网讯科技有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
CN115273808A (en) * | 2021-04-14 | 2022-11-01 | 上海博泰悦臻网络技术服务有限公司 | Sound processing method, storage medium and electronic device |
US12008289B2 (en) | 2021-07-07 | 2024-06-11 | Honeywell International Inc. | Methods and systems for transcription playback with variable emphasis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11468889B1 (en) | Speech recognition services | |
JP7181332B2 (en) | Voice conversion method, device and electronic equipment | |
CN110633419B (en) | Information pushing method and device | |
CN111768755A (en) | Information processing method, information processing apparatus, vehicle, and computer storage medium | |
TWI249729B (en) | Voice browser dialog enabler for a communication system | |
EP2112650B1 (en) | Speech synthesis apparatus, speech synthesis method, speech synthesis program, portable information terminal, and speech synthesis system | |
CN107040452B (en) | Information processing method and device and computer readable storage medium | |
US11586344B1 (en) | Synchronizing media content streams for live broadcasts and listener interactivity | |
CN112533041A (en) | Video playing method and device, electronic equipment and readable storage medium | |
US8340797B2 (en) | Method and system for generating and processing digital content based on text-to-speech conversion | |
CN112165648B (en) | Audio playing method, related device, equipment and storage medium | |
US11449301B1 (en) | Interactive personalized audio | |
CN111142667A (en) | System and method for generating voice based on text mark | |
CN110718221A (en) | Voice skill control method, voice equipment, client and server | |
CN111935551A (en) | Video processing method and device, electronic equipment and storage medium | |
CN111259125A (en) | Voice broadcasting method and device, intelligent sound box, electronic equipment and storage medium | |
KR20210038278A (en) | Speech control method and apparatus, electronic device, and readable storage medium | |
CN111818279A (en) | Subtitle generating method, display method and interaction method | |
CN111739510A (en) | Information processing method, information processing apparatus, vehicle, and computer storage medium | |
WO2023040820A1 (en) | Audio playing method and apparatus, and computer-readable storage medium and electronic device | |
CN111754974B (en) | Information processing method, device, equipment and computer storage medium | |
CN110633357A (en) | Voice interaction method, device, equipment and medium | |
CN111768756B (en) | Information processing method, information processing device, vehicle and computer storage medium | |
US20210392394A1 (en) | Method and apparatus for processing video, electronic device and storage medium | |
CN113160782B (en) | Audio processing method and device, electronic equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |