CN111243570A

CN111243570A - Voice acquisition method and device and computer readable storage medium

Info

Publication number: CN111243570A
Application number: CN202010060939.2A
Authority: CN
Inventors: 李永强; 雷欣; 李志飞
Original assignee: Mobvoi Information Technology Co Ltd
Current assignee: Volkswagen China Investment Co Ltd; Mobvoi Innovation Technology Co Ltd
Priority date: 2020-01-19
Filing date: 2020-01-19
Publication date: 2020-06-05
Anticipated expiration: 2040-01-19
Also published as: CN111243570B

Abstract

The invention discloses a voice acquisition method, a voice acquisition device and a computer readable storage medium, wherein the voice acquisition method comprises the following steps: acquiring text information; splitting the acquired text information to obtain a plurality of split texts; judging whether the obtained split texts exist in a first voice cache one by one; and if the split text is judged to exist in the first voice cache, extracting the voice information corresponding to the split text from the first voice cache. Therefore, after the text information is received, the corresponding voice information can be fed back quickly, the calculation cost of the voice synthesis system is reduced, and the voice acquisition efficiency is greatly improved.

Description

Voice acquisition method and device and computer readable storage medium

Technical Field

The present invention relates to the field of speech synthesis technologies, and in particular, to a speech acquisition method and apparatus, and a computer-readable storage medium.

Background

TTS systems (speech synthesis systems) convert text information into speech information, which takes a certain time overhead. If the TTS system adopts the most advanced neural network model (such as tacotron, wavenet, wavernn, lpcnet and the like), the synthesis speed is very slow, and the rapid synthesis cannot be achieved in many cases.

Disclosure of Invention

Embodiments of the present invention provide a method and an apparatus for obtaining speech, and a computer-readable storage medium, which can reduce the computation overhead of a speech synthesis system and improve the efficiency of obtaining speech.

One aspect of the present invention provides a method for acquiring a voice, including: acquiring text information; splitting the acquired text information to obtain a plurality of split texts; judging whether the obtained split texts exist in a first voice cache one by one; and if the split text is judged to exist in the first voice cache, extracting the voice information corresponding to the split text from the first voice cache.

In an embodiment, the method further comprises: and if the first voice cache does not have the split text, carrying out voice synthesis on the split text to obtain corresponding voice information.

In an embodiment, the method further comprises: and storing the text information and the corresponding voice information into a second voice cache.

In an implementation manner, before splitting the acquired text information to obtain a plurality of split texts, the method further includes: judging whether the text information exists in a second voice cache; if the text information is judged to be in the second voice cache, acquiring the voice information corresponding to the text information from the second voice cache; and splitting the acquired text information to obtain a plurality of split texts if the question information is judged not to exist in the second voice cache.

In an embodiment, before determining whether the text information exists in the second voice cache, the method further comprises: judging whether the text information exists in a voice database; if the text information is judged to be in the voice database, acquiring the voice information corresponding to the text information from the voice database; and if the text information is not judged to exist in the voice database, judging whether the text information exists in the second voice buffer.

In one embodiment, the voice database is a shared resource, and the first voice cache and the second voice cache are exclusive resources.

Another aspect of the present invention provides a speech acquisition apparatus, including: the text acquisition module is used for acquiring text information; the text splitting module is used for splitting the acquired text information to obtain a plurality of split texts; the split text judgment module is used for judging whether the obtained split texts exist in the first voice cache one by one; and the voice extraction module is used for extracting the voice information corresponding to the split text from the first voice cache if the split text is judged to exist in the first voice cache by the split text judgment module.

In an implementation manner, before the text splitting module splits the acquired text information to obtain a plurality of split texts, the apparatus further includes: the cache judging module is used for judging whether the text information exists in a second voice cache; if the text information is judged to be in the second voice cache, acquiring the voice information corresponding to the text information from the second voice cache; and splitting the acquired text information to obtain a plurality of split texts if the question information is judged not to exist in the second voice cache.

In an embodiment, before the cache determination module determines whether the text information exists in the second voice cache, the method further includes: the database judging module is used for judging whether the text information exists in a voice database; if the text information is judged to be in the voice database, acquiring the voice information corresponding to the text information from the voice database; and if the text information is not judged to exist in the voice database, judging whether the text information exists in the second voice buffer.

Another aspect of the invention provides a computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform any of the speech acquisition methods described above.

In the embodiment of the invention, some phrases and corresponding voice information are stored in the first voice cache in advance. Therefore, after the text information is received, the text is split to obtain a plurality of split texts, whether the split texts exist in the first voice cache or not is judged, if the split texts exist in the first voice cache, the corresponding voice information is directly extracted, and the split texts which do not exist in the first voice cache are subjected to subsequent processing.

Therefore, after the text information is received, the corresponding voice information can be fed back quickly, the calculation cost of the voice synthesis system is reduced, and the voice acquisition efficiency is greatly improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Fig. 1 is a schematic flow chart illustrating an implementation of a speech acquisition method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a specific implementation of a speech acquisition method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech acquisition apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart illustrating an implementation of a speech acquisition method according to an embodiment of the present invention.

As shown in fig. 1, an aspect of the present invention provides a method for acquiring a voice, where the method includes:

step 101, acquiring text information;

step 102, splitting the acquired text information to obtain a plurality of split texts;

step 103, judging whether the obtained split texts exist in a first voice cache one by one;

and 104, if the split text is judged to exist in the first voice cache, extracting the voice information corresponding to the split text from the first voice cache.

In this embodiment, first, text information is acquired, where the text information may be provided by a user.

And then splitting the acquired text information to obtain a plurality of split texts, wherein the splitting mode can be split according to punctuations in the text information, and the existing word segmentation tools can be used for carrying out word segmentation processing on the split texts.

Then judging whether the obtained split texts exist in a first voice cache one by one; wherein, the first voice cache is used for storing some common phrases and corresponding voice messages, for example: "hello, i is xxx" and "hello, ask xxx", and "hello" in the two sentences is a frequently appearing phrase, the same part in the two sentences is extracted in advance, corresponding voice information is generated by the same part, and the text information and the corresponding voice information are stored in a first voice cache.

Then judging whether the split texts exist in a first voice cache one by one;

and if the split text is judged to exist in the first voice cache, extracting the voice information corresponding to the split text.

In an embodiment, the method further comprises:

and if the first voice cache is judged not to have the split text, carrying out voice synthesis on the first voice cache to obtain corresponding voice information.

In this embodiment, if one or more of the split texts does not exist in the first voice cache, the split texts are converted by using a voice synthesis technology to obtain corresponding voice information, and the obtained voice information is fed back to the user.

In an embodiment, the method further comprises:

and storing the text information and the corresponding voice information into a second voice cache.

In this embodiment, after the above steps, the entire text information and the corresponding voice information are stored in the second voice cache. The second voice cache is mainly used for storing text information and voice information which are requested to be synthesized by a recent user.

In an implementation manner, before splitting the acquired text information to obtain a plurality of split texts, the method further includes:

judging whether the text information exists in a second voice cache;

if the text information is judged to be in the second voice cache, acquiring the voice information corresponding to the text information from the second voice cache;

and splitting the acquired text information to obtain a plurality of split texts if the question information is judged not to exist in the second voice cache.

In this embodiment, before splitting the acquired text information to obtain a plurality of split texts, it is determined whether the text information exists in the second voice cache, if the text information exists in the second voice cache, the voice information corresponding to the text information is directly extracted, and if it is determined that the text information does not exist in the second voice cache, the text information is split to obtain a plurality of split texts, and then the subsequent steps are continued.

Therefore, when the text information exists in the second voice cache, the corresponding voice information can be directly extracted and fed back to the user, and the voice acquisition efficiency is improved.

In an embodiment, before determining whether the text information exists in the second voice cache, the method further comprises:

judging whether the text information exists in a voice database;

if the text information is judged to be in the voice database, acquiring the voice information corresponding to the text information from the voice database;

and if the text information is not in the voice database, judging whether the text information exists in the second voice buffer.

In this embodiment, before determining whether the text information exists in the second voice cache, it is determined whether the text information exists in the voice database, where the voice database is used to store a large amount of text information with high use frequency and corresponding voice information. During storage, all historical text information is sorted from high to low according to frequency, tens of thousands of pieces of text information with high frequency and corresponding voice information are selected and stored into a voice database.

After a user inputs text information, whether the text information exists in the voice database is judged firstly, if the text information exists in the voice database, corresponding voice information is extracted from the voice database, and if the text information does not exist in the voice database, whether the text information exists in the second voice cache is judged continuously.

Therefore, when the text information exists in the voice database, the corresponding voice information can be directly extracted and fed back to the user, and the voice acquisition efficiency is improved.

In this embodiment, the voice database is a shared resource for the distributed server to access; the first voice cache and the second voice cache are independent resources and are located in each server.

Fig. 2 is a schematic flow chart of a specific implementation of a speech acquisition method according to an embodiment of the present invention.

As shown in fig. 2, after the user inputs text information at the local end, the local end transmits the input text information to a server, and the server first transmits the text information to a Remote dictionary server (Remote dictionary server) which is the above-mentioned voice database, and the storage system stores key-value pair information, where the key information is text information and the value information is corresponding voice information. Searching whether the text information exists in the storage system, if the text information exists in the storage system, extracting the voice information corresponding to the text information, feeding the voice information back to the server, and feeding the voice information back to the local terminal by the server.

If the text information does not exist in the storage system, the storage system feeds back an instruction which is not found by the server, the server stores the text information in an LRU Cache (least recently used algorithm), and the LRU Cache is the second voice Cache mentioned above and is used for storing the recently accessed text information and the corresponding voice information. The server judges whether the text information exists in the LRU Cache, if the text information exists in the LRU Cache, the voice information corresponding to the text information is extracted, and the voice information is fed back to the local terminal.

If the text information does not exist in the LRU Cache, the text information is split to obtain a plurality of split texts, and then it is determined whether a Prefix Cache (Prefix Cache) in the server exists in the plurality of split texts, where the Prefix Cache is the first voice Cache mentioned above and is mainly used for storing phrases and corresponding voice information. And if the split texts do not exist in Prefix Cache, converting the split texts into corresponding voice information by using a voice synthesis system, and feeding back the obtained voice information to the local terminal.

By setting Redis, LRU Cache and Prefix Cache, corresponding voice information can be quickly searched, the calculation overhead of a voice synthesis system is reduced, the voice acquisition efficiency is greatly improved, and a second-come feeling is provided for a user.

As shown in fig. 3, another aspect of the embodiment of the present invention provides a speech acquiring apparatus, including:

a text acquisition module 201, configured to acquire text information;

the text splitting module 202 is configured to split the acquired text information to obtain a plurality of split texts;

the split text judgment module 203 is configured to judge whether the obtained multiple split texts exist in the first voice cache one by one;

the voice extracting module 204 is configured to, if it is determined by the split text determining module 203 that the split text exists in the first voice cache, extract the voice information corresponding to the split text from the first voice cache.

In this embodiment, the text information is first acquired by the text acquisition module 201, where the text information may be provided by a user.

The obtained text information is then split by the text splitting module 202 to obtain a plurality of split texts, wherein the splitting mode can be split according to punctuations in the text information, and the existing word segmentation tools can also be used for performing word segmentation processing on the text information.

Then, the split text judgment module 203 judges whether the obtained split texts exist in the first voice cache one by one; wherein, the first voice cache is used for storing some common phrases and corresponding voice messages, for example: "hello, i is xxx" and "hello, ask xxx", and "hello" in the two sentences is a frequently appearing phrase, the same part in the two sentences is extracted in advance, corresponding voice information is generated by the same part, and the text information and the corresponding voice information are stored in a first voice cache. Then judging whether the split texts exist in a first voice cache one by one;

if the split text determination module 203 determines that the split text exists in the first voice cache, the voice extraction module 204 extracts the voice information corresponding to the split text.

In an implementation manner, before the text splitting module 202 splits the acquired text information to obtain a plurality of split texts, the apparatus further includes:

a buffer judgment module 2012, configured to judge whether the text information exists in the second voice buffer;

In this embodiment, before the text splitting module 202 splits the acquired text information to obtain a plurality of split texts, the cache determination module 2012 determines whether the text information exists in the second voice cache, if the text information exists in the second voice cache, the voice information corresponding to the text information is directly extracted, and if the text information does not exist in the second voice cache, the text information is split by the text splitting module 202 to obtain a plurality of split texts, and then the subsequent steps are continued.

In an embodiment, before the buffer determination module 2012 determines whether the text information exists in the second voice buffer, the method further includes:

a database determination module 2011, configured to determine whether the text information exists in the voice database;

In this embodiment, before the cache determination module 2012 determines whether the text information exists in the second voice cache, the database determination module 2011 determines whether the text information exists in a voice database, where the voice database is used for storing a large amount of text information with a high use frequency and corresponding voice information. During storage, all historical text information is sorted from high to low according to frequency, tens of thousands of pieces of text information with high frequency and corresponding voice information are selected and stored into a voice database.

After the user inputs the text message, the database determination module 2011 first determines whether the text message exists in the voice database, if it is determined that the text message exists in the voice database, extracts the corresponding voice message from the voice database, and if it is determined that the text message does not exist in the voice database, it continues to determine whether the text message exists in the second voice cache through the cache determination module 2012.

In an embodiment of the present invention, a computer-readable storage medium includes a set of computer-executable instructions that, when executed, operate to obtain textual information; splitting the acquired text information to obtain a plurality of split texts; judging whether the obtained split texts exist in a first voice cache one by one; and if the split text is judged to exist in the first voice cache, extracting the voice information corresponding to the split text from the first voice cache.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for speech acquisition, the method comprising:

acquiring text information;

splitting the acquired text information to obtain a plurality of split texts;

judging whether the obtained split texts exist in a first voice cache one by one;

and if the split text is judged to exist in the first voice cache, extracting the voice information corresponding to the split text from the first voice cache.

2. The method of claim 1, further comprising:

and if the first voice cache does not have the split text, carrying out voice synthesis on the split text to obtain corresponding voice information.

3. The method of claim 2, further comprising:

4. The method according to claim 1 or 3, wherein before splitting the acquired text information into a plurality of split texts, the method further comprises:

judging whether the text information exists in a second voice cache;

5. The method of claim 4, wherein prior to determining whether the text information is present in the second speech buffer, the method further comprises:

judging whether the text information exists in a voice database;

and if the text information is not judged to exist in the voice database, judging whether the text information exists in the second voice buffer.

6. The method of claim 5, wherein the voice database is a shared resource and the first voice cache and the second voice cache are exclusive resources.

7. A speech acquisition apparatus, characterized in that the apparatus comprises:

the text acquisition module is used for acquiring text information;

the text splitting module is used for splitting the acquired text information to obtain a plurality of split texts;

the split text judgment module is used for judging whether the obtained split texts exist in the first voice cache one by one;

and the voice extraction module is used for extracting the voice information corresponding to the split text from the first voice cache if the split text is judged to exist in the first voice cache by the split text judgment module.

8. The apparatus according to claim 7, wherein before the text splitting module splits the acquired text information to obtain a plurality of split texts, the apparatus further comprises:

the cache judging module is used for judging whether the text information exists in a second voice cache;

9. The apparatus of claim 8, wherein before the buffer determination module determines whether the text message exists in a second speech buffer, the method further comprises:

the database judging module is used for judging whether the text information exists in a voice database;

10. A computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform a speech acquisition method according to any one of claims 1-6.