CN110942765B

CN110942765B - Method, device, server and storage medium for constructing corpus

Info

Publication number: CN110942765B
Application number: CN201911095120.3A
Authority: CN
Inventors: 李阳
Original assignee: Gree Electric Appliances Inc of Zhuhai
Current assignee: Gree Electric Appliances Inc of Zhuhai
Priority date: 2019-11-11
Filing date: 2019-11-11
Publication date: 2022-05-27
Anticipated expiration: 2039-11-11
Also published as: CN110942765A

Abstract

The application relates to the technical field of intelligent voice, in particular to a method, equipment, a server and a storage medium for constructing a corpus, wherein the method comprises the following steps: reading each voice resource in sequence, and executing the following operations when each voice resource is read: extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined not to be successfully matched with the existing reference voice features, and storing the voice features serving as the reference voice features into the new voice feature library; constructing a new language database corresponding to the new speech feature library; converting a voice resource into corresponding text corpora, and adding the text corpora to a new corpus when determining that the text corpora are not successfully matched with the existing reference text corpora. The method improves the efficiency of constructing the corpus.

Description

Method, device, server and storage medium for constructing corpus

Technical Field

The present application relates to the field of intelligent speech technologies, and in particular, to a method, an apparatus, a server, and a storage medium for constructing a corpus.

Background

With the development of information technology, intelligent voice technology has become one of the most convenient and effective technical means for people to acquire and communicate information.

The intelligent voice technology is a means for realizing man-machine language interaction, and voice recognition and voice synthesis are two main branches of the intelligent voice technology. The realization of speech recognition and speech synthesis requires the pre-construction of a corpus, and speech recognition or synthesis is performed based on the corpus.

In the prior art, a method for constructing a corpus comprises the following steps: the corpora are recorded by a large number of volunteers, and then the working personnel collect, label and maintain the recorded corpora information at a later stage.

The corpus building method has the advantages that corpus collection and building are greatly dependent on manual operation, a large amount of labor is occupied, manual collection efficiency is low, time cost consumed by corpus collection is high, and corpus building efficiency is low.

In view of the above, there is a need to redesign a process to overcome the above-mentioned drawbacks.

Disclosure of Invention

The embodiment of the application provides a method, equipment, a server and a storage medium for constructing a corpus, which are used for solving the technical problem of low construction efficiency in the prior art.

The embodiment of the application provides the following specific technical scheme:

in a first aspect of the embodiments of the present application, a method for constructing a corpus is provided, including:

acquiring existing voice resources in a network;

reading each voice resource in sequence, and executing the following operations when each voice resource is read:

extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined to be unsuccessfully matched with the existing reference voice features based on a first difference value and a second difference value corresponding to the voice features, and storing the voice features serving as the reference voice features into the new voice feature library; wherein the speech features include at least: a pitch feature and a pitch feature, the first difference being derived from the pitch feature and the second difference being derived from the pitch feature;

constructing a new language database corresponding to the new voice feature library;

and converting the voice resource into a corresponding text corpus, and adding the text corpus to the new corpus when the text corpus is determined to be unsuccessfully matched with each existing reference text corpus.

Optionally, before reading each voice resource in sequence, the method further includes:

constructing a Mandarin feature library, extracting the pitch feature and the tone feature of the Mandarin, and storing the extracted pitch feature and the tone feature as initial reference voice features in the Mandarin feature library;

and constructing a mandarin corpus corresponding to the mandarin feature library.

Optionally, based on a speech resource, extracting a corresponding speech feature, and based on a first difference and a second difference corresponding to the speech feature, determining that the speech feature is not successfully matched with any existing reference speech feature, specifically including:

acquiring a first pitch value and a first pitch value, wherein the first pitch value is obtained by converting a pitch feature in any one reference voice feature, and the first pitch value is obtained by converting a pitch feature in any one reference voice feature;

extracting corresponding fundamental tone characteristics and pitch characteristics based on a voice resource, converting the extracted fundamental tone characteristics into a second fundamental tone value, and converting the extracted pitch characteristics into a second pitch value;

calculating a first difference between the second pitch value and a preset first pitch value, and calculating a second difference between the second pitch value and the preset first pitch value;

and when the first difference is judged to be larger than a preset first threshold value and the second difference is judged to be larger than a preset second threshold value, determining that the voice feature is not successfully matched with the reference voice feature in the Mandarin feature library.

Optionally, further comprising:

after extracting corresponding voice features based on a voice resource, if the voice features are determined to be successfully matched with at least one existing reference voice feature, storing the voice features serving as the reference voice features into a voice feature library corresponding to the at least one reference voice feature;

and when the situation that the text corpus is not successfully matched with at least one existing reference text corpus is determined, adding the text corpus to a corpus corresponding to the at least one reference text corpus.

In a second aspect of the embodiments of the present application, there is also provided an apparatus for constructing a corpus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the existing voice resources in the network;

the processing unit is used for reading each voice resource in sequence, and executing the following operations when each voice resource is read:

Optionally, before reading each voice resource in sequence, the processing unit is further configured to:

Optionally, based on a speech resource, extracting a corresponding speech feature, and based on a first difference and a second difference corresponding to the speech feature, when it is determined that the speech feature is not successfully matched with any existing reference speech feature, the processing unit is specifically configured to:

Optionally, the processing unit is further configured to:

In a third aspect of the embodiments of the present application, a server is provided, including: a memory, a processor; wherein the content of the first and second substances,

a memory for storing executable instructions;

a processor for reading and executing executable instructions stored in the memory to implement a method as claimed in any one of the preceding claims.

In a fourth aspect of the embodiments of the present application, there is also provided a storage medium, wherein instructions of the storage medium, when executed by a processor, enable execution of the method according to any one of the above.

In the embodiment of the application, the existing voice resources in the network are obtained, each voice resource is read in sequence, each voice resource is read, corresponding voice characteristics are extracted, when the voice characteristics are determined to be not successfully matched with the existing reference voice characteristics, a new voice characteristic library is established corresponding to the voice characteristics, and the voice characteristics are stored into the new voice characteristic library as the reference voice characteristics; constructing a corpus corresponding to the new speech feature library; and converting the voice resource into corresponding text corpora, and adding the text corpora to the corpus when the text corpora are not successfully matched with the existing reference text corpora. Therefore, the existing voice resources are directly acquired from the network, and the acquired voice resources are respectively stored in the corresponding voice feature libraries through voice feature recognition, so that the automatic classification of the linguistic data can be realized; and when the corpus is not recorded in the corpus to some extent at present, add to corresponding corpus, realized the automatic of corpus in the corpus and added, compare in artifical collection pronunciation and add to corpus and artifical recognition voice and carry out categorised mode, promoted the efficiency of the structure and the maintenance of corpus greatly, saved the operation and maintenance cost.

Drawings

FIG. 1 is a schematic flowchart of an embodiment of a method for constructing a corpus according to the present application;

FIG. 2 is a schematic structural diagram of an embodiment of an apparatus for constructing a corpus according to the present application;

fig. 3 is a schematic structural diagram of a server according to the present application.

Detailed Description

In order to solve the technical problem of low corpus construction efficiency in the prior art, in the embodiment of the application, the existing voice resources in a network are obtained, voice features are extracted from the voice resources, the voice features are matched with the existing reference voice features, when the voice features cannot be matched with the existing reference voice features, a new voice feature library is established, a corpus corresponding to the feature library is constructed, then the voice resources are converted into text corpora, and when the text corpora are determined to be incapable of being matched with the existing text corpora, the text corpora are added to the corresponding corpus.

Alternative embodiments of the present application will now be described in further detail with reference to the accompanying drawings:

in the speech recognition and speech synthesis technology, dialects and mandarin are generally recognized or synthesized separately, so in the embodiment of the present application, when a corpus is established, the dialects and mandarin should respectively construct different corpora, and correspondingly, in the feature recognition process, a speech feature library should be respectively established to store various speech features.

Therefore, as an implementable manner, at least one speech feature library is first constructed as a base speech feature library. Specifically, a mandarin feature library is constructed in advance, and the pitch feature and the tone feature of the mandarin are extracted and stored into the mandarin feature library as reference voice features; and constructing a mandarin corpus corresponding to the mandarin feature library.

After extracting the pitch feature and the pitch feature of the Mandarin, converting the extracted pitch feature of the Mandarin into a first pitch value, and storing the first pitch value and the first pitch value into the Mandarin feature library as basic reference voice features.

It should be noted that the basic speech feature library is not limited to the mandarin feature library, but may be other dialect feature libraries such as a cantonese feature library and a sichuan dialect feature library, and the language type of the basic feature library may be specifically determined according to the actual target user.

Referring to fig. 1, a specific process of the method for constructing a corpus provided in the embodiment of the present application is as follows:

s101: and acquiring the existing voice resources in the network.

The existing voice resources in the network, including voice resources such as audio and video programs on the network, can be obtained by crawling of a web crawler.

Optionally, the acquired voice resource should be preprocessed to remove noise and background noise.

S102: and reading one voice resource from the obtained voice resources.

S103: and extracting corresponding voice features based on the read voice resource.

In the embodiment of the application, extracting the corresponding voice features comprises extracting the pitch feature and the pitch feature.

Generally, a sound is composed of a series of vibrations having different frequencies and amplitudes emitted from a sound-producing body, and among these vibrations, there is a vibration having the lowest frequency, and the sound emitted therefrom is a fundamental tone (fundamental tone), and the rest are overtones. The pitch feature refers to a speech signal extracted from speech resources and containing pitch information.

In the embodiment of the present application, the pitch feature is a speech signal extracted from a speech resource and containing information about the level of a sound frequency.

S104: is the speech feature determined to be not successfully matched with each of the existing reference speech features? If so, the process proceeds to S105, otherwise, the process proceeds to S107.

Optionally, when step S104 is executed, the current speech feature is matched with the reference speech feature in the following manner:

And when the voice feature is successfully matched with at least one existing reference voice feature, the voice feature is taken as a reference voice feature and is stored in a voice feature library corresponding to the at least one reference voice feature.

For example, assuming that the first pitch value of mandarin chinese is 65, the first pitch value of mandarin chinese is the average of the pitch values of up, down, in, and flat, 98, the first threshold is 0.5, the second threshold is 0.8, when the second pitch value is 67, 67-65 is 2, and is greater than the first threshold 0.5, it is determined as a mismatch; when the second pitch value is 99, 99-98 is 1, and is greater than the second threshold value 0.8, it is determined as a mismatch.

S105: and establishing a new voice feature library corresponding to the voice features, and storing the voice features serving as reference voice features into the new voice feature library.

Specifically, the mandarin speech feature in the mandarin chinese feature library is taken as an example of the reference speech feature.

First, the pitch feature and pitch feature extracted based on a currently read speech resource are compared with the reference speech feature in the mandarin chinese feature library.

If there is at least one match, it indicates that the currently read voice resource corresponds to the mandarin chinese feature library, so that a new voice feature library does not need to be established. Optionally, at this time, a speech feature library matched with the currently read speech resource is marked.

If the current read voice resource is not matched with all the reference voice characteristics, the currently read voice resource does not belong to the category of Mandarin. At this time, a new speech feature library corresponding to the one speech resource needs to be created, and the pitch feature extracted from the one speech resource are stored as a reference speech feature in the first dialect feature library.

For the next voice resource, when the process is circulated to S104 again through S101-S103, the voice features extracted from the next voice resource are respectively compared with the mandarin reference voice features in the constructed mandarin feature library and the reference voice features in the first dialect feature library, and if the voice features are matched with the voice features in the first dialect feature library, the next voice resource belongs to the first dialect without building a new voice feature library; if the two speech resources are not matched, the next speech resource is not related to Mandarin or the first dialect, if the next speech resource is determined to be the second dialect, a second dialect feature library is constructed, and the like.

For example, assuming that a currently read voice resource is a Sichuan dialect, after extracting the fundamental tone feature and the tone feature from the voice resource, comparing the extracted fundamental tone feature and the tone feature with the reference voice feature in the Mandarin feature library, and if the extracted fundamental tone feature and the tone feature are not matched with each other, establishing a Sichuan dialect feature library; and comparing the read next voice resource with the reference voice characteristics in the mandarin Chinese feature library and the reference voice characteristics in the Sichuan dialect feature library, and establishing a Henan dialect feature library if the next voice resource is the Henan dialect and the reference voice characteristics in the Mandarin Chinese feature library are determined to be unmatched.

S106: and constructing a corpus corresponding to the new speech feature library.

Optionally, each new speech feature library is constructed, and a new corpus is correspondingly constructed.

The corpus, which is the basic audio material for speech recognition and speech synthesis, may be individual words, phrases or idioms, or may be a sentence.

S107: and converting the voice resource into a corresponding text corpus.

S108: is it judged that the converted text corpus has not been successfully matched with the existing reference text corpora? If so, the process proceeds to S109, otherwise, the process proceeds to S110.

S109: adding the text corpus to the corpus.

For example, if a speech resource is mandarin chinese "with you without melon", then it is determined in S104 that the speech resource matches the reference speech features in the mandarin chinese feature library, i.e., the speech resource is primarily classified as mandarin chinese corpus, and then it is determined in S108 that the speech resource does not match the existing reference text corpora successfully, then in S109, the speech resource is added to the corpus corresponding to the mandarin chinese feature library.

S110: is there the next voice resource determined? If yes, the process returns to the step S102, otherwise, the process is ended.

A plurality of voice feature libraries and corpora can be constructed by circularly executing S102-S110, corresponding reference voice features and reference text corpora are continuously accumulated through automatic matching of voice features, and corpora with sufficient corpora are obtained through autonomous learning, so that the method has important reference values for voice synthesis and voice recognition.

A complete embodiment of the method of constructing a corpus is listed below:

a mandarin feature library is constructed in advance, and a mandarin corpus is correspondingly constructed. The mandarin feature library stores a mandarin pitch value (corresponding to the first pitch value) and a mandarin pitch value (corresponding to the first pitch value).

And acquiring the existing voice resources from the network.

Reading the nth (n is an integer, n is more than or equal to 1) voice resource, if the content is 'I love my country' which is spoken in Sichuan dialect, extracting the fundamental tone feature and the pitch feature in the voice resource, converting the extracted fundamental tone feature into a second fundamental tone value, and converting the extracted pitch feature into the second pitch value.

A first difference between the second pitch value and a pitch value of Mandarin is calculated, and a second difference between the second pitch value and a pitch value of Mandarin is calculated.

And meanwhile, judging whether the first difference is greater than a first threshold value or not and whether the second difference is greater than a second threshold value or not, if the first difference is greater than the first threshold value and the second difference is greater than the second threshold value at the same time, judging that the current voice resource 'the country where I love my me' does not belong to the mandarin feature library, therefore, constructing a new voice feature library, storing a second pitch value and a second pitch value corresponding to the 'the country where I love me' into the new voice feature library, and marking the second pitch value and the second pitch value as a Sichuan dialect feature library.

And corresponding to the Szechwan dialect feature library, constructing a new corpus, and marking the new corpus as the Szechwan dialect corpus.

The current voice resource is converted into text corpora of 'I', 'love', 'My' and 'country'.

And judging whether reference text corpora matched with the 'love my country' exist in the Sichuan dialect corpus.

For the newly-built corpus, no reference text corpus is obviously stored, so that the corpus is directly judged to be not matched with the existing reference text corpus, and the 'I', 'love', 'I' and 'country' are added into the Sichuan dialect corpus.

Thus, a processing flow of the voice resource is completed.

And then, judging whether the next voice resource exists, if so, assigning n to n +1, reading the assigned nth voice resource and executing the process again, otherwise, ending the process.

Referring to the processing procedure of the nth voice resource, the processing procedures of the nth +1, the nth +2 … … and other voice resources can be obtained correspondingly, and are not described in detail.

Referring to fig. 2, an embodiment of the present application provides an apparatus for constructing a corpus, including:

an obtaining unit 201, configured to obtain existing voice resources in a network;

a processing unit 202, configured to read each voice resource in sequence, and execute the following operations for each read voice resource:

extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined to be unsuccessfully matched with the existing reference voice features based on a first difference value and a second difference value corresponding to the voice features, and storing the voice features serving as the reference voice features into the new voice feature library; wherein the speech features include at least: a pitch feature and a pitch feature, the first difference being derived from the pitch feature, the second difference being derived from the pitch feature;

Optionally, before reading each voice resource in sequence, the processing unit 202 is further configured to:

Optionally, based on a voice resource, extracting a corresponding voice feature, and based on a first difference and a second difference corresponding to the voice feature, when it is determined that the voice feature is not successfully matched with any existing reference voice feature, the processing unit 202 is specifically configured to:

Optionally, the processing unit 202 is further configured to:

after extracting corresponding voice features based on a voice resource, if the voice features are determined to be successfully matched with at least one existing reference voice feature, taking the voice features as reference voice features and storing the reference voice features into a voice feature library corresponding to the at least one reference voice feature;

Based on the same inventive concept, referring to fig. 3, an embodiment of the present application further provides a server, including: a memory 301 and a processor 302, wherein,

a memory 301 for storing executable instructions;

a processor 302, configured to read and execute executable instructions stored in the memory, so as to implement any one of the methods for constructing a corpus described above.

Based on the same inventive concept, the present application further provides a storage medium, and when executed by a processor, the storage medium enables execution of any one of the methods for constructing a corpus.

To sum up, in the embodiment of the present application, based on existing voice resources in a slave network, each voice resource is sequentially read, and when it is determined that none of the voice features matches any existing reference voice feature, a new voice feature library is established corresponding to the voice feature, and a corpus corresponding to the new voice feature library is established; and then, converting the voice resource into a corresponding text corpus, and adding the text corpus to the corpus when the text corpus is not successfully matched with each existing reference text corpus. Therefore, the existing voice resources in the network are matched through the voice characteristics, and different voice characteristics correspond to different corpora, so that the automatic classification of the corpora based on the characteristic matching is realized; when the current corpus is not recorded in the corpus, the corpus is added into the corresponding corpus, so that the automatic addition of the corpus in the corpus is realized, the construction and maintenance efficiency of the corpus is improved, and the operation and maintenance cost is saved;

furthermore, the pitch feature and the pitch feature are typical features which can be represented in a quantized mode in the voice, differences among different voices can be reflected, the pitch feature and the pitch feature are extracted from voice resources to perform feature matching, the matching effect is good, and the recognition rate is high.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to encompass such modifications and variations.

Claims

1. A method of constructing a corpus, comprising:

acquiring existing voice resources in a network;

converting the voice resource into a corresponding text corpus, and adding the text corpus to the new corpus when the text corpus is determined to be unsuccessfully matched with each existing reference text corpus;

the extracting, based on a speech resource, a corresponding speech feature, and determining that the speech feature is not successfully matched with any existing reference speech feature based on a first difference and a second difference corresponding to the speech feature specifically include:

2. The method of claim 1, wherein prior to reading each voice resource in sequence, further comprising:

3. The method of claim 1, further comprising:

and when the fact that the text corpus is not successfully matched with at least one existing reference text corpus is determined, the text corpus is added to a corpus corresponding to the at least one reference text corpus.

4. An apparatus for constructing a corpus, comprising:

5. The device of claim 4, wherein prior to reading each voice resource in turn, the processing unit is further configured to:

6. The device of claim 4, wherein the processing unit is further to:

7. A server, comprising: a memory, a processor; wherein the content of the first and second substances,

a memory for storing executable instructions;

a processor for reading and executing executable instructions stored in the memory to implement the method of any one of claims 1-3.

8. A storage medium, characterized in that instructions in the storage medium, when executed by a processor, enable execution of the method according to any one of claims 1-3.