CN110942765B - Method, device, server and storage medium for constructing corpus - Google Patents

Method, device, server and storage medium for constructing corpus Download PDF

Info

Publication number
CN110942765B
CN110942765B CN201911095120.3A CN201911095120A CN110942765B CN 110942765 B CN110942765 B CN 110942765B CN 201911095120 A CN201911095120 A CN 201911095120A CN 110942765 B CN110942765 B CN 110942765B
Authority
CN
China
Prior art keywords
voice
feature
pitch
corpus
difference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911095120.3A
Other languages
Chinese (zh)
Other versions
CN110942765A (en
Inventor
李阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gree Electric Appliances Inc of Zhuhai
Original Assignee
Gree Electric Appliances Inc of Zhuhai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gree Electric Appliances Inc of Zhuhai filed Critical Gree Electric Appliances Inc of Zhuhai
Priority to CN201911095120.3A priority Critical patent/CN110942765B/en
Publication of CN110942765A publication Critical patent/CN110942765A/en
Application granted granted Critical
Publication of CN110942765B publication Critical patent/CN110942765B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/61Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Abstract

The application relates to the technical field of intelligent voice, in particular to a method, equipment, a server and a storage medium for constructing a corpus, wherein the method comprises the following steps: reading each voice resource in sequence, and executing the following operations when each voice resource is read: extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined not to be successfully matched with the existing reference voice features, and storing the voice features serving as the reference voice features into the new voice feature library; constructing a new language database corresponding to the new speech feature library; converting a voice resource into corresponding text corpora, and adding the text corpora to a new corpus when determining that the text corpora are not successfully matched with the existing reference text corpora. The method improves the efficiency of constructing the corpus.

Description

Method, device, server and storage medium for constructing corpus
Technical Field
The present application relates to the field of intelligent speech technologies, and in particular, to a method, an apparatus, a server, and a storage medium for constructing a corpus.
Background
With the development of information technology, intelligent voice technology has become one of the most convenient and effective technical means for people to acquire and communicate information.
The intelligent voice technology is a means for realizing man-machine language interaction, and voice recognition and voice synthesis are two main branches of the intelligent voice technology. The realization of speech recognition and speech synthesis requires the pre-construction of a corpus, and speech recognition or synthesis is performed based on the corpus.
In the prior art, a method for constructing a corpus comprises the following steps: the corpora are recorded by a large number of volunteers, and then the working personnel collect, label and maintain the recorded corpora information at a later stage.
The corpus building method has the advantages that corpus collection and building are greatly dependent on manual operation, a large amount of labor is occupied, manual collection efficiency is low, time cost consumed by corpus collection is high, and corpus building efficiency is low.
In view of the above, there is a need to redesign a process to overcome the above-mentioned drawbacks.
Disclosure of Invention
The embodiment of the application provides a method, equipment, a server and a storage medium for constructing a corpus, which are used for solving the technical problem of low construction efficiency in the prior art.
The embodiment of the application provides the following specific technical scheme:
in a first aspect of the embodiments of the present application, a method for constructing a corpus is provided, including:
acquiring existing voice resources in a network;
reading each voice resource in sequence, and executing the following operations when each voice resource is read:
extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined to be unsuccessfully matched with the existing reference voice features based on a first difference value and a second difference value corresponding to the voice features, and storing the voice features serving as the reference voice features into the new voice feature library; wherein the speech features include at least: a pitch feature and a pitch feature, the first difference being derived from the pitch feature and the second difference being derived from the pitch feature;
constructing a new language database corresponding to the new voice feature library;
and converting the voice resource into a corresponding text corpus, and adding the text corpus to the new corpus when the text corpus is determined to be unsuccessfully matched with each existing reference text corpus.
Optionally, before reading each voice resource in sequence, the method further includes:
constructing a Mandarin feature library, extracting the pitch feature and the tone feature of the Mandarin, and storing the extracted pitch feature and the tone feature as initial reference voice features in the Mandarin feature library;
and constructing a mandarin corpus corresponding to the mandarin feature library.
Optionally, based on a speech resource, extracting a corresponding speech feature, and based on a first difference and a second difference corresponding to the speech feature, determining that the speech feature is not successfully matched with any existing reference speech feature, specifically including:
acquiring a first pitch value and a first pitch value, wherein the first pitch value is obtained by converting a pitch feature in any one reference voice feature, and the first pitch value is obtained by converting a pitch feature in any one reference voice feature;
extracting corresponding fundamental tone characteristics and pitch characteristics based on a voice resource, converting the extracted fundamental tone characteristics into a second fundamental tone value, and converting the extracted pitch characteristics into a second pitch value;
calculating a first difference between the second pitch value and a preset first pitch value, and calculating a second difference between the second pitch value and the preset first pitch value;
and when the first difference is judged to be larger than a preset first threshold value and the second difference is judged to be larger than a preset second threshold value, determining that the voice feature is not successfully matched with the reference voice feature in the Mandarin feature library.
Optionally, further comprising:
after extracting corresponding voice features based on a voice resource, if the voice features are determined to be successfully matched with at least one existing reference voice feature, storing the voice features serving as the reference voice features into a voice feature library corresponding to the at least one reference voice feature;
and when the situation that the text corpus is not successfully matched with at least one existing reference text corpus is determined, adding the text corpus to a corpus corresponding to the at least one reference text corpus.
In a second aspect of the embodiments of the present application, there is also provided an apparatus for constructing a corpus, including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the existing voice resources in the network;
the processing unit is used for reading each voice resource in sequence, and executing the following operations when each voice resource is read:
extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined to be unsuccessfully matched with the existing reference voice features based on a first difference value and a second difference value corresponding to the voice features, and storing the voice features serving as the reference voice features into the new voice feature library; wherein the speech features include at least: a pitch feature and a pitch feature, the first difference being derived from the pitch feature and the second difference being derived from the pitch feature;
constructing a new language database corresponding to the new voice feature library;
and converting the voice resource into a corresponding text corpus, and adding the text corpus to the new corpus when the text corpus is determined to be unsuccessfully matched with each existing reference text corpus.
Optionally, before reading each voice resource in sequence, the processing unit is further configured to:
constructing a Mandarin feature library, extracting the pitch feature and the tone feature of the Mandarin, and storing the extracted pitch feature and the tone feature as initial reference voice features in the Mandarin feature library;
and constructing a mandarin corpus corresponding to the mandarin feature library.
Optionally, based on a speech resource, extracting a corresponding speech feature, and based on a first difference and a second difference corresponding to the speech feature, when it is determined that the speech feature is not successfully matched with any existing reference speech feature, the processing unit is specifically configured to:
acquiring a first pitch value and a first pitch value, wherein the first pitch value is obtained by converting a pitch feature in any one reference voice feature, and the first pitch value is obtained by converting a pitch feature in any one reference voice feature;
extracting corresponding fundamental tone characteristics and pitch characteristics based on a voice resource, converting the extracted fundamental tone characteristics into a second fundamental tone value, and converting the extracted pitch characteristics into a second pitch value;
calculating a first difference between the second pitch value and a preset first pitch value, and calculating a second difference between the second pitch value and the preset first pitch value;
and when the first difference is judged to be larger than a preset first threshold value and the second difference is judged to be larger than a preset second threshold value, determining that the voice feature is not successfully matched with the reference voice feature in the Mandarin feature library.
Optionally, the processing unit is further configured to:
after extracting corresponding voice features based on a voice resource, if the voice features are determined to be successfully matched with at least one existing reference voice feature, storing the voice features serving as the reference voice features into a voice feature library corresponding to the at least one reference voice feature;
and when the situation that the text corpus is not successfully matched with at least one existing reference text corpus is determined, adding the text corpus to a corpus corresponding to the at least one reference text corpus.
In a third aspect of the embodiments of the present application, a server is provided, including: a memory, a processor; wherein the content of the first and second substances,
a memory for storing executable instructions;
a processor for reading and executing executable instructions stored in the memory to implement a method as claimed in any one of the preceding claims.
In a fourth aspect of the embodiments of the present application, there is also provided a storage medium, wherein instructions of the storage medium, when executed by a processor, enable execution of the method according to any one of the above.
In the embodiment of the application, the existing voice resources in the network are obtained, each voice resource is read in sequence, each voice resource is read, corresponding voice characteristics are extracted, when the voice characteristics are determined to be not successfully matched with the existing reference voice characteristics, a new voice characteristic library is established corresponding to the voice characteristics, and the voice characteristics are stored into the new voice characteristic library as the reference voice characteristics; constructing a corpus corresponding to the new speech feature library; and converting the voice resource into corresponding text corpora, and adding the text corpora to the corpus when the text corpora are not successfully matched with the existing reference text corpora. Therefore, the existing voice resources are directly acquired from the network, and the acquired voice resources are respectively stored in the corresponding voice feature libraries through voice feature recognition, so that the automatic classification of the linguistic data can be realized; and when the corpus is not recorded in the corpus to some extent at present, add to corresponding corpus, realized the automatic of corpus in the corpus and added, compare in artifical collection pronunciation and add to corpus and artifical recognition voice and carry out categorised mode, promoted the efficiency of the structure and the maintenance of corpus greatly, saved the operation and maintenance cost.
Drawings
FIG. 1 is a schematic flowchart of an embodiment of a method for constructing a corpus according to the present application;
FIG. 2 is a schematic structural diagram of an embodiment of an apparatus for constructing a corpus according to the present application;
fig. 3 is a schematic structural diagram of a server according to the present application.
Detailed Description
In order to solve the technical problem of low corpus construction efficiency in the prior art, in the embodiment of the application, the existing voice resources in a network are obtained, voice features are extracted from the voice resources, the voice features are matched with the existing reference voice features, when the voice features cannot be matched with the existing reference voice features, a new voice feature library is established, a corpus corresponding to the feature library is constructed, then the voice resources are converted into text corpora, and when the text corpora are determined to be incapable of being matched with the existing text corpora, the text corpora are added to the corresponding corpus.
Alternative embodiments of the present application will now be described in further detail with reference to the accompanying drawings:
in the speech recognition and speech synthesis technology, dialects and mandarin are generally recognized or synthesized separately, so in the embodiment of the present application, when a corpus is established, the dialects and mandarin should respectively construct different corpora, and correspondingly, in the feature recognition process, a speech feature library should be respectively established to store various speech features.
Therefore, as an implementable manner, at least one speech feature library is first constructed as a base speech feature library. Specifically, a mandarin feature library is constructed in advance, and the pitch feature and the tone feature of the mandarin are extracted and stored into the mandarin feature library as reference voice features; and constructing a mandarin corpus corresponding to the mandarin feature library.
After extracting the pitch feature and the pitch feature of the Mandarin, converting the extracted pitch feature of the Mandarin into a first pitch value, and storing the first pitch value and the first pitch value into the Mandarin feature library as basic reference voice features.
It should be noted that the basic speech feature library is not limited to the mandarin feature library, but may be other dialect feature libraries such as a cantonese feature library and a sichuan dialect feature library, and the language type of the basic feature library may be specifically determined according to the actual target user.
Referring to fig. 1, a specific process of the method for constructing a corpus provided in the embodiment of the present application is as follows:
s101: and acquiring the existing voice resources in the network.
The existing voice resources in the network, including voice resources such as audio and video programs on the network, can be obtained by crawling of a web crawler.
Optionally, the acquired voice resource should be preprocessed to remove noise and background noise.
S102: and reading one voice resource from the obtained voice resources.
S103: and extracting corresponding voice features based on the read voice resource.
In the embodiment of the application, extracting the corresponding voice features comprises extracting the pitch feature and the pitch feature.
Generally, a sound is composed of a series of vibrations having different frequencies and amplitudes emitted from a sound-producing body, and among these vibrations, there is a vibration having the lowest frequency, and the sound emitted therefrom is a fundamental tone (fundamental tone), and the rest are overtones. The pitch feature refers to a speech signal extracted from speech resources and containing pitch information.
In the embodiment of the present application, the pitch feature is a speech signal extracted from a speech resource and containing information about the level of a sound frequency.
S104: is the speech feature determined to be not successfully matched with each of the existing reference speech features? If so, the process proceeds to S105, otherwise, the process proceeds to S107.
Optionally, when step S104 is executed, the current speech feature is matched with the reference speech feature in the following manner:
acquiring a first pitch value and a first pitch value, wherein the first pitch value is obtained by converting a pitch feature in any one reference voice feature, and the first pitch value is obtained by converting a pitch feature in any one reference voice feature;
extracting corresponding fundamental tone characteristics and pitch characteristics based on a voice resource, converting the extracted fundamental tone characteristics into a second fundamental tone value, and converting the extracted pitch characteristics into a second pitch value;
calculating a first difference between the second pitch value and a preset first pitch value, and calculating a second difference between the second pitch value and the preset first pitch value;
and when the first difference is judged to be larger than a preset first threshold value and the second difference is judged to be larger than a preset second threshold value, determining that the voice feature is not successfully matched with the reference voice feature in the Mandarin feature library.
And when the voice feature is successfully matched with at least one existing reference voice feature, the voice feature is taken as a reference voice feature and is stored in a voice feature library corresponding to the at least one reference voice feature.
For example, assuming that the first pitch value of mandarin chinese is 65, the first pitch value of mandarin chinese is the average of the pitch values of up, down, in, and flat, 98, the first threshold is 0.5, the second threshold is 0.8, when the second pitch value is 67, 67-65 is 2, and is greater than the first threshold 0.5, it is determined as a mismatch; when the second pitch value is 99, 99-98 is 1, and is greater than the second threshold value 0.8, it is determined as a mismatch.
S105: and establishing a new voice feature library corresponding to the voice features, and storing the voice features serving as reference voice features into the new voice feature library.
Specifically, the mandarin speech feature in the mandarin chinese feature library is taken as an example of the reference speech feature.
First, the pitch feature and pitch feature extracted based on a currently read speech resource are compared with the reference speech feature in the mandarin chinese feature library.
If there is at least one match, it indicates that the currently read voice resource corresponds to the mandarin chinese feature library, so that a new voice feature library does not need to be established. Optionally, at this time, a speech feature library matched with the currently read speech resource is marked.
If the current read voice resource is not matched with all the reference voice characteristics, the currently read voice resource does not belong to the category of Mandarin. At this time, a new speech feature library corresponding to the one speech resource needs to be created, and the pitch feature extracted from the one speech resource are stored as a reference speech feature in the first dialect feature library.
For the next voice resource, when the process is circulated to S104 again through S101-S103, the voice features extracted from the next voice resource are respectively compared with the mandarin reference voice features in the constructed mandarin feature library and the reference voice features in the first dialect feature library, and if the voice features are matched with the voice features in the first dialect feature library, the next voice resource belongs to the first dialect without building a new voice feature library; if the two speech resources are not matched, the next speech resource is not related to Mandarin or the first dialect, if the next speech resource is determined to be the second dialect, a second dialect feature library is constructed, and the like.
For example, assuming that a currently read voice resource is a Sichuan dialect, after extracting the fundamental tone feature and the tone feature from the voice resource, comparing the extracted fundamental tone feature and the tone feature with the reference voice feature in the Mandarin feature library, and if the extracted fundamental tone feature and the tone feature are not matched with each other, establishing a Sichuan dialect feature library; and comparing the read next voice resource with the reference voice characteristics in the mandarin Chinese feature library and the reference voice characteristics in the Sichuan dialect feature library, and establishing a Henan dialect feature library if the next voice resource is the Henan dialect and the reference voice characteristics in the Mandarin Chinese feature library are determined to be unmatched.
S106: and constructing a corpus corresponding to the new speech feature library.
Optionally, each new speech feature library is constructed, and a new corpus is correspondingly constructed.
The corpus, which is the basic audio material for speech recognition and speech synthesis, may be individual words, phrases or idioms, or may be a sentence.
S107: and converting the voice resource into a corresponding text corpus.
S108: is it judged that the converted text corpus has not been successfully matched with the existing reference text corpora? If so, the process proceeds to S109, otherwise, the process proceeds to S110.
S109: adding the text corpus to the corpus.
For example, if a speech resource is mandarin chinese "with you without melon", then it is determined in S104 that the speech resource matches the reference speech features in the mandarin chinese feature library, i.e., the speech resource is primarily classified as mandarin chinese corpus, and then it is determined in S108 that the speech resource does not match the existing reference text corpora successfully, then in S109, the speech resource is added to the corpus corresponding to the mandarin chinese feature library.
S110: is there the next voice resource determined? If yes, the process returns to the step S102, otherwise, the process is ended.
A plurality of voice feature libraries and corpora can be constructed by circularly executing S102-S110, corresponding reference voice features and reference text corpora are continuously accumulated through automatic matching of voice features, and corpora with sufficient corpora are obtained through autonomous learning, so that the method has important reference values for voice synthesis and voice recognition.
A complete embodiment of the method of constructing a corpus is listed below:
a mandarin feature library is constructed in advance, and a mandarin corpus is correspondingly constructed. The mandarin feature library stores a mandarin pitch value (corresponding to the first pitch value) and a mandarin pitch value (corresponding to the first pitch value).
And acquiring the existing voice resources from the network.
Reading the nth (n is an integer, n is more than or equal to 1) voice resource, if the content is 'I love my country' which is spoken in Sichuan dialect, extracting the fundamental tone feature and the pitch feature in the voice resource, converting the extracted fundamental tone feature into a second fundamental tone value, and converting the extracted pitch feature into the second pitch value.
A first difference between the second pitch value and a pitch value of Mandarin is calculated, and a second difference between the second pitch value and a pitch value of Mandarin is calculated.
And meanwhile, judging whether the first difference is greater than a first threshold value or not and whether the second difference is greater than a second threshold value or not, if the first difference is greater than the first threshold value and the second difference is greater than the second threshold value at the same time, judging that the current voice resource 'the country where I love my me' does not belong to the mandarin feature library, therefore, constructing a new voice feature library, storing a second pitch value and a second pitch value corresponding to the 'the country where I love me' into the new voice feature library, and marking the second pitch value and the second pitch value as a Sichuan dialect feature library.
And corresponding to the Szechwan dialect feature library, constructing a new corpus, and marking the new corpus as the Szechwan dialect corpus.
The current voice resource is converted into text corpora of 'I', 'love', 'My' and 'country'.
And judging whether reference text corpora matched with the 'love my country' exist in the Sichuan dialect corpus.
For the newly-built corpus, no reference text corpus is obviously stored, so that the corpus is directly judged to be not matched with the existing reference text corpus, and the 'I', 'love', 'I' and 'country' are added into the Sichuan dialect corpus.
Thus, a processing flow of the voice resource is completed.
And then, judging whether the next voice resource exists, if so, assigning n to n +1, reading the assigned nth voice resource and executing the process again, otherwise, ending the process.
Referring to the processing procedure of the nth voice resource, the processing procedures of the nth +1, the nth +2 … … and other voice resources can be obtained correspondingly, and are not described in detail.
Referring to fig. 2, an embodiment of the present application provides an apparatus for constructing a corpus, including:
an obtaining unit 201, configured to obtain existing voice resources in a network;
a processing unit 202, configured to read each voice resource in sequence, and execute the following operations for each read voice resource:
extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined to be unsuccessfully matched with the existing reference voice features based on a first difference value and a second difference value corresponding to the voice features, and storing the voice features serving as the reference voice features into the new voice feature library; wherein the speech features include at least: a pitch feature and a pitch feature, the first difference being derived from the pitch feature, the second difference being derived from the pitch feature;
constructing a new language database corresponding to the new voice feature library;
and converting the voice resource into a corresponding text corpus, and adding the text corpus to the new corpus when the text corpus is determined to be unsuccessfully matched with each existing reference text corpus.
Optionally, before reading each voice resource in sequence, the processing unit 202 is further configured to:
constructing a Mandarin feature library, extracting the pitch feature and the tone feature of the Mandarin, and storing the extracted pitch feature and the tone feature as initial reference voice features in the Mandarin feature library;
and constructing a mandarin corpus corresponding to the mandarin feature library.
Optionally, based on a voice resource, extracting a corresponding voice feature, and based on a first difference and a second difference corresponding to the voice feature, when it is determined that the voice feature is not successfully matched with any existing reference voice feature, the processing unit 202 is specifically configured to:
acquiring a first pitch value and a first pitch value, wherein the first pitch value is obtained by converting a pitch feature in any one reference voice feature, and the first pitch value is obtained by converting a pitch feature in any one reference voice feature;
extracting corresponding fundamental tone characteristics and pitch characteristics based on a voice resource, converting the extracted fundamental tone characteristics into a second fundamental tone value, and converting the extracted pitch characteristics into a second pitch value;
calculating a first difference between the second pitch value and a preset first pitch value, and calculating a second difference between the second pitch value and the preset first pitch value;
and when the first difference is judged to be larger than a preset first threshold value and the second difference is judged to be larger than a preset second threshold value, determining that the voice feature is not successfully matched with the reference voice feature in the Mandarin feature library.
Optionally, the processing unit 202 is further configured to:
after extracting corresponding voice features based on a voice resource, if the voice features are determined to be successfully matched with at least one existing reference voice feature, taking the voice features as reference voice features and storing the reference voice features into a voice feature library corresponding to the at least one reference voice feature;
and when the situation that the text corpus is not successfully matched with at least one existing reference text corpus is determined, adding the text corpus to a corpus corresponding to the at least one reference text corpus.
Based on the same inventive concept, referring to fig. 3, an embodiment of the present application further provides a server, including: a memory 301 and a processor 302, wherein,
a memory 301 for storing executable instructions;
a processor 302, configured to read and execute executable instructions stored in the memory, so as to implement any one of the methods for constructing a corpus described above.
Based on the same inventive concept, the present application further provides a storage medium, and when executed by a processor, the storage medium enables execution of any one of the methods for constructing a corpus.
To sum up, in the embodiment of the present application, based on existing voice resources in a slave network, each voice resource is sequentially read, and when it is determined that none of the voice features matches any existing reference voice feature, a new voice feature library is established corresponding to the voice feature, and a corpus corresponding to the new voice feature library is established; and then, converting the voice resource into a corresponding text corpus, and adding the text corpus to the corpus when the text corpus is not successfully matched with each existing reference text corpus. Therefore, the existing voice resources in the network are matched through the voice characteristics, and different voice characteristics correspond to different corpora, so that the automatic classification of the corpora based on the characteristic matching is realized; when the current corpus is not recorded in the corpus, the corpus is added into the corresponding corpus, so that the automatic addition of the corpus in the corpus is realized, the construction and maintenance efficiency of the corpus is improved, and the operation and maintenance cost is saved;
furthermore, the pitch feature and the pitch feature are typical features which can be represented in a quantized mode in the voice, differences among different voices can be reflected, the pitch feature and the pitch feature are extracted from voice resources to perform feature matching, the matching effect is good, and the recognition rate is high.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to encompass such modifications and variations.

Claims (8)

1. A method of constructing a corpus, comprising:
acquiring existing voice resources in a network;
reading each voice resource in sequence, and executing the following operations when each voice resource is read:
extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined to be unsuccessfully matched with the existing reference voice features based on a first difference value and a second difference value corresponding to the voice features, and storing the voice features serving as the reference voice features into the new voice feature library; wherein the speech features include at least: a pitch feature and a pitch feature, the first difference being derived from the pitch feature and the second difference being derived from the pitch feature;
constructing a new language database corresponding to the new voice feature library;
converting the voice resource into a corresponding text corpus, and adding the text corpus to the new corpus when the text corpus is determined to be unsuccessfully matched with each existing reference text corpus;
the extracting, based on a speech resource, a corresponding speech feature, and determining that the speech feature is not successfully matched with any existing reference speech feature based on a first difference and a second difference corresponding to the speech feature specifically include:
acquiring a first pitch value and a first pitch value, wherein the first pitch value is obtained by converting a pitch feature in any one reference voice feature, and the first pitch value is obtained by converting a pitch feature in any one reference voice feature;
extracting corresponding fundamental tone characteristics and pitch characteristics based on a voice resource, converting the extracted fundamental tone characteristics into a second fundamental tone value, and converting the extracted pitch characteristics into a second pitch value;
calculating a first difference between the second pitch value and a preset first pitch value, and calculating a second difference between the second pitch value and the preset first pitch value;
and when the first difference is judged to be larger than a preset first threshold value and the second difference is judged to be larger than a preset second threshold value, determining that the voice feature is not successfully matched with the reference voice feature in the Mandarin feature library.
2. The method of claim 1, wherein prior to reading each voice resource in sequence, further comprising:
constructing a Mandarin feature library, extracting the pitch feature and the tone feature of the Mandarin, and storing the extracted pitch feature and the tone feature as initial reference voice features in the Mandarin feature library;
and constructing a mandarin corpus corresponding to the mandarin feature library.
3. The method of claim 1, further comprising:
after extracting corresponding voice features based on a voice resource, if the voice features are determined to be successfully matched with at least one existing reference voice feature, storing the voice features serving as the reference voice features into a voice feature library corresponding to the at least one reference voice feature;
and when the fact that the text corpus is not successfully matched with at least one existing reference text corpus is determined, the text corpus is added to a corpus corresponding to the at least one reference text corpus.
4. An apparatus for constructing a corpus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the existing voice resources in the network;
the processing unit is used for reading each voice resource in sequence, and executing the following operations when each voice resource is read:
extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined to be unsuccessfully matched with the existing reference voice features based on a first difference value and a second difference value corresponding to the voice features, and storing the voice features serving as the reference voice features into the new voice feature library; wherein the speech features include at least: a pitch feature and a pitch feature, the first difference being derived from the pitch feature and the second difference being derived from the pitch feature;
constructing a new language database corresponding to the new voice feature library;
converting the voice resource into a corresponding text corpus, and adding the text corpus to the new corpus when the text corpus is determined to be unsuccessfully matched with each existing reference text corpus;
the extracting, based on a speech resource, a corresponding speech feature, and determining that the speech feature is not successfully matched with any existing reference speech feature based on a first difference and a second difference corresponding to the speech feature specifically include:
acquiring a first pitch value and a first pitch value, wherein the first pitch value is obtained by converting a pitch feature in any one reference voice feature, and the first pitch value is obtained by converting a pitch feature in any one reference voice feature;
extracting corresponding fundamental tone characteristics and pitch characteristics based on a voice resource, converting the extracted fundamental tone characteristics into a second fundamental tone value, and converting the extracted pitch characteristics into a second pitch value;
calculating a first difference between the second pitch value and a preset first pitch value, and calculating a second difference between the second pitch value and the preset first pitch value;
and when the first difference is judged to be larger than a preset first threshold value and the second difference is judged to be larger than a preset second threshold value, determining that the voice feature is not successfully matched with the reference voice feature in the Mandarin feature library.
5. The device of claim 4, wherein prior to reading each voice resource in turn, the processing unit is further configured to:
constructing a Mandarin feature library, extracting the pitch feature and the tone feature of the Mandarin, and storing the extracted pitch feature and the tone feature as initial reference voice features in the Mandarin feature library;
and constructing a mandarin corpus corresponding to the mandarin feature library.
6. The device of claim 4, wherein the processing unit is further to:
after extracting corresponding voice features based on a voice resource, if the voice features are determined to be successfully matched with at least one existing reference voice feature, storing the voice features serving as the reference voice features into a voice feature library corresponding to the at least one reference voice feature;
and when the situation that the text corpus is not successfully matched with at least one existing reference text corpus is determined, adding the text corpus to a corpus corresponding to the at least one reference text corpus.
7. A server, comprising: a memory, a processor; wherein the content of the first and second substances,
a memory for storing executable instructions;
a processor for reading and executing executable instructions stored in the memory to implement the method of any one of claims 1-3.
8. A storage medium, characterized in that instructions in the storage medium, when executed by a processor, enable execution of the method according to any one of claims 1-3.
CN201911095120.3A 2019-11-11 2019-11-11 Method, device, server and storage medium for constructing corpus Active CN110942765B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911095120.3A CN110942765B (en) 2019-11-11 2019-11-11 Method, device, server and storage medium for constructing corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911095120.3A CN110942765B (en) 2019-11-11 2019-11-11 Method, device, server and storage medium for constructing corpus

Publications (2)

Publication Number Publication Date
CN110942765A CN110942765A (en) 2020-03-31
CN110942765B true CN110942765B (en) 2022-05-27

Family

ID=69906444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911095120.3A Active CN110942765B (en) 2019-11-11 2019-11-11 Method, device, server and storage medium for constructing corpus

Country Status (1)

Country Link
CN (1) CN110942765B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111356022A (en) * 2020-04-18 2020-06-30 徐琼琼 Video file processing method based on voice recognition
CN113593556A (en) * 2021-07-26 2021-11-02 深圳市捌零零在线科技有限公司 Human-computer interaction method and device for vehicle-mounted voice operating system
CN115810345A (en) * 2022-11-23 2023-03-17 北京伽睿智能科技集团有限公司 Intelligent speech technology recommendation method, system, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003029774A (en) * 2001-07-19 2003-01-31 Matsushita Electric Ind Co Ltd Voice waveform dictionary distribution system, voice waveform dictionary preparing device, and voice synthesizing terminal equipment
CN1604182A (en) * 2003-09-29 2005-04-06 摩托罗拉公司 Method for voice synthesizing
CN109616131A (en) * 2018-11-12 2019-04-12 南京南大电子智慧型服务机器人研究院有限公司 A kind of number real-time voice is changed voice method

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102881283B (en) * 2011-07-13 2014-05-28 三星电子(中国)研发中心 Method and system for processing voice
CN202584695U (en) * 2011-12-30 2012-12-05 深圳市车音网科技有限公司 Mapping display system and device thereof
EP3077919A4 (en) * 2013-12-02 2017-05-10 Qbase LLC Method for disambiguating features in unstructured text
US20160262974A1 (en) * 2015-03-10 2016-09-15 Strathspey Crown Holdings, LLC Autonomic nervous system balancing device and method of use
US10511712B2 (en) * 2016-08-19 2019-12-17 Andrew Horton Caller identification in a secure environment using voice biometrics
CN106202380B (en) * 2016-07-08 2019-12-24 中国科学院上海高等研究院 Method and system for constructing classified corpus and server with system
CN106649278B (en) * 2016-12-30 2019-11-15 三星电子(中国)研发中心 Extend the method and system of spoken dialogue system corpus
CN106935248B (en) * 2017-02-14 2021-02-05 广州孩教圈信息科技股份有限公司 Voice similarity detection method and device
CN108764010A (en) * 2018-03-23 2018-11-06 姜涵予 Emotional state determines method and device
CN109065028B (en) * 2018-06-11 2022-12-30 平安科技(深圳)有限公司 Speaker clustering method, speaker clustering device, computer equipment and storage medium
CN109036424A (en) * 2018-08-30 2018-12-18 出门问问信息科技有限公司 Audio recognition method, device, electronic equipment and computer readable storage medium
CN109215638B (en) * 2018-10-19 2021-07-13 珠海格力电器股份有限公司 Voice learning method and device, voice equipment and storage medium
CN109215636B (en) * 2018-11-08 2020-10-30 广东小天才科技有限公司 Voice information classification method and system
CN109801628B (en) * 2019-02-11 2020-02-21 龙马智芯(珠海横琴)科技有限公司 Corpus collection method, apparatus and system
CN110046261B (en) * 2019-04-22 2022-01-21 山东建筑大学 Construction method of multi-modal bilingual parallel corpus of construction engineering
CN110134799B (en) * 2019-05-29 2022-03-01 四川长虹电器股份有限公司 BM25 algorithm-based text corpus construction and optimization method
CN110413723A (en) * 2019-06-06 2019-11-05 福建奇点时空数字科技有限公司 A kind of corpus automated construction method of data-driven
CN110265028B (en) * 2019-06-20 2020-10-09 百度在线网络技术(北京)有限公司 Method, device and equipment for constructing speech synthesis corpus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003029774A (en) * 2001-07-19 2003-01-31 Matsushita Electric Ind Co Ltd Voice waveform dictionary distribution system, voice waveform dictionary preparing device, and voice synthesizing terminal equipment
CN1604182A (en) * 2003-09-29 2005-04-06 摩托罗拉公司 Method for voice synthesizing
CN109616131A (en) * 2018-11-12 2019-04-12 南京南大电子智慧型服务机器人研究院有限公司 A kind of number real-time voice is changed voice method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
庞伟.双语语料库构建研究综述.《信息技术与信息化》.2015,(第03期), *

Also Published As

Publication number Publication date
CN110942765A (en) 2020-03-31

Similar Documents

Publication Publication Date Title
CN110377716B (en) Interaction method and device for conversation and computer readable storage medium
CN109065031B (en) Voice labeling method, device and equipment
CN110942765B (en) Method, device, server and storage medium for constructing corpus
CN101076851B (en) Spoken language identification system and method for training and operating the said system
CN100449611C (en) Lexical stress prediction
US20090254349A1 (en) Speech synthesizer
CN105206258A (en) Generation method and device of acoustic model as well as voice synthetic method and device
US8019605B2 (en) Reducing recording time when constructing a concatenative TTS voice using a reduced script and pre-recorded speech assets
CN105185372A (en) Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN103823867A (en) Humming type music retrieval method and system based on note modeling
KR20080069990A (en) Speech index pruning
CN110459202B (en) Rhythm labeling method, device, equipment and medium
CN106302987A (en) A kind of audio frequency recommends method and apparatus
CN105161116A (en) Method and device for determining climax fragment of multimedia file
JP2020166839A (en) Sentence recommendation method and apparatus based on associated points of interest
CN111178081B (en) Semantic recognition method, server, electronic device and computer storage medium
CN108364655B (en) Voice processing method, medium, device and computing equipment
CN111724769A (en) Production method of intelligent household voice recognition model
CN109492126B (en) Intelligent interaction method and device
CN113609264B (en) Data query method and device for power system nodes
CN114550718A (en) Hot word speech recognition method, device, equipment and computer readable storage medium
CN111063337B (en) Large-scale voice recognition method and system capable of rapidly updating language model
CN114783424A (en) Text corpus screening method, device, equipment and storage medium
CN113516963B (en) Audio data generation method and device, server and intelligent sound box
JP4504469B2 (en) Method for determining reliability of data composed of audio signals

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant