CN110942765A - Method, device, server and storage medium for constructing corpus - Google Patents
Method, device, server and storage medium for constructing corpus Download PDFInfo
- Publication number
- CN110942765A CN110942765A CN201911095120.3A CN201911095120A CN110942765A CN 110942765 A CN110942765 A CN 110942765A CN 201911095120 A CN201911095120 A CN 201911095120A CN 110942765 A CN110942765 A CN 110942765A
- Authority
- CN
- China
- Prior art keywords
- voice
- feature
- pitch
- resource
- corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 241001672694 Citrus reticulata Species 0.000 claims description 67
- 238000012545 processing Methods 0.000 claims description 26
- 230000008569 process Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 10
- 238000004590 computer program Methods 0.000 description 7
- 230000015572 biosynthetic process Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 238000003786 synthesis reaction Methods 0.000 description 6
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000010276 construction Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000012423 maintenance Methods 0.000 description 4
- 241000219112 Cucumis Species 0.000 description 1
- 235000015510 Cucumis melo subsp melo Nutrition 0.000 description 1
- FJJCIZWZNKZHII-UHFFFAOYSA-N [4,6-bis(cyanoamino)-1,3,5-triazin-2-yl]cyanamide Chemical compound N#CNC1=NC(NC#N)=NC(NC#N)=N1 FJJCIZWZNKZHII-UHFFFAOYSA-N 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/61—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The application relates to the technical field of intelligent voice, in particular to a method, equipment, a server and a storage medium for constructing a corpus, wherein the method comprises the following steps: reading each voice resource in sequence, and executing the following operations when each voice resource is read: extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined not to be successfully matched with the existing reference voice features, and storing the voice features serving as the reference voice features into the new voice feature library; constructing a new language database corresponding to the new speech feature library; converting a voice resource into corresponding text corpora, and adding the text corpora to a new corpus when determining that the text corpora are not successfully matched with the existing reference text corpora. The method improves the efficiency of constructing the corpus.
Description
Technical Field
The present application relates to the field of intelligent speech technologies, and in particular, to a method, an apparatus, a server, and a storage medium for constructing a corpus.
Background
With the development of information technology, intelligent voice technology has become one of the most convenient and effective technical means for people to acquire and communicate information.
The intelligent voice technology is a means for realizing man-machine language interaction, and voice recognition and voice synthesis are two main branches of the intelligent voice technology. The realization of speech recognition and speech synthesis requires the pre-construction of a corpus, and speech recognition or synthesis is performed based on the corpus.
In the prior art, a method for constructing a corpus comprises the following steps: the corpora are recorded by a large number of volunteers, and then the working personnel collect, label and maintain the recorded corpora information at a later stage.
The corpus establishing mode is characterized in that collection and establishment of corpora depend on manual operation to a great extent, a large amount of labor is occupied, manual collection efficiency is low, time cost consumed by corpus collection is high, and corpus establishing efficiency is low.
In view of the above, there is a need to redesign a process to overcome the above-mentioned drawbacks.
Disclosure of Invention
The embodiment of the application provides a method, equipment, a server and a storage medium for constructing a corpus, which are used for solving the technical problem of low construction efficiency in the prior art.
The embodiment of the application provides the following specific technical scheme:
in a first aspect of the embodiments of the present application, a method for constructing a corpus is provided, including:
acquiring existing voice resources in a network;
reading each voice resource in sequence, and executing the following operations when each voice resource is read:
extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined not to be successfully matched with all the existing reference voice features, and storing the voice features serving as the reference voice features into the new voice feature library;
constructing a new language database corresponding to the new voice feature library;
and converting the voice resource into a corresponding text corpus, and adding the text corpus to the new corpus when the text corpus is determined to be unsuccessfully matched with each existing reference text corpus.
Optionally, based on a speech resource, extracting corresponding speech features specifically includes:
based on a voice resource, extracting corresponding pitch characteristic and pitch characteristic.
Optionally, before reading each voice resource in sequence, the method further includes:
constructing a Mandarin feature library, extracting the pitch feature and the tone feature of the Mandarin, and storing the extracted pitch feature and the tone feature as initial reference voice features in the Mandarin feature library;
and constructing a mandarin corpus corresponding to the mandarin feature library.
Optionally, based on a speech resource, extracting a corresponding speech feature, and determining that the speech feature is not successfully matched with any existing reference speech feature, specifically including:
acquiring a first pitch value and a first pitch value, wherein the first pitch value is obtained by converting a pitch feature in any one reference voice feature, and the first pitch value is obtained by converting a pitch feature in any one reference voice feature;
extracting corresponding fundamental tone characteristics and pitch characteristics based on a voice resource, converting the extracted fundamental tone characteristics into a second fundamental tone value, and converting the extracted pitch characteristics into a second pitch value;
calculating a first difference between the second pitch value and a preset first pitch value, and calculating a second difference between the second pitch value and the preset first pitch value;
and when the first difference is judged to be larger than a preset first threshold value and the second difference is judged to be larger than a preset second difference value, determining that the voice feature is not successfully matched with the reference voice feature in the Mandarin feature library.
Optionally, further comprising:
after extracting corresponding voice features based on a voice resource, if the voice features are determined to be successfully matched with at least one existing reference voice feature, storing the voice features serving as the reference voice features into a voice feature library corresponding to the at least one reference voice feature;
and when the situation that the text corpus is not successfully matched with at least one existing reference text corpus is determined, adding the text corpus to a corpus corresponding to the at least one reference text corpus.
In a second aspect of the embodiments of the present application, there is also provided an apparatus for constructing a corpus, including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the existing voice resources in the network;
the processing unit is used for reading each voice resource in sequence, and executing the following operations when each voice resource is read:
extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined not to be successfully matched with all the existing reference voice features, and storing the voice features serving as the reference voice features into the new voice feature library;
constructing a new language database corresponding to the new voice feature library;
and converting the voice resource into a corresponding text corpus, and adding the text corpus to the new corpus when the text corpus is determined to be unsuccessfully matched with each existing reference text corpus.
Optionally, when extracting corresponding voice features based on one voice resource, the processing unit is specifically configured to:
based on a voice resource, extracting corresponding pitch characteristic and pitch characteristic.
Optionally, before reading each voice resource in sequence, the processing unit is further configured to:
constructing a Mandarin feature library, extracting the pitch feature and the tone feature of the Mandarin, and storing the extracted pitch feature and the tone feature as initial reference voice features in the Mandarin feature library;
and constructing a mandarin corpus corresponding to the mandarin feature library.
Optionally, based on a speech resource, extracting a corresponding speech feature, and when it is determined that the speech feature is not successfully matched with any existing reference speech feature, the processing unit is specifically configured to:
acquiring a first pitch value and a first pitch value, wherein the first pitch value is obtained by converting a pitch feature in any one reference voice feature, and the first pitch value is obtained by converting a pitch feature in any one reference voice feature;
extracting corresponding fundamental tone characteristics and pitch characteristics based on a voice resource, converting the extracted fundamental tone characteristics into a second fundamental tone value, and converting the extracted pitch characteristics into a second pitch value;
calculating a first difference between the second pitch value and a preset first pitch value, and calculating a second difference between the second pitch value and the preset first pitch value;
and when the first difference is judged to be larger than a preset first threshold value and the second difference is judged to be larger than a preset second difference value, determining that the voice feature is not successfully matched with the reference voice feature in the Mandarin feature library.
Optionally, the processing unit is further configured to:
after extracting corresponding voice features based on a voice resource, if the voice features are determined to be successfully matched with at least one existing reference voice feature, storing the voice features serving as the reference voice features into a voice feature library corresponding to the at least one reference voice feature;
and when the situation that the text corpus is not successfully matched with at least one existing reference text corpus is determined, adding the text corpus to a corpus corresponding to the at least one reference text corpus.
In a third aspect of the embodiments of the present application, a server is provided, including: a memory, a processor; wherein,
a memory for storing executable instructions;
a processor for reading and executing executable instructions stored in the memory to implement a method as claimed in any one of the preceding claims.
In a fourth aspect of the embodiments of the present application, there is also provided a storage medium, wherein instructions of the storage medium, when executed by a processor, enable execution of the method according to any one of the above.
In the embodiment of the application, the existing voice resources in the network are obtained, each voice resource is read in sequence, each voice resource is read, corresponding voice characteristics are extracted, when the voice characteristics are determined to be not successfully matched with the existing reference voice characteristics, a new voice characteristic library is established corresponding to the voice characteristics, and the voice characteristics are stored into the new voice characteristic library as the reference voice characteristics; constructing a corpus corresponding to the new speech feature library; and converting the voice resource into corresponding text corpora, and adding the text corpora to the corpus when the text corpora are not successfully matched with the existing reference text corpora. Therefore, the existing voice resources are directly acquired from the network, and the acquired voice resources are respectively stored in the corresponding voice feature libraries through voice feature recognition, so that the automatic classification of the linguistic data can be realized; and when the corpus is not recorded in the corpus to some extent at present, add to corresponding corpus, realized the automatic of corpus in the corpus and added, compare in artifical collection pronunciation and add to corpus and artifical recognition voice and carry out categorised mode, promoted the efficiency of the structure and the maintenance of corpus greatly, saved the operation and maintenance cost.
Drawings
FIG. 1 is a schematic flowchart of an embodiment of a method for constructing a corpus according to the present application;
FIG. 2 is a schematic structural diagram of an embodiment of an apparatus for constructing a corpus according to the present application;
fig. 3 is a schematic structural diagram of a server according to the present application.
Detailed Description
In order to solve the technical problem of low corpus construction efficiency in the prior art, in the embodiment of the application, the existing voice resources in a network are obtained, voice features are extracted from the voice resources, the voice features are matched with the existing reference voice features, when the voice features cannot be matched with the existing reference voice features, a new voice feature library is established, a corpus corresponding to the feature library is constructed, then the voice resources are converted into text corpora, and when the text corpora are determined to be incapable of being matched with the existing text corpora, the text corpora are added to the corresponding corpus.
Alternative embodiments of the present application will now be described in further detail with reference to the accompanying drawings:
in the speech recognition and speech synthesis technology, dialects and mandarin are generally recognized or synthesized separately, so in the embodiment of the present application, when a corpus is established, the dialects and mandarin should respectively construct different corpora, and correspondingly, in the feature recognition process, a speech feature library should be respectively established to store various speech features.
Therefore, as an implementable manner, at least one speech feature library is first constructed as a base speech feature library. Specifically, a Mandarin feature library is pre-constructed, and the pitch feature and the tone feature of the Mandarin are extracted and stored into the Mandarin feature library as reference voice features; and constructing a mandarin corpus corresponding to the mandarin feature library.
After extracting the pitch feature and the pitch feature of the Mandarin, converting the extracted pitch feature of the Mandarin into a first pitch value, and storing the first pitch value and the first pitch value into the Mandarin feature library as basic reference voice features.
It should be noted that the basic speech feature library is not limited to the mandarin feature library, but may be other dialect feature libraries such as a cantonese feature library and a sichuan dialect feature library, and the language type of the basic feature library may be specifically determined according to the actual target user.
Referring to fig. 1, a specific process of the method for constructing a corpus provided in the embodiment of the present application is as follows:
s101: and acquiring the existing voice resources in the network.
The existing voice resources in the network, including voice resources such as audio and video programs on the network, can be obtained by crawling of a web crawler.
Optionally, the acquired voice resource should be preprocessed to remove noise and background noise.
S102: and reading one voice resource from the obtained voice resources.
S103: and extracting corresponding voice features based on the read voice resource.
In the embodiment of the application, extracting the corresponding voice features comprises extracting the pitch feature and the pitch feature.
Generally, a sound is composed of a series of vibrations having different frequencies and amplitudes emitted from a sound-producing body, and among these vibrations, there is a vibration having the lowest frequency, and the sound emitted therefrom is a fundamental tone (fundamental tone), and the rest are overtones. The pitch feature refers to a speech signal extracted from speech resources and containing pitch information.
In the embodiment of the present application, the pitch feature is a speech signal extracted from a speech resource and containing information about the level of a sound frequency.
S104: is the speech feature determined to be not successfully matched with each of the existing reference speech features? If so, the process proceeds to S105, otherwise, the process proceeds to S107.
Optionally, when step S104 is executed, the current speech feature is matched with the reference speech feature in the following manner:
acquiring a first pitch value and a first pitch value, wherein the first pitch value is obtained by converting a pitch feature in any one reference voice feature, and the first pitch value is obtained by converting a pitch feature in any one reference voice feature;
extracting corresponding fundamental tone characteristics and pitch characteristics based on a voice resource, converting the extracted fundamental tone characteristics into a second fundamental tone value, and converting the extracted pitch characteristics into a second pitch value;
calculating a first difference between the second pitch value and a preset first pitch value, and calculating a second difference between the second pitch value and the preset first pitch value;
and when the first difference is judged to be larger than a preset first threshold value and the second difference is judged to be larger than a preset second difference value, determining that the voice feature is not successfully matched with the reference voice feature in the Mandarin feature library.
And when the voice feature is successfully matched with at least one existing reference voice feature, the voice feature is taken as a reference voice feature and is stored in a voice feature library corresponding to the at least one reference voice feature.
For example, assuming that the first pitch value of mandarin chinese is 65, the first pitch value of mandarin chinese is the average of the pitch values of up, down, in, and flat, 98, the first threshold is 0.5, the second threshold is 0.8, when the second pitch value is 67, 67-65 is 2, and is greater than the first threshold 0.5, it is determined as a mismatch; when the second pitch value is 99, 99-98 is 1, and is greater than the second threshold value 0.8, it is determined as a mismatch.
S105: and establishing a new voice feature library corresponding to the voice features, and storing the voice features serving as reference voice features into the new voice feature library.
Specifically, the mandarin speech feature in the mandarin chinese feature library is taken as an example of the reference speech feature.
First, the pitch feature and pitch feature extracted based on a currently read speech resource are compared with the reference speech feature in the mandarin chinese feature library.
If at least one of the speech resources is matched, the currently read speech resource is corresponding to the Mandarin speech feature library, so that a new speech feature library does not need to be established. Optionally, at this time, a speech feature library matched with the currently read speech resource is marked.
If the current read voice resource is not matched with all the reference voice characteristics, the currently read voice resource does not belong to the category of Mandarin. At this time, a new speech feature library corresponding to the one speech resource needs to be created, and the pitch feature extracted from the one speech resource are stored as a reference speech feature in the first dialect feature library.
For the next voice resource, when the process is circulated to S104 again through S101-S103, the voice features extracted from the next voice resource are respectively compared with the mandarin reference voice features in the constructed mandarin feature library and the reference voice features in the first dialect feature library, and if the voice features are matched with the voice features in the first dialect feature library, the next voice resource belongs to the first dialect without building a new voice feature library; if the two speech resources are not matched, the next speech resource is not related to Mandarin or the first dialect, if the next speech resource is determined to be the second dialect, a second dialect feature library is constructed, and the like.
For example, assuming that a currently read voice resource is a Sichuan dialect, after extracting the fundamental tone feature and the tone feature from the voice resource, comparing the extracted fundamental tone feature and the tone feature with the reference voice feature in the Mandarin feature library, and if the extracted fundamental tone feature and the tone feature are not matched with each other, establishing a Sichuan dialect feature library; and comparing the read next voice resource with the reference voice characteristics in the mandarin Chinese feature library and the reference voice characteristics in the Sichuan dialect feature library, and establishing a Henan dialect feature library if the next voice resource is the Henan dialect and the reference voice characteristics in the Mandarin Chinese feature library are determined to be unmatched.
S106: and constructing a corpus corresponding to the new speech feature library.
Optionally, each new speech feature library is constructed, and a new corpus is correspondingly constructed.
The corpus, which is the basic audio material for speech recognition and speech synthesis, may be individual words, phrases or idioms, or may be a sentence.
S107: and converting the voice resource into a corresponding text corpus.
S108: is it judged that the converted text corpus has not been successfully matched with the existing reference text corpora? If so, the process proceeds to S109, otherwise, the process proceeds to S110.
S109: adding the text corpus to the corpus.
For example, if a speech resource is mandarin chinese "with you without melon", then it is determined in S104 that the speech resource matches the reference speech features in the mandarin chinese feature library, i.e., the speech resource is primarily classified as mandarin chinese corpus, and then it is determined in S108 that the speech resource does not match the existing reference text corpora successfully, then in S109, the speech resource is added to the corpus corresponding to the mandarin chinese feature library.
S110: is there the next voice resource determined? If yes, the process returns to the step S102, otherwise, the process is ended.
A plurality of voice feature libraries and corpora can be constructed by circularly executing S102-S110, corresponding reference voice features and reference text corpora are continuously accumulated through automatic matching of voice features, and corpora with sufficient corpora are obtained through autonomous learning, so that the method has important reference values for voice synthesis and voice recognition.
A complete embodiment of the method of constructing a corpus is listed below:
a mandarin feature library is constructed in advance, and a mandarin corpus is correspondingly constructed. The mandarin feature library stores a mandarin pitch value (corresponding to the first pitch value) and a mandarin pitch value (corresponding to the first pitch value).
And acquiring the existing voice resources from the network.
Reading the nth (n is an integer, n is more than or equal to 1) voice resource, if the content is 'I love my country' which is spoken in Sichuan dialect, extracting the fundamental tone feature and the pitch feature in the voice resource, converting the extracted fundamental tone feature into a second fundamental tone value, and converting the extracted pitch feature into the second pitch value.
A first difference between the second pitch value and a pitch value of Mandarin is calculated, and a second difference between the second pitch value and a pitch value of Mandarin is calculated.
And meanwhile, judging whether the first difference is greater than a first threshold value or not and whether the second difference is greater than a second threshold value or not, if the first difference is greater than the first threshold value and the second difference is greater than the second threshold value at the same time, judging that the current voice resource 'the country where I love my me' does not belong to the mandarin feature library, therefore, constructing a new voice feature library, storing a second pitch value and a second pitch value corresponding to the 'the country where I love me' into the new voice feature library, and marking the second pitch value and the second pitch value as a Sichuan dialect feature library.
And corresponding to the Szechwan dialect feature library, constructing a new corpus, and marking the new corpus as the Szechwan dialect corpus.
The current voice resource is converted into text corpora of 'I', 'love', 'My' and 'country'.
And judging whether reference text corpora matched with the 'love my country' exist in the Sichuan dialect corpus.
For the newly-built corpus, no reference text corpus is obviously stored, so that the corpus is directly judged to be not matched with the existing reference text corpus, and the 'I', 'love', 'I' and 'country' are added into the Sichuan dialect corpus.
Thus, a processing flow of the voice resource is completed.
And then, judging whether the next voice resource exists, if so, assigning n to n +1, reading the assigned nth voice resource and executing the process again, otherwise, ending the process.
Referring to the processing procedure of the nth voice resource, the processing procedures of the nth +1, the nth +2 … … and other voice resources can be obtained correspondingly, and are not described in detail.
Referring to fig. 2, an embodiment of the present application provides an apparatus for constructing a corpus, including:
an obtaining unit 201, configured to obtain existing voice resources in a network;
a processing unit 202, configured to read each voice resource in sequence, and execute the following operations for each read voice resource:
extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined not to be successfully matched with all the existing reference voice features, and storing the voice features serving as the reference voice features into the new voice feature library; constructing a new language database corresponding to the new voice feature library;
and converting the voice resource into a corresponding text corpus, and adding the text corpus to the new corpus when the text corpus is determined to be unsuccessfully matched with each existing reference text corpus.
Optionally, when extracting a corresponding speech feature based on a speech resource, the processing unit 202 is specifically configured to:
based on a voice resource, extracting corresponding pitch characteristic and pitch characteristic.
Optionally, before reading each voice resource in sequence, the processing unit 202 is further configured to:
constructing a Mandarin feature library, extracting the pitch feature and the tone feature of the Mandarin, and storing the extracted pitch feature and the tone feature as initial reference voice features in the Mandarin feature library;
and constructing a mandarin corpus corresponding to the mandarin feature library.
Optionally, based on a speech resource, extracting a corresponding speech feature, and when it is determined that the speech feature is not successfully matched with any existing reference speech feature, the processing unit 202 is specifically configured to:
acquiring a first pitch value and a first pitch value, wherein the first pitch value is obtained by converting a pitch feature in any one reference voice feature, and the first pitch value is obtained by converting a pitch feature in any one reference voice feature;
extracting corresponding fundamental tone characteristics and pitch characteristics based on a voice resource, converting the extracted fundamental tone characteristics into a second fundamental tone value, and converting the extracted pitch characteristics into a second pitch value;
calculating a first difference between the second pitch value and a preset first pitch value, and calculating a second difference between the second pitch value and the preset first pitch value;
and when the first difference is judged to be larger than a preset first threshold value and the second difference is judged to be larger than a preset second difference value, determining that the voice feature is not successfully matched with the reference voice feature in the Mandarin feature library.
Optionally, the processing unit 202 is further configured to:
after extracting corresponding voice features based on a voice resource, if the voice features are determined to be successfully matched with at least one existing reference voice feature, storing the voice features serving as the reference voice features into a voice feature library corresponding to the at least one reference voice feature;
and when the situation that the text corpus is not successfully matched with at least one existing reference text corpus is determined, adding the text corpus to a corpus corresponding to the at least one reference text corpus.
Based on the same inventive concept, referring to fig. 3, an embodiment of the present application further provides a server, including: a memory 301 and a processor 302, wherein,
a memory 301 for storing executable instructions;
a processor 302, configured to read and execute executable instructions stored in the memory, so as to implement any one of the methods for constructing a corpus described above.
Based on the same inventive concept, the present application further provides a storage medium, and when executed by a processor, the storage medium enables execution of any one of the methods for constructing a corpus.
To sum up, in the embodiment of the present application, based on existing voice resources in a slave network, each voice resource is sequentially read, and when it is determined that none of the voice features matches any existing reference voice feature, a new voice feature library is established corresponding to the voice feature, and a corpus corresponding to the new voice feature library is established; and then, converting the voice resource into a corresponding text corpus, and adding the text corpus to the corpus when the text corpus is not successfully matched with each existing reference text corpus. Therefore, the existing voice resources in the network are matched through the voice characteristics, and different voice characteristics correspond to different corpora, so that the automatic classification of the corpora based on the characteristic matching is realized; when the current corpus is not recorded in the corpus, the corpus is added into the corresponding corpus, so that the automatic addition of the corpus in the corpus is realized, the construction and maintenance efficiency of the corpus is improved, and the operation and maintenance cost is saved;
furthermore, the pitch feature and the pitch feature are typical features which can be represented in a quantized mode in the voice, differences among different voices can be reflected, the pitch feature and the pitch feature are extracted from voice resources to perform feature matching, the matching effect is good, and the recognition rate is high.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to encompass such modifications and variations.
Claims (12)
1. A method of constructing a corpus, comprising:
acquiring existing voice resources in a network;
reading each voice resource in sequence, and executing the following operations when each voice resource is read:
extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined not to be successfully matched with all the existing reference voice features, and storing the voice features serving as the reference voice features into the new voice feature library;
constructing a new language database corresponding to the new voice feature library;
and converting the voice resource into a corresponding text corpus, and adding the text corpus to the new corpus when the text corpus is determined to be unsuccessfully matched with each existing reference text corpus.
2. The method of claim 1, wherein extracting the corresponding speech feature based on a speech resource specifically comprises:
based on a voice resource, extracting corresponding pitch characteristic and pitch characteristic.
3. The method of claim 2, wherein prior to reading each voice resource in sequence, further comprising:
constructing a Mandarin feature library, extracting the pitch feature and the tone feature of the Mandarin, and storing the extracted pitch feature and the tone feature as initial reference voice features in the Mandarin feature library;
and constructing a mandarin corpus corresponding to the mandarin feature library.
4. The method of claim 2, wherein extracting corresponding speech features based on a speech resource, and determining that the speech features are not successfully matched with any existing reference speech features comprises:
acquiring a first pitch value and a first pitch value, wherein the first pitch value is obtained by converting a pitch feature in any one reference voice feature, and the first pitch value is obtained by converting a pitch feature in any one reference voice feature;
extracting corresponding fundamental tone characteristics and pitch characteristics based on a voice resource, converting the extracted fundamental tone characteristics into a second fundamental tone value, and converting the extracted pitch characteristics into a second pitch value;
calculating a first difference between the second pitch value and a preset first pitch value, and calculating a second difference between the second pitch value and the preset first pitch value;
and when the first difference is judged to be larger than a preset first threshold value and the second difference is judged to be larger than a preset second difference value, determining that the voice feature is not successfully matched with the reference voice feature in the Mandarin feature library.
5. The method of claim 1 or 2, further comprising:
after extracting corresponding voice features based on a voice resource, if the voice features are determined to be successfully matched with at least one existing reference voice feature, storing the voice features serving as the reference voice features into a voice feature library corresponding to the at least one reference voice feature;
and when the situation that the text corpus is not successfully matched with at least one existing reference text corpus is determined, adding the text corpus to a corpus corresponding to the at least one reference text corpus.
6. An apparatus for constructing a corpus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the existing voice resources in the network;
the processing unit is used for reading each voice resource in sequence, and executing the following operations when each voice resource is read:
extracting corresponding voice features based on a voice resource, establishing a new voice feature library corresponding to the voice features when the voice features are determined not to be successfully matched with all the existing reference voice features, and storing the voice features serving as the reference voice features into the new voice feature library;
constructing a new language database corresponding to the new voice feature library;
and converting the voice resource into a corresponding text corpus, and adding the text corpus to the new corpus when the text corpus is determined to be unsuccessfully matched with each existing reference text corpus.
7. The device according to claim 6, wherein, when extracting the corresponding speech feature based on a speech resource, the processing unit is specifically configured to:
based on a voice resource, extracting corresponding pitch characteristic and pitch characteristic.
8. The device of claim 7, wherein prior to reading each voice resource in turn, the processing unit is further configured to:
constructing a Mandarin feature library, extracting the pitch feature and the tone feature of the Mandarin, and storing the extracted pitch feature and the tone feature as initial reference voice features in the Mandarin feature library;
and constructing a mandarin corpus corresponding to the mandarin feature library.
9. The device according to claim 7, wherein based on a speech resource, extracting corresponding speech features, and when it is determined that the speech features are not successfully matched with any existing reference speech features, the processing unit is specifically configured to:
acquiring a first pitch value and a first pitch value, wherein the first pitch value is obtained by converting a pitch feature in any one reference voice feature, and the first pitch value is obtained by converting a pitch feature in any one reference voice feature;
extracting corresponding fundamental tone characteristics and pitch characteristics based on a voice resource, converting the extracted fundamental tone characteristics into a second fundamental tone value, and converting the extracted pitch characteristics into a second pitch value;
calculating a first difference between the second pitch value and a preset first pitch value, and calculating a second difference between the second pitch value and the preset first pitch value;
and when the first difference is judged to be larger than a preset first threshold value and the second difference is judged to be larger than a preset second difference value, determining that the voice feature is not successfully matched with the reference voice feature in the Mandarin feature library.
10. The device of claim 6 or 7, wherein the processing unit is further to:
after extracting corresponding voice features based on a voice resource, if the voice features are determined to be successfully matched with at least one existing reference voice feature, storing the voice features serving as the reference voice features into a voice feature library corresponding to the at least one reference voice feature;
and when the situation that the text corpus is not successfully matched with at least one existing reference text corpus is determined, adding the text corpus to a corpus corresponding to the at least one reference text corpus.
11. A server, comprising: a memory, a processor; wherein,
a memory for storing executable instructions;
a processor for reading and executing executable instructions stored in the memory to implement the method of any one of claims 1-5.
12. A storage medium, wherein instructions in the storage medium, when executed by a processor, enable performance of the method of any one of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911095120.3A CN110942765B (en) | 2019-11-11 | 2019-11-11 | Method, device, server and storage medium for constructing corpus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911095120.3A CN110942765B (en) | 2019-11-11 | 2019-11-11 | Method, device, server and storage medium for constructing corpus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110942765A true CN110942765A (en) | 2020-03-31 |
CN110942765B CN110942765B (en) | 2022-05-27 |
Family
ID=69906444
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911095120.3A Active CN110942765B (en) | 2019-11-11 | 2019-11-11 | Method, device, server and storage medium for constructing corpus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110942765B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111356022A (en) * | 2020-04-18 | 2020-06-30 | 徐琼琼 | Video file processing method based on voice recognition |
CN113593556A (en) * | 2021-07-26 | 2021-11-02 | 深圳市捌零零在线科技有限公司 | Human-computer interaction method and device for vehicle-mounted voice operating system |
CN115810345A (en) * | 2022-11-23 | 2023-03-17 | 北京伽睿智能科技集团有限公司 | Intelligent speech technology recommendation method, system, equipment and storage medium |
Citations (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003029774A (en) * | 2001-07-19 | 2003-01-31 | Matsushita Electric Ind Co Ltd | Voice waveform dictionary distribution system, voice waveform dictionary preparing device, and voice synthesizing terminal equipment |
CN1604182A (en) * | 2003-09-29 | 2005-04-06 | 摩托罗拉公司 | Method for voice synthesizing |
CN202584695U (en) * | 2011-12-30 | 2012-12-05 | 深圳市车音网科技有限公司 | Mapping display system and device thereof |
CN102881283A (en) * | 2011-07-13 | 2013-01-16 | 三星电子(中国)研发中心 | Method and system for processing voice |
US20160262974A1 (en) * | 2015-03-10 | 2016-09-15 | Strathspey Crown Holdings, LLC | Autonomic nervous system balancing device and method of use |
CN106164890A (en) * | 2013-12-02 | 2016-11-23 | 丘贝斯有限责任公司 | For the method eliminating the ambiguity of the feature in non-structured text |
CN106202380A (en) * | 2016-07-08 | 2016-12-07 | 中国科学院上海高等研究院 | The construction method of a kind of corpus of classifying, system and there is the server of this system |
CN106649278A (en) * | 2016-12-30 | 2017-05-10 | 三星电子(中国)研发中心 | Method and system for extending spoken language dialogue system corpora |
CN106935248A (en) * | 2017-02-14 | 2017-07-07 | 广州孩教圈信息科技股份有限公司 | A kind of voice similarity detection method and device |
US20180205823A1 (en) * | 2016-08-19 | 2018-07-19 | Andrew Horton | Caller identification in a secure environment using voice biometrics |
CN108764010A (en) * | 2018-03-23 | 2018-11-06 | 姜涵予 | Emotional state determines method and device |
CN109036424A (en) * | 2018-08-30 | 2018-12-18 | 出门问问信息科技有限公司 | Audio recognition method, device, electronic equipment and computer readable storage medium |
CN109065028A (en) * | 2018-06-11 | 2018-12-21 | 平安科技(深圳)有限公司 | Speaker clustering method, device, computer equipment and storage medium |
CN109215636A (en) * | 2018-11-08 | 2019-01-15 | 广东小天才科技有限公司 | Voice information classification method and system |
CN109215638A (en) * | 2018-10-19 | 2019-01-15 | 珠海格力电器股份有限公司 | Voice learning method and device, voice equipment and storage medium |
CN109616131A (en) * | 2018-11-12 | 2019-04-12 | 南京南大电子智慧型服务机器人研究院有限公司 | A kind of number real-time voice is changed voice method |
CN109801628A (en) * | 2019-02-11 | 2019-05-24 | 龙马智芯(珠海横琴)科技有限公司 | A kind of corpus collection method, apparatus and system |
CN110046261A (en) * | 2019-04-22 | 2019-07-23 | 山东建筑大学 | A kind of construction method of the multi-modal bilingual teaching mode of architectural engineering |
CN110134799A (en) * | 2019-05-29 | 2019-08-16 | 四川长虹电器股份有限公司 | A kind of text corpus based on BM25 algorithm build and optimization method |
CN110265028A (en) * | 2019-06-20 | 2019-09-20 | 百度在线网络技术(北京)有限公司 | Construction method, device and the equipment of corpus of speech synthesis |
CN110413723A (en) * | 2019-06-06 | 2019-11-05 | 福建奇点时空数字科技有限公司 | A kind of corpus automated construction method of data-driven |
-
2019
- 2019-11-11 CN CN201911095120.3A patent/CN110942765B/en active Active
Patent Citations (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003029774A (en) * | 2001-07-19 | 2003-01-31 | Matsushita Electric Ind Co Ltd | Voice waveform dictionary distribution system, voice waveform dictionary preparing device, and voice synthesizing terminal equipment |
CN1604182A (en) * | 2003-09-29 | 2005-04-06 | 摩托罗拉公司 | Method for voice synthesizing |
CN102881283A (en) * | 2011-07-13 | 2013-01-16 | 三星电子(中国)研发中心 | Method and system for processing voice |
CN202584695U (en) * | 2011-12-30 | 2012-12-05 | 深圳市车音网科技有限公司 | Mapping display system and device thereof |
CN106164890A (en) * | 2013-12-02 | 2016-11-23 | 丘贝斯有限责任公司 | For the method eliminating the ambiguity of the feature in non-structured text |
US20160262974A1 (en) * | 2015-03-10 | 2016-09-15 | Strathspey Crown Holdings, LLC | Autonomic nervous system balancing device and method of use |
CN106202380A (en) * | 2016-07-08 | 2016-12-07 | 中国科学院上海高等研究院 | The construction method of a kind of corpus of classifying, system and there is the server of this system |
US20180205823A1 (en) * | 2016-08-19 | 2018-07-19 | Andrew Horton | Caller identification in a secure environment using voice biometrics |
CN106649278A (en) * | 2016-12-30 | 2017-05-10 | 三星电子(中国)研发中心 | Method and system for extending spoken language dialogue system corpora |
CN106935248A (en) * | 2017-02-14 | 2017-07-07 | 广州孩教圈信息科技股份有限公司 | A kind of voice similarity detection method and device |
CN108764010A (en) * | 2018-03-23 | 2018-11-06 | 姜涵予 | Emotional state determines method and device |
CN109065028A (en) * | 2018-06-11 | 2018-12-21 | 平安科技(深圳)有限公司 | Speaker clustering method, device, computer equipment and storage medium |
CN109036424A (en) * | 2018-08-30 | 2018-12-18 | 出门问问信息科技有限公司 | Audio recognition method, device, electronic equipment and computer readable storage medium |
CN109215638A (en) * | 2018-10-19 | 2019-01-15 | 珠海格力电器股份有限公司 | Voice learning method and device, voice equipment and storage medium |
CN109215636A (en) * | 2018-11-08 | 2019-01-15 | 广东小天才科技有限公司 | Voice information classification method and system |
CN109616131A (en) * | 2018-11-12 | 2019-04-12 | 南京南大电子智慧型服务机器人研究院有限公司 | A kind of number real-time voice is changed voice method |
CN109801628A (en) * | 2019-02-11 | 2019-05-24 | 龙马智芯(珠海横琴)科技有限公司 | A kind of corpus collection method, apparatus and system |
CN110046261A (en) * | 2019-04-22 | 2019-07-23 | 山东建筑大学 | A kind of construction method of the multi-modal bilingual teaching mode of architectural engineering |
CN110134799A (en) * | 2019-05-29 | 2019-08-16 | 四川长虹电器股份有限公司 | A kind of text corpus based on BM25 algorithm build and optimization method |
CN110413723A (en) * | 2019-06-06 | 2019-11-05 | 福建奇点时空数字科技有限公司 | A kind of corpus automated construction method of data-driven |
CN110265028A (en) * | 2019-06-20 | 2019-09-20 | 百度在线网络技术(北京)有限公司 | Construction method, device and the equipment of corpus of speech synthesis |
Non-Patent Citations (3)
Title |
---|
庞伟: "双语语料库构建研究综述", 《信息技术与信息化》, no. 03, 15 March 2015 (2015-03-15) * |
章森等: "大规模语音语料库及其在TTS中应用的几个问题", 《计算机学报》 * |
章森等: "大规模语音语料库及其在TTS中应用的几个问题", 《计算机学报》, no. 04, 15 April 2010 (2010-04-15) * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111356022A (en) * | 2020-04-18 | 2020-06-30 | 徐琼琼 | Video file processing method based on voice recognition |
CN113593556A (en) * | 2021-07-26 | 2021-11-02 | 深圳市捌零零在线科技有限公司 | Human-computer interaction method and device for vehicle-mounted voice operating system |
CN115810345A (en) * | 2022-11-23 | 2023-03-17 | 北京伽睿智能科技集团有限公司 | Intelligent speech technology recommendation method, system, equipment and storage medium |
CN115810345B (en) * | 2022-11-23 | 2024-04-30 | 北京伽睿智能科技集团有限公司 | Intelligent speaking recommendation method, system, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110942765B (en) | 2022-05-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110377716B (en) | Interaction method and device for conversation and computer readable storage medium | |
US10410627B2 (en) | Automatic language model update | |
CN102176310B (en) | Speech recognition system with huge vocabulary | |
CN110942765B (en) | Method, device, server and storage medium for constructing corpus | |
CN106570180B (en) | Voice search method and device based on artificial intelligence | |
CN103823867B (en) | Humming type music retrieval method and system based on note modeling | |
US20090254349A1 (en) | Speech synthesizer | |
CN105206258A (en) | Generation method and device of acoustic model as well as voice synthetic method and device | |
CN110428819B (en) | Decoding network generation method, voice recognition method, device, equipment and medium | |
KR20080069990A (en) | Speech index pruning | |
CN101076851A (en) | Spoken language identification system and method for training and operating the said system | |
JP2020166839A (en) | Sentence recommendation method and apparatus based on associated points of interest | |
CN105161116A (en) | Method and device for determining climax fragment of multimedia file | |
CN111199732A (en) | Emotion-based voice interaction method, storage medium and terminal equipment | |
CN113609264B (en) | Data query method and device for power system nodes | |
CN109492126B (en) | Intelligent interaction method and device | |
CN106302987A (en) | A kind of audio frequency recommends method and apparatus | |
CN111178081A (en) | Semantic recognition method, server, electronic device and computer storage medium | |
CN108364655B (en) | Voice processing method, medium, device and computing equipment | |
CN114550718A (en) | Hot word speech recognition method, device, equipment and computer readable storage medium | |
CN114783424A (en) | Text corpus screening method, device, equipment and storage medium | |
CN108153875B (en) | Corpus processing method and device, intelligent sound box and storage medium | |
CN111724769A (en) | Production method of intelligent household voice recognition model | |
CN112883718B (en) | Spelling error correction method and device based on Chinese character sound-shape similarity and electronic equipment | |
CN109559752B (en) | Speech recognition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |