CN112463921B - Prosody hierarchy dividing method, prosody hierarchy dividing device, computer device and storage medium - Google Patents

Prosody hierarchy dividing method, prosody hierarchy dividing device, computer device and storage medium Download PDF

Info

Publication number
CN112463921B
CN112463921B CN202011339547.6A CN202011339547A CN112463921B CN 112463921 B CN112463921 B CN 112463921B CN 202011339547 A CN202011339547 A CN 202011339547A CN 112463921 B CN112463921 B CN 112463921B
Authority
CN
China
Prior art keywords
text
word
speech
training
random field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011339547.6A
Other languages
Chinese (zh)
Other versions
CN112463921A (en
Inventor
李俊杰
陈闽川
马骏
王少军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011339547.6A priority Critical patent/CN112463921B/en
Publication of CN112463921A publication Critical patent/CN112463921A/en
Application granted granted Critical
Publication of CN112463921B publication Critical patent/CN112463921B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the technical field of artificial intelligence and provides a prosody level dividing method, a prosody level dividing device, computer equipment and a storage medium, wherein a text of a prosody level to be divided is obtained; part of speech recognition is carried out on the text, and part of speech of each word in the text is obtained; inputting the text after part-of-speech recognition into a preset conditional random field model to obtain prosody level labels of each word of the text; based on the feature function included in the conditional random field model, the feature function is used for respectively counting the part-of-speech structure and the text structure of the context of each word, and determining the prosody level label of each word according to the part-of-speech structure and the text structure. The method adopts the conditional random field model, combines the part of speech of each word in the text, and combines the part of speech structure of the considered context to carry out prosody hierarchy division on the text, thereby avoiding the defect of over-fine granularity when only considering the part of speech to carry out prosody hierarchy division at present.

Description

Prosody hierarchy dividing method, prosody hierarchy dividing device, computer device and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a prosody hierarchy dividing method, apparatus, computer device, and storage medium.
Background
The division of prosody levels has important significance in the field of speech synthesis, and good prosody division can enable synthesized speech to be more natural. When prosodic division is not reasonable, not only the synthesized sound effect is poor, but also ambiguity may be generated by a listener.
At present, whether the division of prosody levels is needed among the parts of speech is counted, so that the use of marking data can be effectively reduced, but only part of speech information is considered as a result, and the method often leads the division result to be too fine. For example for sentences: "boy ask: "do you dislike me? "by statistics of parts of speech, the possible partitioning results of this approach are: "boy #1 ask #3: "do you #1 not #1 like #1 i # 1? . In this approach, the word granularity is affected. Because the labels of the parts of speech are often based on word segmentation results, and the word segmentation results have finer granularity than the common prosody hierarchy division results, the current method has the problem of too fine division results.
Disclosure of Invention
The main purpose of the application is to provide a prosody hierarchy dividing method, a prosody hierarchy dividing device, computer equipment and a storage medium, aiming at overcoming the defect of fine granularity when prosody hierarchy dividing is performed based on parts of speech at present.
In order to achieve the above object, the present application provides a prosody hierarchy dividing method, including the steps of:
acquiring a text of a prosody level to be divided;
part of speech recognition is carried out on the text, and part of speech of each word in the text is obtained;
inputting the text after part-of-speech recognition into a preset conditional random field model to obtain prosody level labels of each word of the text; the conditional random field model comprises a characteristic function, wherein the characteristic function is used for respectively counting part-of-speech structures and text structures of the context of each word, and determining prosody level labels of each word according to the part-of-speech structures and the text structures;
and performing prosodic hierarchy division on the text according to prosodic hierarchy labels of each word in the text.
Further, before the step of obtaining the text of the prosody level to be divided, it includes:
acquiring a training data set; the training data comprises a plurality of training texts, wherein the training texts carry part of speech of each word in the training texts and prosody level labels of each word;
inputting the training text into an initial conditional random field model for training to obtain the preset conditional random field model; the initial conditional random field model comprises a feature template, wherein the feature template is used for respectively counting part-of-speech structures and text structures of the context of each word in the training text, and determining model parameters in the feature template according to the part-of-speech structures, the text structures and prosody level labels of each word in a training sample so as to obtain the feature function.
Further, the training text includes three columns:
the first column is the vertical arrangement of each word in the training text, the second column is the part of speech corresponding to each word in the training text, and the third column is the prosody level label corresponding to each word in the training text.
Further, the feature templates included in the initial conditional random field model are:
(1)
(2)
(3)
wherein x is i,1 Data representing row i, column 2, w in training sample i-a Representing the corresponding part of speech m i-a Representing the corresponding text; a. b represents the context; n is a preset super parameter, y i Corresponding to wordsProsody level labels.
Further, after the step of inputting the training text into the initial conditional random field model to train and obtaining the preset conditional random field model, the method includes:
acquiring test text in the test data; wherein the test text comprises part of speech of each word in the test text;
inputting the test text into the preset conditional random field model to obtain a predicted prosody level label of each word in the test text;
obtaining a correct prosody level tag of the test text, and comparing the predicted prosody level tag with the correct prosody level tag to obtain the prediction accuracy of the preset conditional random field model;
and if the prediction accuracy is higher than a threshold value, determining that the training of the preset conditional random field model is completed.
Further, the step of acquiring a training data set includes:
acquiring a text sample;
part of speech recognition is carried out on the text sample, and part of speech of each word in the text sample is obtained;
acquiring prosody level labels of each word in the text sample;
constructing each training text based on the text sample, the part of speech of each word in the text sample and the prosody level label of each word in the text sample;
the training data set is obtained based on a plurality of training texts.
Further, the method further comprises:
and storing the preset conditional random field model in a blockchain.
The application also provides a prosody hierarchy dividing device, comprising:
a first acquisition unit configured to acquire a text of a prosody level to be divided;
the recognition unit is used for recognizing the part of speech of the text to obtain the part of speech of each word in the text;
the tag obtaining unit is used for inputting the text subjected to part-of-speech recognition into a preset conditional random field model to obtain prosody level tags of each word of the text; the conditional random field model comprises a characteristic function, wherein the characteristic function is used for respectively counting part-of-speech structures and text structures of the context of each word, and determining prosody level labels of each word according to the part-of-speech structures and the text structures;
and the dividing unit is used for dividing the prosody level of the text according to the prosody level label of each word in the text.
The present application also provides a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of any of the methods described above when the computer program is executed.
The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of the above.
The prosody level dividing method, the prosody level dividing device, the computer equipment and the storage medium acquire texts of prosody levels to be divided; part of speech recognition is carried out on the text, and part of speech of each word in the text is obtained; inputting the text after part-of-speech recognition into a preset conditional random field model to obtain prosody level labels of each word of the text; based on the feature function included in the conditional random field model, the feature function is used for respectively counting the part-of-speech structure and the text structure of the context of each word, and determining the prosody level label of each word according to the part-of-speech structure and the text structure. The method adopts the conditional random field model, combines the part of speech of each word in the text, and combines the part of speech structure of the considered context to carry out prosody hierarchy division on the text, thereby avoiding the defect of over-fine granularity when only considering the part of speech to carry out prosody hierarchy division at present.
Drawings
FIG. 1 is a schematic diagram showing steps of a prosody hierarchy dividing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of training text in an embodiment of the present application;
FIG. 3 is a block diagram showing a structure of a prosody hierarchy dividing device according to an embodiment of the present application;
fig. 4 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.
The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Referring to fig. 1, in one embodiment of the present application, a prosody hierarchy dividing method is provided, including the following steps:
step S1, acquiring a text of a prosody level to be divided;
s2, identifying the part of speech of the text to obtain the part of speech of each word in the text;
s3, inputting the text subjected to part-of-speech recognition into a preset conditional random field model to obtain prosody level labels of each word of the text; the conditional random field model comprises a characteristic function, wherein the characteristic function is used for respectively counting part-of-speech structures and text structures of the context of each word, and determining prosody level labels of each word according to the part-of-speech structures and the text structures;
and S4, performing prosodic hierarchy division on the text according to prosodic hierarchy labels of each word in the text.
In this embodiment, the method is applied to prosodic hierarchy division of a text, and after the text is reasonably prosodic hierarchy divided, the text can be converted into corresponding voices according to the prosodic hierarchy, so that the obtained voices are more natural. The method can also be applied to the field of smart cities to promote the construction of the smart cities.
The text is a text to be prosody hierarchical divided as described in the above step S1. The text is typically entered by a user.
And (2) performing part-of-speech recognition on the text to obtain the part-of-speech of each word in the text as described in the step (S2). Specifically, part-of-speech recognition can be performed by using methods such as jieba, hanlp and the like. The part-of-speech recognition refers to separating words from the text and recognizing parts of speech of each word, where the parts of speech includes verbs, names, and intonation words. In this embodiment, the obtained part of speech of each word includes not only part of speech information but also word segmentation information, and common prosody segmentation points are all present between different words (word segmentation boundaries), so that the introduction of part of speech information is beneficial to prosody prediction.
For example, in one embodiment, the text "greetings that are honest" may be labeled "greetings/v, honest/a, ude1, greetings/vn" when they are part-of-speech identified.
And (3) inputting the text after part-of-speech recognition into a preset conditional random field model to obtain a prosodic hierarchy tag of each word of the text, wherein the conditional random field model counts part-of-speech structures of the context of each word, and determines the prosodic hierarchy tag of each word according to the part-of-speech structures. For example, prosody level tags include 0, 1, 2, 3, 5; for the above-mentioned "prosody/v, sincere/a/ude, greeting/vn", if prosody level division is directly performed according to the part of speech label, the corresponding prosody level label is "prosody/5 and/or 1 sincere/5/1/0 question/5 candidate/3", where "0", "1", "2", "3" respectively represents a first, second, third and fourth prosody level, and "5" represents no prosody level division behind the corresponding word. It will be appreciated that dividing in this way results in a finer granularity of division, whereas as can be seen from normal speech utterances, no prosodic hierarchy division should be made between "sincere" in the above text, i.e. no pauses are required. Therefore, in this embodiment, the conditional random field model is adopted, words of the context of each word are obtained based on the part of speech of each word of the text, and the part of speech structure and the text structure of the words are obtained; and then predicting the prosody level label of the text according to the prosody level label corresponding to the part-of-speech structure in the conditional random field model. Finally, prosodic hierarchy labels of the text can be predicted to be: "Zhi/5 in 1 Cheng/5 sincere/5/0 Mi/5 Zhi/3".
According to the prosodic hierarchy label of each word in the text, prosodic hierarchy division can be performed on the text, as described in step S4. And obtaining whether each word in the text needs to be stopped when being converted into voice, and obtaining a stopping level according to the prosody level label if the word needs to be stopped.
In one embodiment, prosody level prediction using only part-of-speech information, for example, would make the situation of "play with no speech" should be an overall situation, because the front and back part-of-speech differences (verbs, prosody vocabulary) are separated by some prosody level. Also, for example, "surprise" such sentences are followed by verbs plus morbid terms, and there is prosodic division between "surprise" and "have". This will result in an insufficiently natural synthesized speech. In this embodiment, this is effectively avoided by introducing text information and part-of-speech structures, since in the training of the model, the "mock" and the preceding are in most cases connected. The division of prosodic hierarchy can be made more reasonable by combining the information of the text itself with the information of the parts of speech.
In an embodiment, before the step S1 of obtaining the text of the prosody level to be divided, the method includes:
step S10, acquiring a training data set; the training data comprises a plurality of training texts, wherein the training texts carry part of speech of each word in the training texts and prosody level labels of each word;
step S11, inputting the training text into an initial conditional random field model for training to obtain the preset conditional random field model; the initial conditional random field model comprises a feature template, wherein the feature template is used for respectively counting part-of-speech structures and text structures of the context of each word in the training text, and determining model parameters in the feature template according to the part-of-speech structures, the text structures and prosody level labels of each word in a training sample so as to obtain the feature function.
In this embodiment, there are many methods for constructing the feature function in the conditional random field, and a common method is to construct the feature template first, and then construct a corresponding model function through training data, so as to obtain the feature function. The characteristic function constructed in the mode is simpler, the number of parameters is smaller than that of the neural network method, and the training speed is faster. After the feature function is obtained, the prosody level label of the text can be predicted by using the conditional random field model.
In one embodiment, referring to FIG. 2, the training text includes three columns:
the first column is the vertical arrangement of each word in the training text, the second column is the part of speech corresponding to each word in the training text, and the third column is the prosody level label corresponding to each word in the training text.
In this embodiment, the feature templates included in the initial conditional random field model are:
(1)
(2)
(3)
wherein x is i,1 Data representing row i, column 2, w in training sample i-a Representing the corresponding part of speech, m i-a Representing the corresponding text; a. b represents the context; n is a preset super parameter, y i Is a prosodic hierarchy label corresponding to the text. The training text is x, and the ith behavior x corresponding to the jth column feature ij
Equation (1) shows that this function returns 1 when a certain part-of-speech structure occurs more than n times, otherwise returns 0. For example, when a=1, b=1, the representation takes into account the context and the current part of speech, for example, three ranges in total, such as "x i-1,1 =noun, x i,1 =noun, x i+1,1 Verb, y i In the case of the part-of-speech structure of= #1", where the number of occurrences in the training data is greater than n, we can define a function by the function template (1) that satisfies" x " i-1,1 =noun, x i,1 =noun, x i+1,1 Verb, y i Return 1 in the case of= #1", otherwise return 0. Based on the training text and the feature template, corresponding model parameters can be obtained through training.
Equation (2) is similar to equation (1) except that it considers the text structure of the text itself. Equation (3) is defined as the predicted impact of the last prosody level tag on the current prosody level tag.
The feature template not only introduces part-of-speech structural features, but also combines text structural features of training data, and the combination of text information and part-of-speech information can enable the division of prosody levels to be more reasonable, and the constructed feature function can fully utilize text context information, part-of-speech change and word segmentation information. Compared with the prior method based on the neural network, the method has much smaller time complexity, model complexity and training data volume.
In an embodiment, after the step S11 of inputting the training text into the initial conditional random field model to perform training to obtain the preset conditional random field model, the method includes:
acquiring test text in the test data; wherein the test text comprises part of speech of each word in the test text;
inputting the test text into the preset conditional random field model to obtain a predicted prosody level label of each word in the test text;
obtaining a correct prosody level tag of the test text, and comparing the predicted prosody level tag with the correct prosody level tag to obtain the prediction accuracy of the preset conditional random field model;
and if the prediction accuracy is higher than a threshold value, determining that the training of the preset conditional random field model is completed.
In this embodiment, the training conditional random field model is further required to be tested, and the operation process of inputting the test text into the preset conditional random field model is similar to that of the training sample, except that the test text does not include corresponding prosodic hierarchical label information and only includes corresponding part-of-speech labels.
In an embodiment, the step S10 of acquiring a training data set includes:
acquiring a text sample;
part of speech recognition is carried out on the text sample, and part of speech of each word in the text sample is obtained;
acquiring prosody level labels of each word in the text sample;
constructing each training text based on the text sample, the part of speech of each word in the text sample and the prosody level label of each word in the text sample;
the training data set is obtained based on a plurality of training texts.
In this embodiment, when the training text is constructed, three columns of data are required, where the first column is a vertical arrangement of each word in the training text, the second column is a part of speech corresponding to each word in the training text, and the third column is a prosodic hierarchical label corresponding to each word in the training text.
For example, for a training text "greetings with sincere", the corresponding part-of-speech recognition is expressed as "greetings with sincere/v, sincere/a, ude1, greetings/vn", and when the training text is constructed, the part-of-speech annotation can be expressed as "greetings with sincere/v/a/ude questions/vn". The corresponding label is 'Zhi/5 and 1 Cheng/5 sincere/5/0 Su/5 Su/3'.
In an embodiment, the method further comprises:
and storing the preset conditional random field model in a blockchain. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
Referring to fig. 3, in an embodiment of the present application, there is further provided a prosody hierarchy dividing device, including:
a first acquisition unit 10 for acquiring a text of a prosody level to be divided;
the recognition unit 20 is configured to perform part-of-speech recognition on the text, so as to obtain a part-of-speech of each word in the text;
a tag obtaining unit 30, configured to input the text after part-of-speech recognition into a preset conditional random field model, and obtain a prosody level tag of each word of the text; the conditional random field model comprises a characteristic function, wherein the characteristic function is used for respectively counting part-of-speech structures and text structures of the context of each word, and determining prosody level labels of each word according to the part-of-speech structures and the text structures;
a dividing unit 40 for performing prosodic hierarchy division on the text according to prosodic hierarchy tags of each word in the text.
In an embodiment, the apparatus further includes:
a second acquisition unit configured to acquire a training data set; the training data comprises a plurality of training texts, wherein the training texts carry part of speech of each word in the training texts and prosody level labels of each word;
inputting the training text into an initial conditional random field model for training to obtain the preset conditional random field model; the initial conditional random field model comprises a feature template, wherein the feature template is used for respectively counting part-of-speech structures and text structures of the context of each word in the training text, and determining model parameters in the feature template according to the part-of-speech structures, the text structures and prosody level labels of each word in a training sample so as to obtain the feature function.
In one embodiment, the training text includes three columns:
the first column is the vertical arrangement of each word in the training text, the second column is the part of speech corresponding to each word in the training text, and the third column is the prosody level label corresponding to each word in the training text.
In an embodiment, the feature templates included in the initial conditional random field model are:
(1)
(2)
(3)
wherein x is i,1 Data representing row i, column 2, w in training sample i-a Representing the corresponding part of speech, m i-a Representing the corresponding text; a. b represents the context; n is a preset super parameter, y i Is a prosodic hierarchy label corresponding to the text.
In an embodiment, the apparatus further includes:
the third acquisition unit is used for acquiring test texts in the test data; wherein the test text comprises part of speech of each word in the test text;
the prediction unit is used for inputting the test text into the preset conditional random field model to obtain a predicted prosody level label of each word in the test text;
the comparison unit is used for obtaining a correct prosody level label of the test text and comparing the predicted prosody level label with the correct prosody level label to obtain the prediction accuracy of the preset conditional random field model;
and the determining unit is used for determining that the training of the preset conditional random field model is completed if the prediction accuracy is higher than a threshold value.
In an embodiment, the second obtaining unit is specifically configured to:
acquiring a text sample;
part of speech recognition is carried out on the text sample, and part of speech of each word in the text sample is obtained;
acquiring prosody level labels of each word in the text sample;
constructing each training text based on the text sample, the part of speech of each word in the text sample and the prosody level label of each word in the text sample;
the training data set is obtained based on a plurality of training texts.
In an embodiment, the apparatus further includes:
and the storage unit is used for storing the preset conditional random field model in a block chain.
In this embodiment, for specific implementation of each unit in the above embodiment of the apparatus, please refer to the description in the above embodiment of the method, and no further description is given here.
Referring to fig. 3, a computer device is further provided in the embodiment of the present application, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing conditional random field models etc. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a prosody hierarchy dividing method.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device to which the present application is applied.
An embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a prosody parsing method. It is understood that the computer readable storage medium in this embodiment may be a volatile readable storage medium or a nonvolatile readable storage medium.
In summary, in the prosody hierarchy dividing method, device, computer equipment and storage medium provided in the embodiments of the present application, a text of a prosody hierarchy to be divided is obtained; part of speech recognition is carried out on the text, and part of speech of each word in the text is obtained; inputting the text after part-of-speech recognition into a preset conditional random field model to obtain prosody level labels of each word of the text; based on the feature functions included in the conditional random field model, the feature functions are used for respectively counting part-of-speech structures of the contexts of each word, and determining prosody level labels of each word according to the part-of-speech structures. The method adopts the conditional random field model, combines the part of speech of each word in the text, and combines the part of speech structure of the considered context to carry out prosody hierarchy division on the text, thereby avoiding the defect of over-fine granularity when only considering the part of speech to carry out prosody hierarchy division at present.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (7)

1. A prosody hierarchy dividing method, characterized by comprising the steps of:
acquiring a text of a prosody level to be divided;
part of speech recognition is carried out on the text, and part of speech of each word in the text is obtained;
inputting the text after part-of-speech recognition into a preset conditional random field model to obtain prosody level labels of each word of the text; the conditional random field model comprises a characteristic function, wherein the characteristic function is used for respectively counting part-of-speech structures and text structures of the context of each word, and determining prosody level labels of each word according to the part-of-speech structures and the text structures;
performing prosodic hierarchy division on the text according to prosodic hierarchy tags of each word in the text;
before the step of obtaining the text of the prosody level to be divided, the method comprises the following steps:
acquiring a training data set; the training data comprises a plurality of training texts, wherein the training texts carry part of speech of each word in the training texts and prosody level labels of each word;
inputting the training text into an initial conditional random field model for training to obtain the preset conditional random field model; the initial conditional random field model comprises a characteristic template, wherein the characteristic template is used for respectively counting part-of-speech structures and text structures of the context of each word in the training text, and determining model parameters in the characteristic template according to the part-of-speech structures, the text structures and prosody level labels of each word in a training sample so as to obtain the characteristic function;
the training text includes three columns:
the first column is the vertical arrangement of each word in the training text, the second column is the part of speech corresponding to each word in the training text, and the third column is the prosodic hierarchy label corresponding to each word in the training text;
the feature templates included in the initial conditional random field model are:
(1)
(2)
(3)
wherein x is i,1 Data representing row i, column 2, w in training sample i-a Representing the corresponding part of speech, m i-a Representing the corresponding text; a. b represents the context; n is a preset super parameter, y i Is a prosodic hierarchy label corresponding to the text.
2. The prosody level division method according to claim 1, wherein after the step of inputting the training text into an initial conditional random field model for training to obtain the preset conditional random field model, the method comprises:
acquiring test text in the test data; wherein the test text comprises part of speech of each word in the test text;
inputting the test text into the preset conditional random field model to obtain a predicted prosody level label of each word in the test text;
obtaining a correct prosody level tag of the test text, and comparing the predicted prosody level tag with the correct prosody level tag to obtain the prediction accuracy of the preset conditional random field model;
and if the prediction accuracy is higher than a threshold value, determining that the training of the preset conditional random field model is completed.
3. The prosodic hierarchy dividing method according to claim 1, wherein the step of acquiring a training data set comprises:
acquiring a text sample;
part of speech recognition is carried out on the text sample, and part of speech of each word in the text sample is obtained;
acquiring prosody level labels of each word in the text sample;
constructing each training text based on the text sample, the part of speech of each word in the text sample and the prosody level label of each word in the text sample;
the training data set is obtained based on a plurality of training texts.
4. The prosody level dividing method according to claim 1, further comprising:
and storing the preset conditional random field model in a blockchain.
5. A prosodic hierarchy dividing device employing the method according to any one of claims 1 to 4, comprising:
a first acquisition unit configured to acquire a text of a prosody level to be divided;
the recognition unit is used for recognizing the part of speech of the text to obtain the part of speech of each word in the text;
the tag obtaining unit is used for inputting the text subjected to part-of-speech recognition into a preset conditional random field model to obtain prosody level tags of each word of the text; the conditional random field model comprises a characteristic function, wherein the characteristic function is used for respectively counting part-of-speech structures and text structures of the context of each word, and determining prosody level labels of each word according to the part-of-speech structures and the text structures;
a dividing unit for performing prosodic hierarchy division on the text according to prosodic hierarchy tags of each word in the text;
before the step of obtaining the text of the prosody level to be divided, the method comprises the following steps:
acquiring a training data set; the training data comprises a plurality of training texts, wherein the training texts carry part of speech of each word in the training texts and prosody level labels of each word;
inputting the training text into an initial conditional random field model for training to obtain the preset conditional random field model; the initial conditional random field model comprises a characteristic template, wherein the characteristic template is used for respectively counting part-of-speech structures and text structures of the context of each word in the training text, and determining model parameters in the characteristic template according to the part-of-speech structures, the text structures and prosody level labels of each word in a training sample so as to obtain the characteristic function;
the training text includes three columns:
the first column is the vertical arrangement of each word in the training text, the second column is the part of speech corresponding to each word in the training text, and the third column is the prosodic hierarchy label corresponding to each word in the training text;
the feature templates included in the initial conditional random field model are:
(1)
(2)
(3)
wherein x is i,1 Data representing row i, column 2, w in training sample i-a Representing the corresponding part of speech, m i-a Representing the corresponding text; a. b represents the context; n is a preset super parameter, y i Is a prosodic hierarchy label corresponding to the text.
6. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 4.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 4.
CN202011339547.6A 2020-11-25 2020-11-25 Prosody hierarchy dividing method, prosody hierarchy dividing device, computer device and storage medium Active CN112463921B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011339547.6A CN112463921B (en) 2020-11-25 2020-11-25 Prosody hierarchy dividing method, prosody hierarchy dividing device, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011339547.6A CN112463921B (en) 2020-11-25 2020-11-25 Prosody hierarchy dividing method, prosody hierarchy dividing device, computer device and storage medium

Publications (2)

Publication Number Publication Date
CN112463921A CN112463921A (en) 2021-03-09
CN112463921B true CN112463921B (en) 2024-03-19

Family

ID=74807889

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011339547.6A Active CN112463921B (en) 2020-11-25 2020-11-25 Prosody hierarchy dividing method, prosody hierarchy dividing device, computer device and storage medium

Country Status (1)

Country Link
CN (1) CN112463921B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023102931A1 (en) * 2021-12-10 2023-06-15 广州虎牙科技有限公司 Method for predicting prosodic structure, and electronic device, program product and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110534087A (en) * 2019-09-04 2019-12-03 清华大学深圳研究生院 A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7263488B2 (en) * 2000-12-04 2007-08-28 Microsoft Corporation Method and apparatus for identifying prosodic word boundaries

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110534087A (en) * 2019-09-04 2019-12-03 清华大学深圳研究生院 A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于句法依存和条件随机场的韵律短语识别;钱揖丽等;《清华大学学报(自然科学版)》;第59卷(第7期);第530-536页 *

Also Published As

Publication number Publication date
CN112463921A (en) 2021-03-09

Similar Documents

Publication Publication Date Title
CN105244020B (en) Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN111145718B (en) Chinese mandarin character-voice conversion method based on self-attention mechanism
CN109461437B (en) Verification content generation method and related device for lip language identification
US20220020355A1 (en) Neural text-to-speech synthesis with multi-level text information
CN113297366B (en) Emotion recognition model training method, device, equipment and medium for multi-round dialogue
CN112562640B (en) Multilingual speech recognition method, device, system, and computer-readable storage medium
CN111369974A (en) Dialect pronunciation labeling method, language identification method and related device
WO2021134591A1 (en) Speech synthesis method, speech synthesis apparatus, smart terminal and storage medium
CN116778967B (en) Multi-mode emotion recognition method and device based on pre-training model
KR20170090127A (en) Apparatus for comprehending speech
CN115116428B (en) Prosodic boundary labeling method, device, equipment, medium and program product
CN111223476A (en) Method and device for extracting voice feature vector, computer equipment and storage medium
Ali Multi-dialect Arabic broadcast speech recognition
CN111370001B (en) Pronunciation correction method, intelligent terminal and storage medium
CN112463921B (en) Prosody hierarchy dividing method, prosody hierarchy dividing device, computer device and storage medium
US20140249800A1 (en) Language processing method and electronic device
CN116597809A (en) Multi-tone word disambiguation method, device, electronic equipment and readable storage medium
CN116665639A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN112464649A (en) Pinyin conversion method and device for polyphone, computer equipment and storage medium
CN113053409B (en) Audio evaluation method and device
Park et al. Jejueo datasets for machine translation and speech synthesis
CN112735379B (en) Speech synthesis method, device, electronic equipment and readable storage medium
CN115099222A (en) Punctuation mark misuse detection and correction method, device, equipment and storage medium
CN114492382A (en) Character extraction method, text reading method, dialog text generation method, device, equipment and storage medium
CN113012685B (en) Audio recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant