CN112463921B

CN112463921B - Prosody hierarchy dividing method, prosody hierarchy dividing device, computer device and storage medium

Info

Publication number: CN112463921B
Application number: CN202011339547.6A
Authority: CN
Inventors: 李俊杰; 陈闽川; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2024-03-19
Anticipated expiration: 2040-11-25
Also published as: CN112463921A

Abstract

The application relates to the technical field of artificial intelligence and provides a prosody level dividing method, a prosody level dividing device, computer equipment and a storage medium, wherein a text of a prosody level to be divided is obtained; part of speech recognition is carried out on the text, and part of speech of each word in the text is obtained; inputting the text after part-of-speech recognition into a preset conditional random field model to obtain prosody level labels of each word of the text; based on the feature function included in the conditional random field model, the feature function is used for respectively counting the part-of-speech structure and the text structure of the context of each word, and determining the prosody level label of each word according to the part-of-speech structure and the text structure. The method adopts the conditional random field model, combines the part of speech of each word in the text, and combines the part of speech structure of the considered context to carry out prosody hierarchy division on the text, thereby avoiding the defect of over-fine granularity when only considering the part of speech to carry out prosody hierarchy division at present.

Description

Prosody hierarchy dividing method, prosody hierarchy dividing device, computer device and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a prosody hierarchy dividing method, apparatus, computer device, and storage medium.

Background

The division of prosody levels has important significance in the field of speech synthesis, and good prosody division can enable synthesized speech to be more natural. When prosodic division is not reasonable, not only the synthesized sound effect is poor, but also ambiguity may be generated by a listener.

At present, whether the division of prosody levels is needed among the parts of speech is counted, so that the use of marking data can be effectively reduced, but only part of speech information is considered as a result, and the method often leads the division result to be too fine. For example for sentences: "boy ask: "do you dislike me? "by statistics of parts of speech, the possible partitioning results of this approach are: "boy #1 ask #3: "do you #1 not #1 like #1 i # 1? . In this approach, the word granularity is affected. Because the labels of the parts of speech are often based on word segmentation results, and the word segmentation results have finer granularity than the common prosody hierarchy division results, the current method has the problem of too fine division results.

Disclosure of Invention

The main purpose of the application is to provide a prosody hierarchy dividing method, a prosody hierarchy dividing device, computer equipment and a storage medium, aiming at overcoming the defect of fine granularity when prosody hierarchy dividing is performed based on parts of speech at present.

In order to achieve the above object, the present application provides a prosody hierarchy dividing method, including the steps of:

acquiring a text of a prosody level to be divided;

part of speech recognition is carried out on the text, and part of speech of each word in the text is obtained;

inputting the text after part-of-speech recognition into a preset conditional random field model to obtain prosody level labels of each word of the text; the conditional random field model comprises a characteristic function, wherein the characteristic function is used for respectively counting part-of-speech structures and text structures of the context of each word, and determining prosody level labels of each word according to the part-of-speech structures and the text structures;

and performing prosodic hierarchy division on the text according to prosodic hierarchy labels of each word in the text.

Further, before the step of obtaining the text of the prosody level to be divided, it includes:

acquiring a training data set; the training data comprises a plurality of training texts, wherein the training texts carry part of speech of each word in the training texts and prosody level labels of each word;

inputting the training text into an initial conditional random field model for training to obtain the preset conditional random field model; the initial conditional random field model comprises a feature template, wherein the feature template is used for respectively counting part-of-speech structures and text structures of the context of each word in the training text, and determining model parameters in the feature template according to the part-of-speech structures, the text structures and prosody level labels of each word in a training sample so as to obtain the feature function.

Further, the training text includes three columns:

the first column is the vertical arrangement of each word in the training text, the second column is the part of speech corresponding to each word in the training text, and the third column is the prosody level label corresponding to each word in the training text.

Further, the feature templates included in the initial conditional random field model are:

(1)

(2)

(3)

wherein x is _i,1 Data representing row i, column 2, w in training sample _i-a Representing the corresponding part of speech m _i-a Representing the corresponding text; a. b represents the context; n is a preset super parameter, y _i Corresponding to wordsProsody level labels.

Further, after the step of inputting the training text into the initial conditional random field model to train and obtaining the preset conditional random field model, the method includes:

acquiring test text in the test data; wherein the test text comprises part of speech of each word in the test text;

inputting the test text into the preset conditional random field model to obtain a predicted prosody level label of each word in the test text;

obtaining a correct prosody level tag of the test text, and comparing the predicted prosody level tag with the correct prosody level tag to obtain the prediction accuracy of the preset conditional random field model;

and if the prediction accuracy is higher than a threshold value, determining that the training of the preset conditional random field model is completed.

Further, the step of acquiring a training data set includes:

acquiring a text sample;

part of speech recognition is carried out on the text sample, and part of speech of each word in the text sample is obtained;

acquiring prosody level labels of each word in the text sample;

constructing each training text based on the text sample, the part of speech of each word in the text sample and the prosody level label of each word in the text sample;

the training data set is obtained based on a plurality of training texts.

Further, the method further comprises:

and storing the preset conditional random field model in a blockchain.

The application also provides a prosody hierarchy dividing device, comprising:

a first acquisition unit configured to acquire a text of a prosody level to be divided;

the recognition unit is used for recognizing the part of speech of the text to obtain the part of speech of each word in the text;

the tag obtaining unit is used for inputting the text subjected to part-of-speech recognition into a preset conditional random field model to obtain prosody level tags of each word of the text; the conditional random field model comprises a characteristic function, wherein the characteristic function is used for respectively counting part-of-speech structures and text structures of the context of each word, and determining prosody level labels of each word according to the part-of-speech structures and the text structures;

and the dividing unit is used for dividing the prosody level of the text according to the prosody level label of each word in the text.

The present application also provides a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of any of the methods described above when the computer program is executed.

The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of the above.

The prosody level dividing method, the prosody level dividing device, the computer equipment and the storage medium acquire texts of prosody levels to be divided; part of speech recognition is carried out on the text, and part of speech of each word in the text is obtained; inputting the text after part-of-speech recognition into a preset conditional random field model to obtain prosody level labels of each word of the text; based on the feature function included in the conditional random field model, the feature function is used for respectively counting the part-of-speech structure and the text structure of the context of each word, and determining the prosody level label of each word according to the part-of-speech structure and the text structure. The method adopts the conditional random field model, combines the part of speech of each word in the text, and combines the part of speech structure of the considered context to carry out prosody hierarchy division on the text, thereby avoiding the defect of over-fine granularity when only considering the part of speech to carry out prosody hierarchy division at present.

Drawings

FIG. 1 is a schematic diagram showing steps of a prosody hierarchy dividing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of training text in an embodiment of the present application;

FIG. 3 is a block diagram showing a structure of a prosody hierarchy dividing device according to an embodiment of the present application;

fig. 4 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.

The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Referring to fig. 1, in one embodiment of the present application, a prosody hierarchy dividing method is provided, including the following steps:

step S1, acquiring a text of a prosody level to be divided;

s2, identifying the part of speech of the text to obtain the part of speech of each word in the text;

s3, inputting the text subjected to part-of-speech recognition into a preset conditional random field model to obtain prosody level labels of each word of the text; the conditional random field model comprises a characteristic function, wherein the characteristic function is used for respectively counting part-of-speech structures and text structures of the context of each word, and determining prosody level labels of each word according to the part-of-speech structures and the text structures;

and S4, performing prosodic hierarchy division on the text according to prosodic hierarchy labels of each word in the text.

In this embodiment, the method is applied to prosodic hierarchy division of a text, and after the text is reasonably prosodic hierarchy divided, the text can be converted into corresponding voices according to the prosodic hierarchy, so that the obtained voices are more natural. The method can also be applied to the field of smart cities to promote the construction of the smart cities.

The text is a text to be prosody hierarchical divided as described in the above step S1. The text is typically entered by a user.

And (2) performing part-of-speech recognition on the text to obtain the part-of-speech of each word in the text as described in the step (S2). Specifically, part-of-speech recognition can be performed by using methods such as jieba, hanlp and the like. The part-of-speech recognition refers to separating words from the text and recognizing parts of speech of each word, where the parts of speech includes verbs, names, and intonation words. In this embodiment, the obtained part of speech of each word includes not only part of speech information but also word segmentation information, and common prosody segmentation points are all present between different words (word segmentation boundaries), so that the introduction of part of speech information is beneficial to prosody prediction.

For example, in one embodiment, the text "greetings that are honest" may be labeled "greetings/v, honest/a, ude1, greetings/vn" when they are part-of-speech identified.

And (3) inputting the text after part-of-speech recognition into a preset conditional random field model to obtain a prosodic hierarchy tag of each word of the text, wherein the conditional random field model counts part-of-speech structures of the context of each word, and determines the prosodic hierarchy tag of each word according to the part-of-speech structures. For example, prosody level tags include 0, 1, 2, 3, 5; for the above-mentioned "prosody/v, sincere/a/ude, greeting/vn", if prosody level division is directly performed according to the part of speech label, the corresponding prosody level label is "prosody/5 and/or 1 sincere/5/1/0 question/5 candidate/3", where "0", "1", "2", "3" respectively represents a first, second, third and fourth prosody level, and "5" represents no prosody level division behind the corresponding word. It will be appreciated that dividing in this way results in a finer granularity of division, whereas as can be seen from normal speech utterances, no prosodic hierarchy division should be made between "sincere" in the above text, i.e. no pauses are required. Therefore, in this embodiment, the conditional random field model is adopted, words of the context of each word are obtained based on the part of speech of each word of the text, and the part of speech structure and the text structure of the words are obtained; and then predicting the prosody level label of the text according to the prosody level label corresponding to the part-of-speech structure in the conditional random field model. Finally, prosodic hierarchy labels of the text can be predicted to be: "Zhi/5 in 1 Cheng/5 sincere/5/0 Mi/5 Zhi/3".

According to the prosodic hierarchy label of each word in the text, prosodic hierarchy division can be performed on the text, as described in step S4. And obtaining whether each word in the text needs to be stopped when being converted into voice, and obtaining a stopping level according to the prosody level label if the word needs to be stopped.

In one embodiment, prosody level prediction using only part-of-speech information, for example, would make the situation of "play with no speech" should be an overall situation, because the front and back part-of-speech differences (verbs, prosody vocabulary) are separated by some prosody level. Also, for example, "surprise" such sentences are followed by verbs plus morbid terms, and there is prosodic division between "surprise" and "have". This will result in an insufficiently natural synthesized speech. In this embodiment, this is effectively avoided by introducing text information and part-of-speech structures, since in the training of the model, the "mock" and the preceding are in most cases connected. The division of prosodic hierarchy can be made more reasonable by combining the information of the text itself with the information of the parts of speech.

In an embodiment, before the step S1 of obtaining the text of the prosody level to be divided, the method includes:

step S10, acquiring a training data set; the training data comprises a plurality of training texts, wherein the training texts carry part of speech of each word in the training texts and prosody level labels of each word;

step S11, inputting the training text into an initial conditional random field model for training to obtain the preset conditional random field model; the initial conditional random field model comprises a feature template, wherein the feature template is used for respectively counting part-of-speech structures and text structures of the context of each word in the training text, and determining model parameters in the feature template according to the part-of-speech structures, the text structures and prosody level labels of each word in a training sample so as to obtain the feature function.

In this embodiment, there are many methods for constructing the feature function in the conditional random field, and a common method is to construct the feature template first, and then construct a corresponding model function through training data, so as to obtain the feature function. The characteristic function constructed in the mode is simpler, the number of parameters is smaller than that of the neural network method, and the training speed is faster. After the feature function is obtained, the prosody level label of the text can be predicted by using the conditional random field model.

In one embodiment, referring to FIG. 2, the training text includes three columns:

In this embodiment, the feature templates included in the initial conditional random field model are:

(1)

(2)

(3)

wherein x is _i,1 Data representing row i, column 2, w in training sample _i-a Representing the corresponding part of speech, m _i-a Representing the corresponding text; a. b represents the context; n is a preset super parameter, y _i Is a prosodic hierarchy label corresponding to the text. The training text is x, and the ith behavior x corresponding to the jth column feature _ij 。

Equation (1) shows that this function returns 1 when a certain part-of-speech structure occurs more than n times, otherwise returns 0. For example, when a=1, b=1, the representation takes into account the context and the current part of speech, for example, three ranges in total, such as "x _i-1,1 =noun, x _i,1 =noun, x _i+1,1 Verb, y _i In the case of the part-of-speech structure of= #1", where the number of occurrences in the training data is greater than n, we can define a function by the function template (1) that satisfies" x " _i-1,1 =noun, x _i,1 =noun, x _i+1,1 Verb, y _i Return 1 in the case of= #1", otherwise return 0. Based on the training text and the feature template, corresponding model parameters can be obtained through training.

Equation (2) is similar to equation (1) except that it considers the text structure of the text itself. Equation (3) is defined as the predicted impact of the last prosody level tag on the current prosody level tag.

The feature template not only introduces part-of-speech structural features, but also combines text structural features of training data, and the combination of text information and part-of-speech information can enable the division of prosody levels to be more reasonable, and the constructed feature function can fully utilize text context information, part-of-speech change and word segmentation information. Compared with the prior method based on the neural network, the method has much smaller time complexity, model complexity and training data volume.

In an embodiment, after the step S11 of inputting the training text into the initial conditional random field model to perform training to obtain the preset conditional random field model, the method includes:

In this embodiment, the training conditional random field model is further required to be tested, and the operation process of inputting the test text into the preset conditional random field model is similar to that of the training sample, except that the test text does not include corresponding prosodic hierarchical label information and only includes corresponding part-of-speech labels.

In an embodiment, the step S10 of acquiring a training data set includes:

acquiring a text sample;

acquiring prosody level labels of each word in the text sample;

the training data set is obtained based on a plurality of training texts.

In this embodiment, when the training text is constructed, three columns of data are required, where the first column is a vertical arrangement of each word in the training text, the second column is a part of speech corresponding to each word in the training text, and the third column is a prosodic hierarchical label corresponding to each word in the training text.

For example, for a training text "greetings with sincere", the corresponding part-of-speech recognition is expressed as "greetings with sincere/v, sincere/a, ude1, greetings/vn", and when the training text is constructed, the part-of-speech annotation can be expressed as "greetings with sincere/v/a/ude questions/vn". The corresponding label is 'Zhi/5 and 1 Cheng/5 sincere/5/0 Su/5 Su/3'.

In an embodiment, the method further comprises:

and storing the preset conditional random field model in a blockchain. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

Referring to fig. 3, in an embodiment of the present application, there is further provided a prosody hierarchy dividing device, including:

a first acquisition unit 10 for acquiring a text of a prosody level to be divided;

the recognition unit 20 is configured to perform part-of-speech recognition on the text, so as to obtain a part-of-speech of each word in the text;

a tag obtaining unit 30, configured to input the text after part-of-speech recognition into a preset conditional random field model, and obtain a prosody level tag of each word of the text; the conditional random field model comprises a characteristic function, wherein the characteristic function is used for respectively counting part-of-speech structures and text structures of the context of each word, and determining prosody level labels of each word according to the part-of-speech structures and the text structures;

a dividing unit 40 for performing prosodic hierarchy division on the text according to prosodic hierarchy tags of each word in the text.

In an embodiment, the apparatus further includes:

a second acquisition unit configured to acquire a training data set; the training data comprises a plurality of training texts, wherein the training texts carry part of speech of each word in the training texts and prosody level labels of each word;

In one embodiment, the training text includes three columns:

In an embodiment, the feature templates included in the initial conditional random field model are:

(1)

(2)

(3)

wherein x is _i,1 Data representing row i, column 2, w in training sample _i-a Representing the corresponding part of speech, m _i-a Representing the corresponding text; a. b represents the context; n is a preset super parameter, y _i Is a prosodic hierarchy label corresponding to the text.

In an embodiment, the apparatus further includes:

the third acquisition unit is used for acquiring test texts in the test data; wherein the test text comprises part of speech of each word in the test text;

the prediction unit is used for inputting the test text into the preset conditional random field model to obtain a predicted prosody level label of each word in the test text;

the comparison unit is used for obtaining a correct prosody level label of the test text and comparing the predicted prosody level label with the correct prosody level label to obtain the prediction accuracy of the preset conditional random field model;

and the determining unit is used for determining that the training of the preset conditional random field model is completed if the prediction accuracy is higher than a threshold value.

In an embodiment, the second obtaining unit is specifically configured to:

acquiring a text sample;

acquiring prosody level labels of each word in the text sample;

the training data set is obtained based on a plurality of training texts.

In an embodiment, the apparatus further includes:

and the storage unit is used for storing the preset conditional random field model in a block chain.

In this embodiment, for specific implementation of each unit in the above embodiment of the apparatus, please refer to the description in the above embodiment of the method, and no further description is given here.

Referring to fig. 3, a computer device is further provided in the embodiment of the present application, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing conditional random field models etc. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a prosody hierarchy dividing method.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device to which the present application is applied.

An embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a prosody parsing method. It is understood that the computer readable storage medium in this embodiment may be a volatile readable storage medium or a nonvolatile readable storage medium.

In summary, in the prosody hierarchy dividing method, device, computer equipment and storage medium provided in the embodiments of the present application, a text of a prosody hierarchy to be divided is obtained; part of speech recognition is carried out on the text, and part of speech of each word in the text is obtained; inputting the text after part-of-speech recognition into a preset conditional random field model to obtain prosody level labels of each word of the text; based on the feature functions included in the conditional random field model, the feature functions are used for respectively counting part-of-speech structures of the contexts of each word, and determining prosody level labels of each word according to the part-of-speech structures. The method adopts the conditional random field model, combines the part of speech of each word in the text, and combines the part of speech structure of the considered context to carry out prosody hierarchy division on the text, thereby avoiding the defect of over-fine granularity when only considering the part of speech to carry out prosody hierarchy division at present.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims

1. A prosody hierarchy dividing method, characterized by comprising the steps of:

acquiring a text of a prosody level to be divided;

performing prosodic hierarchy division on the text according to prosodic hierarchy tags of each word in the text;

before the step of obtaining the text of the prosody level to be divided, the method comprises the following steps:

inputting the training text into an initial conditional random field model for training to obtain the preset conditional random field model; the initial conditional random field model comprises a characteristic template, wherein the characteristic template is used for respectively counting part-of-speech structures and text structures of the context of each word in the training text, and determining model parameters in the characteristic template according to the part-of-speech structures, the text structures and prosody level labels of each word in a training sample so as to obtain the characteristic function;

the training text includes three columns:

the first column is the vertical arrangement of each word in the training text, the second column is the part of speech corresponding to each word in the training text, and the third column is the prosodic hierarchy label corresponding to each word in the training text;

the feature templates included in the initial conditional random field model are:

(1)

(2)

(3)

2. The prosody level division method according to claim 1, wherein after the step of inputting the training text into an initial conditional random field model for training to obtain the preset conditional random field model, the method comprises:

3. The prosodic hierarchy dividing method according to claim 1, wherein the step of acquiring a training data set comprises:

acquiring a text sample;

acquiring prosody level labels of each word in the text sample;

the training data set is obtained based on a plurality of training texts.

4. The prosody level dividing method according to claim 1, further comprising:

and storing the preset conditional random field model in a blockchain.

5. A prosodic hierarchy dividing device employing the method according to any one of claims 1 to 4, comprising:

a dividing unit for performing prosodic hierarchy division on the text according to prosodic hierarchy tags of each word in the text;

the training text includes three columns:

(1)

(2)

(3)

6. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 4.

7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 4.