CN111078937B

CN111078937B - Voice information retrieval method, device, equipment and computer readable storage medium

Info

Publication number: CN111078937B
Application number: CN201911374503.4A
Authority: CN
Inventors: 郭立钊; 谷悦; 杨嵩; 刘子韬
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2021-08-10
Anticipated expiration: 2039-12-27
Also published as: CN111078937A

Abstract

The application provides a voice information retrieval method, a voice information retrieval device, voice information retrieval equipment and a computer readable storage medium. The method comprises the following steps: acquiring a candidate text set of voice information to be retrieved, wherein the candidate text set comprises a plurality of candidate texts; searching a first candidate text matched with the target text in the plurality of candidate texts; acquiring a first prosodic feature of a first candidate text in the voice information to be retrieved; acquiring intrinsic rhythm characteristics of a target text; and determining whether the target text exists in the voice information to be retrieved or not according to the first prosody feature of the first candidate text and the inherent prosody feature of the target text. By the aid of the method and the device, the misjudgment rate of voice information retrieval can be reduced.

Description

Voice information retrieval method, device, equipment and computer readable storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for retrieving speech information.

Background

With the rapid development of mobile internet and hardware devices, the accumulation of voice information has shown explosive growth. How to accurately find the content concerned by the user in the voice information becomes a hot problem interested by researchers. With the development of deep neural network technology, the keyword search by voice (KWS) technology is widely applied to search of voice information. The purpose of speech information retrieval is to recognize certain text in continuous speech, for example to search for keywords in speech information that are of interest to the user.

At present, a commonly used speech keyword retrieval technology generally identifies a speech file to obtain text content corresponding to the speech file and occurrence time of each word or phrase in the text content, then searches for a target keyword in the text content, which is interested by a user, and a system can return the searched keyword and the occurrence start-stop time thereof.

However, the existing speech recognition model, especially the speech recognition model trained in a noisy environment or under a condition of less training data, is not high in recognition accuracy, and often causes a speech information recognition error, so that when the keyword retrieval is performed on the speech information, erroneous judgment often occurs, and much inconvenience is brought to practical application.

Disclosure of Invention

The embodiment of the application provides a voice information retrieval method, a device, equipment and a computer readable storage medium, which are used for solving the problems in the related technology, and the technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a method for retrieving voice information, including:

acquiring a candidate text set of voice information to be retrieved, wherein the candidate text set comprises a plurality of candidate texts;

searching a first candidate text matched with the target text in the plurality of candidate texts;

acquiring a first prosodic feature of a first candidate text in the voice information to be retrieved;

acquiring intrinsic rhythm characteristics of a target text;

and determining whether the target text exists in the voice information to be retrieved or not according to the first prosody feature of the first candidate text and the inherent prosody feature of the target text.

In one embodiment, the plurality of candidate texts have corresponding confidence levels;

retrieving a first candidate text matching the target text among the plurality of candidate texts, comprising:

and when the candidate texts same as the target text are retrieved from the candidate texts, selecting the candidate text with the confidence level larger than or equal to a preset threshold value from the retrieved candidate texts as a first candidate text.

In one embodiment, acquiring intrinsic prosodic features of a target text comprises:

acquiring a prosodic feature curve corresponding to the target text from a pre-stored prosodic feature curve template library to serve as the inherent prosodic feature of the target text; wherein,

the prosodic feature curve template library comprises a plurality of texts and a respective set of prosodic feature curves corresponding to the texts, wherein each set of prosodic feature curves comprises a respective prosodic feature curve corresponding to a plurality of position areas, and the plurality of position areas comprise a plurality of position areas where a single text corresponding to the set of prosodic feature curves is located in different voice information.

In one embodiment, acquiring a prosodic feature curve corresponding to a target text from a pre-stored prosodic feature curve template library as an inherent prosodic feature of the target text includes:

determining that the first candidate text is in a first position area in the voice information to be retrieved according to the moment when the first candidate text appears in the voice information to be retrieved;

and determining a set of prosodic feature curves corresponding to the target text in the prosodic feature curve template library, and selecting the prosodic feature curve corresponding to the first position area from the set of prosodic feature curves as the inherent prosodic feature of the target text.

In one embodiment, obtaining a first prosodic feature of a first candidate text in the speech information to be retrieved includes:

determining the duration of a voice segment corresponding to the first candidate text in the voice information to be retrieved according to the starting time and the ending time of the first candidate text in the voice information to be retrieved;

acquiring duration corresponding to the inherent rhythm characteristics;

processing the voice segment so as to enable the duration of the voice segment to be equal to the duration corresponding to the inherent prosodic feature;

and performing curve fitting on the prosody characteristics of the processed voice segments, and taking a prosody characteristic curve obtained by fitting as a first prosody characteristic of the first candidate text in the voice information to be retrieved.

In one embodiment, determining whether the target text exists in the voice information to be retrieved according to the first prosodic feature of the first candidate text and the intrinsic prosodic feature of the target text comprises:

determining a correlation coefficient of a first prosodic feature of the first candidate text and an inherent prosodic feature of the target text;

and if the correlation coefficient is larger than or equal to a preset threshold value, determining that the target text exists in the voice information to be retrieved.

determining whether the target text exists in the voice information to be retrieved according to the first prosody feature of the first candidate text and the inherent prosody feature of the target text, wherein the determining comprises the following steps:

according to the preset confidence coefficient weight and the preset correlation coefficient weight, carrying out weighted summation on the confidence coefficient and the correlation coefficient of the first candidate text to obtain the score of the first candidate text;

and if the score of the first candidate text is greater than or equal to a preset threshold value, determining that the target text exists in the voice information to be retrieved.

In one embodiment, the first prosodic feature comprises a first fundamental frequency curve and/or a first voiced probability curve; the intrinsic prosodic features include an intrinsic fundamental frequency curve and/or an intrinsic voiced probability curve.

In a second aspect, an embodiment of the present application provides a speech information retrieval apparatus, including:

the recognition module is used for acquiring a candidate text set of the voice information to be retrieved, wherein the candidate text set comprises a plurality of candidate texts;

the retrieval module is used for retrieving a first candidate text matched with the target text from the candidate texts;

the first acquisition module is used for acquiring first prosodic features of the first candidate text in the voice information to be retrieved;

the second acquisition module is used for acquiring the inherent prosodic features of the target text;

and the determining module is used for determining whether the target text exists in the voice information to be retrieved according to the first prosody feature of the first candidate text and the inherent prosody feature of the target text.

In a third aspect, an embodiment of the present application provides a device for retrieving voice information, where the device includes: a memory and a processor. Wherein the memory and the processor are in communication with each other via an internal connection path, the memory is configured to store instructions, the processor is configured to execute the instructions stored by the memory, and the processor is configured to perform the method of any of the above aspects when the processor executes the instructions stored by the memory.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the method in any one of the above-mentioned aspects is executed.

The advantages or beneficial effects in the above technical solution at least include:

according to the voice information retrieval scheme, the candidate text set of the voice information to be retrieved is obtained, after the first candidate text matched with the target text is retrieved from the candidate text set, the prosody feature of the first candidate text in the voice information to be retrieved and the inherent prosody feature of the target text are obtained, and whether the target text exists in the voice information to be retrieved is determined according to the prosody feature and the inherent prosody feature. The embodiment of the application considers that the voices corresponding to the same text should have approximate prosody, eliminates the retrieval result with larger prosody difference by introducing prosody characteristics, and reduces the misjudgment rate of voice information retrieval.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

FIG. 1 is a word graph diagram of an exemplary speech to be retrieved;

fig. 2 is a flowchart of a voice information retrieval method according to an embodiment of the present application;

fig. 3 is a flowchart of a voice information retrieval method according to an embodiment of the present application;

fig. 4 is a flowchart of a voice information retrieval method according to an embodiment of the present application;

fig. 5 is a block diagram of a voice information retrieval apparatus according to an embodiment of the present application;

fig. 6 is a block diagram of a voice information retrieval apparatus according to an embodiment of the present application;

fig. 7 is a block diagram of a voice information retrieval apparatus according to an embodiment of the present application;

fig. 8 is a block diagram of a device according to an embodiment of the present invention.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

The technology of searching the voice information to identify the specific text in the voice information can be widely applied to daily life of people. For example, recording is performed in a teaching environment, sensitive words are identified in voice information obtained through recording, and whether the teaching staff use the sensitive words in the teaching process can be judged. For another example, recording is performed in the online car booking driving process, specific keywords are identified in the voice information obtained through recording, and whether personal safety risks of passengers exist in the online car booking driving process or not can be judged.

In the field, the retrieval of voice information is performed by using a voice keyword retrieval technology, which may include voice recognition and keyword retrieval. In the speech recognition stage, the trained speech recognition model can be used for recognizing the speech information. And at each recognition node, measuring the reliability of the recognition result according to the confidence coefficient, reserving a plurality of candidate words with the highest confidence coefficient, and gradually recognizing the whole voice information by taking the candidate words as the next recognition node to obtain a word graph (lattice) corresponding to the voice information. The word graph may include candidate words in a plurality of candidate paths. Each candidate word has a corresponding confidence. The word graph can also comprise the starting time of the candidate word in the voice message.

As an example, fig. 1 shows a word graph of a piece of speech information. The content expressed by the voice information is "give me sound". The word graph comprises a plurality of candidate paths, such as connecting lines between black dots in the graph. The candidate paths respectively include a plurality of candidate words w, for example, a candidate path formed from the first black dot on the left to the last black dot on the right along the upper line includes! NULL (silent note) "," < enter >, "" give me "," point "," sound ", and"! NULL (silent note) ". As shown in fig. 1, the candidate recognition results obtained by recognizing the speech include "point-giving sound", "roadside accident", "roadside business", and the like. The word map also includes the starting time t of the candidate word in the speech information, and the unit of the starting time t can be seconds(s).

In the keyword retrieval stage, based on given target keywords, texts which are the same as the target keywords are quickly retrieved on a word graph, and time information and confidence of the target keywords in the voice information are obtained. However, the word graph includes a plurality of candidate paths, and not all candidate words are accurate speech content. Moreover, existing speech recognition models are limited in accuracy. Therefore, the confidence of the target keyword retrieved from the word graph may be low, which may easily cause erroneous judgment. For example, the target keyword is "business", and the word map of the voice "click to me" includes the candidate word "business", so that a conclusion is drawn that the target keyword "business" is retrieved from the voice "click to me", but actually the target keyword does not exist in the voice, thereby causing erroneous judgment. The accuracy of the speech recognition model, the pronunciation difference of different speakers, and the noise in the pronunciation environment all cause misjudgment of the speech information retrieval.

The embodiment of the application provides a voice information retrieval method, and fig. 2 shows a flow chart of the method. As shown in fig. 2, the method includes:

step S101, a candidate text set of the voice information to be retrieved is obtained, and the candidate text set comprises a plurality of candidate texts.

In the embodiment of the present application, the voice information may be an audio file for carrying sound information, such as music or voice of people speaking. The speech information may include a time-varying waveform signal from which time-varying acoustic features such as fundamental frequency, amplitude, harmonic proportion, voiced probability, etc. may be extracted. These acoustic features are fitted to a time-varying curve to obtain a characteristic curve.

The voice information to be retrieved may include voice information that needs to be confirmed whether a specific text exists, for example, a classroom teaching recording that needs to be confirmed whether a sensitive vocabulary exists.

Illustratively, a set of candidate texts of the speech information to be retrieved may be generated using a speech recognition model. The speech recognition model may include an acoustic model and/or a language model. The voice information to be retrieved is input into a voice recognition model, and the voice recognition model extracts voice recognition characteristics of the voice information to be retrieved, including Filter bank (FBank) characteristics or Mel-Frequency Cepstral Coefficients (MFCC) characteristics and the like. Then, the speech recognition model gradually recognizes the content expressed by the speech information to be retrieved by using a search algorithm, such as a viterbi algorithm, to obtain a candidate text set of the speech information to be retrieved, where the candidate text set may include a plurality of candidate texts. The candidate text may include text such as a short sentence, a single word, and/or a single word.

The candidate text set may be a word diagram, for example, the speech information to be retrieved is "given me voice", and is input into a speech recognition model, and the word diagram shown in fig. 1 may be generated by using the speech recognition model, and the word diagram includes candidate texts: give me, point, sound, me, roadside, thing, nothing, change, business, other, thing, and general. In some embodiments, the set of candidate texts may also be a set of multiple candidate texts in other forms.

And step S102, searching a first candidate text matched with the target text in the candidate texts.

The target text may include some specific text preset according to the retrieval requirement. For example, it is necessary to confirm whether the instructor utters some sensitive words in the classroom instruction recording, and the target text may be set to include the sensitive words.

Illustratively, a search may be traversed through the set of candidate texts, indexed by the target text, for a first candidate text that matches the target text. The first candidate text matching the target text may be the same first candidate text as the target text. For example, the target text is "sound", and searching in the word graph shown in fig. 1 can find the text "sound" at the upper right in the graph as the first candidate text.

As described above, due to pronunciation differences, environmental noise, and recognition accuracy, a speech segment corresponding to the same candidate text as the target text may not express the target text. For example, the target text "business" is retrieved in the to-be-retrieved speech "give me point sound", and the text "business" appearing at t ═ 0.97s is retrieved. However, in the speech to be retrieved, the speech segment at t ═ 0.97s actually expresses "voice". In some embodiments, some filtering conditions may be set, and the first candidate text matching the target text needs to be the same as the target text and meet the filtering conditions, so as to exclude candidate texts which are the same as the target text but have no expression of the target text in the corresponding speech segment.

For example, each candidate text in the word graph corresponds to a confidence level, and the "confidence level" described herein may be understood as a score given by the speech recognition model for each text in the recognition result, and the higher the score, the higher the confidence level, and the higher the recognition accuracy probability. In step S102, a candidate text having a confidence level greater than or equal to a preset threshold may be set as the filtering condition, and the candidate text having the same confidence level as the target text and greater than or equal to the preset threshold is used as the first candidate text, so that the retrieval accuracy may be improved to some extent.

Step S103, acquiring a first prosodic feature of the first candidate text in the voice information to be retrieved.

Prosodic features include prosodic information of a person speaking, such as intonation and pitch with prosody in speech. Typically, features of speech information include fundamental frequency and voiced probability. The fundamental frequency can represent pitch, namely commonly spoken tone; voiced probabilities can represent the jerky and wavy sensations of speech. Thus, the prosodic features of the speech information may include fundamental frequency and/or voiced probability. In step S103, the fundamental frequency and/or the voiced probability of the first candidate text at each time point in the speech information to be retrieved may be directly used as the first prosodic feature. The fundamental frequency and/or the voiced sound probability can be fitted to curves changing along with time respectively to obtain a first fundamental frequency curve and/or a first voiced sound probability curve which are used as the first prosodic feature. For example, as shown in fig. 1, a target text "business" is searched for in a "given point voice" of a speech to be retrieved, "business" in the word graph is a first candidate text, the position of the first candidate text in the speech information to be retrieved is t-0.97 s to t-0.98 s, and the first prosodic feature may be a fundamental frequency or a fundamental frequency curve between t-0.97 s to t-0.98 s in the speech to be retrieved.

Where the text is the same, the prosodic features of the voices of different speakers are similar. If an adult and a child read the future, the tone combination is 'four tones + two tones'. Considering that the voices corresponding to the same text should have similar prosody, the embodiment of the present application can eliminate the wrong retrieval result by using the prosody difference, thereby reducing the misjudgment rate of voice retrieval. For example, the first prosodic feature of the first candidate text "business" in the voice information to be retrieved is the prosodic feature of the voice segment "voice", and in the subsequent step, the misjudgment rate of the voice retrieval can be reduced by using the prosodic difference between the "voice" and the "business".

And step S104, acquiring the intrinsic prosodic features of the target text.

The intrinsic prosodic feature is a prosodic feature that the speech corresponding to the text should have, and represents a prosodic feature of speech obtained by most people speaking the text in a common context. For example, a plurality of texts, such as "voice", "business", "things", etc., may be prepared in advance, and the voice information of each text is recorded, and prosodic features are extracted therefrom and processed as the inherent prosodic features of the texts.

In order to make the inherent prosodic features of a single text similar to the prosodic features of the speech corresponding to the text in most cases, multiple speeches of different people, such as adults and children, speaking the text in different contexts may be prerecorded. The text can also be placed in different sentences or paragraphs, a plurality of voices of the sentences or paragraphs which are spoken by different persons under different contexts are recorded, and a plurality of voices of the text are extracted from the voices of the sentences or paragraphs. For example, placing "sound" in the sentences "click me sound", "her sound really good" and "where the sound comes" records the voices of the three sentences spoken by different persons in different contexts, and extracts the voice of "sound" therefrom. Prosodic features are extracted from a plurality of voices of the text, and then the prosodic features of the plurality of voices are processed to obtain the inherent prosodic features of the text. For example, the average or mode of prosodic features of a plurality of voices is taken as the intrinsic prosodic feature of the text. For another example, a prosodic feature curve is fit to the prosodic features of a plurality of voices, and the prosodic features are used as the intrinsic prosodic features of the text.

As an example, for each text in the plurality of texts, a prosody feature curve is obtained by fitting prosody features of a plurality of voices respectively, so as to obtain an intrinsic prosody feature of each text, and a prosody feature curve template library including the prosody feature curves of the plurality of texts may be established. Step S104, acquiring the intrinsic prosodic feature of the target text, which may include:

and acquiring a prosodic feature curve corresponding to the target text from a pre-stored prosodic feature curve template library to serve as the inherent prosodic feature of the target text. The prosodic feature curve template library may include a plurality of texts and one or more prosodic feature curves corresponding to the texts respectively.

In each of the above examples, a fundamental frequency curve can be fitted by using the fundamental frequency of the speech; a voiced probability curve can be fitted using voiced probabilities of speech. The intrinsic prosodic features of the text may include an intrinsic fundamental frequency curve and/or an intrinsic prosodic feature curve.

Step S105, determining whether the target text exists in the voice information to be retrieved according to the first prosody feature of the first candidate text and the inherent prosody feature of the target text.

Illustratively, the first prosodic feature of the first candidate text in the speech information to be retrieved and the intrinsic prosodic feature of the target text may be compared. And judging whether the voice segment corresponding to the first candidate text in the voice information to be retrieved is the voice of the target text according to the similarity or the correlation of the target text and the voice information to be retrieved, so as to determine whether the target text exists in the voice information to be retrieved. For example, the first prosodic feature of the first candidate text "business" in the to-be-retrieved voice information is the prosodic feature of the voice segment "voice", and the prosodic feature of the voice segment "voice" is compared with the inherent prosodic feature of the target text "business", so that the two are easily found to be dissimilar, and the target text "business" is determined not to exist in the to-be-retrieved voice information "voice to me".

According to the technical scheme, the candidate text set is obtained by identifying the voice information to be retrieved, after the first candidate text matched with the target text is retrieved from the set, the first prosodic feature of the first candidate text in the voice information to be retrieved and the inherent prosodic feature of the target text are obtained, and whether the target text exists in the voice information to be retrieved is determined according to the first prosodic feature and the inherent prosodic feature of the target text. The technical scheme considers that the voices corresponding to the same text have approximate prosody, excludes the retrieval results with larger prosody difference and reduces the misjudgment rate of voice information retrieval.

For the same text, the context is different, and the prosodic feature curves are different because the prosodic features are influenced by the context. Therefore, the same text can be subdivided according to different contexts, voices under different contexts are recorded in advance, and prosodic feature curves under different contexts are fitted.

When the text is in different position areas in the voice information, the text prosody is affected differently by the context content, and thus the prosodic features are different. The prosodic features of the text in different position areas reflect the influence of the context on the prosodic features, so the position areas can be used as context information. For example: the target text to be retrieved is "true in the sky", and the voice information is "true in the summer! "and" child in nature ", because the target text is located in different position areas in different voice messages, its characteristic curve will be changed by the difference of the part of speech of the preceding and following pronouncing words or the whole sentence pattern (exclamation, question). Therefore, a plurality of prosodic feature curves can be fitted by using the voice segments of the same text in different position areas in different voice information. For example, fitting a prosodic feature curve of "voice" at the voice start position with a voice segment corresponding to "voice" in which the voice information "voice is coming"; fitting a prosodic characteristic curve of the voice in the middle position of the voice by using a voice segment corresponding to the voice in the voice information that the voice of the user is really good; and fitting a prosodic characteristic curve of the sound at the end position of the voice by using the corresponding voice segment of the sound in the voice information 'give-me-point sound'. The prosodic feature curve template library comprises a group of prosodic feature curves corresponding to the texts respectively. Each set of prosodic feature curves includes a respective one of the prosodic feature curves corresponding to each of the plurality of location areas. Wherein the plurality of location areas include a plurality of location areas where a single text corresponding to the set of prosodic features is located in different speech information. For example, each set of prosodic features includes prosodic features corresponding to text at a speech start position, a speech middle position, and a speech end position, respectively.

As an exemplary embodiment, as shown in the flowchart of the voice information retrieval method shown in fig. 3, in step S104, acquiring a prosodic feature curve corresponding to the target text from a pre-stored prosodic feature curve template library as an intrinsic prosodic feature of the target text, including:

step S1041, determining that the first candidate text is in a first position area in the voice information to be retrieved according to the moment when the first candidate text appears in the voice information to be retrieved; the time may be a starting time, an intermediate time or an ending time;

step S1042, determining a set of prosodic feature curves corresponding to the target text in the prosodic feature curve template library, and selecting a first prosodic feature curve corresponding to the first position region in the set of prosodic feature curves as the inherent prosodic feature of the target text.

For example, referring to fig. 1, according to the word map, when the starting time t of the first candidate text "business" is 0.97s, it can be determined that the first candidate text "business" is at the end of speech position in the whole sentence of speech information with the time length of 0.98 s. And then determining a set of prosodic feature curves corresponding to the target text business in the prosodic feature curve template library. And selecting a prosodic feature curve corresponding to the end position of the voice from the set of prosodic feature curves as the inherent prosodic feature of the 'business' of the target text.

In addition, the context information may also include a language context. In different language situations, the speaker has different attitudes, environmental atmospheres and rhythms. The prosodic feature curve template library may include prosodic feature curves for a plurality of texts in a plurality of language scenarios, respectively. In step S104, a prosodic feature curve corresponding to the target text is obtained from a pre-stored prosodic feature curve template library, and the obtaining may include, as the intrinsic prosodic feature of the target text: acquiring a language scene of voice information to be retrieved; and acquiring a prosodic feature curve of the target text in the language scene from a prosodic feature curve template library to serve as the inherent prosodic feature of the target text.

As an exemplary implementation, as shown in the flowchart of the voice information retrieval method shown in fig. 4, the step S103 of obtaining the prosodic feature of the first candidate text in the voice information to be retrieved may include:

step S1031, determining the duration of a voice segment corresponding to the first candidate text in the voice information to be retrieved according to the starting time and the ending time of the first candidate text in the voice information to be retrieved;

step S1032, acquiring duration corresponding to the intrinsic rhythm characteristics; the intrinsic prosody feature is a prosody feature curve selected from a prosody feature curve template library, and the duration corresponding to the intrinsic prosody feature can be the time axis length of the curve;

step S1033, processing the voice segment to make the duration of the voice segment equal to the duration corresponding to the inherent prosodic feature;

step S1034, performing curve fitting on the prosody features of the processed voice segments, and taking the prosody feature curve obtained by fitting as the first prosody feature of the first candidate text in the voice information to be retrieved.

For example, the processing of the speech segment may be up-sampling or down-sampling the speech segment corresponding to the first candidate text. According to the method and the device, the first prosody feature of the first candidate text and the intrinsic prosody feature of the target text can be compared, and the correlation or similarity between the first prosody feature of the first candidate text and the intrinsic prosody feature of the target text can be measured, so that whether the target text exists in the voice information to be retrieved or not can be determined. The inherent prosody feature is a prosody feature curve in the template library, and the first prosody feature of the first candidate text is compared with the prosody feature curve in the template library, so that curve fitting can be performed on the voice segment corresponding to the first candidate text. The above exemplary processing manner may make the duration of the speech segment corresponding to the first candidate text consistent with the duration corresponding to the prosody feature curve as the intrinsic prosody feature, so that the durations of the two curves for comparison are consistent, and the accuracy of the correlation and the similarity measurement may be improved.

In some embodiments, when the prosodic feature curve template library is preset, the process of curve fitting therein may include: firstly, obtaining the rhythm characteristics of each syllable in the voice information, forming a rhythm characteristic sequence comprising each syllable and the rhythm characteristics thereof which are arranged in sequence, and carrying out polynomial curve fitting, such as cubic polynomial curve fitting, on the rhythm characteristic sequence to obtain a rhythm characteristic curve. Correspondingly, the step S1034 may also perform curve fitting on the prosodic features of the speech segment corresponding to the first candidate text in the same or similar manner.

Taking the prosodic features as fundamental frequencies as an example, the step of curve fitting may be as follows:

step A, extracting a base frequency value of each syllable in the voice information according to the starting and stopping time of each syllable;

and step B, for the fundamental frequency sequence F0 { (t1, F1), (t2, F2) … (tn, fn) }, wherein n represents that a plurality of syllables in the voice information comprise n fundamental frequency points in total, the time point corresponding to the ith fundamental frequency point is ti, and the fundamental frequency value is fi. After normalization, a new sequence of fundamental frequencies F ' 0 { (t ' 1, F ' 1), (t ' 2, F ' 2), …, (t ' n, F ' n) } is obtained.

And C, carrying out cubic curve fitting on the fundamental frequency sequence to obtain four parameters a0, a1, a2 and a3 which represent the characteristics of the curve shape, wherein the four parameters are coefficients of a polynomial obtained by cubic curve fitting. The polynomial is: f0(t) ═ a0+ a1 × t + a2 × t2+ a3 × t3, and finally, the fundamental frequency curve is obtained.

Fitting may also be done for voiced probability curves using the exemplary approach described above.

As an exemplary implementation, the step S105 of determining whether the target text exists in the speech information to be retrieved according to the first prosodic feature of the first candidate text and the intrinsic prosodic feature of the target text may include:

The correlation may reflect the degree of correlation between the two variables. The correlation can be measured by a correlation coefficient. The higher the correlation coefficient of the two curves, the closer the two curves are going. The first prosodic feature and the inherent prosodic feature of the embodiment of the application can both be prosodic feature curves, and the difference between the first prosodic feature of the first candidate text and the inherent prosodic feature of the target text is quantized by using the related coefficients, so that whether the target text exists in the voice information to be retrieved can be conveniently determined.

As an example, two characteristic curves are represented by two variables (X, Y), the correlation coefficient (ρ X, Y) being equal to the product ρ X ρ Y of the covariance cov (X, Y) between them divided by their respective standard deviations. It can be calculated using the following formula, where the symbol E () represents the mathematical expectation:

the embodiment of the application can also combine the confidence of the candidate text and the correlation coefficient between the prosodic features of the candidate text and the inherent prosodic features of the target text to comprehensively judge whether the voice segment corresponding to the candidate text pronounces the target text, so as to determine whether the target text exists in the voice information to be retrieved. There are examples of comprehensive decisions as follows:

example one, when a first candidate text matching the target text is retrieved from the plurality of candidate texts, the candidate texts that are the same as the target text but have a confidence level lower than a preset threshold may be excluded, and only the candidate texts higher than the threshold are kept as the first candidate text. Step S102, retrieving a first candidate text matching the target text from the plurality of candidate texts, including:

By filtering the confidence of the retrieval result, the computation of each step of subsequent prosodic feature processing, correlation coefficient calculation and the like can be reduced, and the efficiency of voice information retrieval can be improved.

And in the second example, after the prosodic feature correlation coefficient is obtained, the retrieval result can be judged by combining the confidence coefficient. That is, step S105, determining whether the target text exists in the to-be-retrieved speech information according to the first prosody feature of the first candidate text and the intrinsic prosody feature of the target text, may include:

determining a correlation coefficient between a first prosodic feature of the first candidate text and an intrinsic prosodic feature of the target text;

and determining whether the target text exists in the voice information to be retrieved or not according to whether the score of the first candidate text is larger than a preset threshold or not.

In this example, the confidence coefficient weight and the correlation coefficient weight may be adjusted, and the weight corresponding to the time when the recall rate of the voice information retrieval is the maximum is taken as the optimal weight coefficient.

In particular implementations, the prosodic features may include a fundamental frequency or a voiced probability, or may include both a fundamental frequency and a voiced probability. Correspondingly, the prosodic feature curves of the speech segments corresponding to the first candidate text obtained in the candidate text set may also include a fundamental frequency curve and a voiced probability curve. The prosodic feature curves in the prosodic feature curve template library include a fundamental frequency curve and a voiced probability curve, and thus, the intrinsic prosodic features of the target text may also include a fundamental frequency curve and a voiced probability curve. The correlation coefficients of the two prosodic features can be calculated respectively and then are combined to judge whether the target text exists in the voice information to be retrieved.

It should be noted that, although the above example describes the voice information retrieval method, those skilled in the art can understand that the present application should not be limited thereto. In fact, the user can set the method flexibly according to personal preference and/or actual application scene.

Thus, according to the voice information retrieval method of the embodiment of the application, the candidate text set is obtained by identifying the voice information to be retrieved, after the first candidate text matched with the target text is retrieved from the candidate text set, the prosody feature of the first candidate text in the voice information to be retrieved and the inherent prosody feature of the target text are obtained, and whether the target text exists in the voice information to be retrieved is determined according to the prosody feature and the inherent prosody feature of the target text. The embodiment of the application considers that the voices corresponding to the same text should have approximate prosody, eliminates the retrieval result with larger prosody difference by introducing prosody characteristics, and reduces the misjudgment rate of voice information retrieval.

Fig. 5 is a block diagram showing the configuration of a voice information retrieval apparatus according to an embodiment of the present invention. As shown in fig. 5, the apparatus 600 may include:

the recognition module 610 is configured to obtain a candidate text set of the voice information to be retrieved, where the candidate text set includes a plurality of candidate texts;

a retrieving module 620, configured to retrieve a first candidate text matching the target text from the plurality of candidate texts;

a first obtaining module 630, configured to obtain a first prosodic feature of the first candidate text in the to-be-retrieved voice information;

a second obtaining module 640, configured to obtain an intrinsic prosodic feature of the target text;

and the determining module 650 is configured to determine whether the target text exists in the voice information to be retrieved according to the first prosody feature of the first candidate text and the intrinsic prosody feature of the target text.

In one embodiment, the plurality of candidate texts have corresponding confidence levels; the retrieving module 620 is configured to, when a candidate text that is the same as the target text is retrieved from the plurality of candidate texts, select, as the first candidate text, a candidate text with a confidence level that is greater than or equal to a preset threshold from the retrieved candidate texts.

In one embodiment, the second obtaining module 640 is configured to obtain a prosodic feature curve corresponding to the target text from a pre-stored prosodic feature curve template library as an inherent prosodic feature of the target text; wherein,

In one embodiment, as shown in the block diagram of the voice information retrieving apparatus in fig. 6, the second obtaining module 640 includes:

the position determining unit 641 is configured to determine that the first candidate text is in a first position area in the to-be-retrieved voice information according to a time when the first candidate text appears in the to-be-retrieved voice information;

the selecting unit 642 is configured to determine a set of prosody feature curves corresponding to the target text in the prosody feature curve template library, and select a prosody feature curve corresponding to the first location area in the set of prosody feature curves as the intrinsic prosody feature of the target text.

In one embodiment, as shown in fig. 7, the first obtaining module 630 includes:

the duration determining unit 631 is configured to determine, according to a start time and an end time of the first candidate text appearing in the to-be-retrieved voice information, a duration of a voice segment corresponding to the first candidate text in the to-be-retrieved voice information;

a duration obtaining unit 632, configured to obtain a duration corresponding to the intrinsic prosodic feature;

a duration processing unit 633, configured to process the voice segment so that a duration of the voice segment is equal to a duration corresponding to the intrinsic prosodic feature;

the curve fitting unit 634 is configured to perform curve fitting on the prosody features of the processed voice segments, and use a prosody feature curve obtained through the curve fitting as a first prosody feature of the first candidate text in the voice information to be retrieved.

In one embodiment, the identification module 650 includes:

a correlation coefficient determining unit, configured to determine a correlation coefficient of a first prosodic feature of the first candidate text and an intrinsic prosodic feature of the target text;

and the first identification unit is used for identifying that the target text exists in the voice information to be retrieved under the condition that the correlation coefficient is greater than or equal to a preset threshold value.

In one embodiment, the plurality of candidate texts have corresponding confidence levels; the qualification module 650 includes:

the weighted summation unit is used for carrying out weighted summation on the confidence coefficient and the correlation coefficient of the first candidate text according to the preset confidence coefficient weight and the correlation coefficient weight to obtain the score of the first candidate text;

and the second determining unit is used for determining that the target text exists in the voice information to be retrieved under the condition that the score of the first candidate text is greater than or equal to a preset threshold value.

Fig. 8 shows a block diagram of the structure of an apparatus according to an embodiment of the invention. As shown in fig. 8, the apparatus includes: a memory 910 and a processor 920, the memory 910 having stored therein computer programs operable on the processor 920. The processor 920 implements the voice information retrieval method in the above-described embodiment when executing the computer program. The number of the memory 910 and the processor 920 may be one or more.

The apparatus further comprises:

and a communication interface 930 for communicating with an external device to perform data interactive transmission.

If the memory 910, the processor 920 and the communication interface 930 are implemented independently, the memory 910, the processor 920 and the communication interface 930 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.

Optionally, in an implementation, if the memory 910, the processor 920 and the communication interface 930 are integrated on a chip, the memory 910, the processor 920 and the communication interface 930 may complete communication with each other through an internal interface.

Embodiments of the present invention provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method provided in the embodiments of the present application.

The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and execute the instruction stored in the memory from the memory, so that the communication device in which the chip is installed executes the method provided in the embodiment of the present application.

An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.

It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be an advanced reduced instruction set machine (ARM) architecture supported processor.

Further, optionally, the memory may include a read-only memory and a random access memory, and may further include a nonvolatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may include a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

While the present invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for retrieving voice information, comprising:

retrieving a first candidate text matching the target text from the plurality of candidate texts;

acquiring a first prosodic feature of the first candidate text in the voice information to be retrieved;

acquiring intrinsic prosodic features of the target text;

determining whether the target text exists in the voice information to be retrieved or not according to the first prosody feature of the first candidate text and the inherent prosody feature of the target text;

wherein the acquiring the intrinsic prosodic features of the target text comprises:

determining a set of prosodic feature curves corresponding to the target text in a prosodic feature curve template library, and selecting the prosodic feature curve corresponding to the first position area from the set of prosodic feature curves as the inherent prosodic feature of the target text; the prosodic feature curve template library comprises a plurality of texts and a respective set of prosodic feature curves corresponding to the texts, wherein each set of prosodic feature curve comprises a respective prosodic feature curve corresponding to a plurality of position areas, and the position areas comprise a plurality of position areas where a single text corresponding to the set of prosodic feature curves is located in different voice information.

2. The method of claim 1, wherein the plurality of candidate texts have corresponding confidence levels;

the retrieving, from the plurality of candidate texts, a first candidate text that matches the target text, comprising:

and when the candidate texts same as the target text are retrieved from the candidate texts, selecting the candidate text with the confidence degree larger than or equal to a preset threshold value from the retrieved candidate texts as the first candidate text.

3. The method according to claim 1, wherein the obtaining the first prosodic feature of the first candidate text in the speech information to be retrieved comprises:

acquiring duration corresponding to the intrinsic rhythm characteristics;

processing the voice segment to enable the duration of the voice segment to be equal to the duration corresponding to the inherent prosodic feature;

and performing curve fitting on the processed prosody features of the voice segments, and taking a prosody feature curve obtained by fitting as a first prosody feature of the first candidate text in the voice information to be retrieved.

4. The method according to claim 1, wherein the determining whether the target text exists in the speech information to be retrieved according to the first prosodic feature of the first candidate text and the intrinsic prosodic feature of the target text comprises:

determining a correlation coefficient of a first prosodic feature of the first candidate text and an intrinsic prosodic feature of the target text;

5. The method of claim 1, wherein the plurality of candidate texts have corresponding confidence levels;

determining whether the target text exists in the voice information to be retrieved according to the first prosody feature of the first candidate text and the intrinsic prosody feature of the target text, including:

according to preset confidence coefficient weight and correlation coefficient weight, carrying out weighted summation on the confidence coefficient and the correlation coefficient of the first candidate text to obtain the score of the first candidate text;

6. The method of claim 1, wherein the first prosodic feature comprises a first fundamental frequency curve and/or a first voiced probability curve; the intrinsic prosodic features include an intrinsic fundamental frequency curve and/or an intrinsic voiced probability curve.

7. A speech information retrieval apparatus, comprising:

the first obtaining module is used for obtaining a first prosodic feature of the first candidate text in the voice information to be retrieved;

the second acquisition module is used for acquiring the intrinsic prosodic features of the target text;

the determining module is used for determining whether the target text exists in the voice information to be retrieved according to the first prosody feature of the first candidate text and the inherent prosody feature of the target text;

wherein the second obtaining module comprises:

the position determining unit is used for determining that the first candidate text is in a first position area in the voice information to be retrieved according to the moment when the first candidate text appears in the voice information to be retrieved;

a selecting unit, configured to determine, in a prosodic feature curve template library, a set of prosodic feature curves corresponding to the target text, and select, in the set of prosodic feature curves, a prosodic feature curve corresponding to the first location area as an intrinsic prosodic feature of the target text; the prosodic feature curve template library comprises a plurality of texts and a respective set of prosodic feature curves corresponding to the texts, wherein each set of prosodic feature curve comprises a respective prosodic feature curve corresponding to a plurality of position areas, and the position areas comprise a plurality of position areas where a single text corresponding to the set of prosodic feature curves is located in different voice information.

8. An apparatus for voice information retrieval, comprising: comprising a processor and a memory, said memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of claims 1 to 6.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.