CN114900590A

CN114900590A - Tracking display method, prompter, electronic device and storage medium

Info

Publication number: CN114900590A
Application number: CN202210293071.XA
Authority: CN
Inventors: 司马华鹏; 李慧水; 汤毅平; 汪成; 张亚萍
Original assignee: Nanjing Silicon Intelligence Technology Co Ltd
Current assignee: Nanjing Silicon Intelligence Technology Co Ltd
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2022-08-12
Anticipated expiration: 2042-03-23
Also published as: CN114900590B

Abstract

The application provides a tracking display method, a prompter, electronic equipment and a storage medium. The technical problem that the prompter cannot realize ideal voice tracking in the working process in the related technology is solved. The tracking display method is suitable for different languages and context scenes. And determining a corresponding tracking display scheme and language logic through judgment. Specifically, a positioning and voice tracking mode is adopted, and voice tracking is realized through screening of candidate positions and a comparison method of similarity. On one hand, the process of voice tracking and positioning does not depend strictly on the effect of voice recognition. On the other hand, in the positioning process, the possible positions are screened and positioned based on an iterative thought, and compared objects are not all texts to be tracked. And then shortened processing time, and can still realize accurate positioning according to the actual speech content of user under speech recognition mistake and the condition that the user is misspoken. Thereby realizing fast and accurate voice tracking.

Description

Tracking display method, prompter, electronic device and storage medium

Technical Field

The present application relates to the field of electronic device technologies, and in particular, to a tracking display method, a prompter, an electronic device, and a storage medium.

Background

In the recording or live broadcasting process of a user, electronic equipment is required to assist in prompting the user of a related target text so as to realize a word prompting function and help the user to work smoothly. Generally, the prompter function can be implemented by a separate electronic device, such as a prompter, or can be implemented in the form of an application program in the electronic device.

In the process that the user uses the word-extracting function, the target text can be displayed on a screen of the word-extracting device in real time for the user to browse, and meanwhile, the word-extracting device can mark the current speaking position of the user in real time in the process of displaying the target text according to the voice of the user during speaking so that the user can obtain the prompt text required at the next moment. And further, voice tracking is carried out while the function of prompting is realized.

However, in implementing the speech tracking function, the tokenizer typically performs speech recognition on the target text using an Automatic Speech Recognition (ASR) model. In the speech recognition process, the recognition error of the ASR model is limited, for example, in a scene where a user has a speech rate change, a speech pause or a speech mistake, the speech position of the user cannot be accurately recognized, and the use experience of the user is reduced. Aiming at the problem that the prompter cannot realize ideal voice tracking in the working process in the related technology, an effective solution is not provided yet.

Disclosure of Invention

The embodiment of the application provides a tracking display method, a prompter, electronic equipment and a storage medium, so as to at least solve the technical problem that ideal voice tracking cannot be realized in the related technology.

In one embodiment of the present application, a tracking display method is provided, the method including: obtaining an identification text according to voice data input by a user at the current moment, and determining an initial position in the identification text; the initial position is a text position determined by voice tracking at the last moment. Determining a plurality of candidate positions and candidate text segments corresponding to each candidate position according to the initial position, the preset total number of the candidate positions and the preset text length; wherein the total number of candidate positions is used for indicating the number of candidate positions, and the text length is used for indicating the text length of each candidate text segment. And performing similarity matching on the candidate text segment corresponding to each candidate position and the recognition text to determine a text matching result. And determining the voice tracking position at the current moment according to the text matching result, and presenting the voice tracking position in a user interface.

In an embodiment of the present application, in a case that the text is identified as the first language type, determining a plurality of candidate positions and a candidate text segment corresponding to each candidate position according to the initial position, a preset total number of candidate positions, and a preset text length includes: selecting a candidate character set according to the total number of the candidate positions, wherein the candidate character set comprises a plurality of first characters; the first character is a plurality of continuous characters behind the initial position, and the number of the first characters is equal to the total number of the candidate positions; each first character corresponds to a candidate position. Determining candidate text segments according to the text length and the first character; the candidate text segments comprise a plurality of continuous characters with the corresponding number of text lengths; the first character is located at the last position of the corresponding candidate text segment.

In an embodiment of the present application, determining a plurality of candidate positions and a candidate text segment corresponding to each candidate position according to an initial position, a preset total number of candidate positions, and a preset text length includes: selecting a candidate character set according to the total number of the candidate positions, wherein the candidate character set comprises a plurality of first characters and second characters; the first characters are a plurality of continuous characters after the initial position, the second characters comprise characters corresponding to the initial position and a plurality of continuous characters before the initial position, and the total number of the first characters and the second characters is equal to the total number of the candidate positions; each of the first character and the second character corresponds to a candidate position. Determining candidate text segments according to the text length, the first character and the second character; the candidate text segments comprise a plurality of continuous characters with the corresponding number of text lengths; the first character is located at the last of the corresponding candidate text segment and the second character is located at the last of the corresponding candidate text segment.

In one embodiment of the application, similarity matching is carried out on the candidate text segment corresponding to each candidate position and the recognition text, and a text matching result is determined; the method comprises the following steps: and determining a first text matching value corresponding to each candidate text segment according to the similarity between each candidate text segment and the recognition text. And determining a distance penalty value corresponding to each candidate text segment according to the candidate positions, wherein the distance penalty value is used for representing the text distance between the candidate positions and the initial position. And determining a second text matching value corresponding to each candidate text segment according to the first text matching value and the distance penalty value corresponding to each candidate text segment. And determining a text matching result according to the second text matching values corresponding to the candidate text segments.

In one embodiment of the application, a user interface displays a text to be tracked, and the voice tracking position is a position of a candidate text segment terminal text corresponding to the text to be tracked; presenting the voice-tracked location in a user interface, comprising: marking the voice tracking position at a display tracking position of the text to be tracked, and presenting the display tracking position in a user interface; wherein the display tracking position is a text position after the voice tracking position.

In an embodiment of the present application, in a case that the text is identified as the second language type, determining a plurality of candidate positions and a candidate text segment corresponding to each candidate position according to the initial position, a preset total number of candidate positions, and a preset text length includes: selecting a candidate word set according to the total number of the candidate positions, wherein the candidate word set comprises a plurality of first words; the first words are a plurality of continuous words after the initial position, and the number of the first words is equal to the total number of the candidate positions; each first word corresponds to a candidate position. Determining candidate text segments according to the text length and the first word; the candidate text segments comprise a plurality of continuous words with the corresponding number of text lengths; the first word is located at the last position of the corresponding candidate text segment.

In an embodiment of the present application, determining a plurality of candidate positions and a candidate text segment corresponding to each candidate position according to an initial position, a preset total number of candidate positions, and a preset text length includes: selecting a candidate word set according to the total number of the candidate positions, wherein the candidate word set comprises a plurality of first words and second words; the first words are a plurality of continuous words after the initial position, the second words comprise words corresponding to the initial position and a plurality of continuous words before the initial position, and the total number of the first words and the second words is equal to the total number of the candidate positions; each of the first word and the second word corresponds to a candidate position. Determining candidate text segments according to the text length, the first word and the second word; the candidate text segments comprise a plurality of continuous words with the corresponding number of text lengths; the first word is located at the last position of the corresponding candidate text segment, and the second word is located at the last position of the corresponding candidate text segment.

In an embodiment of the present application, performing similarity matching between a candidate text segment corresponding to each candidate position and a recognition text, and determining a text matching result includes: and determining a third text matching value corresponding to each candidate text segment according to the similarity between each candidate text segment and the recognition text. Determining a first comparison value and a second comparison value corresponding to each candidate text segment according to the third text matching value; the first comparison value is used for representing the same proportion of letters corresponding to each letter position in two words at the same position of the candidate text segment and the recognition text; the second comparison value is used for representing the proportion of the same letter in the candidate text segment and two words at the same position of the recognition text to the total number of letters. And determining a distance penalty value corresponding to each candidate text segment according to the candidate positions, wherein the distance penalty value is used for representing the text distance between the candidate positions and the initial position. And determining a fourth text matching value corresponding to each candidate text segment according to the first comparison value, the second comparison value and the distance penalty value corresponding to each candidate text segment. And determining a text matching result according to the fourth text matching values corresponding to the candidate text segments.

In one embodiment of the present application, the method further comprises: the total number of the candidate positions set when the text is identified as the first type of language type is smaller than the total number of the candidate positions set when the text is identified as the second type of language type or the mixed type of the first type of language and the second type of language.

In an embodiment of the present application, before determining the initial position in the recognition text, the method further includes: and detecting the language type of the recognized text so as to determine the corresponding different initial positions according to different language types.

In one embodiment of the present application, there is also provided a prompter, including:

the ASR module is configured to obtain a recognition text according to the voice data input by the user at the current moment;

a speech tracking module configured to determine an initial position in the recognized text; the initial position is a text position determined by voice tracking at the last moment; determining a plurality of candidate positions and candidate text segments corresponding to each candidate position according to the initial position, the preset total number of the candidate positions and the preset text length; the total number of the candidate positions is used for indicating the number of the candidate positions, and the text length is used for indicating the text length of each candidate text fragment; carrying out similarity matching on the candidate text segment corresponding to each candidate position and the recognition text to determine a text matching result; determining the voice tracking position at the current moment according to the text matching result;

A display module configured to present the voice-tracking location in a user interface.

In one embodiment of the application, a user interface displays a text to be tracked, and the voice tracking position is a position of a candidate text segment terminal text corresponding to the text to be tracked; presenting the voice tracking location in a user interface, the display module further configured to:

marking the voice tracking position at a display tracking position of the text to be tracked, and presenting the display tracking position in a user interface; wherein the display tracking position is a text position after the voice tracking position.

In an embodiment of the present application, a computer-readable storage medium is also proposed, in which a computer program is stored, wherein the computer program is configured to perform the steps of the above-mentioned method when executed.

In an embodiment of the present application, there is also provided an electronic device, including a memory and a processor, the memory having a computer program stored therein, the processor being configured to execute the computer program to perform the steps of the above method.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow chart of an alternative tracking display method according to an embodiment of the present application;

FIG. 2 is a flow chart of an alternative method for determining candidate text segments according to an embodiment of the present application;

FIG. 3 is a flow chart of an alternative method for determining candidate text snippets according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an alternative prompter according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

First, an application scenario of the embodiment of the present application will be described. Electronic equipment is needed to assist in prompting a user related target text in the process of host, performance, recording and broadcasting or live broadcasting of the user, so that the user is helped to smoothly perform corresponding work while the function of prompting is realized. Generally, the prompter function can be realized by a separate electronic device, such as a prompter, or can be realized in the form of an application program in the electronic device.

In the process that a user uses the word-extracting function, the target text can be displayed in a user interface of a word-extractor screen in real time for the user to browse, and meanwhile, the word-extractor can mark the current speaking position of the user in real time in the process of displaying the target text according to the voice of the user during speaking so that the user can conveniently obtain the prompt text required at the next moment. And further, language tracking is carried out while the function of word extraction is realized. For the convenience of the following description, the target text used by the user is referred to as a text to be tracked.

Illustratively, the text to be tracked is "i feel like a whole family you go on your cumin to find a temple … … that is completely absent in the world. The user can read with the displayed characters, and the voice tracking marking is carried out by taking a sentence as a unit or taking a character as a unit in the reading process. For example, the user follows the "one" of "i feel you like" and the location of the "one" is marked in real time in the user interface. For example, after the user reads the temple in the world, the position of the temple is marked in real time in the user interface, and the next sentence is automatically tracked and marked. And if the user follows the last word of the current page, controlling to display the next page of text of the currently displayed text and continuing to mark in real time.

When a tokenizer implements a speech tracking function in the related art, an Automatic Speech Recognition (ASR) model is typically employed to perform speech recognition on a target text. Meanwhile, the prompter usually only supports Chinese and the context of uniform pronunciation of the user. Generally, in the process of speech recognition, the recognition error of the ASR model itself is limited, and the prompter generally adopts a uniform-speed playing mode, for example, when the speech of the user is other than the chinese language and/or the user has a scene of speech speed change, speech pause or speech mistake during speech, the speech position of the user cannot be accurately recognized, and the tracking position cannot be updated in time, so as to reduce the user experience.

In order to optimize the above problem, the present application provides a tracking display method to at least solve the technical problem in the related art that ideal voice tracking cannot be realized.

Fig. 1 is a flowchart of an alternative tracking display method according to an embodiment of the present application, and referring to fig. 1, a tracking display method provided by the present application includes the following specific processes:

s1: obtaining an identification text according to voice data input by a user at the current moment, and determining an initial position in the identification text; the initial position is a text position determined by voice tracking at the last moment.

In this embodiment, the recognized text is a text recognized according to the voice data input by the user, that is, the text content currently spoken by the user. The text to be tracked is a text created and set by the user, such as a speech line and a speech book which assist the user in prompting the user in the scene. When a user reads the text to be tracked according to the displayed text to be tracked, the identified text needs to be associated with the file to be tracked, and the text position read by the user can be tracked in the text to be tracked in real time, so that the functions of voice tracking and word extraction are realized.

In this embodiment, the position located after the previous time voice tracking is defined as the initial position of the current time. It should be noted that the tracking display method provided in the present application is advanced differently. Namely, the tracking result of the previous moment is used as the initial position of the current moment, and the tracking result of the current moment is also used as the initial position of the next moment, and iteration is sequentially performed.

For example, if the text content spoken by the user at the previous time is a "tired" word in "i feel that you are just like a whole family would like cumin you go on the road" i.e., the voice tracking result at the previous time is also a "tired" position. Further, the initial position at the current time is "drowsy".

In this embodiment, before determining the initial position in the recognition text, the method further includes: and detecting the language type of the recognized text so as to determine the corresponding different initial positions according to different language types.

In the embodiment, the tracking display method is suitable for different language types. Illustratively, a first language type, a second language type, and a third language type. The first language type refers to a language in units of characters such as chinese, japanese, korean, and the like. The second language type refers to Latin language with English, French, German, etc. as unit of letter. The third language type refers to a hybrid type language, such as a mixed type of the first language and the second language, i.e. Chinese and English hybrid, and the like. It should be noted that the speech in the language type, including but not limited to the above language, can be designed according to the actual situation.

S2: determining a plurality of candidate positions and candidate text segments corresponding to each candidate position according to the initial position, the preset total number of the candidate positions and the preset text length; wherein the total number of candidate positions is used for indicating the number of candidate positions, and the text length is used for indicating the text length of each candidate text segment.

In this embodiment, a candidate position list is constructed based on the initial position, wherein the candidate position list includes a total number of candidate positions, a preset text length, and a candidate text segment. Specifically, the total number of candidate positions is used for characterizing the number of candidate positions; each candidate position corresponds to one candidate text fragment; the text length is used to indicate the text length of each candidate text segment. It should be noted that the total number of candidate positions, the preset text length, and the setting of the candidate text segment are not specifically limited in the present application, and may be designed according to the actual situation. For example, different text types may set different N. The numerical value of N can be automatically adjusted by determining the corresponding type of the subsequent text according to the text to be tracked and the initial position.

Fig. 2 is a flow chart of an alternative process for determining candidate text segments according to an embodiment of the present application, and with reference to fig. 2, the specific process is as follows:

under the condition that the text is recognized as the first language type, determining a plurality of candidate positions and a candidate text segment corresponding to each candidate position according to the initial position, the preset total number of the candidate positions and the preset text length, wherein the step of determining the candidate text segment comprises the following steps:

S21, selecting a candidate character set according to the total number of the candidate positions, wherein the candidate character set comprises a plurality of first characters; the first character is a plurality of continuous characters behind the initial position, and the number of the first characters is equal to the total number of the candidate positions; each first character corresponds to a candidate position.

S22, determining candidate text segments according to the text length and the first character; the candidate text segments comprise a plurality of continuous characters with the corresponding number of text lengths; the first character is located at the last position of the corresponding candidate text segment.

Illustratively, the total number N of candidate positions is 10, and the length K of the text corresponding to each candidate position is 5. The following text is taken as an example: "I feel you like a whole life all have to go on your way like cumin of Xiangke search for a whole world's temple … … that does not exist at all. The initial position is "fatigue". The first character is based on "fatigue" and 10 characters after "fatigue" are selected and respectively correspond to "ground, search, find, one, seat, complete, incomplete, absent, present". The above 10 characters correspond to 10 candidate positions. Furthermore, taking each character as a starting point, intercepting K-1 characters forward word by word to obtain each candidate position.

On the basis of each character, 4 characters are cut forward, if the character before the ground is cumin, not, and tired, the candidate position corresponding to the ground is cumin not tired; if the character before the "seat" is "ground, search, find, one", the candidate position corresponding to the "seat" is "ground search one". By analogy, all the candidate positions are as follows:

candidate position 1: cumin not tired;

candidate position 2: searching cumin for the people without fatigue;

candidate position 3: searching without fatigue;

candidate position 4: searching for one in the tired place;

……

candidate position 10: is completely absent.

Therefore, each candidate position corresponds to one candidate text segment, and the text length of each candidate text segment is the same.

It should be noted that, during the practical operation of the prompter, there may be some cases where the user repeatedly reads the previous content, and for this reason, the present application also provides an implementation manner for determining a plurality of candidate positions and candidate text segments corresponding to each candidate position. Fig. 3 is a flowchart of an alternative process for determining candidate text segments according to an embodiment of the present application, and with reference to fig. 3, the specific process is as follows:

S31, selecting a candidate character set according to the total number of the candidate positions, wherein the candidate character set comprises a plurality of first characters and second characters; the first characters are a plurality of continuous characters after the initial position, the second characters comprise characters corresponding to the initial position and a plurality of continuous characters before the initial position, and the total number of the first characters and the second characters is equal to the total number of the candidate positions; each of the first character and the second character corresponds to a candidate position.

S32, determining candidate text segments according to the text length, the first character and the second character; the candidate text segments comprise a plurality of continuous characters with the corresponding number of text lengths; the first character is located at the last of the corresponding candidate text segment and the second character is located at the last of the corresponding candidate text segment.

Illustratively, the total number N of candidate positions is 10, and the length K of the text corresponding to each candidate position is 5. The following text is taken as an example: "I feel you like a whole life all have to go on your way like cumin of Xiangke search for a whole world's temple … … that does not exist at all. The initial position is "fatigue". Setting the second character to be M, the first character is N-M. Still by way of illustration in the above example, setting M to be 4, taking "fatigue" as a starting point, selecting "di, di" before "fatigue" and "di, seek, find, one, and one after" fatigue "; and then determining corresponding candidate positions on the basis of each character. Generally, M is set much less than N-M.

Furthermore, taking each character as a starting point, forward intercepting K-1 characters word by word to obtain each candidate position.

Wherein, the candidate character set is 'cumin, not, tired, ground, seek, find, one, seat, finish', on the basis of each character, 4 characters are cut forward, if the character before 'ground' is 'cumin, not, tired', the corresponding candidate position of 'ground' is 'cumin not tired'; if the character before the "seat" is "ground, search, find, one", the candidate position corresponding to the "seat" is "ground search one". By analogy, all the candidate positions are as follows:

candidate position 1: shangxi Cumin;

candidate position 2: cumin, zikeli;

candidate position 3: cuminbu of Xiangke;

candidate position 4: cumin is not tired;

……

candidate position 10: and finding one seat.

In the case that the text is identified as the second language type, the present application further provides two implementation manners for determining a plurality of candidate positions and a candidate text segment corresponding to each candidate position, where the first implementation manner specifically includes the following steps:

determining a plurality of candidate positions and candidate text segments corresponding to each candidate position according to the initial position, the preset total number of candidate positions and the preset text length, wherein the method comprises the following steps:

Selecting a candidate word set according to the total number of the candidate positions, wherein the candidate word set comprises a plurality of first words; the first words are a plurality of continuous words after the initial position, and the number of the first words is equal to the total number of the candidate positions; each first word corresponds to a candidate position respectively;

determining candidate text segments according to the text length and the first word; the candidate text segments comprise a plurality of continuous words with the corresponding number of text lengths; the first word is located at the last position of the corresponding candidate text segment.

In this embodiment, the second method specifically includes the following steps:

selecting a candidate word set according to the total number of the candidate positions, wherein the candidate word set comprises a plurality of first words and second words; the first words are a plurality of continuous words after the initial position, the second words comprise words corresponding to the initial position and a plurality of continuous words before the initial position, and the total number of the first words and the second words is equal to the total number of the candidate positions; each first word and each second word respectively correspond to a candidate position;

determining candidate text segments according to the text length, the first word and the second word; the candidate text segments comprise a plurality of continuous words with the corresponding number of text lengths; the first word is located at the last position of the corresponding candidate text segment, and the second word is located at the last position of the corresponding candidate text segment.

It should be noted that, in the case that the recognized text is of the second language type, the initial position, the preset total number of candidate positions, the preset text length, and the determination manner of the candidate text segment corresponding to each candidate position are all the same as those of the recognized text of the first language type, and are not described herein again.

In this embodiment, the total number of candidate positions set when the recognized text is of the first type of language is smaller than the total number of candidate positions set when the recognized text is of the second type of language or a mixed type of the first type of language and the second type of language.

S3: and performing similarity matching on the candidate text segment corresponding to each candidate position and the recognition text, and determining a text matching result.

Under the condition that the recognized text is of the first language type, performing similarity matching on the candidate text segment corresponding to each candidate position and the recognized text to determine a text matching result; the method comprises the following steps:

determining a first text matching value corresponding to each candidate text segment according to the similarity between each candidate text segment and the recognition text;

determining a distance penalty value corresponding to each candidate text segment according to the candidate positions, wherein the distance penalty value is used for representing the text distance between the candidate positions and the initial position;

Determining a second text matching value corresponding to each candidate text segment according to the first text matching value and the distance penalty value corresponding to each candidate text segment;

and determining a text matching result according to the second text matching values corresponding to the candidate text segments.

In specific implementation, a first formula is adopted to obtain a first text matching value

The first formula is as follows:

wherein the content of the first and second substances,

representing the pinyin of the ith character in the identification text;

representing the pinyin of the ith character in the jth candidate position;

the expression is that the pinyin of the ith character in the recognition text is compared with the pinyin of the ith character in the jth candidate position, if the pinyin is the same, 1 is output, and otherwise, 0 is output.

In this embodiment, according to a plurality of candidate positions, a distance penalty value corresponding to each candidate text segment is determined, and the distance penalty value is used for characterizing a text distance between the candidate position and the initial position.

In this case, the speech of the user is generally consecutive, so the position where the user speaks at the current time should not be too far away from the initial position in theory. Further, a candidate position with a higher priority (a candidate position with a smaller j) located at a position earlier in the candidate position list is given, and a candidate position with a lower priority (a candidate position with a larger j) located at a position later is given. That is, for a candidate position with a smaller j, the distance penalty is smaller, and the influence on the similarity result is smaller. For candidate positions with larger j, the distance penalty value is larger, and the influence on the similarity result is large. It can also be understood that the position candidates corresponding to the closer to the initial position a have smaller distance penalty values, and the position candidates corresponding to the farther from the initial position have larger penalty values.

In the embodiment of the application, a second text matching value corresponding to each candidate text segment is determined according to a first text matching value and a distance penalty value corresponding to each candidate text segment; the second text matching value is obtained by adopting a second formula, and the second formula is as follows:

wherein j represents from 0 to j _max A plurality of candidate locations therebetween; j is a unit of a group _max Representing the largest candidate position, i.e. the last candidate position in the list of candidate positions; k represents the text length corresponding to the jth candidate position; α represents the maximum penalty allowed, typically 0.1;

representing a distance penalty value.

In the embodiment of the application, a text matching result is determined according to the second text matching values corresponding to the candidate text segments. The text matching result is obtained by adopting a third formula, wherein the third formula is as follows:

j _best ＝argmax(sim _j )；

wherein j is _best The candidate position that best matches the target, i.e., the candidate position that is most similar to the recognized text, is represented.

In the embodiment of the application, pinyin comparison is carried out on each character in the text corresponding to the jth candidate position and the character at the corresponding position in the identification text, if the characters are the same, 1 is output, and if the characters are different, 0 is output. And further introducing a position weight on the basis of the output result, wherein the position weight is used for indicating the weight distributed according to the position of the character in the candidate position corresponding text. For speech tracking it is more important whether the end of the text is similar than elsewhere in the text. Specifically, the closer the character is to the end of the text, the higher the weight, and vice versa the lower the weight.

The above i represents the position of the character in the recognition text or the candidate text, and starting from the end of the text, for example, position 1 "cumin" in the above example, i corresponding to "di" should be 1, and i corresponding to "not" should be 3. It should be noted that the position i in each candidate text segment starts from the end of the text.

Under the condition that the recognized text is of the second language type, similarity matching is carried out on the candidate text segment corresponding to each candidate position and the recognized text, and a text matching result is determined, wherein the method comprises the following steps:

determining a third text matching value corresponding to each candidate text segment according to the similarity between each candidate text segment and the recognition text;

determining a first comparison value and a second comparison value corresponding to each candidate text segment according to the third text matching value; the first comparison value is used for representing the same proportion of letters corresponding to each letter position in two words at the same position of the candidate text segment and the recognition text; the second comparison value is used for representing the proportion of the same letter in two words at the same position of the candidate text segment and the recognition text to the total number of the letters;

Determining a fourth text matching value corresponding to each candidate text segment according to the first comparison value, the second comparison value and the distance penalty value corresponding to each candidate text segment;

and determining a text matching result according to the fourth text matching values corresponding to the candidate text segments.

In the embodiment of the present application, when the text is recognized as the second language type, the text is distinguished from the first language type by using the word as the minimum unit for positioning, intercepting and similarity comparison. Specifically, according to the similarity between each candidate text segment and the recognition text, determining a third text matching value corresponding to each candidate text segment;

in concrete implementation, a fourth formula is adopted to obtain a third text matching value

The fourth formula is as follows:

wherein the content of the first and second substances,

indicates the ith word in the jth candidate position (hereinafter

) The g-th letter;

indicating that the ith word in the text is recognized (hereinafter

) The g-th letter. It should be noted that the ith character and the g-th letter are derived from the last character/letter of the text or word as the starting point, i.e., i or g corresponding to the last character/letter of the text or word is recorded as 1, i or g corresponding to the second last character/letter is recorded as 2, and so on.

The expression is that the ith word in the recognized text is compared with the ith word in the text corresponding to the jth candidate position, and 1 is output if the g-th letters in the two words are the same, and 0 is output if the g-th letters in the two words are different.

In the embodiment of the application, a first comparison value and a second comparison value corresponding to each candidate text segment are determined according to the third text matching value; the first comparison value is used for representing the same proportion of letters corresponding to each letter position in two words at the same position of the candidate text segment and the recognition text; the second comparison value is used for representing the proportion of the same letter in two words at the same position of the candidate text segment and the recognition text to the total number of the letters;

in specific implementation, a fifth formula is adopted to obtain a first comparison value and a second comparison value, and the fifth formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

representing the ith word in the recognition text, wherein i and g are from the end to the beginning of the word;

representing the ith word in the jth candidate position,

indicating that the ratio of the ith letter in the two words is the same for the ith word in the text corresponding to the jth candidate position in the recognition text and the ith word in the text, wherein z represents the total number of letters of the ith word;

showing that the proportion of the same letter in the two words in the total number of the letters is calculated aiming at the ith word in the text corresponding to the ith candidate position in the recognition text, namely, the letters contained in the two words are respectively constructed into two sets, and the coincidence proportion of the letters in the two sets is compared; interselection function representation w ¹ And w ^j The number of coincidences of the letter set in (a); len function represents w ¹ And w ^j The number of letters in the corresponding letter set.

In the embodiment of the application, a distance penalty value corresponding to each candidate text segment is determined according to a plurality of candidate positions, and the distance penalty value is used for representing the text distance between the candidate positions and the initial position;

in concrete implementation, a sixth formula is adopted to obtain a fourth text matching value, and the sixth formula is as follows:

wherein j represents from 0 to j _max A plurality of candidate locations therebetween; j is a function of _max Representing the largest candidate position, i.e. the last candidate position in the list of candidate positions; when the subsequent text has more English and Chinese, j _max Taking a larger value; k represents the text length (number of words) corresponding to the jth candidate position; α represents the maximum penalty allowed, typically 0.1;

representing a distance penalty value.

In the embodiment of the application, a text matching result is determined according to fourth text matching values corresponding to a plurality of candidate text segments.

In concrete implementation, a seventh formula is adopted to obtain a text matching result, and the seventh formula is as follows:

j _best ＝argmax(sim _j )；

Wherein j is _best The position where the matching target is best, i.e., the candidate position most similar to the recognized text, is represented.

It should be noted that, because english is influenced by grammars such as tense and single-plural, the following situations exist in the practical use process of the prompter product: when a user speaks, a word in the text to be tracked is read into another word due to a mistake, and therefore the recognition text is different from the text to be tracked. But from the perspective of voice tracking, the position to be tracked is still the position of the corresponding word in the text to be tracked. Illustratively, the user reads threw in the text to be tracked as threw. Although the words are different, the difference is caused by the user's misstatement, and the position which needs to be tracked is still the position corresponding to the throw in the text to be tracked.

In contrast, in the process of comparing each word in the text corresponding to the jth candidate position with the word at the corresponding position in the recognition text by letters, the method adopts a comprehensive calculation mode of sequential comparison and non-sequential comparison instead of simple one-to-one comparison. Wherein, the sequential comparison refers to a first comparison value obtained in the above formula, and the non-sequential comparison refers to a second comparison value obtained in the above formula.

Specifically, in calculating l (w) ¹ ,w ^j ) First, the j-th word (called word 1) in the recognized text is compared with the j-th word (called word 2) in the text corresponding to the j-th candidate position one by one according to the alphabetical order. If the letters in the same position in the word 1 and the word 2 are the same, the letters are the same

And outputting 1, otherwise, outputting 0. And for all letters in words 1 and 2, by

And overlapping the results to determine the similarity between the word 1 and the word 2 in the sequential comparison. On the basis, further respectively constructing a set by all letters in the word 1 and the word 2, and comparing the coincidence proportion of the letters in the two sets, namely, by

I.e., the similarity of word 1 and word 2 in the non-sequential comparison can be determined. And finally, assigning values according to different weights, and further finally determining the similarity of the word 1 and the word 2.

In the above embodiment, the word is used as the minimum unit in the positioning, intercepting and similarity comparison, and the length of the text in the recognition text and the candidate position text can be 4, that is, the number of words. So that the calculated similarity of the two texts is expressed as the overall word similarity.

On this basis, the present application may use letters as a scheme of a minimum unit in positioning, intercepting, and similarity comparison, and a specific calculation manner and an identified text are a scheme of a first language type, which is not described herein again. In this scheme, the length of the text in the text and the candidate position text can be generally identified, i.e., the number of letters can be 10 to 20. And the calculated similarity of the two texts is expressed as the coincidence similarity.

Further, for the case where the recognized text is of the second language type, the above-mentioned overall word similarity and the coincidence similarity may be calculated simultaneously in the actual calculation process, and finally the maximum value or the average value is taken as the final similarity result.

In the embodiment of the present application, when the language type is a number, the corresponding utterance mode is different depending on the kind of the number. For example, a telephone number or other digits without the concept of size, order, and number of digits, a user can directly read each digit. For the numbers with the concepts of money amount, counting and the like, the general user can add other contents such as 'ten hundred million' and the like into the speech. In contrast, the text to be tracked can be detected in advance through a preset NLP model, the corresponding number type is identified in advance, if the number is the number of the telephone class, a second language type scheme is adopted, and if the number is the number of the money class, a first language type scheme is adopted.

In the embodiment of the present application, when the text is recognized as the third language type, a hybrid application may be performed according to the first language type scheme and the second language type scheme. For this reason, the similarity calculation, the candidate positions, the total number of candidate positions, the text length, and the candidate text segment corresponding to each candidate position are the same as those described above, and are not described herein again. The present application does not specifically limit this, and the setting can be performed by itself according to the actual situation. The method and the device can realize voice tracking when the recognized text is mixed with Chinese and English, and are suitable for application scenes of the text to be tracked which is mixed with Chinese and English. And the accuracy of voice tracking display is improved through corresponding calculation and adjustment. In addition, the voice tracking in the application is carried out by adopting an iterative thought, so that the overall positioning and tracking effect is always kept in a better state, and the problems of accidental positioning and tracking errors are avoided.

S4: and determining the voice tracking position at the current moment according to the text matching result, and presenting the voice tracking position in a user interface.

Illustratively, if "candidate position 2: and the cumin of the current moment. And presenting the voice tracking location in a user interface. That is, "I feel like a whole life goes on your way like Cumin of Zike finds a Cumin of Zike … …" which does not exist in the world at all in the text to be tracked.

In the embodiment of the application, a user interface displays a text to be tracked, and a voice tracking position is a position of a candidate text segment tail end text corresponding to the text to be tracked; presenting the voice tracking location in a user interface, further comprising: marking the voice tracking position at a display tracking position of the text to be tracked, and presenting the display tracking position in a user interface; wherein the display tracking position is a text position after the voice tracking position.

In specific implementation, after the determination of the voice tracking position is completed, it is considered that certain time is required for the prompter to perform voice recognition and the similarity calculation in the implementation process. Further, there may be subsequent utterances by the user during that time, so that when the voice tracking location is presented in the user interface, the location marked for display is behind the voice tracking location. Illustratively, the voice tracking location is the last "Cumin" of the "Cumin of Cumin". When the actual calculation results, the user is likely to continue reading backwards, so the final tracking position presented in the user interface is directly backwards across the preset number of characters on a "cumin" basis. The preset number of characters can be set according to actual conditions. According to the scheme, the prompter cannot have the situation that the actual speaking position of the user is ahead of the voice tracking position displayed on the user interface in the actual use process through the advance mechanism, and therefore the user is prevented from feeling delayed.

It should be understood that, although the steps in the flowcharts of fig. 1 to 3 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

In some embodiments, fig. 4 is a schematic structural diagram of an optional prompter according to embodiments of the present application. Referring to fig. 4, the prompter includes:

It should be noted that the user interface displays the text to be tracked, and the speech tracking position is a position of the text at the end of the candidate text segment corresponding to the text to be tracked.

In presenting the voice tracking location in the user interface, the display module is further configured to:

Further, the prompter provided by the present application is further configured to perform all the steps in the tracking display method, which is not described herein again.

According to another aspect of the embodiments of the present application, there is also provided an electronic device for implementing the above tracking display method, which may be applied in a server, but is not limited thereto. As shown in fig. 5, the electronic device comprises a memory 502 and a processor 504, the memory 502 having stored therein a computer program, the processor 504 being arranged to perform the steps of any of the above-described method embodiments by means of the computer program.

Optionally, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of a computer network.

It should be noted that, in this embodiment, the processor may be configured to execute all the steps in the tracking display method through a computer program, and details are not described herein again.

Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 5 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, and a Mobile Internet Device (MID), a PAD, and the like. Fig. 5 does not limit the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 5, or have a different configuration than shown in FIG. 5.

The memory 502 may be used to store software programs and modules, such as program instructions/modules corresponding to the voice tracking method and apparatus in the embodiment of the present application, and the processor 504 executes various functional applications and data processing by running the software programs and modules stored in the memory 502, so as to implement the voice tracking method. The memory 502 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 502 may further include memory located remotely from the processor 504, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 502 may be used for storing program steps of the voice tracking method, but is not limited thereto.

Optionally, the transmission device 506 is used for receiving or sending data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 506 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 506 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner. In addition, the electronic device further includes: a display 508 for displaying the voice tracking method process; and a connection bus 510 for connecting the respective module parts in the above-described electronic apparatus.

Embodiments of the present application further provide a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the steps of any of the above method embodiments when executed.

In this embodiment, the storage medium may be configured to store a computer program for executing all steps in the tracking display method, which is not described herein again.

Optionally, the storage medium is further configured to store a computer program for executing the steps included in the method in the foregoing embodiment, which is not described in detail in this embodiment.

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The application provides a tracking display method, a prompter, electronic equipment and a storage medium. It is applicable to different languages and context scenarios. And determining a corresponding tracking display scheme and language logic through judgment. Specifically, a positioning and voice tracking mode is adopted, and voice tracking is realized through screening of candidate positions and a comparison method of similarity. On one hand, the process of voice tracking and positioning does not depend on the effect of voice recognition strictly, so that the positioning precision is not influenced by voice recognition errors. On the other hand, in the positioning process, the possible positions are screened and positioned based on the iterative thought, and compared objects are not all texts to be tracked, so that the processing time is shortened in the actual positioning process, and accurate positioning can be still realized according to the actual speech content of the user under the conditions of speech recognition errors and user misstatement. Further, in the calculation process, the calculation method and formula, the position and distance of the character/letter and the like are optimized. And carrying out corresponding weight distribution and adjustment, thereby realizing rapid and accurate voice tracking and avoiding the problem that the speaking position of the user cannot be accurately identified under the scenes that the user has speed change, pauses to speak or has misoral errors during speaking.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, network devices, or the like) to execute all or part of the steps of the method of the embodiments of the present application.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method of tracking display, the method comprising:

obtaining an identification text according to voice data input by a user at the current moment, and determining an initial position in the identification text; the initial position is a text position determined by voice tracking at the last moment;

Determining a plurality of candidate positions and a candidate text segment corresponding to each candidate position according to the initial position, the preset total number of candidate positions and a preset text length; wherein the total number of candidate positions is used for indicating the number of candidate positions, and the text length is used for indicating the text length of each candidate text segment;

performing similarity matching on the candidate text segment corresponding to each candidate position and the recognition text, and determining a text matching result;

and determining the voice tracking position at the current moment according to the text matching result, and presenting the voice tracking position in a user interface.

2. The method according to claim 1, wherein in a case that the recognized text is of the first language type, determining a plurality of candidate positions and a candidate text segment corresponding to each candidate position according to the initial position, a preset total number of candidate positions and a preset text length comprises:

selecting a candidate character set according to the total number of the candidate positions, wherein the candidate character set comprises a plurality of first characters; wherein the first character is a plurality of continuous characters after the initial position, and the number of the first character is equal to the total number of the candidate positions; each first character corresponds to one candidate position respectively;

Determining the candidate text fragment according to the text length and the first character; wherein the candidate text segment comprises a plurality of continuous characters with the number corresponding to the text length; the first character is located at the last position of the corresponding candidate text segment.

3. The method of claim 1, wherein determining a plurality of candidate positions and a candidate text segment corresponding to each candidate position according to the initial position, a preset total number of candidate positions and a preset text length comprises:

selecting a candidate character set according to the total number of the candidate positions, wherein the candidate character set comprises a plurality of first characters and second characters; the first characters are a plurality of continuous characters after the initial position, the second characters comprise characters corresponding to the initial position and a plurality of continuous characters before the initial position, and the total number of the first characters and the second characters is equal to the total number of the candidate positions; each first character and each second character respectively correspond to one candidate position;

determining the candidate text segments according to the text length, the first character and the second character; wherein the candidate text segment comprises a plurality of continuous characters with the number corresponding to the text length; the first character is located at the last position of the corresponding candidate text segment, and the second character is located at the last position of the corresponding candidate text segment.

4. The method according to claim 2 or 3, characterized in that the candidate text segment corresponding to each candidate position is subjected to similarity matching with the recognition text, and a text matching result is determined; the method comprises the following steps:

determining a distance penalty value corresponding to each candidate text segment according to the plurality of candidate positions, wherein the distance penalty value is used for characterizing the text distance between the candidate positions and the initial position;

determining a second text matching value corresponding to each candidate text segment according to the first text matching value corresponding to each candidate text segment and the distance penalty value;

and determining the text matching result according to the second text matching values corresponding to the candidate text fragments.

5. The method according to claim 1, wherein the user interface displays a text to be tracked, and the voice tracking position is a position of the candidate text segment end text corresponding to the text to be tracked; the presenting the voice tracking location in a user interface includes:

Marking the voice tracking location at a display tracking location of the text to be tracked, and presenting the display tracking location in the user interface; wherein the display tracking location is a text location after the voice tracking location.

6. The method according to claim 1, wherein in a case that the recognized text is of a second language type, determining a plurality of candidate positions and a candidate text segment corresponding to each candidate position according to the initial position, a preset total number of candidate positions and a preset text length comprises:

selecting a candidate word set according to the total number of the candidate positions, wherein the candidate word set comprises a plurality of first words; wherein the first words are a plurality of continuous words after the initial position, and the number of the first words is equal to the total number of the candidate positions; each first word corresponds to one candidate position;

determining the candidate text segment according to the text length and the first word; wherein the candidate text segment comprises a plurality of continuous words with the number corresponding to the text length; the first word is located at the last position of the corresponding candidate text segment.

7. The method of claim 1, wherein determining a plurality of candidate positions and a candidate text segment corresponding to each candidate position according to the initial position, a preset total number of candidate positions and a preset text length comprises:

selecting a candidate word set according to the total number of the candidate positions, wherein the candidate word set comprises a plurality of first words and second words; wherein the first words are a plurality of continuous words after the initial position, the second words comprise words corresponding to the initial position and a plurality of continuous words before the initial position, and the total number of the first words and the second words is equal to the total number of the candidate positions; each of the first word and the second word respectively corresponds to one of the candidate positions;

determining the candidate text segment according to the text length, the first word and the second word; wherein the candidate text segment comprises a plurality of continuous words with the number corresponding to the text length; the first word is located at the last position of the corresponding candidate text segment, and the second word is located at the last position of the corresponding candidate text segment.

8. The method according to claim 6 or 7, wherein performing similarity matching between the candidate text segment corresponding to each candidate position and the recognition text, and determining a text matching result comprises:

determining a third text matching value corresponding to each candidate text fragment according to the similarity between each candidate text fragment and the recognition text;

determining a first comparison value and a second comparison value corresponding to each candidate text segment according to the third text matching value; the first comparison value is used for representing the same proportion of letters corresponding to each letter position in two words at the same position of the candidate text segment and the recognition text; the second comparison value is used for representing the proportion of the same letter in two words at the same position of the candidate text segment and the recognition text to the total number of letters;

And determining the text matching result according to the fourth text matching values corresponding to the candidate text segments.

9. The method of claim 1, further comprising:

the total number of the candidate positions set when the recognition text is of the first type of language is smaller than the total number of the candidate positions set when the recognition text is of the second type of language or the mixed type of the first type of language and the second type of language.

10. The method of claim 1, wherein prior to determining the initial position in the recognized text, further comprising:

and detecting the language type of the recognized text so as to determine the corresponding different initial positions according to different language types.

11. A prompter, characterized in that it comprises:

an ASR module configured to derive a recognition text from speech data input by a user at a current time;

a speech tracking module configured to determine an initial position in the recognized text; the initial position is a text position determined by voice tracking at the last moment; determining a plurality of candidate positions and a candidate text segment corresponding to each candidate position according to the initial position, the preset total number of candidate positions and the preset text length; wherein the total number of candidate positions is used for indicating the number of candidate positions, and the text length is used for indicating the text length of each candidate text segment; carrying out similarity matching on the candidate text segment corresponding to each candidate position and the recognition text to determine a text matching result; determining the voice tracking position at the current moment according to the text matching result;

12. The prompter of claim 11, wherein the user interface displays a text to be tracked, and the voice tracking position is a position of the candidate text segment end text corresponding to the text to be tracked; the presenting the voice-tracking location in a user interface, the display module further configured to:

marking the voice tracking location at a display tracking location of the text to be tracked and presenting the display tracking location in the user interface; wherein the display tracking location is a text location after the speech tracking location.

13. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 10 when executed.

14. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 10.