CN113225624A

CN113225624A - Time-consuming determination method and device for voice recognition

Info

Publication number: CN113225624A
Application number: CN202110379684.0A
Authority: CN
Inventors: 张�杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-04-08
Filing date: 2021-04-08
Publication date: 2021-08-06

Abstract

The embodiment of the invention discloses a method and a device for determining time consumption of voice recognition; the method and the device can acquire a video to be detected, compare text information in adjacent video frames, determine text difference information between the adjacent video frames, determine a target video frame containing a target text in the video frames according to the text difference information of the adjacent video frames, determine the first display time of the target text according to a timestamp of the target video frame, extract user interaction voice from the video to be detected, detect the first vocalization time of the target text in the user interaction voice, and determine time consumption of voice recognition response corresponding to the video to be detected according to the first display time and the first vocalization time corresponding to the same target text; therefore, the dependence on manpower can be reduced in the process of determining the time consumption of the voice recognition response, the manpower resource is saved, and the time consumption efficiency of determining the voice recognition response is improved on the basis of ensuring the accuracy.

Description

Time-consuming determination method and device for voice recognition

Technical Field

The invention relates to the technical field of communication, in particular to a method and a device for determining time consumption of voice recognition.

Background

With the continuous progress of artificial intelligence technology, people can realize the interaction with electronic equipment through voice instructions. When a user interacts with electronic equipment with a screen through voice, the electronic equipment can display the content in the voice of the user in a text form through the screen, and in this case, the recognition performance and the display speed of the electronic equipment can greatly influence the user experience.

In order to study the speech recognition performance and display speed of products, it is often necessary to evaluate electronic devices. The current evaluation scheme generally adopts manual direct operation, namely, manually listens to audio in a video repeatedly through video analysis software, so that the start time and the end time of a voice instruction are determined, and then the video is checked frame by frame, so that the time consumption of voice recognition is determined, the requirement on manpower resources is high, the evaluation period is long, and the evaluation result is greatly influenced by human factors.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining time consumption of voice recognition, which can reduce dependence on manpower, save manpower resources and improve the efficiency of determining the industry to which a client belongs on the basis of ensuring accuracy.

The embodiment of the invention provides a method for determining time consumption of voice recognition, which comprises the following steps:

acquiring a video to be detected, wherein video frames in the video to be detected are provided with time stamps, the video to be detected comprises user interaction voice, text information is displayed on the video frames, and the text information on the video frames is obtained by performing voice recognition on voice information corresponding to the video frames in the user interaction voice;

comparing the text information in the adjacent video frames to determine the text difference information between the adjacent video frames;

determining a target video frame containing a target text in the video frame according to text difference information of adjacent video frames, and determining the first display time of the target text according to a timestamp of the target video frame, wherein the target video frame is the first video frame containing the target text in the video to be detected;

extracting the user interaction voice from the video to be detected, and detecting the first sounding time of the target text in the user interaction voice;

and determining the time consumed by the voice recognition response corresponding to the video to be detected according to the first display time and the first sounding time corresponding to the same target text.

Correspondingly, an embodiment of the present invention provides a device for determining time consumed for speech recognition, including:

the video acquisition unit is used for acquiring a video to be detected, video frames in the video to be detected are provided with time stamps, the video to be detected comprises user interaction voice, text information is displayed on the video frames, and the text information on the video frames is obtained by performing voice recognition on voice information corresponding to the video frames in the user interaction voice;

the text comparison unit is used for comparing the text information in the adjacent video frames and determining the text difference information between the adjacent video frames;

a first time display determining unit, configured to determine, according to text difference information of adjacent video frames, a target video frame containing a target text in the video frames, and determine, according to a timestamp of the target video frame, a first time display time of the target text, where the target video frame is a first video frame containing the target text in the video to be detected;

a first-time-to-sound determining unit, configured to extract the user interaction voice from the video to be detected, and detect a first-time sound-to-sound time of the target text in the user interaction voice;

and the time consumption calculation unit is used for determining the time consumption of the voice recognition response corresponding to the video to be detected according to the first display time and the first sounding time corresponding to the same target text.

In an optional example, the text comparison unit includes a text acquisition unit and a text comparison subunit, where the text acquisition unit may be configured to perform text detection on the video frames to acquire text information in each video frame;

the text comparison subunit may be configured to calculate a text similarity between text information in adjacent video frames, where the text similarity is text difference information between adjacent video frames.

In an optional example, the target text of the embodiment of the present invention includes a first character and a last character in the user interaction speech, and the target video frames include a first target first character frame containing the first character and a first target last character frame containing the last character in the video to be detected;

the first-time display determining unit may include a target frame determining unit and a first time determining unit, where the target frame determining unit is configured to determine a target first frame and a target last frame in the video frames according to a text similarity between the adjacent video frames and an association relationship that is satisfied between a target first frame and a target last frame of the user interactive voice and the adjacent video frames;

the first time determining unit is configured to determine, according to the timestamp of the target first character frame, first display time of a first character in the user interaction speech, and determine, according to the timestamp of the target last character frame, first display time of a last character in the user interaction speech;

the first-time sounding determining unit is used for extracting the user interaction voice from the video to be detected, and detecting the sounding time of the first character in the user interaction voice and the sounding time of the tail character in the user interaction voice.

In an optional example, the target frame determining unit may be configured to determine, based on the text similarity between the adjacent video frames, a video frame whose text similarity with the video frame of the previous frame is smaller than a first similarity threshold and whose text similarity with the video frame of the next frame is not smaller than the first similarity threshold as a target first character frame of the video to be detected;

and determining the video frames with the text similarity not smaller than a first similarity threshold value and the text similarity not smaller than a second similarity threshold value with the video frame of the previous frame as the target suffix frames of the video to be detected based on the text similarity between the adjacent video frames, wherein the second similarity threshold value is not smaller than the first similarity threshold value, and the text similarity between the adjacent video frames in the video frames between the target prefix frame and the target suffix frames is not smaller than the first similarity threshold value.

In an optional example, a specific display area is set in the video frame according to the embodiment of the present invention, and the specific display area is configured to display the text information obtained by performing speech recognition on the user interaction speech;

the text acquisition unit may be configured to determine an area display image of the specific display area in the video frame;

and performing text detection on the area display image of each video frame to acquire text information displayed in a specific display area in each video frame.

In an optional example, the target frame determining unit may be configured to perform speech recognition on the video to be detected before determining, according to the timestamp of the target end word frame, an end word first display time of an end word in the user interaction speech, so as to obtain a user interaction text of the user interaction speech;

acquiring text information displayed in the target tail word frame;

if the text information is the same as the user interactive text, determining that the target end word frame is a correct target end word frame, and continuing to execute the step of determining the initial display time of the end word in the user interactive voice according to the timestamp of the target end word frame;

and if the text information is different from the user interaction text, the step of determining the initial display time of the tail word in the user interaction voice according to the timestamp of the target tail word frame is not executed, and the step of determining the target first word frame including the initial word of the user interaction voice and the target tail word frame including the tail word of the user interaction voice in the video frames according to the text similarity between the text detection result and the adjacent video frames is executed in a return mode.

In an optional example, the first time of utterance determining unit may be configured to extract a user interaction voice from the video to be detected, and perform voice endpoint detection on the user interaction voice;

determining an effective voice area of the user interactive voice according to a voice endpoint detection result;

and determining the initial word sounding time and the final word sounding time of the user interaction voice based on the starting time and the ending time of the effective voice area.

In an optional example, the first time of occurrence determining unit includes a voice obtaining subunit and a second time determining unit, where the voice obtaining subunit may be configured to obtain a reference voice corresponding to the user interaction voice, and a voice content in the reference voice is the same as a voice content in the user interaction voice;

extracting user interaction voice from the video to be detected;

the second time determination unit may be configured to determine, according to the reference voice and the user interaction voice, a first word utterance time and a last word utterance time of the user interaction voice.

In an optional example, the second time determining unit may be configured to determine a speech acquisition starting point in the user interaction speech, and select, from the speech acquisition starting point, speech with a preset duration as a candidate speech, where the duration of the candidate speech is equal to the duration of the reference speech;

calculating the voice similarity between the reference voice and the candidate voice;

selecting a new voice acquisition starting point after the current voice acquisition starting point in the user interactive voice;

returning to the step of executing the step of selecting the voice with preset duration as the candidate voice from the voice acquisition starting point until the end point of the user interaction voice participates in the calculation of the voice similarity;

determining candidate voice with the highest voice similarity with the reference voice as target voice;

and determining the initial word sounding time and the final word sounding time of the user interaction voice according to the initial position and the end position of the target voice in the user interaction voice.

Correspondingly, the embodiment of the invention also provides the electronic equipment, which comprises a memory and a processor; the memory stores an application program, and the processor is configured to run the application program in the memory to perform any of the steps of the speech recognition time-consumption determination method provided by the embodiments of the present invention.

In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to perform any of the steps in the method for determining consumed speech recognition time provided by the embodiment of the present invention.

By adopting the scheme of the embodiment of the invention, the video to be detected can be obtained, the text information in the adjacent video frames is compared, the text difference information between the adjacent video frames is determined, the target video frame containing the target text in the video frame is determined according to the text difference information of the adjacent video frames, the first display time of the target text is determined according to the timestamp of the target video frame, the user interaction voice is extracted from the video to be detected, the first vocalization time of the target text in the user interaction voice is detected, and the time consumed by the voice recognition response corresponding to the video to be detected is determined according to the first display time and the first vocalization time corresponding to the same target text; according to the embodiment of the invention, by comparing the difference of the text information in the adjacent video frames, the target video frame displayed for the first time by the target text in the video to be detected can be determined according to the obtained text difference information, the first vocalization time of the target text is determined from the video to be detected, and the time consumed by the voice recognition response corresponding to the video to be detected can be calculated based on the first display time and the first vocalization time corresponding to the target text.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of a scene of a speech recognition time-consumption determination method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for determining time consumption for speech recognition according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for determining time consumption for speech recognition according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a page during a voice recognition response according to an embodiment of the present invention;

FIG. 5 is another flowchart of a method for determining time consumption for speech recognition according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a speech recognition elapsed-time determination apparatus according to an embodiment of the present invention;

fig. 7 is another schematic structural diagram of a speech recognition elapsed-time determining apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a method and a device for determining time consumption of voice recognition. Specifically, the embodiment of the present invention provides a time-consuming speech recognition determination method suitable for a time-consuming speech recognition determination apparatus, which may be integrated in an electronic device.

The electronic device may be a terminal or other devices, including but not limited to a mobile terminal and a fixed terminal, for example, the mobile terminal includes but is not limited to a smart phone, a smart watch, a tablet computer, a notebook computer, a smart car, and the like, wherein the fixed terminal includes but is not limited to a desktop computer, a smart television, and the like.

The electronic device may also be a device such as a server, and the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.

The method for determining the time consumption of voice recognition in the embodiment of the invention can be realized by a server, and can also be realized by a terminal and the server together.

The method is described below by taking an example in which the terminal and the server jointly implement the method for determining the time consumed for speech recognition.

As shown in fig. 1, the speech recognition elapsed time determination system provided by the embodiment of the present invention includes a terminal 10, a server 20, and the like; the terminal 10 and the server 20 are connected via a network, for example, a wired or wireless network connection, wherein the terminal 10 may exist as a terminal for a user to send client data to be analyzed to the server 20.

The terminal 10 may be a terminal that generates and/or uploads a video to be detected, and is configured to send the video to be detected to the server 20.

The server 20 may be configured to obtain a video to be detected, where video frames in the video to be detected have time stamps, compare text information in adjacent video frames, determine text difference information between adjacent video frames, determine a target video frame containing a target text in the video frames according to the text difference information of the adjacent video frames, determine a first display time of the target text according to the time stamp of the target video frame, extract user interaction speech from the video to be detected, detect a first utterance time of the target text in the user interaction speech, and determine time consumed by a speech recognition response corresponding to the video to be detected according to the first display time and the first utterance time corresponding to the same target text.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

The embodiment of the present invention will be described from the perspective of a speech recognition time-consuming determination device, which may be specifically integrated in a server or a terminal.

As shown in fig. 2, a specific flow of the speech recognition time-consumption determining method of this embodiment may be as follows:

201. the method comprises the steps of obtaining a video to be detected, wherein video frames in the video to be detected are provided with time stamps, the video to be detected comprises user interaction voice, text information is displayed on the video frames, and the text information on the video frames is obtained by carrying out voice recognition on voice information corresponding to the video frames in the user interaction voice.

The video to be detected may be in any video format capable of including image display and voice/audio, which is not limited in the present invention.

In an actual application process, the video to be detected may be generated by performing screen recording and voice acquisition on a device that needs to determine time consumption of voice recognition response, and optionally, before step 201, the embodiment of the present invention may further include:

collecting user interaction voice, and carrying out voice recognition on the user interaction voice;

displaying text information obtained by voice recognition of user interaction voice in real time in a display area;

in the process of collecting user interaction voice, acquiring a display image of a display area as a video frame of a video to be detected;

and acquiring user interaction voice as audio information of the video to be detected, and obtaining the video to be detected based on the display image and the audio information.

For example, when the user interaction voice is collected, the collection is started when there is no voice in the surrounding environment, and after the collection is determined to be started, the user starts to perform voice interaction, and the like.

It can be understood that the text information displayed on each video frame in the video to be detected is obtained by presenting the recognized text information on the display area of the electronic device after the electronic device performs voice recognition on the collected user interaction voice in real time. As the voice content of the user interaction changes, the text information displayed on the video frame also changes.

The text information on the video frame may be obtained by any Speech Recognition technology such as Automatic Speech Recognition (ASR). Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

202. And comparing the text information in the adjacent video frames to determine the text difference information between the adjacent video frames.

In the embodiment of the present invention, since the text information in the video frame is obtained by performing real-time speech recognition on the user interactive speech, the text information in the adjacent video frame may be different as the speech content in the user interactive speech changes. For example, the user interaction speech content is "true weather today", and "true weather today" may be presented in the 3 rd video frame of the video to be detected and "true weather today" may be presented in the 5 th video frame.

Specifically, a text picture can be generated by intercepting areas with texts in adjacent video frames, and the picture similarity of the text picture is compared; the text difference information between adjacent video frames may also be obtained by first obtaining the text information in each video frame and then performing similarity calculation on the text information, that is, step 202 may include:

performing text detection on the video frames to acquire text information in each video frame;

and calculating text similarity between text information in adjacent video frames, wherein the text similarity is text difference information between the adjacent video frames.

The Text detection of the video frame can be realized by an Efficient and Accurate Scene Text (EAST) model, or by a technology capable of realizing Text Recognition, such as Optical Character Recognition (OCR), and the embodiment of the invention does not limit the implementation manner of the Text detection.

When the text similarity between text information is calculated, the text similarity can be realized by calculating a literal distance, semantic matching, a distance between text vectors or a text matching model based on deep learning and the like.

In an actual application process, a product or a device may perform voice recognition response according to the collected user interaction voice, and simultaneously display information associated with the text information of the response in real time, for example, as shown in fig. 4, the voice content of the user interaction voice is "good weather today", and the display page includes similar text information associated with the text information in addition to the text information recognized according to the user interaction voice. In order to avoid interference of similar text information on the process of calculating the text similarity, optionally, a specific display area may be set in the product or the electronic device for displaying the text information, and the similar text information may be displayed in other display areas except the specific display area.

Specifically, the method for detecting the text of the video frame to acquire the text information in each video frame includes the steps of setting a specific display area in the video frame, wherein the specific display area is used for displaying the text information obtained by performing voice recognition on the user interaction voice, and the steps of performing text detection on the video frame to acquire the text information in each video frame include:

determining a regional display image of a specific display region in a video frame;

When the area display image of the specific display area in the video frame is determined, the area display image can be obtained by only intercepting the specific display area in the video; after the video frame is obtained, the area of the video frame except the specific display area may be subjected to processing operations such as masking or blurring, and the processed video frame may be used as an area display image.

203. Determining a target video frame containing a target text in the video frame according to text difference information of adjacent video frames, and determining the first display time of the target text according to a timestamp of the target video frame, wherein the target video frame is the first video frame containing the target text in the video to be detected.

The target text may be a text preset by a developer, for example, the developer may preset a keyword as the target text, and when the keyword is detected to appear in a video frame, a video frame including the keyword at the head is used as the target video frame.

In the practical application process, because the voice used by the user or the test has diversity, if a certain keyword is fixedly set, no keyword appears until the user interactive voice is played, so that the target text can be selected not according to the specific voice content, but the content at a certain position in the user interactive voice is taken as the target text, and the applicability of the embodiment of the invention can be enhanced.

Alternatively, as shown in fig. 3, the step "determining a target video frame containing the target text in the video frame according to text difference information of adjacent video frames, and determining a first display time of the target text according to a timestamp of the target video frame" may include:

determining a target first character frame and/or a target tail character frame in the video frames according to the text similarity between adjacent video frames and the incidence relation which is satisfied between the target first character frame and/or the target tail character frame of the user interactive voice and the adjacent video frames;

according to the time stamp of the target first character frame, the first display time of the first character in the user interactive voice is determined, and according to the time stamp of the target last character frame, the first display time of the last character in the user interactive voice is determined.

The association relationship may be set by a developer according to a specific usage scenario, for example, the step "determining a target first word frame and/or a target last word frame in a video frame according to a text similarity between adjacent video frames and an association relationship satisfied between the target first word frame and/or the target last word frame of the user interactive voice and the adjacent video frames" may include:

determining a video frame, as a target first character frame of the video to be detected, of which the text similarity with a previous video frame is smaller than a first similarity threshold and the text similarity with a next video frame is not smaller than the first similarity threshold based on the text similarity between adjacent video frames; and/or the presence of a gas in the gas,

based on the text similarity between adjacent video frames, determining the video frame with the text similarity with the previous video frame not less than a first similarity threshold and the text similarity with the next video frame not less than a second similarity threshold as a target suffix frame of the video to be detected, wherein the second similarity threshold is not less than the first similarity threshold, and the text similarity of the adjacent video frame in the video frames between the target suffix frame and the target suffix frame is not less than the first similarity threshold.

In some cases, after the user interaction voice is finished, the text information in the video frame of the video to be detected may disappear immediately, or only similar text information is presented, and when the target first character frame and the target last character frame are to be determined, the following steps may be performed: and if the text information exists in the currently detected video frame, no text information exists in the next frame of the currently detected video frame, and the text similarity between the currently detected video frame and the previous frame is greater than a preset second threshold value, determining the currently detected video frame as a target end word frame.

The first similarity threshold and the second similarity threshold may be set by a technician according to an actually used text recognition algorithm, a text similarity comparison algorithm, and the like. For example, after calculating the text similarity between text information in adjacent videos, binarization processing may be performed on the text similarity, so that the text similarity between adjacent video frames greater than a certain threshold is modified to 1, and the text similarity between adjacent video frames not greater than the certain threshold is modified to 0. When determining the target first word frame and the target last word frame, specifically:

determining a video frame with the text similarity of 0 and the text similarity of 1 to a previous video frame as a target first character frame of the video to be detected based on the text similarity between adjacent video frames;

and determining the video frame with the text similarity of 1 to the previous frame video frame and the text similarity of 1 to the next frame video frame as the target end word frame of the video to be detected based on the text similarity between the adjacent video frames.

The association relationship may be set by a technician according to an actual situation, and is not limited in the embodiment of the present invention.

In one example, to ensure the accuracy of the determined target end word frame, the text content in the target end word frame may be compared with the content in the user interactive speech, and if the text content in the target end word frame is completely consistent with the content in the user interactive speech, the video frame is taken as the correct target end word frame. Before the step of determining the first display time of the final word in the user interactive voice according to the timestamp of the target final word frame, the method further comprises the following steps:

performing voice recognition on a video to be detected to obtain a user interaction text of user interaction voice;

acquiring text information displayed in a target tail word frame;

and if the text information is different from the user interactive text, the step of determining the initial display time of the tail word in the user interactive voice according to the timestamp of the target tail word frame is not executed, and the step of determining the target initial frame and the target tail word frame in the video frames according to the text similarity between the adjacent video frames and the incidence relation which is satisfied between the target initial frame and the target tail word frame of the user interactive voice and the adjacent video frames is executed.

The content of the user interaction text is the same as the content of the user interaction voice, or the content of the user interaction text may be different from the content of the user interaction voice, but the voice recognition method for obtaining the user interaction text should be the same as the voice recognition method for obtaining the text information in the video frame, or the voice recognition method for obtaining the user interaction text may also be different from the voice recognition method for obtaining the text information in the video frame, as long as the correctness of the voice recognition can be ensured.

204. And extracting user interaction voice from the video to be detected, and detecting the first sounding time of the target text in the user interaction voice.

In one example, if the target text is a keyword, the time when the keyword appears first in the user interaction voice is detected as the first utterance time.

In another example, if the target text is a first word and a last word in the user interaction speech, step 204 may include:

and extracting user interaction voice from the video to be detected, and detecting the sounding time of the first character in the user interaction voice and the sounding time of the tail character in the user interaction voice.

Under the daily application scenario or the unsupervised file test condition, audio can be extracted from an original video, the audio is cut by using Voice endpoint Detection (VAD) technology, an effective Voice area is separated, and the start time and the end time of the effective Voice area can be naturally output as the initial character sounding time and the end character sounding time of user interactive Voice according to the cutting result. That is, the steps of "extracting the user interaction voice from the video to be detected, detecting the initial word speaking time in the user interaction voice, and detecting the final word speaking time in the user interaction voice" may include:

extracting user interaction voice from a video to be detected, and carrying out voice endpoint detection on the user interaction voice;

determining an effective voice area of the user interactive voice according to the voice endpoint detection result;

based on the start time and the end time of the effective speech area, the initial word utterance time and the final word utterance time of the user interactive speech are determined.

VAD, also known as voice activity detection, silence suppression, can identify periods of silence from an audio signal. Commonly used VAD methods are: discrimination according to a combination of a time domain and a frequency domain of a sound, discrimination according to a signal-to-noise ratio, use of models such as HMM (Hidden Markov Model), MLP (multi-layer Neural network), DNN (Deep Neural network), and the like.

In another example, now most of the evaluation behaviors are performed in the presence of a supervision file, and in the presence of the supervision file, the steps of "extracting the user interaction voice from the video to be detected, detecting the initial word speaking time in the user interaction voice, and detecting the final word speaking time in the user interaction voice" include:

acquiring reference voice corresponding to the user interaction voice, wherein the voice content in the reference voice is the same as the voice content in the user interaction voice;

extracting user interaction voice from a video to be detected;

and determining the initial word sounding time and the final word sounding time of the user interactive voice according to the reference voice and the user interactive voice.

Wherein, the supervision file refers to an original audio frequency of an evaluation instruction in the test process. Generally speaking the way of evaluating the text by people, without original audio; if the evaluation text recording is read aloud by a player, the original audio, i.e. the supervision file, is available.

Specifically, the step of determining the initial word speaking time and the final word speaking time of the user interaction voice according to the reference voice and the user interaction voice may include:

determining a voice acquisition starting point in user interactive voice, and selecting voice with preset duration as candidate voice from the voice acquisition starting point, wherein the duration of the candidate voice is equal to that of the reference voice;

returning to the step of executing the steps of starting from the voice acquisition starting point and selecting the voice with preset duration as the candidate voice until the end point of the user interaction voice participates in the calculation of the voice similarity;

determining candidate voice with highest voice similarity with reference voice as target voice;

and determining the initial word sounding time and the final word sounding time of the user interactive voice according to the initial position and the end position of the target voice in the user interactive voice.

The calculation of the voice similarity can be performed by a method for calculating a Pearson correlation coefficient, wherein the Pearson correlation coefficient is as follows: also known as Pearson product-moment correlation coefficient (PPMCC), is used to measure the correlation (linear correlation) between two variables X and Y, and has a value between-1 and 1.

Or, the speech similarity may be calculated by extracting a feature value from the reference speech and the user interaction speech, and then performing similarity calculation based on the feature value, and the method for calculating the speech similarity is not limited, and is not limited in the embodiment of the present invention.

205. And determining the time consumed by the voice recognition response corresponding to the video to be detected according to the first display time and the first sounding time corresponding to the same target text.

If the target text is a first word and a last word in the user interactive voice, in the process of determining the time consumed by the voice recognition response corresponding to the video to be detected, calculation may be performed according to the first display time of the first word, the first speaking time, the first display time of the last word, and the speaking time of the last word, that is, step 205 may include:

determining the first voice recognition response time consumption according to the first display time and the first sounding time, and/or determining the second voice recognition response time consumption according to the last display time and the last sounding time;

and determining the time consumption of the voice recognition response corresponding to the video to be detected based on the time consumption of the first voice recognition response and/or the time consumption of the second voice recognition response.

Therefore, the embodiment of the invention can reduce the dependence on manpower in the process of determining the time consumption of the voice recognition response, save the manpower resource and improve the time consumption efficiency of determining the voice recognition response on the basis of ensuring the accuracy.

The method according to the preceding embodiment is illustrated in further detail below by way of example.

In this embodiment, the system and the supervised document in fig. 1 will be described.

As shown in fig. 5, the specific process of the speech recognition time-consumption determining method of this embodiment may be as follows:

301. and the terminal detects the voice signal, collects the interactive voice and generates a video to be detected.

After the terminal generates the video to be detected, the video to be detected can be directly sent to the server, the video to be detected can also be stored in a preset storage area of the video to be detected, and when the server needs to perform a time-consuming evaluation task of voice recognition response, the video to be detected can be directly obtained from the storage area.

The storage area may be set in a memory of the terminal, may be set in the server, may be set in another cloud server independent of the terminal and the server, and the like. The embodiment of the present invention is not limited thereto.

Specifically, the terminal can start to collect user interaction voice after detecting a voice signal, perform voice recognition on the user interaction voice, display text information obtained by the user interaction voice through the voice recognition in real time in the display area, acquire a display image of the display area as a video frame of the video to be detected, acquire the user interaction voice as audio information of the video to be detected, and obtain the video to be detected based on the display image and the audio information.

In another example, the terminal may be in a voice capture state at all times, rather than beginning capture after detecting a voice signal.

302. The server acquires a video to be detected.

It can be understood that the server may actively acquire the video to be detected from the storage area, and may also receive the video to be detected sent by the terminal.

303. The server compares the text information in the adjacent video frames to determine the text similarity between the adjacent video frames.

The text similarity between adjacent video frames can be obtained by acquiring the text information in each video frame and then performing similarity calculation on the text information. That is, step 303 may include:

the text similarity between the text information in adjacent video frames is calculated.

When the text similarity between the text information is calculated, the text similarity can be realized by calculating a literal distance, semantic matching, a distance between text vectors, or a text matching model based on deep learning, and the like, which is not limited in the embodiment of the present invention.

304. And determining a target first character frame and a target tail character frame in the video frames according to the text similarity between the adjacent video frames and the incidence relation which is satisfied between the target first character frame and the target tail character frame of the user interactive voice and the adjacent video frames.

After calculating the text similarity between the text information in the adjacent videos, binarization processing can be performed on the text similarity, the text similarity between the adjacent video frames larger than a certain threshold value is modified to be 1, and the text similarity between the adjacent video frames not larger than the certain threshold value is modified to be 0. When determining the target first word frame and the target last word frame, specifically:

305. According to the time stamp of the target first character frame, the first display time of the first character in the user interactive voice is determined, and according to the time stamp of the target last character frame, the first display time of the last character in the user interactive voice is determined.

When the first display time and the last display time are determined, if the time information carried by the timestamp is real time information, the time in the timestamp should be converted. For example, the time information of the time stamp of the target first word frame is n-hour m seconds, and the first word sound time is k-th seconds, at which time the time stamp or the first word sound time should be time-converted.

306. Obtaining reference voice corresponding to user interaction voice, determining a voice obtaining starting point in the user interaction voice, selecting voice with preset duration as candidate voice from the voice obtaining starting point, and calculating voice similarity between the reference voice and the candidate voice.

The duration of the candidate voice may be preset under the duration of a known supervision file (reference voice), or the duration of the candidate voice may be detected by performing duration detection on the reference voice, and the duration of the candidate voice is determined according to the result of the duration detection each time, and so on.

307. And after the current voice acquisition starting point in the user interactive voice, selecting a new voice acquisition starting point, returning to the step of executing the step of selecting the voice with preset duration as the candidate voice from the voice acquisition starting point.

In an example, the voice obtaining starting point may be sequentially selected from the user interactive voice until the end point of the user interactive voice participates in the calculation of the voice similarity, and then step 308 is executed, or after the end point of the user interactive voice participates in the calculation of the voice similarity, the end point of the candidate voice is kept the same as the end point of the user interactive voice, the duration of the candidate voice is modified until the end point of the candidate voice is not used as the voice obtaining starting point, and then step 308 is executed.

308. And determining candidate voices with the highest voice similarity with the reference voices to be used as target voices, and determining the initial word speaking time and the final word speaking time of the user interaction voices according to the initial position and the end position of the target voices in the user interaction voices.

The starting position of the target voice in the user interaction voice is the position of the voice acquisition starting point of the target voice, and the ending position of the target voice in the user interaction voice can be determined according to the voice acquisition starting point of the target voice and the duration of the target voice.

309. Determining the first voice recognition response consumed time according to the first display time and the first sound production time, determining the second voice recognition response consumed time according to the first display time and the tail word sound production time, and determining the voice recognition response consumed time corresponding to the video to be detected based on the first voice recognition response consumed time and the second voice recognition response consumed time.

If the time standard of the first character display time and the first character sounding time is the same, the first voice recognition response time consumption can be obtained by subtracting the first character sounding time from the first character display time; if the time standard of the first character display time and the first character sounding time are different, the first speech recognition response time consumption can be obtained by firstly converting the time standard of the first character display time or the first character sounding time and then subtracting the first character sounding time from the first character display time. The determination method of the time consumption of the second speech recognition response is similar and will not be described herein.

In order to better implement the method, correspondingly, the embodiment of the invention also provides a device for determining the time consumption of voice recognition.

Referring to fig. 6, the apparatus includes:

the video acquisition unit 601 is configured to acquire a video to be detected, where video frames in the video to be detected all have time stamps, the video to be detected includes user interaction voice, text information is displayed on the video frames, and the text information on the video frames is obtained by performing voice recognition on voice information corresponding to the video frames in the user interaction voice;

a text comparing unit 602, configured to compare text information in adjacent video frames, and determine text difference information between the adjacent video frames;

a first time display determining unit 603, configured to determine, according to text difference information of adjacent video frames, a target video frame including a target text in the video frames, and determine, according to a timestamp of the target video frame, a first time display time of the target text, where the target video frame is a first video frame including the target text in a video to be detected;

a first-time-to-sound determining unit 604, configured to extract user interaction speech from a video to be detected, and detect a first-time sound-to-sound time of a target text in the user interaction speech;

and a time consumption calculation unit 605, configured to determine time consumption for voice recognition response corresponding to the video to be detected according to the first display time and the first utterance time corresponding to the same target text.

In an alternative example, as shown in fig. 7, the text comparing unit 602 includes a text obtaining unit 606 and a text comparing sub-unit 607, where the text obtaining unit 606 may be configured to perform text detection on video frames to obtain text information in each video frame;

the text comparison subunit 607 may be configured to calculate a text similarity between text information in adjacent video frames, where the text similarity is text difference information between adjacent video frames.

In an alternative example, the target text of the embodiment of the present invention includes a first character and a last character in the user interaction voice, and the target video frame includes a first target first character frame containing the first character and a first target last character frame containing the last character in the video to be detected;

the first-time-of-display determining unit 603 may include a target frame determining unit 608 and a first time determining unit 609, the target frame determining unit 608 configured to determine a target first character frame and a target last character frame in the video frames according to a text similarity between adjacent video frames and an association relationship satisfied between the target first character frame and the target last character frame of the user interactive voice and the adjacent video frames;

a first time determining unit 609, configured to determine, according to the timestamp of the target first character frame, a first time display time of a first character in the user interaction speech, and determine, according to the timestamp of the target last character frame, a last character first display time of a last character in the user interaction speech;

the first time of utterance determining unit 604 is configured to extract a user interaction voice from a video to be detected, and detect utterance time of a first word in the user interaction voice and utterance time of a last word in the user interaction voice.

In an optional example, the target frame determining unit 608 may be configured to determine, based on the text similarity between adjacent video frames, a video frame whose text similarity with a video frame of a previous frame is smaller than a first similarity threshold and whose text similarity with a video frame of a next frame is not smaller than the first similarity threshold as a target first character frame of the video to be detected;

In an optional example, a specific display area is set in a video frame of the embodiment of the present invention, and the specific display area is used for displaying text information obtained by performing speech recognition on user interaction speech;

a text acquisition unit 606, which may be configured to determine a region display image of a specific display region in a video frame;

In an optional example, the target frame determining unit 608 may be configured to, before determining, according to a timestamp of the target end word frame, time for displaying an end word of an end word in the user interaction speech for the first time, perform speech recognition on a video to be detected, so as to obtain a user interaction text of the user interaction speech;

acquiring text information displayed in a target tail word frame;

and if the text information is different from the user interactive text, the step of determining the initial display time of the tail word in the user interactive voice according to the timestamp of the target tail word frame is not executed, and the step of determining the target first word frame including the initial word of the user interactive voice and the target tail word frame including the user interactive voice tail word in the video frames according to the text detection result and the text similarity between adjacent video frames is returned.

In an optional example, the first-time-of-utterance determining unit 604 may be configured to extract a user interaction voice from a video to be detected, and perform voice endpoint detection on the user interaction voice;

In an optional example, the first-time-of-utterance first time determining unit includes a speech acquiring subunit 610 and a second time determining unit 611, where the speech acquiring subunit 610 may be configured to acquire a reference speech corresponding to the user interaction speech, and a speech content in the reference speech is the same as a speech content in the user interaction speech;

extracting user interaction voice from a video to be detected;

In an optional example, the second time determining unit 611 may be configured to determine a speech acquisition starting point in the user interaction speech, and select, from the speech acquisition starting point, speech with a preset duration as a candidate speech, where the duration of the candidate speech is equal to the duration of the reference speech;

In an optional example, the voice recognition time-consuming determining apparatus, before the video acquiring unit 601, further includes a video recording unit 612, which may be configured to collect user interaction voice and perform voice recognition on the user interaction voice;

Therefore, through the time-consuming voice recognition determining device, the dependence on manpower can be reduced in the time-consuming voice recognition response determining process, the manpower resource is saved, and the time-consuming voice recognition response determining efficiency is improved on the basis of ensuring the accuracy.

In addition, an embodiment of the present invention further provides an electronic device, where the electronic device may be a terminal or a server, as shown in fig. 8, which shows a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically:

the electronic device may include Radio Frequency (RF) circuitry 801, memory 802 including one or more computer-readable storage media, input unit 803, display unit 804, sensor 805, audio circuitry 806, Wireless Fidelity (WiFi) module 807, processor 808 including one or more processing cores, and power supply 809. Those skilled in the art will appreciate that the terminal structure shown in fig. 8 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the RF circuit 801 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receive downlink information from a base station and then send the received downlink information to one or more processors 808 for processing; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuitry 801 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 801 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The memory 802 may be used to store software programs and modules, and the processor 808 may execute various functional applications and data processing by operating the software programs and modules stored in the memory 802. The memory 802 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory 802 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 802 may also include a memory controller to provide the processor 808 and the input unit 803 access to the memory 802.

The input unit 803 may be used to receive input numeric or textual information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in a particular embodiment, the input unit 803 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 808, and can receive and execute commands sent by the processor 808. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 803 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 804 may be used to display information input by or provided to a user and various graphical user interfaces of the terminal, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 804 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 808 to determine the type of touch event, and the processor 808 then provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 8 the touch sensitive surface and the display panel are two separate components to implement input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel to implement input and output functions.

The terminal may also include at least one sensor 805, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal, detailed description is omitted here.

Audio circuitry 806, a speaker, and a microphone may provide an audio interface between the user and the terminal. The audio circuit 806 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into an audio signal for output; on the other hand, the microphone converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 806, and then outputs the audio data to the processor 808 for processing, and then passes through the RF circuit 801 to be transmitted to, for example, another terminal, or outputs the audio data to the memory 802 for further processing. The audio circuit 806 may also include an earbud jack to provide peripheral headset communication with the terminal.

WiFi belongs to short distance wireless transmission technology, and the terminal can help the user to send and receive e-mail, browse web page and access streaming media etc. through WiFi module 807, which provides wireless broadband internet access for the user. Although fig. 8 shows the WiFi module 807, it is understood that it does not belong to the essential constitution of the terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 808 is a control center of the terminal, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the terminal and processes data by operating or executing software programs and/or modules stored in the memory 802 and calling data stored in the memory 802, thereby integrally monitoring the mobile phone. Optionally, processor 808 may include one or more processing cores; preferably, the processor 808 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 808.

The terminal also includes a power supply 809 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 808 via a power management system to manage charging, discharging, and power consumption via the power management system. The power supply 809 may also include one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, or any other component.

Although not shown, the terminal may further include a camera, a bluetooth module, and the like, which will not be described herein. Specifically, in this embodiment, the processor 808 in the terminal loads the executable file corresponding to the process of one or more application programs into the memory 802 according to the following instructions, and the processor 808 runs the application programs stored in the memory 802, thereby implementing various functions as follows:

determining a target video frame containing a target text in the video frame according to text difference information of adjacent video frames, and determining the first display time of the target text according to a timestamp of the target video frame, wherein the target video frame is the first video frame containing the target text in a video to be detected;

extracting user interaction voice from a video to be detected, and detecting the first sounding time of a target text in the user interaction voice;

and determining the time consumed by the voice recognition response corresponding to the video to be detected according to the first display time and the first sounding time corresponding to the same target text. It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present invention provide a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute steps in any one of the speech recognition time-consuming determination methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium may execute the steps in any one of the speech recognition time-consuming determination methods provided in the embodiments of the present invention, beneficial effects that can be achieved by any one of the speech recognition time-consuming determination methods provided in the embodiments of the present invention may be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

According to an aspect of the application, there is also provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the method provided in the various alternative implementations in the above embodiments.

The method and the device for determining time consumption of speech recognition provided by the embodiment of the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for determining time consumed by speech recognition is characterized by comprising the following steps:

2. The method of claim 1, wherein comparing the text information in the adjacent video frames to determine the text difference information between the adjacent video frames comprises:

3. The method according to claim 2, wherein the target text comprises a first character and a last character in the user interaction voice, and the target video frames comprise a first target first character frame containing the first character and a first target last character frame containing the last character in the video to be detected;

the determining a target video frame containing a target text in the video frame according to the text difference information of the adjacent video frames and determining the first display time of the target text according to the timestamp of the target video frame includes:

determining a target first character frame and a target tail character frame in the video frames according to the text similarity between the adjacent video frames and the incidence relation which is satisfied between the target first character frame and the target tail character frame of the user interactive voice and the adjacent video frames;

determining the first display time of the first character in the user interactive voice according to the timestamp of the target first character frame, and determining the first display time of the last character in the user interactive voice according to the timestamp of the target last character frame;

the extracting the user interaction voice from the video to be detected and detecting the first sounding time of the target text in the user interaction voice comprise:

and extracting the user interaction voice from the video to be detected, and detecting the initial word sounding time in the user interaction voice and the tail word sounding time in the user interaction voice.

4. The method according to claim 3, wherein the determining the target first character frame and the target last character frame in the video frames according to the text similarity between the adjacent video frames and the association relationship satisfied between the target first character frame and the target last character frame of the user interactive voice and the adjacent video frames comprises:

determining a video frame, as a target first character frame of the video to be detected, of which the text similarity with a previous video frame is smaller than a first similarity threshold and the text similarity with a next video frame is not smaller than the first similarity threshold based on the text similarity between the adjacent video frames;

5. The method according to claim 2, wherein a specific display area is set in the video frame, and the specific display area is used for displaying the text information obtained by performing speech recognition on the user interaction speech;

the performing text detection on the video frames to obtain text information in each video frame includes:

determining a regional display image of the specific display region in the video frame;

6. The method of claim 3, wherein determining the time before the first display of the end word in the user interactive speech according to the timestamp of the target end word frame further comprises:

performing voice recognition on the video to be detected to obtain a user interaction text of the user interaction voice;

acquiring text information displayed in the target tail word frame;

and if the text information is different from the user interactive text, the step of determining the initial display time of the last word in the user interactive voice according to the timestamp of the target last word frame is not executed, and the step of determining the target first word frame and the target last word frame in the video frames according to the text similarity between the adjacent video frames and the incidence relation which is satisfied between the target first word frame and the target last word frame of the user interactive voice and the adjacent video frames is executed.

7. The method according to claim 3, wherein the extracting the user interaction voice from the video to be detected, and detecting the speaking time of the first word in the user interaction voice and the speaking time of the last word in the user interaction voice comprises:

extracting user interaction voice from the video to be detected, and carrying out voice endpoint detection on the user interaction voice;

8. The method according to claim 3, wherein the extracting the user interaction voice from the video to be detected, and detecting the speaking time of the first word in the user interaction voice and the speaking time of the last word in the user interaction voice comprises:

extracting user interaction voice from the video to be detected;

and determining the initial word sounding time and the final word sounding time of the user interaction voice according to the reference voice and the user interaction voice.

9. The method of claim 8, wherein determining the initial utterance time and the final utterance time of the user-interactive voice according to the reference voice and the user-interactive voice comprises:

determining a voice acquisition starting point in the user interaction voice, and selecting a voice with preset duration as a candidate voice from the voice acquisition starting point, wherein the duration of the candidate voice is equal to the duration of the reference voice;

10. A speech recognition elapsed time determination apparatus, comprising: