CN115482809A

CN115482809A - Keyword search method, keyword search device, electronic equipment and storage medium

Info

Publication number: CN115482809A
Application number: CN202211137975.XA
Authority: CN
Inventors: 张辉; 熊新雷; 周羊; 黄宇鑫; 陈泽裕; 文灿
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-09-19
Filing date: 2022-09-19
Publication date: 2022-12-16
Anticipated expiration: 2042-09-19
Also published as: CN115482809B

Abstract

The disclosure provides a keyword retrieval method, a keyword retrieval device, electronic equipment and a storage medium, relates to the technical field of voice recognition, in particular to the technical field of voice keyword retrieval, and can be applied to scenes such as customer service quality inspection. The scheme comprises the following steps: decoding the voice data to obtain a text of the voice data and decoded frame data, wherein the decoded frame data comprises a decoded frame corresponding to each character in the text; for each character in the text, calculating time information of the character based on the time stamp of the decoded frame of the character and the time stamps of the decoded frames of the adjacent characters of the character; performing keyword retrieval on the text, responding to the text containing preset target keywords, and determining time information of the target keywords based on the time information of characters in the target keywords; and generating a retrieval result containing the target keyword and the time information of the target keyword. The method can accurately acquire the time information of the target keyword on the basis of not introducing excessive additional models.

Description

Keyword search method, keyword search device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of voice recognition, in particular to the technical field of voice keyword retrieval, and can be applied to scenes such as customer service quality inspection.

Background

In some services for keyword retrieval of voice data, a text of the voice data is generally obtained based on a voice recognition technology, and then it is determined whether the text contains a target keyword in a preset keyword recognition manner. Once the text is found to contain the target keywords, the staff member needs to play the audio of the voice data in order to check whether the voice data has voice content matching the target keywords.

In order to facilitate a worker to quickly locate a voice section corresponding to a target keyword in voice data, the related technology can estimate time information of each character in a text based on the average speech speed of a speaker in the voice data after the text of the voice data is acquired, but the accuracy of the time information acquired in this way is low; alternatively, the related art may also introduce an alignment model to estimate the time information of each word in the text, but this approach requires the introduction of a new model, which significantly increases the cost.

Disclosure of Invention

The disclosure provides a keyword retrieval method, a keyword retrieval device, an electronic device and a storage medium.

According to a first aspect of the present disclosure, there is provided a keyword retrieval method, the method including:

decoding the voice data to obtain a text of the voice data and decoded frame data, wherein the decoded frame data comprise a decoded frame corresponding to each character in the text;

for each character in the text, calculating time information of the character based on the timestamp of the decoded frame of the character and the timestamps of the decoded frames of adjacent characters of the character;

performing keyword retrieval on the text, responding to the text containing preset target keywords, and determining time information of the target keywords based on the time information of characters in the target keywords;

and generating a retrieval result containing the target keyword and the time information of the target keyword.

In the embodiment of the present disclosure, for each word in a text, calculating time information of the word based on a time stamp of a decoded frame of the word and a time stamp of a decoded frame of an adjacent word of the word includes:

for each character in the text, determining a representative decoding frame of the character from the decoding frames of the character, wherein the representative decoding frame is the decoding frame with the highest probability of the phoneme containing the character;

the time information of the character is calculated based on the time stamp of the representative decoded frame of the character and the time stamps of the representative decoded frames of the adjacent characters of the character.

In the embodiment of the present disclosure, calculating the time information of the word based on the time stamp of the representative decoded frame of the word and the time stamp of the representative decoded frame of the adjacent word of the word includes:

calculating the starting time of the character based on the time stamp of the representative decoding frame of the character and the time stamp of the representative decoding frame of the previous character of the character;

calculating the ending time of the character based on the timestamp of the representative decoded frame of the character and the timestamp of the representative decoded frame of the next character of the character;

when the character is the first character in the text, the representative decoding frame of the previous character of the character is the first decoding frame in the decoding frame data, and the first decoding frame is before the representative decoding frame of the character;

and in the case that the character is the last character in the text, the representative decoding frame of the character next to the character is the second decoding frame in the decoding frame data, and the second decoding frame is behind the representative decoding frame of the character.

In the embodiment of the present disclosure, calculating the start time of the word based on the timestamp of the representative decoded frame of the word and the timestamp of the representative decoded frame of the previous word of the word includes:

and calculating the average value of the time stamp of the representative decoding frame of the character and the time stamp of the representative decoding frame of the character before the character as the starting time of the character.

In the embodiment of the present disclosure, calculating the end time of the word based on the timestamp of the representative decoded frame of the word and the timestamp of the representative decoded frame of the word next to the word includes:

and calculating the average value of the time stamp of the representative decoding frame of the character and the time stamp of the representative decoding frame of the character which is next to the character to be used as the ending time of the character.

In the embodiment of the present disclosure, determining time information of a target keyword based on time information of a word in the target keyword includes:

taking the starting time of the first character in the target keyword as the starting time of the target keyword;

and taking the end time of the last character in the target keyword as the end time of the target keyword.

In the embodiment of the present disclosure, the timestamp of each decoded frame in the decoded frame data is calculated based on the frame number of the decoded frame and the duration of the decoded frame;

the duration of each decoded frame in the decoded frame data is the sum of the durations of all speech frames corresponding to the decoded frame.

In the disclosed embodiment, the speech data is decoded by a speech recognition model;

the number of speech frames corresponding to each decoded frame in the decoded frame data is proportional to the number of layers of the convolutional neural network in the speech recognition model and the step size of each layer of the convolutional neural network.

According to a second aspect of the present disclosure, there is provided a keyword retrieval apparatus including a voice decoding module, a time information calculation module, a keyword retrieval module, and a retrieval result generation module.

The voice decoding module is used for decoding the voice data to obtain a text of the voice data and decoded frame data, wherein the decoded frame data comprises a decoded frame corresponding to each character in the text;

the time information calculation module is used for calculating the time information of each character in the text based on the time stamp of the decoded frame of the character and the time stamps of the decoded frames of the adjacent characters of the character;

the keyword retrieval module is used for performing keyword retrieval on the text, responding to the text containing preset target keywords, and determining time information of the target keywords based on the time information of characters in the target keywords;

the search result generation module is used for generating a search result containing the target keyword and the time information of the target keyword.

In this embodiment of the present disclosure, the time information calculating module, when being configured to calculate, for each word in the text, time information of the word based on a time stamp of a decoded frame of the word and a time stamp of a decoded frame of an adjacent word of the word, is specifically configured to:

the time information of the word is calculated based on the time stamp of the representative decoded frame of the word and the time stamps of the representative decoded frames of adjacent words of the word.

In an embodiment of the present disclosure, the time information calculating module, when configured to calculate the time information of the word based on the timestamp of the representative decoded frame of the word and the timestamps of the representative decoded frames of adjacent words of the word, is specifically configured to:

calculating a start time of the word based on a time stamp of the representative decoded frame of the word and a time stamp of a representative decoded frame of a previous word of the word;

calculating the ending time of the character based on the time stamp of the representative decoding frame of the character and the time stamp of the representative decoding frame of the next character of the character;

when the character is a first character in the text, the representative decoding frame of the character which is previous to the character is a first decoding frame in the decoding frame data, and the first decoding frame is before the representative decoding frame of the character;

In this embodiment of the present disclosure, the time information calculating module, when being configured to calculate the start time of the word based on the timestamp of the representative decoded frame of the word and the timestamp of the representative decoded frame of the previous word of the word, is specifically configured to:

In this embodiment of the present disclosure, the time information calculating module is specifically configured to, when the time information calculating module is configured to calculate the end time of the word based on the time stamp of the representative decoded frame of the word and the time stamp of the representative decoded frame of the word that is next to the word:

In an embodiment of the present disclosure, when the keyword search module is configured to determine time information of a target keyword based on time information of a word in the target keyword, the keyword search module is specifically configured to:

taking the starting time of the first character in the target keyword as the starting time of the target keyword; and taking the end time of the last character in the target keyword as the end time of the target keyword.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method according to the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first aspect.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

The technical scheme provided by the disclosure has the following beneficial effects:

according to the technical scheme, the text of the voice data and the decoding frame corresponding to each character in the text are obtained by decoding the voice data, the time information representing the position of the voice section corresponding to the character in the voice data is accurately calculated based on the time stamp of the decoding frame of the character, and after the target keyword is searched from the text, the time information representing the position of the voice section corresponding to the target keyword in the voice data can be determined through the time information of the character contained in the target keyword. According to the method, the time information of the target keyword can be accurately acquired on the basis of not introducing excessive additional models, the voice section corresponding to the target keyword can be conveniently and quickly positioned in the voice data, and the accuracy and the lower cost of the time information are considered.

In addition, considering the characteristic that the probability of a decoded frame obtained by decoding voice data is delayed, the time stamp of a decoded frame of one character is used as the basis for calculating the time information of the character, and the time stamp of a decoded frame of an adjacent character of the character is further used as the basis for calculating the time information of the character, thereby ensuring more accurate time information of the calculated character.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart illustrating a keyword retrieval method provided by an embodiment of the present disclosure;

fig. 2 illustrates a flowchart of one implementation of S120 in fig. 1 according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating a keyword searching apparatus according to an embodiment of the present disclosure;

FIG. 4 shows a schematic block diagram of an example electronic device 400 that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be understood that in the embodiments of the present disclosure, the character "/" generally indicates that the former and latter associated objects are in an "or" relationship. The terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated.

In some services for keyword retrieval of voice data, a text of the voice data is generally obtained based on a voice recognition technology, and then whether the text contains a target keyword is determined by a preset keyword recognition mode. Once the text is found to contain the target keywords, the staff member needs to play the audio of the voice data in order to check whether the voice data has voice content matching the target keywords.

Taking customer service quality inspection as an example, in order to ensure that there is no violation in the service process of the customer service staff, it is necessary to obtain a text of the speech data of the customer service staff based on a speech recognition technology, and then determine whether a target keyword (such as an offending word or an unconventional word) is included in the text in a preset keyword recognition manner. Once the text is found to contain the target keywords, the staff member needs the audio of the client playing the voice data to check whether the voice data has the voice content matched with the target keywords.

The invention provides a keyword retrieval method, which can accurately acquire the time information of a target keyword on the basis of not introducing excessive additional models, conveniently and quickly locate a voice section corresponding to the target keyword in voice data, and gives consideration to the accuracy and lower cost of the time information.

The execution subject of the method may be a terminal device, or a computer, or a server, or may also be other devices with data processing capabilities. The subject matter of the method is not limited in this respect.

Optionally, the terminal device may be a mobile phone, or may be a tablet computer, a wearable device, an in-vehicle device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), or the like, and the specific type of the terminal device is not limited in the embodiment of the present disclosure.

In some embodiments, the server may be a single server, or may be a server cluster composed of a plurality of servers. In some embodiments, the server cluster may also be a distributed cluster. The present disclosure is also not limited to a specific implementation of the server.

The keyword search method is exemplified below.

Fig. 1 shows a schematic flowchart of a keyword search method provided in an embodiment of the present disclosure, and as shown in fig. 1, the method may include:

s110: and decoding the voice data to obtain a text of the voice data and decoded frame data.

In the embodiment of the present disclosure, the speech data is decoded by a speech recognition model, for example, the speech recognition model may be decoded by a connection time-series Classification model, where the connection time-series Classification model, called CTC for short, is an algorithm that can be used to train a deep neural network in a speech recognition problem. Of course, the speech data may also be decoded by other types of speech recognition models, which are not limited by this disclosure. After decoding the speech data, a text corresponding to the speech data and decoded frame data may be obtained, it is understood that the text contains the text content of the speech data, and the decoded frame data includes a decoded frame corresponding to each word in the text, where each word typically corresponds to a plurality of decoded frames.

It will be appreciated that each of the decoded frame data has a time stamp, the time stamp of a decoded frame being calculated based on the frame number of the decoded frame and the duration of the decoded frame, the time stamp of a decoded frame indicating the point in time of the decoded frame in the speech data. Specifically, the frame number of the decoding frame may represent a sequence number of the decoding frame in all the decoding frames, for example, the frame number of the first frame decoding frame is 1, and the frame number of the second frame decoding frame is 2; the duration of the decoded frame represents the duration of the speech corresponding to the decoded frame, for example, the duration of the decoded frame may be 40ms. And for each decoded frame, calculating the total time length from the first frame to the decoded frame, and thus obtaining the time stamp of the decoded frame. The frame number of the decoding frame can be obtained after the voice data is decoded, the time length of the decoding frame can also be directly determined by the parameters of the decoding tool, the two parameters can be obtained without extra pole calculation process, the time stamp can be obtained based on the corresponding times of the frame number to the time length of a single decoding frame, and the calculation process is simple and quick.

It should be noted here that each of the decoded frame data generally corresponds to at least one speech frame. The number of speech frames corresponding to each decoded frame is proportional to the number of layers of a Convolutional Neural Network (CNN) in the speech recognition model and the step size (Stride) of each layer of the Convolutional Neural Network. And (3) if the number of layers of the convolutional neural network of the voice recognition model is 2 and the step length of each layer of convolutional neural network is 2, each decoding frame corresponds to 4 voice frames. The duration of each decoded frame in the decoded frame data is the sum of the durations of all speech frames corresponding to the decoded frame, taking 4 speech frames corresponding to each decoded frame as an example, if the duration of a speech frame is 10ms, the duration of a decoded frame is 40ms.

Alternatively, the voice data in S110 may be extracted from the original voice data. In order to eliminate redundant data (e.g., long-time silence period data) in the original Voice data, voice data can be extracted from the original Voice data based on Voice Activity Detection (VAD) technology, and the redundant data in the extracted Voice data can be significantly reduced.

S120: for each word in the text, time information for the word is calculated based on the time stamp of the decoded frame for the word and the time stamps of the decoded frames for adjacent words of the word.

In the embodiment of the present disclosure, the time information of the text indicates the position of the speech segment corresponding to the text in the speech data. As described above, the time stamp of the decoded frame indicates a time point of the decoded frame in the voice data, and thus the time information of the word can be generally calculated based on the time stamp of the decoded frame of the word. Considering the characteristic that the probability of a decoded frame obtained by decoding voice data is delayed, the time stamp of the decoded frame of one character is used as the basis for calculating the time information of the character, and the time stamp of the decoded frame of the adjacent character of the character is further used as the basis for calculating the time information of the character, thereby ensuring that more accurate time information of the character is calculated.

S130: and performing keyword retrieval on the text, responding to the text containing preset target keywords, and determining the time information of the target keywords based on the time information of characters in the target keywords.

The embodiment of the disclosure can preset a keyword list, and the keyword list contains target keywords which need to be retrieved in the text. When the keyword search is performed on the text, each target keyword in the keyword list can be matched with the text content in the text, and whether the text contains the target keyword is determined. If the text is determined to contain the target keyword, the time information of the target keyword can be determined based on the time information of the characters in the target keyword in the text, wherein the time information of the target keyword represents the position of the voice segment corresponding to the target keyword in the voice data.

S140: and generating a retrieval result containing the target keyword and the time information of the target keyword.

It is understood that, if the execution subject of the embodiment of the present disclosure is a server, the server may return the search result to the terminal device after generating the search result, or store the search result in a preset location. If the execution main body of the embodiment of the disclosure is the terminal device, the terminal device can display the target keyword and the time information thereof after generating the search result, can store the search result in a preset position, and can send the search result to other terminal devices. Optionally, the user may play the voice in the voice data indicated by the time information according to the time information of the target keyword, and determine whether the voice content is consistent with the target keyword.

In the embodiment of the present disclosure, each word generally corresponds to a plurality of decoded frames, and for this situation, fig. 2 illustrates an implementation flow diagram of S120 in fig. 1 provided in the embodiment of the present disclosure, and as shown in fig. 2, the flow mainly includes the following steps:

s210: for each word in the text, a representative decoded frame for the word is determined from the decoded frames for the word.

It can be understood that each decoded frame of a word has a certain probability that the decoded frame contains a phoneme of the word, and the probability corresponding to the decoded frame is obtained in the process of decoding the voice data, wherein the representative decoded frame of each word of the text is the decoded frame with the highest probability of containing the phoneme of the word.

S220: the time information of the word is calculated based on the time stamp of the representative decoded frame of the word and the time stamps of the representative decoded frames of adjacent words of the word.

Because the representative decoding frame of a character is the decoding frame with the highest probability of containing the phoneme of the character, the timestamp of the representative decoding frame is the time point with the highest probability of the occurrence of the audio content of the character, and it is further determined that the time range corresponding to the time information of the character has the highest probability of containing the time point, the accuracy of the time information calculated based on the timestamp of the representative decoding frame of the character and the timestamps of the representative decoding frames of the adjacent characters of the character is high.

The time information of each word may include a start time and an end time, and the adjacent words of each word include a previous word and a next word of the word. In S120, a start time of the word may be calculated based on the time stamp of the representative decoded frame of the word and the time stamp of the representative decoded frame of the previous word of the word; the end time of the word is calculated based on the timestamp of the decoded frame representative of the word and the timestamp of the decoded frame representative of the word subsequent to the word.

It will be appreciated that in the case where a word is the first word in the text, since the word has no preceding word adjacent to it, the decoded frame representative of the preceding word of the word can be defined as the first decoded frame in the decoded frame data, which precedes the decoded frame representative of the word, that is, the frame number of the first decoded frame is less than the frame number of the decoded frame representative of the word. Here, the specific position of the first decoded frame may be determined according to actual design requirements, for example, a fifth decoded frame before the representative decoded frame of the first word may be selected as the first decoded frame.

If a word is the last word in the text, since the word has no next word adjacent to the word, the decoded frame representative of the next word of the word may be defined as a second decoded frame in the decoded frame data, where the second decoded frame is subsequent to the decoded frame representative of the word, that is, the frame number of the second decoded frame is greater than the frame number of the decoded frame representative of the word. Here, the specific position of the second decoded frame may be determined according to actual design requirements, for example, a fifth decoded frame after the representative decoded frame of the last word may be selected as the second decoded frame.

In the embodiment of the present disclosure, when the start time of the character is calculated based on the time stamp of the representative decoded frame of the character and the time stamp of the representative decoded frame of the character immediately preceding the character, an average value of the time stamp of the representative decoded frame of the character and the time stamp of the representative decoded frame of the character immediately preceding the character may be calculated as the start time of the character. The average value of the timestamps of the representative decoding frames of the two adjacent characters can be used as the starting time of one character and the ending time of the other character, the time data of the two characters can be obtained through one-time calculation, the program running time can be shortened, and the efficiency can be improved.

Alternatively, the average may be a weighted average. Specifically, when calculating the first average value of the timestamp of the representative decoded frame of one text and the timestamp of the representative decoded frame of the text immediately preceding the text, a weighting coefficient may be configured for the timestamp of the representative decoded frame of the text and the timestamp of the representative decoded frame of the text immediately preceding the text, and the weighted average value of the two timestamps may be calculated based on the two timestamps and their corresponding weighting coefficients.

In the embodiment of the present disclosure, when the end time of the character is calculated based on the time stamp of the representative decoded frame of the character and the time stamp of the representative decoded frame of the character next to the character, an average value of the time stamp of the representative decoded frame of the character and the time stamp of the representative decoded frame of the character next to the character may be calculated as the end time of the character. The average value of the timestamps of the representative decoding frames of the two adjacent characters can be used as the starting time of one character and the ending time of the other character, the time data of the two characters can be obtained through one-time calculation, the program running time can be shortened, and the efficiency can be improved.

Alternatively, the average may be a weighted average. Specifically, when calculating the average value of the time stamp of the representative decoded frame of one text and the time stamp of the representative decoded frame of the text immediately preceding the text, a weight coefficient may be configured for the time stamp of the representative decoded frame of the text and the time stamp of the representative decoded frame of the text immediately preceding the text, and the weighted average value of the two time stamps may be calculated based on the two time stamps and their corresponding weight coefficients.

In the embodiment of the present disclosure, the time information of each character may include a start time and an end time, and when the time information of the target keyword is determined based on the time information of the characters in the target keyword, the start time of the first character in the target keyword may be used as the start time of the target keyword, and the end time of the last character in the target keyword may be used as the end time of the target keyword. It can be understood that the first character in the target keyword is the first character in the target keyword, and the last character in the target keyword is the last character in the target keyword; when the target keyword contains only one character, the character is simultaneously taken as a first character and a last character.

Based on the same principle as the keyword retrieval method described above, the embodiment of the present disclosure provides a keyword retrieval apparatus, and fig. 3 shows a schematic diagram of a keyword retrieval apparatus provided by the embodiment of the present disclosure. As shown in fig. 3, the keyword search apparatus 300 includes a speech decoding module 310, a time information calculation module 320, a keyword search module 330, and a search result generation module 340.

The voice decoding module 310 is configured to decode the voice data to obtain a text of the voice data and decoded frame data, where the decoded frame data includes a decoded frame corresponding to each word in the text;

the time information calculation module 320 is configured to calculate, for each word in the text, time information of the word based on a time stamp of a decoded frame of the word and a time stamp of a decoded frame of an adjacent word of the word;

the keyword retrieval module 330 is configured to perform keyword retrieval on the text, and determine time information of a target keyword based on time information of a word in the target keyword in response to the text including the preset target keyword;

the search result generation module 340 is configured to generate a search result including the target keyword and the time information of the target keyword.

According to the keyword retrieval device provided by the embodiment of the disclosure, a text of the voice data and a decoded frame corresponding to each character in the text are obtained by decoding the voice data, time information representing the position of a voice segment corresponding to the character in the voice data is accurately calculated based on the timestamp of the decoded frame of the character, and after a target keyword is retrieved from the text, the time information representing the position of the voice segment corresponding to the target keyword in the voice data can be determined through the time information of the character contained in the target keyword. According to the method, the time information of the target keyword can be accurately acquired on the basis of not introducing excessive additional models, the voice section corresponding to the target keyword can be conveniently and quickly positioned in the voice data, and the accuracy and the lower cost of the time information are considered.

In addition, in consideration of the characteristic that the probability of a decoded frame obtained by decoding speech data is delayed, the time stamp of the decoded frame of one character is used as the basis for calculating the time information of the character, and the time stamp of the decoded frame of the adjacent character of the character is further used as the basis for calculating the time information of the character, thereby ensuring more accurate time information of the calculated character.

In this embodiment of the present disclosure, the time information calculating module 320 is specifically configured to, for each word in the text, calculate the time information of the word based on the time stamp of the decoded frame of the word and the time stamp of the decoded frame of the adjacent word of the word:

In this embodiment of the present disclosure, the time information calculating module 320 is specifically configured to, when calculating the time information of the word based on the time stamp of the representative decoded frame of the word and the time stamp of the representative decoded frame of the adjacent word of the word:

In this embodiment of the present disclosure, the time information calculating module 320 is specifically configured to, when the time information calculating module is configured to calculate the start time of the word based on the time stamp of the representative decoded frame of the word and the time stamp of the representative decoded frame of the previous word of the word,:

In this embodiment of the present disclosure, the time information calculating module 320 is specifically configured to, when the time information calculating module is configured to calculate the ending time of the word based on the time stamp of the representative decoded frame of the word and the time stamp of the representative decoded frame of the word next to the word:

In the embodiment of the present disclosure, the keyword retrieving module 330, when configured to determine the time information of the target keyword based on the time information of the text in the target keyword, is specifically configured to:

It can be understood that each module of the keyword retrieval device in the embodiment of the present disclosure has a function of implementing a corresponding step of the keyword retrieval method. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules can be software and/or hardware, and each module can be implemented independently or by integrating a plurality of modules. For the functional description of each module of the keyword search apparatus, reference may be made to the corresponding description of the keyword search method, which is not described herein again.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, a computer program product, and an autonomous vehicle according to embodiments of the present disclosure.

In an exemplary embodiment, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the above embodiments. The electronic device may be the computer or the server described above.

In an exemplary embodiment, the readable storage medium may be a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method according to the above embodiments.

In an exemplary embodiment, the computer program product comprises a computer program which, when being executed by a processor, carries out the method according to the above embodiments.

In an exemplary embodiment, an autonomous vehicle includes the electronic device described above.

FIG. 4 shows a schematic block diagram of an example electronic device 400 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the electronic device 400 includes a computing unit 401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data required for the operation of the device 400 can also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

A number of components in the electronic device 400 are connected to the I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, or the like; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408 such as a magnetic disk, optical disk, or the like; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the electronic device 400 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 401 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 401 executes the respective methods and processes described above, such as the keyword retrieval method. For example, in some embodiments, the keyword retrieval method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 400 via the ROM 402 and/or the communication unit 409. When the computer program is loaded into RAM 403 and executed by computing unit 401, one or more steps of the keyword retrieval method described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the keyword retrieval method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A keyword retrieval method, the method comprising:

decoding voice data to obtain a text of the voice data and decoded frame data, wherein the decoded frame data comprises a decoded frame corresponding to each character in the text;

for each character in the text, calculating time information of the character based on the time stamp of the decoded frame of the character and the time stamps of the decoded frames of the adjacent characters of the character;

performing keyword retrieval on the text, responding to the text containing preset target keywords, and determining time information of the target keywords based on time information of characters in the target keywords;

2. The method of claim 1, wherein said calculating, for each word in the text, time information for the word based on a time stamp of a decoded frame of the word and a time stamp of a decoded frame of an adjacent word to the word comprises:

for each character in the text, determining a representative decoding frame of the character from the decoding frames of the character, wherein the representative decoding frame is a decoding frame with the highest probability of the phoneme containing the character;

3. The method of claim 2, wherein said calculating the time information of the word based on the time stamp of the representative decoded frame of the word and the time stamps of the representative decoded frames of the adjacent words of the word comprises:

and under the condition that the character is the last character in the text, the representative decoding frame of the character next to the character is the second decoding frame in the decoding frame data, and the second decoding frame is behind the representative decoding frame of the character.

4. The method of claim 3, wherein calculating the start time of the word based on the timestamp of the decoded frame representative of the word and the timestamp of the decoded frame representative of the word immediately preceding the word comprises:

5. The method of claim 3, wherein calculating the end time of the word based on the timestamp of the decoded frame representative of the word and the timestamp of the decoded frame representative of the word that is subsequent to the word comprises:

6. The method of claim 3, wherein the determining the time information of the target keyword based on the time information of the text in the target keyword comprises:

7. The method according to any of claims 1-6, wherein the timestamp of each of the decoded frame data is calculated based on a frame number of the decoded frame and a duration of the decoded frame;

and the duration of each decoded frame in the decoded frame data is the sum of the durations of all speech frames corresponding to the decoded frame.

8. The method of claim 7, wherein the speech data is decoded by a speech recognition model;

the number of the voice frames corresponding to each decoding frame in the decoding frame data is in direct proportion to the number of layers of the convolutional neural network in the voice recognition model and the step length of each layer of the convolutional neural network.

9. A keyword retrieval apparatus, the apparatus comprising:

the voice decoding module is used for decoding voice data to obtain a text of the voice data and decoded frame data, wherein the decoded frame data comprises a decoded frame corresponding to each character in the text;

a time information calculation module for calculating, for each word in the text, time information of the word based on a time stamp of a decoded frame of the word and a time stamp of a decoded frame of an adjacent word of the word;

the keyword retrieval module is used for performing keyword retrieval on the text, responding to the text containing preset target keywords, and determining time information of the target keywords based on time information of characters in the target keywords;

and the retrieval result generation module is used for generating a retrieval result containing the target keyword and the time information of the target keyword.

10. The apparatus of claim 9, wherein the time information calculation module, when configured to calculate, for each word in the text, the time information for the word based on the time stamp of the decoded frame for the word and the time stamps of the decoded frames for adjacent words to the word, is specifically configured to:

11. The apparatus according to claim 10, wherein the time information calculating module, when configured to calculate the time information of the word based on the time stamp of the representative decoded frame of the word and the time stamps of the representative decoded frames of adjacent words of the word, is specifically configured to:

12. The apparatus according to claim 11, wherein the time information calculating module, when configured to calculate the start time of the word based on the time stamp of the decoded frame representative of the word and the time stamp of the decoded frame representative of the word preceding the word, is specifically configured to:

13. The apparatus according to claim 11, wherein the time information calculating module, when configured to calculate the ending time of the word based on the timestamp of the decoded frame representative of the word and the timestamp of the decoded frame representative of the word following the word, is specifically configured to:

14. The apparatus of claim 11, wherein the keyword retrieval module, when configured to determine the time information of the target keyword based on the time information of the text in the target keyword, is specifically configured to:

15. The apparatus according to any of claims 9-14, wherein the timestamp of each of the decoded frame data is calculated based on a frame number of the decoded frame and a duration of the decoded frame;

16. The apparatus of claim 15, wherein the speech data is decoded by a speech recognition model;

the number of the voice frames corresponding to each decoded frame in the decoded frame data is in direct proportion to the number of layers of the convolutional neural network in the voice recognition model and the step length of each layer of the convolutional neural network.

17. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.