CN109657094B

CN109657094B - Audio processing method and terminal equipment

Info

Publication number: CN109657094B
Application number: CN201811423356.0A
Authority: CN
Inventors: 彭捷; 黄欣新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2024-05-07
Anticipated expiration: 2038-11-27
Also published as: CN109657094A

Abstract

The invention is applicable to the technical field of computer application, and provides an audio processing method, terminal equipment and a computer readable storage medium, comprising the following steps: acquiring an audio file to be processed; analyzing the audio file to obtain original text information; the original text information comprises entry text of the audio file and playing time of each entry in the entry text; acquiring a text to be searched input by a user, determining a target entry matched with the text to be searched in the entry text, and playing a target playing time of the target entry; and playing the audio corresponding to the target entry according to the target entry and the target playing time. The position and the playing time of the entry in the entry text are determined according to the entry input by the user, so that the audio file can be flexibly presented to the user according to the mode selected by the user, and the intelligentization of audio playing of the audio software and the use experience of the user are improved.

Description

Audio processing method and terminal equipment

Technical Field

The present invention relates to the field of computer applications, and in particular, to an audio processing method, a terminal device, and a computer readable storage medium.

Background

With the development of computer multimedia technology, various types of audio and video playing software exist, users can play music and enjoy video through the software, and life entertainment modes are increased. In the existing audio playing software, the audio files can only be played by setting different playing modes, the audio playing can not be carried out according to the playing requirements of users, and the flexibility is low.

Disclosure of Invention

In view of the above, embodiments of the present invention provide an audio processing method, a terminal device, and a computer readable storage medium, so as to solve the problem in the prior art that audio playback cannot be performed according to a playback requirement of a user, and flexibility is low.

A first aspect of an embodiment of the present invention provides an audio processing method, including:

acquiring an audio file to be processed;

Analyzing the audio file to obtain original text information; the original text information comprises an entry text of the audio file and a playing time for playing each entry in the entry text;

Acquiring a text to be searched input by a user, determining a target entry matched with the text to be searched in the entry text, and playing a target playing time of the target entry;

and playing the audio corresponding to the target entry according to the target entry and the target playing time.

A second aspect of an embodiment of the present invention provides a terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

acquiring an audio file to be processed;

A third aspect of an embodiment of the present invention provides a terminal device, including:

The acquisition unit is used for acquiring the audio file to be processed;

the analysis unit is used for analyzing the audio file to obtain original text information; the original text information comprises an entry text of the audio file and a playing time for playing each entry in the entry text;

The matching unit is used for acquiring a text to be searched input by a user, determining a target entry matched with the text to be searched in the entry text, and playing a target playing time of the target entry;

and the playing unit is used for playing the audio corresponding to the target entry according to the target entry and the target playing time.

A fourth aspect of an embodiment of the invention provides a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of the first aspect described above.

Compared with the prior art, the embodiment of the invention has the beneficial effects that:

According to the embodiment of the invention, the audio file to be processed is obtained; analyzing the audio file to obtain original text information; the original text information comprises entry text of the audio file and playing time of each entry in the entry text; acquiring a text to be searched input by a user, determining a target entry matched with the text to be searched in the entry text, and playing a target playing time of the target entry; and playing the audio corresponding to the target entry according to the target entry and the target playing time. The position and the playing time of the entry in the entry text are determined according to the entry input by the user, so that the audio file can be flexibly presented to the user according to the mode selected by the user, and the intelligentization of audio playing of the audio software and the use experience of the user are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an audio processing method according to a first embodiment of the present invention;

Fig. 2 is a flowchart of an audio processing method according to a second embodiment of the present invention;

Fig. 3 is a schematic diagram of a terminal device according to a third embodiment of the present invention;

fig. 4 is a schematic diagram of a terminal device according to a fourth embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In order to illustrate the technical scheme of the invention, the following description is made by specific examples.

Referring to fig. 1, fig. 1 is a flowchart of an audio processing method according to an embodiment of the present invention. In this embodiment, the execution body of the audio processing method is a terminal. The terminal comprises, but is not limited to, a mobile terminal such as a smart phone, a tablet computer, a wearable device and the like, and can also be a desktop computer and the like. The audio processing method as shown may include the steps of:

s101: and acquiring an audio file to be processed.

The audio file is acquired before being processed. The acquisition mode may be acquired through wireless transmission, wired network, etc., which is not limited herein. Audio files are generally divided into two categories: sound files and instrumental digital interface (Musical Instrument DIGITAL INTERFACE, MIDI) files, the sound files are original sounds recorded through a sound recording device, and binary sampling data of real sounds are directly recorded. The MIDI file is a musical performance instruction sequence which can be played by a sound output device or an electronic musical instrument connected to a computer, and the audio file in this embodiment is a MIDI file.

Audio files are an important type of file in internet multimedia. The format of the audio file in this embodiment may include, but is not limited to: waveform Audio file Format (WAVE Form Audio File Forma, WAVE), audio interchange file Format (Audio INTERCHANGE FILE Format, AIFF), audio Format (Audio, AU), moving picture experts group (Moving Picture Experts Group, MPEG), real Audio (RAM), musical instrument digital interface (Musical Instrument DIGITAL INTERFACE, MIDI). The WAVE Format is a sound file Format developed by microsoft corporation, and accords with the file specification of Resource INTERCHANGE FILE Format (PIFF), and is used for storing audio information resources of WINDOWS platform, and is supported by WINDOWS platform and application programs thereof. The WAVE format supports a variety of audio bit numbers, sampling frequencies, and channels, and is a popular sound file format on personal computers that is relatively large in file size, mostly for storing brief sound clips. The Audio type file is a compressed digital Audio format, which is a sound file format commonly used in web applications. The MPEG type file represents a moving image compression standard, and the audio file format herein refers to an audio part in the MPGE standard, i.e., a moving image expert group audio layer, because of its high cost performance of sound quality and storage space. The realrudio type file is mainly used for transmitting audio information in real time over a low-rate wide area network. The network connection rates are different and the sound quality obtained by the clients is also different. MIDI type files are a unified international standard for digital music or synthetic musical instruments, defining the way computer music programs, synthesizers, and other electronic devices exchange music signals, and also defining protocols between cables and hardware and devices for electronic musical instruments and computer connections of different manufacturers. The digital sound simulator can be used for creating digital sounds for different musical instruments, such as violins, pianos and the like. Each audio file has its own audio file information, where the audio file information may include original text information of the audio file, a file format, a number of file frames, a playing time and an ending time of each text, a text duration, and the like. For example, audio file information in a song may include lyrics, duration, composition, word making, singer, etc.

S102: analyzing the audio file to obtain original text information; the original text information comprises an entry text of the audio file and a playing time for playing each entry in the entry text.

After the audio file to be processed is obtained, the audio file is analyzed, and original text information is obtained. Specifically, because the types of the audio files are different, the encoding mode and the corresponding parsing mode of each audio file are also different. In the scheme, the original text information of each audio file can be analyzed according to the coding mode by determining the format and the coding mode of the audio file. The original text information comprises entry text of the audio file and playing time of each entry in the entry text.

Illustratively, the audio file of a song includes at least song information and lyric text. The original text information is obtained by reading and analyzing the audio lyric file information, and the original text information obtained in a song is the lyrics of the song and the playing time. Wherein, the time before and after each lyric is the beginning time and ending time of the lyric respectively. The parsed text format is as follows:

32.34 ancient street passing through the alley 35.89

35.89 Near the green wall you look into the distance sunset 39.03

39.03 At a glance, just because of being careless, 42.79

42.79 Disturbing me thinking about day and night 46.07

46.07 The week of idea praise becomes butterfly 49.57

49.57 The Nelumbo nucifera passes through the Muzhi mao leaf 52.74

52.74, 56.56, Even if the mountain layers are stacked

56.56A also without a central flow break here 59.82

In this embodiment, the playing time may be used to indicate the time when each sentence, each entry or each word in the audio text starts to be played, that is, the time when the first word of each sentence entry in the audio file starts to be played from the 0 th second after the audio file starts to be played is the playing time. For example, in the above example, the entry "even if the mountain groups are layered" corresponds to a playback time of 52.74 seconds, that is, the 52.74 th second from the start of playback of the audio file. Further, in order to more precisely and accurately represent the playing time of each word and word in the audio file, we can also determine the playing time of each word in the audio text information, that is, the playing time of the beginning of playing of the word is the playing time.

Among the many audio compression methods, these methods try to compress digital audio to occupy less memory while maintaining sound quality. MPEG compression is lossy compression, meaning that a portion of the audio information must be lost when compressed using this method. But such losses are difficult to find due to the control of the compression method. Several very complex and demanding mathematical algorithms are used, so that only barely audible parts of the original audio are lost. This leaves more room for important information, by which the 12-fold effect of audio compression can be achieved, just to his quality, MPEG audio has become popular. The motion picture expert compression standard audio layer 3 (Moving Picture Experts Group Audio Layer III, MP 3) file is largely divided into three parts: tag_v2 (ID 3V 2), audio data frame, tag_v1 (ID 3V 1). The ID3V2 contains information such as an author, a composition, an album and the like at the beginning position of the file, the length is not fixed, and the information quantity of the ID3V1 is expanded; the frames comprise a series of frames, and the number of the frames is determined by the size of the file and the frame length at the middle position of the file; the length of each frame may not be fixed, may be fixed, and is determined by the bit rate, and each frame is divided into a frame header and a data entity; the frame header records the information of bit rate, sampling rate, version, etc. of MP3, and each frame is independent.

Illustratively, an MP3 file is composed of frames, which are the smallest constituent units of an MP3 file. The MPEG audio files are divided into three layers according to the compression quality and the encoding complexity, and correspond to three sound files, namely MP1, MP2 and MP3 respectively, and encoding of different layers is used according to different purposes. The higher the level of MPEG audio coding, the more complex the encoder, the higher the compression ratio, the compression ratios of MP1 and MP2 are respectively 4:1 and 6:1-8:1, and the compression ratio of MP3 is as high as 10:1-12:1, the uncompressed one minute of tone quality music requires 10MB of storage space, and only about 1MB of tone quality music is compressed and encoded by MP 3. However, MP3 applies a lossy compression method to the audio signal, in order to reduce the distortion of sound, MP3 firstly performs spectrum analysis to the audio file during encoding, then filters out the noise level by using a filter, then spreads each remaining bit by quantization, finally forms an MP3 file with a higher compression ratio, and enables the compressed file to achieve a sound effect closer to the original sound source during playback.

WMA is a media file format defined by microsoft and is a streaming media. Each WMA file, the first 16 bytes of which are fixed, hexadecimal "30 26 B2 75 8E 66 CF 11 A6 D9 00 AA 00 62 CE 6C", is used to identify whether this is a WMA file. The next 8 bytes are an integer high order, and there is a back, representing the size of the whole WMA file header, which contains all non-audio information such as tag information, and the back of the header is audio information. A number of frames are stored therein starting from a file start offset of 31, with our required standard tag information, extended tag information, WMA file control information, etc. Each frame is not of equal length, but the frame header is a fixed 24 bytes, with the first 16 bytes being the name used to identify the frame and the last 8 bytes being the size used to represent the frame. Since only the tag information needs to be read and written, and the tag information is stored in two frames, namely a standard tag frame and an extended tag frame, the two frames only need to be processed, and other frames can be skipped completely according to the obtained frame length. Standard tag frames contain only four contents of song title, artist, copyright and remark. Its frame name is hexadecimal "33 26 B2 75 8E 66 CF 11 A6 D9 00 AA 00 62 CE 6C", followed by 5 integers of 2 bytes each, the first four representing the song title, artist, copyright, and remark size, respectively, at the 24-byte frame head. After the 10 bytes, the contents of the five information are stored in order. All the words in WMA files are stored in a wide character encoding mode, and each string is followed by a0 end character. The number of information contained within the expansion tag frame is not certain and each information is organized in the same way as the frame. The frame name of the expansion tag frame is "40 A4 D0 D2 07 E3 D2 11 97 F0 00 A0 C9 5E A8 50" in hexadecimal, and an integer of two bytes follows the frame header of 24 bytes to represent the number of expansion information shared in the frame, followed by the expansion information. Each extension information contains an extension information name and a corresponding value. There is an integer of 2 bytes to represent the size of the extension information name, followed by the extension information name, then there is an integer identification of 2 bytes, followed by an integer of 2 bytes to represent the size of the value, followed by the value. When the extension information name is WMFSDKVersion, this value represents the version of the WMA file; when the extension information name is WM/AlbumTitle, this value represents the album name; when the extended information name is WM/Genre, this value represents the Genre; similarly, the use of this value is easily seen from the name of the extension information. The names and values of these extension information are almost all stored in strings of wide characters. The integer identification is only useful for both extended information names WM/TrackNumber and WM/Track, the latter value being expressed in the form of an integer of 4 bytes when the integer identification is 3, i.e. Track information, which is expressed in the form of a common string when the integer identification is 0.

AMR adaptive multi-rate audio compression audio coding format is a speech coding optimized, dedicated to efficiently compressing speech frequencies. AMR audio is mainly used for audio compression of mobile devices, where the compression ratio is very high but the sound quality is poor, and is mainly used for audio compression of speech types, and is not suitable for compression of music audio with high sound quality requirements. It uses 1-8 different bit rate encodings. AMR has a total of 16 coding schemes. 0-7 corresponds to 8 different coding modes, and the sampling frequency of each coding mode is different; 8-15 for noise or hold. All AMR header flags are 6 bytes. This file is 21 bytes per frame. The header of AMR audio is inconsistent in the mono and multichannel case, the header of mono case includes only one magic number, while the header of multichannel case contains both magic number and a 32-bit channel description field. The 32-bit channel description character in the case of multi-channel, the first 28 bits are reserved characters, must be set to 0, and the last 4 bits indicate the number of channels used. The header of AMR audio is followed by temporally successive speech frame blocks, each frame block comprising a number of 8-bit aligned speech frames arranged in succession, starting from the first channel, with respect to the number of channels. Each speech frame starts with an 8-bit frame header, where P is the padding bit that must be set to 0, and each frame is 8-bit aligned.

It should be noted that, for different coding modes, the size of the audio frame is different, the bit rate is also different, and the calculation mode of the size of the audio data frame is as follows: AMR corresponds to 20ms for one second with 50 frames of audio data. The data size per frame is different due to the different bit rates. If the bit rate is 12.2kbs, the number of audio data bits sampled per second is: 12200/50=244 bit=30.5 byte, rounded to 31 bytes. Rounding is performed to a size of 32 bytes, plus a frame header of one byte.

S103: obtaining a text to be searched input by a user, determining a target entry matched with the text to be searched in the entry text, and playing a target playing time of the target entry.

After the entry text in the original text information is analyzed, determining a target entry matched with the entry text to be searched and a target playing time for playing the target entry in the entry text by acquiring the entry text to be searched input by a user. The term text may be, for example, lyrics of a song audio file, and the text to be searched input by the user may be a word or a sentence, and the corresponding playing time is searched for in the original lyrics file through the word or sentence input by the user. The object searched in this embodiment is a target term input by the user, and each term and its playing time are stored in the original text information, and the target playing time of the target term is determined by searching the corresponding term and the playing time of the term in the original text information.

In practical applications, the manner of obtaining the text to be searched input by the user may be that the user inputs an entry in a window of the audio player, or that a cursor is placed at a certain position of the entry text, so that the text to be searched can be determined. When matching is performed in the entry text according to the text to be searched, the part with the highest similarity coefficient in the text to be searched and the entry text can be calculated in a mode of calculating the similarity coefficient, so that a target entry and a target playing time for playing the target entry are determined.

Specifically, when calculating the similarity coefficient between the text to be searched and the entry text, the two objects may be segmented first, and at least one entry is determined in the two objects. The specific similarity coefficient calculation mode can determine the running deviation value by calculating the distance between two sequences or the similarity coefficient. The distance between the text to be searched and the entry text may be calculated by means of euclidean distance, normalization of euclidean distance, mahalanobis distance, manhattan distance, chebyshev distance, markov distance, or hamming distance, or the similarity coefficient between the text to be searched and the entry text may be calculated by means of calculating a cosine similarity coefficient and adjusting the cosine similarity coefficient, pearson correlation coefficient, log likelihood similarity, mutual information gain, or word pair similarity coefficient.

By way of example, the running deviation value between the text to be searched and the entry text may be calculated by Jaccard similarity coefficients: And X and Y are respectively used for representing the quantization values of the vocabulary entries in the text to be searched and the vocabulary entry text, and the running deviation value between the text to be searched and the vocabulary entry text can be determined by calculating the Jaccard similarity coefficient between the text to be searched and the vocabulary entry text.

S104: and playing the audio corresponding to the target entry according to the target entry and the target playing time.

And after determining a target entry matched with the text to be searched in the entry text and determining a target playing time for playing the target entry, playing the audio corresponding to the target entry according to the target entry and the target playing time. For example, after a user selects a certain sentence of lyrics or inputs a certain word in the practical application, the lyrics are played through the position of the word determiner in the entry text and the target playing time.

Further, in this embodiment, after determining the target entry and the target playing time, the target entry may be played, where the playing mode may be continuous and cyclic playing of the target entry, or only playing of the target entry once; it is also possible to play the target entry and continue the audio portion immediately following the play of the target entry. In addition, other playing modes are also possible, and the embodiment is not limited thereto.

According to the scheme, the audio file to be processed is obtained; analyzing the audio file to obtain original text information; the original text information comprises an entry text of the audio file and a playing time for playing each entry in the entry text; acquiring a text to be searched input by a user, determining a target entry matched with the text to be searched in the entry text, and playing a target playing time of the target entry; and playing the audio corresponding to the target entry according to the target entry and the target playing time. The position and the playing time of the entry in the entry text are determined according to the entry input by the user, so that the audio file can be flexibly presented to the user according to the mode selected by the user, and the intelligentization of audio playing of the audio software and the use experience of the user are improved.

Referring to fig. 2, fig. 2 is a flowchart of an audio processing method according to a second embodiment of the present invention. In this embodiment, the execution body of the audio processing method is a terminal. The terminal comprises, but is not limited to, a mobile terminal such as a smart phone, a tablet computer, a wearable device and the like, and can also be a desktop computer and the like. The audio processing method as shown may include the steps of:

S201: and acquiring an audio file to be processed.

In this embodiment, the implementation manner of S201 is identical to that of S101 in the embodiment corresponding to fig. 1, and specific reference may be made to the description related to S101 in the embodiment corresponding to fig. 1, which is not repeated here.

S202: analyzing the audio file to obtain original text information; the original text information comprises an entry text of the audio file and a playing time for playing each entry in the entry text.

In this embodiment, the implementation manner of S202 is identical to that of S102 in the embodiment corresponding to fig. 1, and specific reference may be made to the description related to S102 in the embodiment corresponding to fig. 1, which is not repeated here.

S203: and acquiring a text to be searched which is input by a user, and extracting at least one keyword from the text to be searched.

After the audio file to be processed is obtained and the original text information is obtained from the audio file, the text to be searched input by the user is obtained, wherein the text to be searched can be a word or a sentence. And extracting at least one keyword from the text to be searched by acquiring the text to be searched input by the user.

Further, step S203 may specifically include steps S2031 to S2032:

s2031: and acquiring a text to be searched which is input by a user, and preprocessing the text to be searched to obtain a preprocessed text.

In practical application, the method for acquiring the text to be searched input by the user can be acquired in real time or in a timing mode, the time period can be set for the user by the timing mode, and the running quality of the audio file can be ensured by periodically acquiring the text to be searched input by the user, so that the method has better controllability.

After the text to be searched is obtained, preprocessing is performed on the text to be searched, and in this embodiment, the preprocessing mode may include operations such as deleting redundant entries, correcting data or filtering data, which is not limited herein. Specifically, since the entry input or selected by the user has the text without meaning such as punctuation marks in many cases, in this case, redundant entries such as punctuation can be deleted, thereby improving the audio processing efficiency. In many cases, the word text input by the user has wrong words, so that the wrong words in the text to be searched can be identified in preprocessing, the meaning of the word in the text to be searched and the correct word which the user wants to input are predicted, and the word text is corrected, so that the accuracy of word search is improved. In many cases, the user can input many repeated entries when inputting, and the time and error rate of the entry search can be increased, so that the entry data of the text to be searched is reduced by identifying the repeated entries and filtering the data, and the efficiency and the accuracy of the entry search are improved.

S2032: and according to a pre-trained word segmentation model, segmenting the preprocessed text to obtain at least one keyword.

For any language, words are the most basic units, and for a computer to understand and analyze natural language, word segmentation must be performed on the original long text. The word segmentation technology is a technology for automatically identifying words in a text through a computer, has unique advantages in word segmentation processing for Latin language systems represented by English, and is characterized in that words are separated by spaces by default. However, it is much more complex and difficult for Chinese word segmentation, the smallest unit in Chinese text is a word, and there is no obvious separator between words.

In this embodiment, firstly, word segmentation is performed on the text to be matched, and at least one keyword is extracted. Optionally, word segmentation can be performed through a word segmentation algorithm based on character string matching, dictionary data is loaded through a certain data structure, an input text character string is segmented according to a certain scanning sequence and a matching strategy, character string matching is performed on the input text character string and the words in the dictionary, if matching is successful, a word is recognized, and word segmentation thought based on dictionary matching is clear, and the principle is simple and easy to implement. The algorithm can also analyze the text sentence from the semantic and grammatical angles by simulating the understanding process of human sentences through word segmentation based on understanding. A large amount of language, grammar related information and knowledge needs to be prepared in advance.

The word segmentation model is trained in advance according to the historical data of the entry text, and when the word segmentation model is trained, training set data with good manual standard is obtained. The training set data are marked with word segmentation positions, and word segmentation positions corresponding to different characters are determined, wherein the word segmentation positions comprise a start position, an end position and a middle position of a word segment. And secondly, preprocessing and extracting features of the acquired training set data. Selecting non-target characters through a screen, giving a Chinese character, and judging whether the Chinese character belongs to punctuation marks, numbers, chinese numbers or letters; if not, the word position where the character appears in the training corpus is counted and indicated by B, M, E, S. Wherein B is used to indicate that the character is the beginning of each word; m is used for representing the middle position of the character in a word; e is used to indicate that the character is the ending position of a word; s is used to indicate that the character can independently form a word. Counting the positions of the matched characters through rules, counting the position contents corresponding to the characters, and judging the position category of the characters; illustratively, the threshold value adopted by the scheme is 90%, and as long as the occurrence frequency of the character position exceeds 90% of the total frequency, the characters are considered to be mostly in the corresponding characters of the word;

Next, the positions of the key characters are predicted by a word segmentation model. The features taken by the word segmentation model in this embodiment may include N-gram features, which may include, but are not limited to, features such as ci, ci+1, and ci+2. Wherein ci is used for representing the character types corresponding to the front keyword and the rear keyword, wherein i= -2, -1, 0, 1,2 or 5 features; cci+1 is used to represent adjacently spaced character combination features, where i= -2, -1, 0, 1, or 4 features; cci+2 is used to represent character combination features separated by one character, where i= -1, 0, or 2 features. The features adopted by the word segmentation model in this embodiment may further include character repetition information features, and calculate whether a certain character is a repeated character with the first three characters, and the function is set to duplication (c 0, ci), where i= -2, -1 or 2 features. The features taken by the word segmentation model in this embodiment may also include character class features for calculating three character types preceding the character.

Finally, in this embodiment, a grid traversal method is used to learn parameters in the model, and the main traversal indexes are as follows: learning rate, training times, batch number, termination error, etc. Conditions for model training termination include, but are not limited to, the number of training times reaching a certain number of times, the error having reached a certain index. In performing parameter learning, the numerical determination of each index includes, but is not limited to, the following: the learning rate is selected from three dimensions of 0.01, 0.02, 0.03 and the like; the training times are selected from 500, 1000 and 2000 dimensions; selecting 100, 200 and 500 dimensions of the batch number; the termination error selects 0.05,0.01,0.5 three dimensions. Specific parameter combinations can be obtained through different network learning methods. Model combinations composed of different parameters are obtained through model training: { params1, params2, params3,..params n }, where params n is used to represent the different parameters resulting from training. After training parameters are obtained, a model combination consisting of the parameters is tested, the accuracy of the test is determined, and a model with the highest accuracy is selected as a word segmentation model, so that word segmentation processing is carried out on the text to be searched after pretreatment, at least one keyword is obtained and used for representing entries in the text to be searched, and entry searching is carried out.

S204: and carrying out fuzzy matching in the original text information according to the keywords to obtain the target entry matched with the text to be searched.

After determining the keywords in the text to be searched, fuzzy matching is carried out in the entry text of the original text information according to the keywords, and the target entries matched with the text to be searched are obtained.

Further, step S204 may specifically include steps S2041 to S2044.

S2041: and generating a first word vector corresponding to the keyword according to the keyword.

Under the condition that the acquired text to be matched is one or more keywords, the position and the playing time of the target entry of the keywords in the target text can be determined by directly searching the keywords in the target text so as to play the audio part corresponding to the target entry. Further, for example, there is often repeated text in a song, and thus, the user is required to input at least two keywords to precisely determine the target term by matching a greater number of keywords in the target text. In this embodiment, the first word vector corresponding to the keyword is obtained by quantizing the keyword.

The keywords in the text are represented by vectors assuming that the text to be matched and each keyword in the original text information are uncorrelated, thereby simplifying complex relations between the keywords in the text. The text is regarded as being composed of mutually independent term groups (T ₁,T₂,T₃,…,T_i,…T_n), a certain weight w _i is given to each term T _i according to the importance degree of the term T _i in the text, and (T ₁,T₂,T₃,…,T_i,…T_n) is regarded as a coordinate axis in an n-dimensional coordinate system, w ₁,w₂,…,w_i,…,w_m are respectively corresponding coordinate values, so that an orthogonal term vector group obtained by decomposition of (T ₁,T₂,T₃,…,T_i,…T_n) forms a text vector space.

S2042: dividing the original text information into single sentences, and determining a second word vector of each single sentence.

Based on the text vector space in step S2042, the text to be matched is reduced to a vector composed of keywords: a= (w _a1,w_a2,…,w_ai,…,w_am)^T; the original text information is reduced to a vector of keywords: b= (w _b1,w_b2,…,w_bi,…,w_bm)^T).

S2043: and calculating the single sentence matching degree between each sentence entry in the original text information and the keywords according to the first word vector and each second word vector.

In practical application, the single sentence matching degree between each term entry and the keyword in the original text information of the two can be calculated through the distance or the similarity between the vectors. Alternatively, the distance between the text to be searched and the entry text may be calculated by means of euclidean distance, normalization of euclidean distance, mahalanobis distance, manhattan distance, chebyshev distance, markov distance, or hamming distance, or the similarity coefficient between the text to be searched and the entry text may be calculated by means of calculating a cosine similarity coefficient and adjusting the cosine similarity coefficient, pearson correlation coefficient, log likelihood similarity, mutual information gain, or word pair similarity coefficient.

The single sentence matching degree of the two is calculated according to the keyword vector:

wherein a= (w _a1,w_a2,…,w_ai,…,w_am)^T is used for representing a vector composed of keywords simplified to text to be matched, and b= (w _b1,w_b2,…,w_bi,…,w_bm)^T is used for representing a vector composed of keywords simplified to original text information).

S2044: and identifying the entry with the highest matching degree of the single sentence as the target entry.

And identifying the original text information corresponding to the keyword with the maximum keyword similarity of the text to be matched as the target text corresponding to the text to be matched. If the audio file is a song, the lyric position is quickly positioned by the function of fuzzy search lyric keywords of the audio lyric file. Click-positioned played lyrics glance search result list. And quickly jumping to the playing time position of the current lyrics. The audio part corresponding to the target entry is played, so that the audio source and the audio lyric text can be displayed in a simultaneous scrolling mode and played, namely, the user can control the audio to be played simultaneously by directly indicating the corresponding audio original lyrics through a control mouse.

Furthermore, a matching degree threshold value can be preset, so that the single sentence matching degree greater than or equal to the matching degree threshold value can be screened out and presented to the user, the user selects a corresponding target entry, and the main control performance of the user is improved.

S205: and determining the target playing time of the target entry according to the playing time of each entry in the target entry and the original text information.

In this embodiment, the implementation manner of S205 is identical to that of S103 in the embodiment corresponding to fig. 1, and specific reference may be made to the description related to S103 in the embodiment corresponding to fig. 1, which is not repeated here.

S206: and playing the audio corresponding to the target entry according to the target entry and the target playing time.

In this embodiment, the implementation manner of S206 is identical to that of S104 in the embodiment corresponding to fig. 1, and specific reference may be made to the description related to S104 in the embodiment corresponding to fig. 1, which is not repeated here.

S207: and acquiring the target audio played at the current playing time, and identifying the text content of the target audio.

And when the audio is played, acquiring the target audio which is currently played, namely, a section of audio within the preset time when the current playing time begins. And recognizing the voice in the target audio to obtain the text content of the target audio. Alternatively, the audio recognition model can be constructed by performing data analysis according to the historical audio file, and the text content of the target audio can be recognized through the audio recognition model. Specifically, the process of constructing the audio recognition model comprises two stages, namely a training stage and a recognition stage. In the training stage, manually determining text content in an audio file, and storing a feature vector of the audio file as a template into a template library; in the recognition stage, the feature vector of the input audio file is compared with the similarity of each template in the template library in sequence, and the highest similarity is output as a recognition result.

S208: and correcting the target playing time corresponding to the target entry text matched with the text content and recorded in the original text information according to the text content and the current playing time.

After identifying text content corresponding to the currently played target audio, correcting entry text and playing time of original text information according to the text content and the current playing time. For example, when a music player plays music, the lyrics are scrolled at the same time, but the lyrics of the song heard by the user may be different from the scrolled lyrics or the playing time is inconsistent, in which case the text content of the currently played audio needs to be identified to correct the original lyrics.

The audio file information is modified while the audio file is played, including modifying the text content and playing time of playing the text, wherein the playing time of the text can be thinned to the playing time of playing each sentence and each word. When the text content is modified, the text content played at the current moment is obtained, voice recognition is carried out according to the text content played at the current moment, the text result obtained through recognition is compared with the text in the audio file information, and if the text result is inconsistent with the text in the audio file information, the playing moment and the content in the audio file information are modified. When the target playing time corresponding to the target entry text in the original text information is corrected, an audio playing progress component can be monitored through an angular timing task, and synchronous rolling playing of the text is controlled.

Specifically, when correcting the playing time of the text, the current time when the text is currently played can be compared with the playing time recorded in the original text information of the text, if the current time is inconsistent with the playing time, the sentences, the entries or the single words which are inconsistent with each other are determined, the current time when the text is played is determined and recorded as the correct playing time, and meanwhile, the playing time of the text recorded in the original text information is modified. After the entry or playing time in the whole audio file is modified, the history file is backed up according to the format of the lyric name and the timestamp, and the modified audio file is stored at the same time, so that error correction playing of the text file after voice recognition is realized.

According to the scheme, the audio file to be processed is obtained; analyzing the audio file to obtain original text information; the original text information comprises an entry text of the audio file and a playing time for playing each entry in the entry text; acquiring a text to be searched input by a user, and extracting at least one keyword from the text to be searched; fuzzy matching is carried out in the original text information according to the keywords, so that the target entry matched with the text to be searched is obtained; and determining the target playing time of the target entry according to the playing time of each entry in the target entry and the original text information. And playing the audio corresponding to the target entry according to the target entry and the target playing time. Acquiring target audio played at the current playing time, and identifying text content of the target audio; and correcting the target playing time corresponding to the target entry text matched with the text content and recorded in the original text information according to the text content and the current playing time. By directly playing the vocabulary entry selected by the user, the playing event is determined by the user, the playing progress of the player is rapidly controlled, and the vocabulary entry text is calibrated and modified at the same time when the user passes the playing, so that the intelligentization of audio frequency playing of the audio software and the use experience of the user are improved.

Referring to fig. 3, fig. 3 is a schematic diagram of a terminal device according to a third embodiment of the present invention. The terminal device includes units for executing the steps in the embodiments corresponding to fig. 1 to 2. Refer specifically to the related descriptions in the respective embodiments of fig. 1-2. For convenience of explanation, only the portions related to the present embodiment are shown. The terminal device 300 of the present embodiment includes:

an acquiring unit 301, configured to acquire an audio file to be processed;

the parsing unit 302 is configured to parse the audio file to obtain original text information; the original text information comprises an entry text of the audio file and a playing time for playing each entry in the entry text;

a matching unit 303, configured to obtain a text to be searched input by a user, determine a target term matched with the text to be searched in the term text, and play a target playing time of the target term;

and a playing unit 304, configured to play audio corresponding to the target entry according to the target entry and the target playing time.

Further, the matching unit 303 may include:

the extraction unit is used for acquiring a text to be searched input by a user and extracting at least one keyword from the text to be searched;

the searching unit is used for carrying out fuzzy matching in the original text information according to the keywords to obtain the target entry matched with the text to be searched;

And the determining unit is used for determining the target playing time of the target entry according to the playing time of each entry in the target entry and the original text information.

Further, the extracting unit may include:

The preprocessing unit is used for acquiring a text to be searched which is input by a user, preprocessing the text to be searched and obtaining a preprocessed text;

and the word segmentation unit is used for segmenting the text after pretreatment according to a pre-trained word segmentation model to obtain at least one keyword.

Further, the search unit may include:

A first vector unit, configured to generate a first word vector corresponding to the keyword according to the keyword;

A second vector unit for dividing the original text information into single sentences and determining a second word vector of each of the single sentences;

The computing unit is used for computing the single sentence matching degree between each sentence entry in the original text information and the keywords according to the first word vector and each second word vector;

And the identification unit is used for identifying the entry with the highest matching degree of the single sentence as the target entry.

Further, the terminal device may further include:

the content identification unit is used for acquiring the target audio played at the current playing time and identifying the text content of the target audio;

and the correcting unit is used for correcting the target playing time corresponding to the target entry text matched with the text content and recorded in the original text information according to the text content and the current playing time.

Fig. 4 is a schematic diagram of a terminal device according to a fourth embodiment of the present invention. As shown in fig. 4, the terminal device 4 of this embodiment includes: a processor 40, a memory 41 and a computer program 42 stored in the memory 41 and executable on the processor 40. The steps of the various audio processing method embodiments described above, such as steps 101 through 103 shown in fig. 1, are implemented by the processor 40 when executing the computer program 42. Or the processor 40, when executing the computer program 42, performs the functions of the modules/units of the device embodiments described above, e.g. the functions of the units 301 to 303 shown in fig. 3.

Illustratively, the computer program 42 may be partitioned into one or more modules/units that are stored in the memory 41 and executed by the processor 40 to complete the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 42 in the terminal device 4.

The terminal device 4 may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, etc. The terminal device may include, but is not limited to, a processor 40, a memory 41. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the terminal device 4 and does not constitute a limitation of the terminal device 4, and may include more or less components than illustrated, or may combine certain components, or different components, e.g., the terminal device may further include an input-output device, a network access device, a bus, etc.

The Processor 40 may be a central processing unit (Central Processing Unit, CPU), other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 41 may be an internal storage unit of the terminal device 4, such as a hard disk or a memory of the terminal device 4. The memory 41 may also be an external storage device of the terminal device 4, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD, FC) or the like, which are provided on the terminal device 4. Further, the memory 41 may also include both an internal storage unit and an external storage device of the terminal device 4. The memory 41 is used for storing the computer program as well as other programs and data required by the terminal device. The memory 41 may also be used for temporarily storing data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. An audio processing method, comprising:

acquiring an audio file to be processed;

Obtaining a text to be searched input by a user, and calculating the part with the highest similarity coefficient in the text to be searched and the entry text in a mode of calculating the similarity coefficient so as to determine a target entry matched with the text to be searched in the entry text and play a target playing time of the target entry;

Playing audio corresponding to the target entry according to the target entry and the target playing time;

The obtaining the text to be searched input by the user, calculating the part with the highest similarity coefficient in the text to be searched and the entry text by calculating the similarity coefficient, so as to determine a target entry matched with the text to be searched in the entry text, and playing the target playing time of the target entry, including:

acquiring a text to be searched input by a user, and extracting at least one keyword from the text to be searched;

fuzzy matching is carried out in the original text information according to the keywords, so that the target entry matched with the text to be searched is obtained;

Determining a target playing time of the target entry according to the playing time of the target entry and the entry in the original text information;

The fuzzy matching is carried out in the original text information according to the keywords, so as to obtain the target entry matched with the text to be searched, which comprises the following steps:

Generating a first word vector corresponding to the keyword according to the keyword;

dividing the original text information into single sentences, and determining a second word vector of each single sentence;

Calculating single sentence matching degree between each sentence entry in the original text information and the keywords according to the first word vector and each second word vector;

identifying the entry with the highest matching degree of the single sentence as the target entry;

2. The audio processing method as claimed in claim 1, wherein the obtaining text to be searched inputted by a user and extracting at least one keyword from the text to be searched comprises:

Acquiring a text to be searched which is input by a user, and preprocessing the text to be searched to obtain a preprocessed text;

and according to a pre-trained word segmentation model, segmenting the preprocessed text to obtain at least one keyword.

3. The audio processing method according to claim 1 or 2, wherein after playing the audio corresponding to the target entry according to the target entry and the target playing time, further comprising:

acquiring target audio played at the current playing time, and identifying text content of the target audio;

And correcting the target playing time corresponding to the target entry text matched with the text content and recorded in the original text information according to the text content and the current playing time.

4. A terminal device comprising a memory and a processor, said memory storing a computer program executable on said processor, characterized in that said processor, when executing said computer program, performs the steps of:

acquiring an audio file to be processed;

Determining a target playing time of the target entry according to the playing time of each entry in the target entry and the original text information;

5. The terminal device of claim 4, wherein the obtaining text to be searched entered by a user and extracting at least one keyword from the text to be searched comprises:

6. A terminal device, comprising:

The acquisition unit is used for acquiring the audio file to be processed;

the matching unit is used for acquiring a text to be searched input by a user, calculating the part with the highest similarity coefficient in the text to be searched and the entry text in a mode of calculating the similarity coefficient so as to determine a target entry matched with the text to be searched in the entry text and play a target playing time of the target entry;

The playing unit is used for playing the audio corresponding to the target entry according to the target entry and the target playing time;

The matching unit includes:

The determining unit is used for determining the target playing time of the target entry according to the playing time of each entry in the target entry and the original text information;

The search unit includes:

the recognition unit is used for recognizing the entry with the highest single sentence matching degree as the target entry;

7. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 3.