CN114329043A

CN114329043A - Audio essence fragment determination method, electronic equipment and computer-readable storage medium

Info

Publication number: CN114329043A
Application number: CN202111611647.4A
Authority: CN
Inventors: 毛绮雯; 陈肇康; 吴斌; 雷兆恒
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-04-12

Abstract

The embodiment of the application discloses an audio essence fragment determining method, electronic equipment and a computer-readable storage medium, wherein the method comprises the following steps: acquiring audio data and text content corresponding to the audio data; inputting the text content into a first supervised model to determine a first set of highlights in the text content, and inputting the text content into a first unsupervised model to determine a second set of highlights in the text content; determining an essential paragraph in the text content based on the first and second highlight sentence sets; and determining the audio data corresponding to the essence paragraphs in the audio data as audio essence segments. The method and the device can be applied to the technical field of audio processing, and the method and the device can be used for determining the essence segments in the audio by combining supervised learning and unsupervised learning, and can reduce manual labeling of the essence segments and reduce labor cost compared with a mode of using supervised learning in the prior art.

Description

Audio essence fragment determination method, electronic equipment and computer-readable storage medium

Technical Field

The invention relates to the technical field of audio processing, in particular to an audio essence fragment determining method, electronic equipment and a computer-readable storage medium.

Background

With the rapid development of multimedia technology, audio is receiving more and more users listening and liking as a carrier bearing a large amount of abundant semantic information. As the data volume of the audio is greatly increased, when a user searches for his favorite audio files (e.g., songs, radio programs, etc.), the user may need to quickly browse through the essence (or highlight) in the audio files.

Currently, supervised learning based machine learning models are mainly used for determining highlights from audio files. During training of the model, a large number of manual labeling needs to be performed on essence segments contained in the full text content of the long audio in advance (that is, all possible essence segments are manually searched), and the labor cost is high.

Disclosure of Invention

The embodiment of the application provides an audio essence segment determining method, electronic equipment and a computer-readable storage medium, which can reduce manual labeling of essence segments and reduce labor cost.

In a first aspect, an embodiment of the present application provides an audio essence segment determining method, including:

acquiring audio data and text content corresponding to the audio data;

inputting the text content into a first supervised model to determine a first set of highlights in the text content and inputting the text content into a first unsupervised model to determine a second set of highlights in the text content;

determining a refined paragraph in the text content based on the first set of highlights and the second set of highlights;

and determining the audio data corresponding to the essence paragraphs in the audio data as audio essence segments.

By implementing the method described in the first aspect, the essence segments in the audio can be determined based on supervised learning and unsupervised learning, and compared with a method using supervised learning in the prior art, manual labeling of the essence segments can be reduced, and labor cost is reduced.

In one possible embodiment, said entering said text content into a first supervised model for determining a first set of highlights in said text content comprises:

inputting the text content into a first supervised model to obtain a first index value of each sentence in the text content, and determining a first wonderful sentence set based on the first index value;

the entering the text content into a first unsupervised model to determine a second set of highlights in the text content comprises:

inputting the text content into a first unsupervised model to obtain a second index value of each sentence in the text content, and determining a second wonderful sentence set based on the second index value;

wherein, the first index value or the second index value is any one or more of the following: the method comprises the steps of obtaining a high-precision score of each sentence in the text content, semantic similarity of each sentence in the text content and the text content, and probability that each sentence in the text content is a summary of the text content.

Based on the mode, whether the sentence belongs to the wonderful sentence or not can be accurately judged through the index parameters.

inputting audio signals of the text content and the audio data into a first supervised model to determine a first set of highlights in the text content;

inputting the text content and an audio signal of the audio data into a first unsupervised model to determine a second set of highlights in the text content.

Based on the mode, the highlight sentence set can be determined by combining text content and audio signals, and the accuracy of determining the highlight sentence set is improved.

inputting the text content into a first supervised model, inputting an audio signal of the audio data into a second supervised model to determine a first set of highlights in the text content;

the text content is input into a first unsupervised model and an audio signal of the audio data is input into a second unsupervised model to determine a second set of highlights in the text content.

Based on the mode, the text content and the audio signal are input into different supervised models or unsupervised models to determine the highlight sentence set, dependence between the text content and the audio signal can be decoupled, and the number of the highlight sentences is increased and determined.

In one possible embodiment, the inputting the text content into a first supervised model and the inputting the audio signal of the audio data into a second supervised model to determine a first set of highlights in the text content comprises:

inputting the text content into a first supervised model to obtain a first index value of each sentence in the text content, inputting an audio signal of the audio data into a second supervised model to obtain a third index value of each sentence in the text content, and determining a first highlight sentence set in the text content based on the first index value and the third index value;

the entering the text content into a first unsupervised model and the entering the audio signal of the audio data into a second unsupervised model to determine a second set of highlights in the text content comprises:

inputting the text content into a first unsupervised model to obtain a second index value of each sentence in the text content, inputting the audio signal of the audio data into a second unsupervised model to obtain a fourth index value of each sentence in the text content, and determining a second highlight sentence set in the text content based on the second index value and the fourth index value.

Wherein the first index value, the second index value, the third index value, or the fourth index value is any one or more of the following: the method comprises the steps of obtaining a high-precision score of each sentence in the text content, semantic similarity of each sentence in the text content and the text content, and probability that each sentence in the text content is a summary of the text content.

Based on the mode, whether the sentence belongs to the wonderful sentence or not can be accurately judged through a plurality of index parameters.

In one possible embodiment, the determining a passage of essence in the text content based on the first set of highlights and the second set of highlights includes:

determining a sum of the first and second sets of the highlights;

determining a elite paragraph in the text content based on the union of highlights.

Based on the mode, the wonderful sentence sets determined by the two models can be synthesized, and then wonderful paragraphs are determined according to the synthesized wonderful sentences, so that the accuracy of determining the wonderful paragraphs is improved.

determining a first highlight paragraph based on the first set of highlight sentences and a second highlight paragraph based on the second set of highlight sentences;

determining a union of the first and second highlight segments as a serum segment in the text content.

Based on the mode, two wonderful paragraphs can be determined according to the wonderful sentence sets determined by the two models respectively, and the final wonderful paragraph is determined by integrating the two wonderful paragraphs, so that the accuracy of determining the wonderful paragraph is improved.

In one possible embodiment, after determining the essence passage in the text content and before determining the audio data corresponding to the essence passage in the audio data as an audio essence segment, the method further includes:

inputting a context adjacent sentence of the essence paragraph and the essence paragraph in the text content into a first deep learning model to obtain the probability that the context adjacent sentence and the essence paragraph belong to the same language segment;

and if the probability is greater than a probability threshold, adding the context adjacent sentence into the essence paragraph.

Based on the mode, the context information of the audio essence segments can be improved, so that the processed audio essence segments are more complete in semantics and more smooth in grammar.

In one possible implementation, if there are a plurality of audio data, the audio essence segment includes a plurality of audio essence segments corresponding to the respective audio data, and the method further includes:

respectively determining the chroma score of the audio essence segment of each piece of audio data;

sequencing the plurality of audio essence fragments based on the fineness scores to obtain a sequencing result;

and recommending audio essence fragments to the user based on the sequencing result.

Based on the mode, the audio essence segments can be scored and sorted, and the sorting result is beneficial to subsequent recommendation and distribution services and is beneficial to accurately recommending favorite audio segments to the user.

In a second aspect, an embodiment of the present application provides an audio essence section determination apparatus, including:

the acquisition module is used for acquiring audio data and text content corresponding to the audio data;

the processing module is used for inputting the text content into a first supervised model so as to determine a first wonderful sentence set in the text content;

inputting the text content into a first unsupervised model to determine a second set of highlights in the text content;

determining a refined paragraph in the text content based on the first set of highlights and the second set of highlights.

In one possible embodiment, the processing module inputs the text content into a first supervised model to determine a first set of highlights in the text content in particular by:

inputting the text content into a first supervised model, obtaining a first index value of each sentence in the text content, and determining a first wonderful sentence set based on the first index value;

the processing module inputs the text content into a first unsupervised model, and the mode of determining the second highlight sentence set in the text content specifically comprises the following steps:

inputting the text content into a first unsupervised model to obtain a second index value of each sentence in the text content and determining a second wonderful sentence set based on the second index value;

In one possible embodiment, the processing module inputs the text content into a first supervised model, inputs the audio signal of the audio data into a second supervised model, and determines the first set of highlights in the text content by:

the processing module inputs the text content into a first unsupervised model, inputs the audio signal of the audio data into a second unsupervised model, and specifically, the mode of determining the second highlight set in the text content is as follows: inputting the text content into a first unsupervised model to obtain a second index value of each sentence in the text content, inputting an audio signal of the audio data into a second unsupervised model to obtain a fourth index value of each sentence in the text content, and determining a second highlight sentence set in the text content based on the second index value and the fourth index value;

In one possible embodiment, the processing module determines, based on the first set of highlights and the second set of highlights, the essence passage in the text content in a manner that:

determining a sum of the first and second sets of the highlights;

In a possible implementation manner, the processing module is further configured to input a context adjacent sentence of the essence paragraph and the essence paragraph in the text content into a first deep learning model, so as to obtain a probability that the context adjacent sentence and the essence paragraph belong to the same language segment;

In a possible implementation manner, if the number of the audio data is multiple, the audio essence segments include audio essence segments corresponding to the multiple audio data, and the processing module is further configured to determine the essence degree scores of the audio essence segments respectively;

sequencing the plurality of audio essence fragments based on the fineness scores of the plurality of audio essence fragments to obtain a sequencing result;

In a third aspect, an embodiment of the present application provides an electronic device, including:

a memory for storing a computer program;

a processor for calling the computer program from the memory to perform the method according to any of the above first aspects.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having a computer program stored therein, the computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any of the first aspect.

For the beneficial effects of each possible implementation manner in the second aspect to the fourth aspect, reference may be made to the corresponding description in the first aspect, which is not repeated herein.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below.

Fig. 1 is a schematic diagram of a communication system provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of an audio essence segment determination method provided by an embodiment of the present application;

fig. 3 is a schematic flowchart of another audio essence segment determination method provided by the embodiment of the present application;

fig. 4 is a schematic structural diagram of an audio essence section determination apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first" and "second," and the like, in the description, claims, and drawings of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

For a better understanding of the concepts of the present application, reference will now be made to the following technical fields and terms of art to which the present application pertains:

automatic Speech Recognition (ASR)

The automatic speech recognition is closely connected with a plurality of disciplines such as acoustics, phonetics, linguistics, digital signal processing theory, information theory, computer discipline and the like, and is a technical field crossed by multiple disciplines. The method converts input audio data into corresponding text data through an acoustic model and a language model. However, due to the diversity and complexity of the audio signals, different acoustic models and language models, the accuracy of the recognized text results is different.

Second, Machine Learning (ML)

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning/deep learning generally includes techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

In one category, machine learning includes supervised learning (supervised learning), unsupervised learning (unsupervised learning). Supervised learning learns the labeled training samples to classify or predict data outside the set of training samples as much as possible. The unsupervised learning training samples do not have labels, and the data beyond the training sample set can be classified or predicted finally by analyzing the characteristic relation in the training samples during learning. Therefore, when supervised learning collects training data, a large number of labels of the training data need to be collected, and the setting of the labels is greatly influenced by the subjectivity of a marker.

The following describes a communication system according to an embodiment of the present application:

referring to fig. 1, fig. 1 is a schematic diagram of a communication system according to an embodiment of the present disclosure. As shown in fig. 1, the communication system includes a terminal device 101 and an audio essence section determination device 102. The terminal device 101 is a device where a client of the playing platform is located, and is a device having an audio playing function, including but not limited to: smart phones, tablet computers, notebook computers, and the like. The audio essence segment determining apparatus 102 is a background device of the playing platform or a chip in the background device, and can determine essence segments in the audio data. The cloud server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like.

After one or more audio files (or audio data) are uploaded to the audio essence determination device 102, the device can extract and store essence segments in the audio files. When the terminal device 101 browses and listens to the audio file, the audio essence section determination device 102 sends the essence section in the stored audio file to the terminal device 101 for output through an output device such as a speaker. The terminal device 101 and the audio essence segment determination apparatus 102 may be directly or indirectly connected through wired communication or wireless communication, which is not limited in the present application.

It should be noted that, in the communication system shown in fig. 1, the number of the terminal device 101 and the audio essence section determining device 102 may be one or more, and the application does not limit this. For convenience of description, the audio essence section determining apparatus 102 is taken as an example of a server, and the audio essence section determining method provided by the embodiment of the present application is further described below.

Referring to fig. 2, a schematic flow chart of an audio essence segment determination method provided in the embodiment of the present application is shown. The audio essence fragment determining method comprises the steps 201 to 204.

201. The server obtains audio data.

In the embodiment of the application, the audio data can be long-distance audio podcast products, radio station audio data, songs and the like.

The server may retrieve one audio data or a batch of audio data. For example, the server may obtain certain audio data when the audio data is detected to be on-line, so as to determine a highlight in the audio data. Alternatively, the server may obtain a batch of audio data newly brought online at a preset time period so as to determine a highlight in the batch of audio data. Or the server may perform highlight extraction on the stored audio data, and the data source of the audio data is not limited in the embodiment of the present application.

The audio data may be a complete audio file. Alternatively, the audio data may be a portion of audio data that is truncated for an audio file. For example, one audio file is 20 minutes long. The audio data may be audio data for 10 th to 20 th minutes of the audio file.

202. The server determines the text content corresponding to the audio data.

In one possible implementation, the server may use ASR techniques to convert the audio data into textual content that is aligned with a timestamp of the audio data. The textual content may be sentence-level textual content. The ASR technique may determine a pause point of each sentence in the audio data according to the audio features such as speech pauses in the audio data, and convert the audio data into text content at a sentence level. The server may also determine the text content corresponding to the audio data by using other speech recognition techniques, which is not limited in the embodiment of the present application.

In a possible implementation manner, the server may also obtain text information corresponding to the audio data without a voice recognition manner. For example, a song has corresponding lyric information, and a recorder is provided with caption information when recording a podcast product such as a network audio program. The lyric information, the subtitle information and the like can be synchronously uploaded to the server as the text content corresponding to the audio data when the audio data is acquired.

203. The server determines a passage of essence in the text content based on the first supervised model and the first unsupervised model.

In this embodiment of the present application, the server may determine one or more essence paragraphs in the text content, and three specific embodiments are described below in which the server determines the essence paragraphs in the text content based on the first supervised model and the first unsupervised model:

the first method is as follows:

the server inputs the text content into a first supervised model to determine a first set of highlights in the text content; the server inputs the text content into a first unsupervised model to determine a second set of highlights in the text content; the server determines a refined paragraph in the text content based on the first set of highlights and the second set of highlights.

Wherein the first set of highlights includes N1 highlights and the second set of highlights includes N2 highlights. N1 and N2 are positive integers greater than or equal to 1. The highlights in the first set of highlights may be the same or different than the highlights in the second set of highlights. Each elite paragraph in the textual content may consist of several consecutive ones of N1 highlights and N2 highlights.

Optionally, the server may determine a highlight sentence union of the first highlight sentence set and the second highlight sentence set, and then determine the essence paragraph according to the highlight sentence union.

Illustratively, if the text content includes 50 sentences, X ═ X₁，x₂，x₃，x₄，...，x₅₀Denotes x_iRepresenting the ith sentence in the text content, and obtaining a first wonderful sentence set which is X through a first supervised model₁＝{x₁，x₂，x₃，x₄，x₁₆，x₁₇，x₃₀，x₃₁，x₃₅，x₃₆，x₄₀，x₄₁Obtaining a second wonderful sentence set X through the first unsupervised model₂＝{x₁，x₂，x₃，x₄，x₅，x₁₈，x₂₅，x₂₈，x₃₁，x₃₂，x₃₃，x₃₄，x₃₅}. Determination of X₁And X₂Is { x }₁，x₂，x₃，x₄，x₅，x₁₆，x₁₇，x₁₈，x₂₅，x₂₈，x₃₀，x₃₁，x₃₂，x₃₃，x₃₄，x₃₅，x₃₆，x₄₀，x₄₁}. According to X₁And X₂Can determine four essence paragraphs, which are x₁，x₂，x₃，x₄，x₅}、{x₁₆，x₁₇，x₁₈}、{x₃₀，x₃₁，x₃₂，x₃₃，x₃₄，x₃₅，x₃₆}、{x₄₀，x₄₁}。

Optionally, the number of the minimum highlights included in the essence paragraphs may be set, for example, 2, 3, 4, 5, etc., which is not limited in this application.

Alternatively, the server may first determine a first elite paragraph based on the first set of highlights and a second elite paragraph based on the second set of highlights. And determining the essence paragraphs of the first essence paragraph and the second essence paragraph as the essence paragraphs in the text content.

Illustratively, for the text content X in the foregoing example, if the first set of highlights X₁＝{x₁，x₂，x₃，x₄，x₁₆，x₁₇，x₁₈，x₁₉，x₃₅，x₃₆，x₄₀，x₄₁Is from X₁The highlight paragraph has { x₁，x₂，x₃，x₄}、{x₁₆，x₁₇，x₁₈，x₁₉}、{x₃₅，x₃₆}、{x₄₀，x₄₁}. If the second wonderful sentence set X₂＝{x₁，x₂，x₃，x₄，x₂₀，x₂₁，x₂₂Is from X₂The highlight paragraph has { x₁，x₂，x₃，x₄}、{x₂₀，x₂₁，x₂₂}. The final highlight segment is the union of all highlight segments as described above as { x₁，x₂，x₃，x₄}、{x₁₆，x₁₇，x₁₈，x₁₉}、{x₂₀，x₂₁，x₂₂}、{x₃₅，x₃₆}、{x₄₀，x₄₁}。

In one possible embodiment, the specific implementation of the server inputting the text content into the first supervised model to determine the first set of highlights in the text content and the specific implementation of the server inputting the text content into the first unsupervised model to determine the second set of highlights in the text content are as follows: the server inputs the text content into a first supervised model to determine a first index value of each sentence in the text content; the server determines a first wonderful sentence set based on the first index value; the server inputs the text content into the first unsupervised model to obtain a second index value of each sentence in the text content; the server determines a second set of highlights based on the second index value. The first index value or the second index value is one or more of the following types: the method comprises the steps of determining the chroma score of each sentence in the text content, the semantic similarity of each sentence in the text content and the text content, and the probability that each sentence in the text content is a summary of the text content.

The summary is another description form of the highlight sentence, and the probability that each sentence in the text content is the summary of the text content refers to the probability that each sentence in the text content is the highlight sentence. The first supervised model and the first unsupervised model are pre-trained models which can judge the splendid degree of each sentence in the text content. And the training data used by the first supervised model during training comprises the highlights of each sentence, and the first unsupervised model does not need to label the highlights of each sentence in the training data during training.

When the server determines the wonderful sentence set according to the index values, if the first index value of the sentence is not smaller than the first index threshold value, the sentence is determined to be a wonderful sentence and added into the first wonderful sentence set, and if the second index value of the sentence is not smaller than the second index threshold value, the sentence is determined to be a wonderful sentence and added into the second wonderful sentence set. The first index threshold and the second index threshold may be the same or different, and the application does not limit this.

Taking the first index parameter as the wonderness score of each sentence in the text content as an example, a specific process of determining the first wonderness sentence set is explained. For example, for the aforementioned text content X, for the first 10 sentences { X ] in X₁，x₂，x₃，x₄，x₅，x₆，x₇，x₈，x₉，x₁₀The popularity score of each sentence is {70, 70, 80, 85, 60, 67, 56, 67, 48, 50}, respectively. If the first index threshold is 70, the chroma scores of the first sentence to the fourth sentence in X are not less than the first index threshold, and the first highlight sentence set can be added as { X }₁，x₂，x₃，x₄}. For a specific process of determining the second highlight set, reference may be made to a specific process of determining the first highlight set, which is not described herein again.

It should be noted that the supervised model may adopt a conventional machine learning algorithm such as a K-nearest neighbor algorithm, a decision tree, naive bayes, or a deep learning algorithm such as a Convolutional Neural Network (CNN). The unsupervised learning model may adopt a K-means algorithm, an auto-encoder (auto-encoder), a principal component analysis (principal component analysis), a generation countermeasure network (GAN), or other deep learning algorithms, which are not limited in this application.

The second method comprises the following steps:

the server inputs audio signals of the text content and the audio data into a first supervised model to determine a first set of highlights in the text content; the server inputs audio signals of the text content and the audio data into the first unsupervised model to determine a second wonderful sentence set in the text content; the server determines a refined paragraph in the text content based on the first set of highlights and the second set of highlights.

The audio signal of the audio data includes features such as tone color, pitch, and the like. The text content and the audio signal are input into the first supervised model and the first unsupervised model together, so that the first supervised model and the first unsupervised model can extract multi-modal characteristics as the basis for determining the first and second highlight sets.

Optionally, the audio signal of the text content and the audio data may be input into a first supervised model to determine a first index value for each sentence in the text content; determining a first set of highlights based on a first index value; inputting the text content and the audio signal of the audio data into a first unsupervised model to obtain a second index value of each sentence in the text content; a second set of highlights is determined based on the second index value.

The possible forms of the first index value or the second index value in this manner and the specific process of how to determine the two sets of highlights and determine the essence paragraphs in the text content according to the two sets of highlights can be referred to the corresponding description in the first manner, which is not repeated herein.

It should be noted that, when training the first supervised model, the text content of the training audio data, the audio signal of the training audio data, and the corresponding highlight mark also need to be input into the model for training. When the first unsupervised model is trained, the text content of the training audio data and the audio signal of the training audio data need to be input into the model for training. Therefore, the trained first supervised model or the trained first unsupervised model can extract the text features and the audio signal features in the newly input audio data, judge the wonderness of the audio data and determine the first index value or the second index value.

The third method comprises the following steps:

the server inputs the text content into a first supervised model, and inputs the audio signal of the audio data into a second supervised model so as to determine a first wonderful sentence set in the text content; the server inputs the text content into a first unsupervised model, and inputs the audio signal of the audio data into a second unsupervised model so as to determine a second wonderful sentence set in the text content; the server determines a refined paragraph in the text content based on the first set of highlights and the second set of highlights.

The method can use different models to perform decoupling extraction operation on different features, avoid mutual dependence among the features and improve the quantity of the determined wonderful sentences.

Optionally, after the server inputs the text content and the audio signal into different supervised models or unsupervised models, the server may also obtain corresponding index values, and determine the highlight set according to the index values. For example, after the server inputs the text content into the first supervised model, a first index value of each sentence in the text content is determined; after the audio signal is input into the second supervised model, determining a third index value of the audio signal corresponding to each sentence; a first set of highlights is determined based on the first index value and the third index value. After the text content is input into the first unsupervised model, determining a second index value of each sentence in the text content; after the audio signal is input into the second unsupervised model, determining a fourth index value of each sentence corresponding to the audio signal; a second set of highlights is determined based on the second index value and the fourth index value.

In this manner, the possible forms of the third index value and the fourth index value refer to the possible forms of the first index value or the second index value, and how to determine the essence paragraphs in the text content based on the two highlight sentence sets refers to the corresponding description in the first manner, which is not repeated herein.

In a possible embodiment, the server determines the first set of highlights based on the first index value and the third index value, or determines the second set of highlights based on the second index value and the fourth index value by: and the server adds the union of the highlight determined based on the first index value and the highlight determined based on the third index value into the first highlight set, and adds the union of the highlight determined based on the second index value and the highlight determined based on the fourth index value into the second highlight set. Illustratively, after 10 sentences are input into the first supervised model, the 4 th sentence to the 8 th sentence are determined to be the highlights according to the popularity scores of the 10 sentences which are {50,60,65,70,72,73,75,80,65 and 40} in sequence. After the audio signals corresponding to the 10 sentences are input into the second supervised model, the wonderful degree scores of the 10 sentences are sequentially {46,52,70,73,70,75,62,60,60,50}, and the 3 rd to 6 th sentences are determined to be wonderful sentences according to the wonderful degree scores, so that the first wonderful sentence set can be a union set of the two wonderful sentence results and comprises the 3 rd to 8 th sentences.

In another possible embodiment, the specific implementation manner of the server determining the first set of highlights based on the first index value and the third index value, or determining the second set of highlights based on the second index value and the fourth index value is: when the server determines the first highlight set or the second highlight set, the server may set a weight for the obtained two index parameters to obtain a new index parameter, and determine the highlight set according to the new index parameter. For example, after obtaining a first index parameter and a third index parameter output by two supervised models, the server multiplies the first index parameter by the first weight, multiplies the third index parameter by the second weight to obtain a fifth index parameter, and determines a first highlight set based on the fifth index parameter. Illustratively, after 10 sentences are input into the first supervised model, the wonderful scores of the 10 sentences are sequentially {50,60,65,70,72,73,75,80,65,40}, and after the audio signals corresponding to the 10 sentences are input into the second supervised model, the wonderful scores of the 10 sentences are sequentially {46,52,70,73,70,75,62,60,60,50 }. If the first weight is 0.7 and the second weight is 0.3, the final saturation score of the first sentence in the 10 sentences is 50 × 0.7+46 × 0.3 ═ 48.8, and so on, so that the saturation scores of the 10 sentences are {48.8,57.6,66.5,70.9,71.4,73.6,71.1,74,63.5,43}, wherein the sentences larger than the first index threshold are the 4 th to 8 th sentences, and the 4 th to 8 th sentences are used as the highlights and added to the first highlight set.

204. And the server determines the audio data corresponding to the essence paragraphs in the audio data as audio essence fragments.

Specifically, the essence paragraph is composed of a plurality of highlights in the text content, and the audio essence segment is determined to be a segment with a timestamp aligned with the essence paragraph in the whole audio data. For example, if the highlights in the essence passage are 20 th to 220 th sentences in the text content, and correspond to the audio data with the time stamps of 17:30 to 17:31, the audio data is determined as the audio essence segment.

Based on the embodiment shown in fig. 2, supervised learning and unsupervised learning can be combined, and the essence segments contained in the audio data can be determined by inputting text information of the audio data.

Fig. 3 is a schematic flow chart of another audio essence segment determination method provided in the embodiment of the present application. The method comprises steps 301 to 306. Wherein:

301. the server obtains audio data.

302. The server determines the text content corresponding to the audio data.

303. The server determines a passage of essence in the text content based on the first supervised model and the first unsupervised model.

The specific embodiments 301 to 303 can refer to the corresponding descriptions 201 to 203, which are not described herein again.

304. The server inputs the essence paragraph and the context adjacent sentence of the essence paragraph into the first deep learning model to determine whether the context adjacent sentence needs to be added into the essence paragraph.

In one possible implementation, the specific implementation of 304 is: the server inputs the context adjacent sentences of the essence paragraphs and the essence paragraphs into the first deep learning model to obtain the probability that the context adjacent sentences and the essence paragraphs belong to the same language segment; and if the probability is greater than the probability threshold value, adding the context adjacent sentence into the essence paragraph.

Specifically, a contextual contiguous sentence of the essence paragraph is first determined, which may be one or more sentences before the starting sentence in the essence paragraph and one or more sentences after the ending sentence in the essence paragraph. Inputting the context adjacent sentences and the essence paragraphs into a first deep learning model, and determining the probability that the context adjacent sentences and the essence paragraphs belong to the same language paragraph by analyzing the word similarity, semantic similarity, grammatical logical relationship and the like of the sentences in the context adjacent sentences and the essence paragraphs; if the probability is greater than the probability threshold (e.g., 0.8), the context adjacent sentence and the essence paragraph are considered to be a single sentence segment, and the sentence segment is added to the essence paragraph, and if the probability is less than the probability threshold (e.g., 0.8), the context adjacent sentence and the essence paragraph are considered not to be a single sentence segment, and the sentence segment is not added to the essence paragraph. The last sentence of the initial sentence in the essence paragraph and the next sentence of the ending sentence in the essence paragraph are most likely to belong to the same language segment as the essence paragraph, so the most adjacent sentence can be judged first, if the most adjacent sentence does not satisfy the above condition, the judgment on the rest sentences can be stopped, if the most adjacent sentence satisfies the above condition, the next most adjacent sentence is judged, and so on. By the possible implementation mode, the context information of the essence paragraphs can be improved, so that the processed essence paragraphs are more complete in semantics and more smooth in grammar.

305. The server determines the audio data corresponding to the essence paragraphs in the audio data as the essence segments of the audio.

For a detailed implementation of step 305, reference may be made to the corresponding description in step 204, which is not described herein again.

306. The server determines a highlight score for the plurality of essence segments and recommends audio essence segments to the user based on the highlight score.

In one possible implementation, if there are a plurality of audio data, the audio essence segments include audio essence segments corresponding to the plurality of audio data, and the specific implementation manner of step 306 is: the server respectively determines the fineness score of the audio essence fragment of each audio data; sequencing the plurality of audio essence fragments based on the fineness scores to obtain a sequencing result; and recommending the audio essence fragments to the user based on the sequencing result.

The server can input the corresponding essence paragraphs into the second deep learning model, the second deep learning model can extract feature vectors for representing text topics, plots, characters and the like according to the semantics of the essence paragraphs, and the wonderful degree score of the audio essence fragments is evaluated according to the feature vectors. And sequencing the audio essence sections according to the evaluated wonderful degree scores to obtain a sequencing result, and recommending the audio essence sections to the user based on the sequencing result.

Optionally, the server may store a correspondence between the audio essence segments and the highlight scores of the audio essence segments. When the audio needs to be pushed to the user, one or more audio essence segments with higher wonderful degree scores are determined according to the corresponding relation and pushed to the user. The audio essence fragment is sent to the terminal equipment of the user by the server, rendered by the terminal equipment of the user and played through the loudspeaker.

By the method, the highlights of the audio essence segments are further scored and sorted, and the sorting result is beneficial to subsequent recommendation and distribution services and is beneficial to accurate recommendation of favorite audio segments to users.

Based on the embodiment shown in fig. 3, after the essence paragraphs are determined, context information may be further added to the essence paragraphs to improve the integrity of the audio essence sections, and the audio essence sections are subjected to highlight scoring and sorting to obtain audio sections that are more representative and more popular with users.

Referring to fig. 4, which is a schematic structural diagram of an audio essence section determination apparatus provided in an embodiment of the present application, the apparatus 40 includes an obtaining module 401 and a processing module 402. Wherein:

an obtaining module 401, configured to obtain audio data and text content corresponding to the audio data;

a processing module 402, configured to input the text content into a first supervised model to determine a first set of highlights in the text content; inputting the text content into a first unsupervised model to determine a second set of highlights in the text content;

the processing module 402 is further configured to determine a refined paragraph in the text content based on the first and second sets of highlights;

the processing module 402 is further configured to determine the audio data corresponding to the essence paragraphs in the audio data as audio essence segments.

In one possible embodiment, the processing module 402 inputs the text content into the first supervised model to determine the first set of highlights in the text content in particular by: inputting the text content into a first supervised model to obtain a first index value of each sentence in the text content; determining a first set of highlights based on a first index value;

the processing module 402 inputs the text content into the first unsupervised model, and the manner of determining the second set of highlights in the text content is specifically: inputting the text content into a first unsupervised model to obtain a second index value of each sentence in the text content; determining a second set of highlights based on the second index value;

wherein, the first index value or the second index value is any one or more of the following: the method comprises the steps of determining the chroma score of each sentence in the text content, the semantic similarity of each sentence in the text content and the text content, and the probability that each sentence in the text content is a summary of the text content.

In one possible embodiment, the processing module 402 inputs the text content into the first supervised model to determine the first set of highlights in the text content in particular by: inputting audio signals of text content and audio data into a first supervised model to determine a first set of highlights in the text content;

the processing module 402 inputs the text content into the first unsupervised model, and the manner of determining the second set of highlights in the text content is specifically: an audio signal of the text content and the audio data is input into a first unsupervised model to determine a second set of highlights in the text content.

In one possible embodiment, the processing module 402 inputs the text content into the first supervised model to determine the first set of highlights in the text content in particular by:

inputting the text content into a first supervised model, and inputting an audio signal of audio data into a second supervised model to determine a first highlight set in the text content;

the processing module 402 inputs the text content into the first unsupervised model, and the manner of determining the second set of highlights in the text content is specifically: the text content is input into a first unsupervised model and an audio signal of the audio data is input into a second unsupervised model to determine a second set of highlights in the text content.

In one possible implementation, the processing module 402 inputs the text content into a first supervised model and inputs the audio signal of the audio data into a second supervised model to determine the first set of highlights in the text content in a specific manner:

inputting the text content into a first supervised model to obtain a first index value of each sentence in the text content, inputting an audio signal of audio data into a second supervised model to obtain a third index value of each sentence in the text content, and determining a first highlight sentence set in the text content based on the first index value and the third index value;

the processing module 402 inputs the text content into the first unsupervised model and inputs the audio signal of the audio data into the second unsupervised model, so as to determine the second highlight set in the text content in the specific manner:

inputting the text content into a first unsupervised model to obtain a second index value of each sentence in the text content, inputting an audio signal of audio data into a second unsupervised model to obtain a fourth index value of each sentence in the text content, and determining a second highlight sentence set in the text content based on the second index value and the fourth index value;

wherein, the first index value, the second index value, the third index value or the fourth index value is any one or more of the following: the method comprises the steps of determining the chroma score of each sentence in the text content, the semantic similarity of each sentence in the text content and the text content, and the probability that each sentence in the text content is a summary of the text content.

In one possible implementation, the processing module 402 determines the essence paragraph in the text content based on the first and second sets of highlights by:

determining a sum of the first and second brilliant sentence sets;

and determining a refined paragraph in the text content based on the highlight sentence union.

the union of the first and second highlight segments is determined as a serum segment in the text content.

In a possible implementation manner, the processing module 402 is further configured to input the context adjacent sentence, in which the essence paragraph and the essence paragraph are located in the text content, into the first deep learning model, so as to obtain a probability that the context adjacent sentence and the essence paragraph belong to the same corpus; and if the probability is greater than the probability threshold, adding the context adjacent sentence into the essence paragraph.

In one possible implementation, if there are a plurality of audio data, the audio essence segments include audio essence segments corresponding to the plurality of audio data, and the processing module 402 is further configured to determine the essence degree scores of the audio essence segments respectively;

and recommending the audio essence fragments to the user based on the sequencing result.

It should be noted that the functions of each module of the audio essence segment determination apparatus in the embodiment of the present application may be specifically implemented according to the method in the embodiment of the method, and the specific implementation process and the beneficial effects thereof may refer to the related description of the embodiment of the method, which is not described herein again.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 50 may include: one or more processors 501, memory 502, and transceiver 503. The processor 501, memory 502, and transceiver 503 are connected by a bus 504. The memory 502 is used to store a computer program comprising program instructions, and the processor 501 and the transceiver 503 are used to execute the program instructions stored in the memory 502 to perform the following operations:

acquiring audio data and text content corresponding to the audio data;

inputting the text content into a first supervised model to determine a first set of highlights in the text content, and inputting the text content into a first unsupervised model to determine a second set of highlights in the text content;

determining an essential paragraph in the text content based on the first and second highlight sentence sets;

It should be understood that in some possible embodiments, the processor 501 may be a Central Processing Unit (CPU), and the processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory 502 may include both read-only memory and random access memory, and provides instructions and data to the processor 501. A portion of the memory 502 may also include non-volatile random access memory. For example, the memory 502 may also store device type information.

In a specific implementation, the terminal device may execute the implementation manners provided in the steps in fig. 2 and fig. 3 through the built-in functional modules, and specific implementation processes and beneficial effects may refer to the implementation manners provided in the steps, which are not described herein again.

An embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions executed by the aforementioned audio synthesizing apparatus, and the computer-readable instructions include program instructions, and when the processor executes the program instructions, the method in the embodiment corresponding to fig. 2 and fig. 3 can be executed, and therefore, details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application. By way of example, program instructions may be deployed to be executed on one computer device or on multiple computer devices at one site or distributed across multiple sites and interconnected by a communication network, which may comprise a block chain system.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device can execute the method in the embodiments corresponding to fig. 2, fig. 3, fig. 4, and fig. 5, and therefore, the description thereof is omitted here. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An audio essence segment determination method, characterized in that the method comprises:

acquiring audio data and text content corresponding to the audio data;

2. The method of claim 1, wherein entering the textual content into a first supervised model to determine a first set of highlights in the textual content comprises:

wherein the first index value or the second index value is any one or more of: the method comprises the steps of obtaining a high-precision score of each sentence in the text content, semantic similarity of each sentence in the text content and the text content, and probability that each sentence in the text content is a summary of the text content.

3. The method of claim 1, wherein entering the textual content into a first supervised model to determine a first set of highlights in the textual content comprises:

4. The method of claim 1, wherein entering the textual content into a first supervised model to determine a first set of highlights in the textual content comprises:

5. The method of claim 4, wherein inputting the textual content into a first supervised model and inputting an audio signal of the audio data into a second supervised model to determine a first set of highlights in the textual content comprises:

inputting the text content into a first unsupervised model to obtain a second index value of each sentence in the text content, inputting an audio signal of the audio data into a second unsupervised model to obtain a fourth index value of each sentence in the text content, and determining a second highlight sentence set in the text content based on the second index value and the fourth index value;

6. The method of any of claims 1-5, wherein determining the elite paragraph in the text content based on the first set of highlights and the second set of highlights comprises:

determining a sum of the first and second sets of the highlights;

7. The method of any of claims 1-5, wherein determining the elite paragraph in the text content based on the first set of highlights and the second set of highlights comprises:

8. The method according to any one of claims 1-7, wherein after determining a serum paragraph in the text content and before determining audio data corresponding to the serum paragraph in the audio data as an audio serum segment, the method further comprises:

9. The method according to any one of claims 1 to 7, wherein if there are a plurality of audio data, the audio essence segment includes audio essence segments corresponding to the plurality of audio data;

the method further comprises the following steps:

10. An electronic device, characterized in that the electronic device comprises: a memory, a processor;

the memory for storing a computer program;

the processor for invoking the computer program from the memory to perform the method of any of claims 1-9.

11. A computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-9.