CN113282509B

CN113282509B - Tone recognition, live broadcast room classification method, device, computer equipment and medium

Info

Publication number: CN113282509B
Application number: CN202110662233.8A
Authority: CN
Inventors: 刘柏基; 徐易楠; 陀得意; 康世胤
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2023-11-10
Anticipated expiration: 2041-06-15
Also published as: CN113282509A

Abstract

The embodiment of the invention discloses a tone color identification and live broadcasting room classification method, a device, computer equipment and a medium. The tone color identification method comprises the following steps: acquiring a voice fragment to be recognized of a speaker, and extracting basic audio characteristics of the voice fragment to be recognized; acquiring joint attribute characteristics corresponding to basic audio characteristics according to tone label relevance among a plurality of tone label systems, and acquiring tone label identification results of the joint attribute characteristics under each tone label system according to tone label specificity among the tone label systems; and combining the identification results of the tone color labels to obtain the tone color identification result of the speaker. The technical scheme of the embodiment of the invention provides a method for identifying a plurality of tone label identification results with joint attribute characteristics under a multi-tone label system, so that accurate multi-dimensional tone identification of audio is realized.

Description

Tone recognition, live broadcast room classification method, device, computer equipment and medium

Technical Field

The embodiment of the invention relates to the technical field of software testing, in particular to a method, a device, computer equipment and a medium for tone identification and live broadcasting room classification.

Background

In recent years, with the development of an audio live platform, the needs of people for the timbre of the host become more and more diversified, and this requires timbre recognition on the voice segments of the host. The target of timbre recognition is to judge the characteristics of the speaker, such as timbre and emotion, from the speaking voice fragments of a given speaker.

In the prior art, only one evaluation dimension is usually needed for the tone color, the description of the tone color of the anchor is not specific enough, the effect of specific association of a user according to the identification result of the tone color single dimension is difficult to achieve, and the tone color of the anchor cannot be clearly known.

The inventor finds that even if the evaluation dimension of the tone color is expanded in the process of realizing the invention, the contradictory result exists among the dimensions due to the lack of joint attribute limitation of the dimensions, so that the multi-dimensional evaluation of the tone color is very difficult.

Disclosure of Invention

The embodiment of the invention provides a method, a device, computer equipment and a medium for identifying timbre and classifying a living broadcast room, which can carry out multidimensional evaluation on timbre, have joint attribute characteristics among dimensions and realize accurate identification on timbre under audio multidimensional.

In a first aspect, an embodiment of the present invention provides a tone color recognition method, including:

Acquiring a voice fragment to be recognized of a speaker, and extracting basic audio characteristics of the voice fragment to be recognized;

acquiring joint attribute characteristics corresponding to basic audio characteristics according to tone label relevance among a plurality of tone label systems, and acquiring tone label identification results of the joint attribute characteristics under each tone label system according to tone label specificity among the tone label systems;

and combining the identification results of the tone color labels to obtain the tone color identification result of the speaker.

In a second aspect, an embodiment of the present invention further provides a live broadcast room classification method, including:

acquiring voice fragments to be recognized, which correspond to the anchor in each live broadcasting room to be classified respectively;

by adopting the tone color recognition method according to any embodiment of the invention, the tone color recognition results corresponding to the anchor are obtained through recognition;

adding each tone color identification result as a main broadcasting tone color description tag in each live broadcasting room;

and classifying each live broadcasting room according to the main broadcasting tone color description label.

In a third aspect, an embodiment of the present invention further provides a tone color recognition apparatus, including:

the basic audio feature extraction module is used for acquiring the voice fragments to be recognized of the speaker and extracting basic audio features of the voice fragments to be recognized;

The system comprises a tone color tag identification result acquisition module, a tone color tag identification module and a tone color tag identification module, wherein the tone color tag identification result acquisition module is used for acquiring joint attribute characteristics corresponding to basic audio characteristics according to tone color tag relevance among a plurality of tone color tag systems and acquiring tone color tag identification results of the joint attribute characteristics under each tone color tag system according to tone color tag specificity among each tone color tag system;

and the tone color tag recognition result combination module is used for combining the tone color tag recognition results to obtain the tone color recognition result of the speaker.

In a fourth aspect, an embodiment of the present invention further provides a live broadcast room classification device, including:

the anchor voice segment acquisition module is used for acquiring the voice segments to be identified, which correspond to anchors in each living broadcast room to be classified, respectively;

the anchor tone color recognition module is used for recognizing and obtaining tone color recognition results respectively corresponding to the anchors by adopting the tone color recognition method according to any embodiment of the invention;

the main broadcasting tone color description tag adding module is used for adding each tone color identification result into each live broadcasting room as a main broadcasting tone color description tag;

and the live broadcasting room classification module is used for classifying each live broadcasting room according to the main broadcasting tone color description label.

In a fifth aspect, an embodiment of the present invention further provides a computer device, including:

one or more processors; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the timbre identification method of any of the embodiments of the present invention or to perform the live room classification method of any of the embodiments of the present invention.

In a sixth aspect, an embodiment of the present invention further provides a computer-readable storage medium having stored thereon a computer program that, when executed by a computer, implements the timbre identification method provided by any of the embodiments of the present invention, or performs the live broadcasting room classification method provided according to any of the embodiments of the present invention.

According to the technical scheme, the voice fragments to be recognized of the speaker are obtained, and basic audio characteristics of the voice fragments to be recognized are extracted; acquiring joint attribute characteristics corresponding to basic audio characteristics according to tone label relevance among a plurality of tone label systems, and acquiring tone label identification results of the joint attribute characteristics under each tone label system according to tone label specificity among the tone label systems; and combining the identification results of the tone color labels to obtain the tone color identification result of the speaker. The technical scheme of the embodiment of the invention provides a method for identifying a plurality of tone label identification results with joint attribute characteristics under a multi-tone label system, so that accurate multi-dimensional tone identification of audio is realized.

Drawings

FIG. 1a is a flowchart of a tone color recognition method according to a first embodiment of the present invention;

FIG. 1b is a schematic diagram of a dominant color according to an embodiment of the present invention;

FIG. 1c is a schematic diagram of a tone color tag system according to a first embodiment of the present invention;

FIG. 2a is a flow chart of a tone color recognition method according to a second embodiment of the present invention;

fig. 2b is a schematic structural diagram of a tone label recognition model according to a second embodiment of the present invention;

fig. 2c is a schematic structural diagram of an encoding layer according to a second embodiment of the present invention;

FIG. 2d is a flowchart of a labeling platform generation according to a second embodiment of the present invention;

fig. 3 is a flowchart of a live broadcast room classification method in accordance with a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a tone color recognition device according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a sorting device for a live broadcast room in a fifth embodiment of the present invention;

fig. 6 is a schematic structural diagram of a computer device according to a sixth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

It should be further noted that, for convenience of description, only some, but not all of the matters related to the present invention are shown in the accompanying drawings. Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example 1

Fig. 1a is a flowchart of a tone color recognition method according to an embodiment of the present invention, where the embodiment is applicable to a case of performing multidimensional tone color tag recognition on a voice clip of a live broadcast room. The method may be performed by a tone color recognition apparatus, which may be implemented in software and/or hardware, and may be generally integrated in a computer device (e.g., various intelligent terminals or servers, etc.). As shown in fig. 1a, the method comprises:

Step 110, obtaining the voice fragment to be recognized of the speaker, and extracting the basic audio characteristics of the voice fragment to be recognized.

Wherein the speaker may be a speaker of speech, such as a host of a live room. Specifically, the speaker may be a female, male, or child, etc. The speech segment to be recognized may be a segment formed by intercepting speech of a speaker speaking during a certain period of time. For example, the speech segment to be recognized may be a segment of speech of the anchor while live. In particular, the speech segment to be recognized may be a segment of speech that is capable of characterizing the character of the host broadcast tone. The underlying audio features may be audio data characterizing the speech segments to be identified. The underlying audio features may have a variety of representations. For example, the basic audio feature may be a spectral feature of the speech segment to be recognized in the time domain, or the basic audio feature may be a spectral feature of the speech segment to be recognized in the frequency domain, such as a mel-spectrum feature, or the basic audio feature may be a feature vector formed by a machine learning model by feature extraction of the speech segment to be recognized. The method of extracting the base audio features may be determined from the representation of the base audio features. For example, the extraction method may be a conventional time-domain spectrum feature extraction method in speech signal science, a method of converting a time domain into a mel-spectrum feature, a machine learning model feature extraction method, or the like.

And 120, acquiring joint attribute characteristics corresponding to the basic audio characteristics according to the tone label relevance among the tone label systems, and acquiring tone label identification results of the joint attribute characteristics under each tone label system according to the tone label specificity among the tone label systems.

Wherein the timbre label system may be a system that characterizes timbres from specified dimensions. In particular, a tone color label system may have multiple tone color labels in a specified dimension. For example, the tone of a speaker can be represented in multiple dimensions, such as primary tone and secondary tone. The tone color tag system may be a primary tone color tag system or a secondary tone color tag system, etc. The primary tone may be a basic division of the tone of the speaker, such as a basic classification of the tone formed by the gender of the speaker and the characteristics of the voice.

Fig. 1b is a schematic diagram illustrating a main tone according to a first embodiment of the present invention. As shown in fig. 1b, the tone colors for different speakers can be roughly classified into female sounds, infant sounds, and male sounds. Wherein, female sounds may include a Raili sound, a girl sound, a Yujie sound, etc. Male sounds may include a receiving sound, a young sound, a tertiary sound, and the like. The baby's voice may include a front tone, a child tooth tone, etc. To further refine the timbre, the female sounds may also include a less-than-glory between the glory and the girl sounds, and a less-than-a-voice between the girl and the sister sounds. Male sounds may also include cyan-receiving sounds between receiving and young sounds, and cyan-tertiary sounds between young and tertiary sounds. The primary tone may include a total of 12 classes of tone including a Rayleigh tone, a girl tone, a sister tone, a scindapsus tone, a young adult tone, a tertiary tone, a cyan receiving tone, a cyan tertiary tone, a front-Tai tone, and a young tooth tone.

In the embodiment of the invention, the auxiliary tone may be an auxiliary division of the tone of the speaker. The secondary tone may have an association with the primary tone. The auxiliary timbre may be a representation of the speaker's timbre from a dimension that facilitates the user's timbre differentiation.

For example, the secondary timbre may include a first secondary timbre and a second secondary timbre. Wherein, the first consonant color can be used for assisting in representing the tone color of the speaker through the four-character adjective; the second consonant color may represent the speaker's timbre with the aid of a social character.

Fig. 1c is a schematic structural diagram of a tone color tag system according to a first embodiment of the present invention. As shown in fig. 1c, the tag system may include three tone color tag systems, a primary tone color tag system, a first secondary tone color tag system, and a second secondary tone color tag system, respectively. The label system of the main tone colors can be used for carrying out different label distinction on the tone colors of the pronounciator through 12 main tone colors. The first secondary tone label system may first secondary label distinguish the speaker tone by a quadword associated with a selected primary tone label in the primary tone label system. The second secondary tone label system may second secondary label distinguish the speaker tone by a social character associated with a selected primary tone label in the primary tone label system. I.e. the tone color labels among the plurality of tone color label systems have an association. For example, the primary tone label may be a young tone, the first secondary tone label may be a refreshing tone, and the second secondary tone label may be a learned tone. In particular, the first auxiliary sound tag and the second auxiliary sound tag may each be represented by a limited vocabulary set. According to the technical scheme, the auxiliary tone is introduced to carry out multidimensional representation on the tone, so that the main tone can be more specifically and complementarily described, a user can specifically associate a tone effect according to the tone label, and the method can be used for knowing the tone of the main player more clearly.

In the embodiment of the invention, the joint attribute feature can be the concrete implementation of tone label relevance among multiple tone label systems under the basic audio feature. For example, the joint attribute feature may be a feature obtained by learning a tone label association between multiple tone label systems by a machine learning model. Alternatively, a direct mapping relationship between the primary tone color label and the secondary tone color label may be constructed, and the joint attribute feature may be a mapping relationship that matches the base audio feature. The mapping relationship may include a mapping of relevance and a mapping of specificity.

The tone color label specificity can be the exclusive performance among tone color labels of all dimensions. For example, when the primary tone label is a tertiary tone, the first secondary tone label may not be soft and lovely, and the second secondary tone label may not be a sister tone. Because tertiary sounds belong to male sounds, while lovely and sister sounds belong to female sounds, the occurrence of the sound is a contradictory result.

In the embodiment of the invention, the identification result of the tone label under each tone label system can be obtained according to the basic audio characteristics and the corresponding joint attribute characteristics. For example, a primary tone color label may be determined by the base audio feature, and a first secondary tone color label and a second secondary tone color label corresponding to the primary tone color label may be determined from the joint attribute feature. And, because the joint attribute feature is embodied under the basic audio feature, unique first auxiliary tone color tags and second auxiliary tone color tags can also be determined.

And 130, combining the identification results of the tone marks to obtain the tone identification result of the speaker.

The combination of the tone color tag recognition results may be a combination of generating a plurality of tags from the tone color tag recognition results, and the combination may be a tone color recognition result of the speaker. For example, the tone color recognition result of the speaker may be combined into a composite label in the order of the primary tone color label, the first secondary tone color label, and the second secondary tone color label. For example, the tone color recognition result of the speaker may be "young-fresh and pretty-learning long tone".

According to the technical scheme, the voice fragments to be recognized of the speaker are obtained, and basic audio characteristics of the voice fragments to be recognized are extracted; acquiring joint attribute characteristics corresponding to basic audio characteristics according to tone label relevance among a plurality of tone label systems, and acquiring tone label identification results of the joint attribute characteristics under each tone label system according to tone label specificity among the tone label systems; and combining the identification results of the tone color labels to obtain the tone color identification result of the speaker. The technical scheme of the embodiment of the invention solves the problem of multi-dimensional identification of the tone, can realize multi-dimensional identification of the tone, ensures that the dimensions have relevance, avoids the specificity among the dimensions, ensures that the multi-dimensional identification of the tone is accurate and reliable, can more particularly describe the tone of a speaker, and carries out omnibearing evaluation on the tone.

Example two

Fig. 2a is a flowchart of a tone color recognition method according to a second embodiment of the present invention, which is based on the foregoing embodiment, in this embodiment, a manner of acquiring a joint attribute feature corresponding to a basic audio feature according to a tone color tag correlation among a plurality of tone color tag systems, and acquiring a tone color tag recognition result of the joint attribute feature under each tone color tag system according to a tone color tag specificity among each tone color tag system is further refined. Correspondingly, the features of the basic audio features of the extracted speech segments to be recognized are further refined. As shown in fig. 2a, the method of the present embodiment may include:

step 210, obtaining a voice fragment to be recognized of a speaker, and dividing the voice fragment to be recognized into audio sub-fragments corresponding to a plurality of time points respectively to form basic audio sub-features corresponding to each audio sub-fragment respectively.

The voice segment to be recognized can be a relatively long voice segment, more basic audio characteristics of the speaker can be obtained through the long voice segment to be recognized, and the tone of the speaker can be recognized better and more accurately. In the voice segment to be recognized, the specific voice of the speaker is fluctuant, so that the basic audio characteristics in the voice segment to be recognized can be well determined, the voice segment to be recognized can be divided according to time points, and audio sub-segments are generated. The audio sub-segment may be a relatively short speech through which the corresponding underlying audio sub-feature may be precisely determined.

And 220, combining the basic audio sub-features corresponding to the time points respectively according to the time sequence to obtain the basic audio features.

The basic audio features of the voice fragments to be recognized are obtained by combining the basic audio sub-features of the audio sub-fragments according to the time sequence of the audio sub-fragments, so that the basic audio features are more accurate and comprehensive.

Step 230, inputting the basic audio features into a pre-trained tone color tag recognition model, and obtaining tone color tag recognition results which are output by the tone color tag recognition model and respectively correspond to each tone color tag system.

The tone color tag recognition model may be a model for recognizing a corresponding tone color tag according to the basic audio features. Specifically, the tone color tag recognition model can recognize multi-dimensional tone color tags, and each tone color tag can be respectively used as a tone color tag recognition result corresponding to each tone color tag system. For example, the timbre tag recognition model may recognize a corresponding primary timbre tag, a first secondary timbre tag, and a second secondary timbre tag according to the base audio characteristics.

In the embodiment of the invention, the tone color tag identification model comprises a plurality of tone color tag identification sub-modules, each tone color tag identification sub-module comprises a coding layer and at least one output layer which are sequentially connected, and the tone color tag identification sub-modules share the same coding layer. The tone color tag recognition sub-modules are associated with the tone color tag system, and the final output layer of each tone color tag recognition sub-module is used for outputting tone color tag recognition results of the basic audio features under the tone color tag system.

In particular, the tone color tag recognition sub-module may be a specific implementation of the associated tone color tag system in a tone color tag recognition model. For example, the last output layer of each tone color label recognition sub-module may output a primary tone color label, a first secondary tone color label, and a second secondary tone color label.

Fig. 2b is a schematic structural diagram of a tone color tag identification model according to a second embodiment of the present invention, where, as shown in fig. 2b, the tone color tag identification model may include a plurality of tone color tag identification sub-modules, for example, the tone color tag identification model includes 3 tone color tag identification sub-modules. The tone color tag identification sub-module 1 is used for outputting a main tone color tag, the tone color tag identification sub-module 2 is used for outputting a first auxiliary tone color tag, and the tone color tag identification sub-module 3 is used for outputting a second auxiliary tone color tag.

In the embodiment of the invention, the coding layer is connected with the input end of the tone label identification model and is used for outputting the joint attribute characteristics corresponding to the basic audio characteristics. Specifically, as shown in fig. 2b, the basic audio features may be input as input features to the common encoding layer through the input. The common encoding layer may machine learn the underlying audio features to extract corresponding joint attribute features. The independent machine learning model which is originally independently used as the tone recognition of a certain dimension can be effectively integrated through the common coding layer, so that tone labels of all dimensions recognized by the tone label recognition model are more relevant, for example, the contradictory results that the primary tone label is tertiary tone, the first auxiliary tone label is lovely and soft, and the second auxiliary tone label is sister tone can be avoided. In addition, the multidimensional degree of the tone color label can be easily amplified by simply adding the output layer, the calculated amount cannot be rapidly increased, and the multidimensional degree evaluation of the tone color can be simpler.

Alternatively, each basic audio sub-feature corresponding to the basic audio feature may be input as an input feature to a common encoding layer. The coding layer is specifically configured to respectively calculate joint attribute sub-features corresponding to each basic audio sub-feature in the basic audio features, and statistically average the joint attribute sub-features to obtain a joint attribute feature. By means of carrying out statistical averaging on the combined attribute sub-features to obtain the combined attribute features, fluctuation of the speaker in the voice fragment to be recognized can be balanced, so that voice data of the speaker can be more stable, tone of the speaker can be better reflected, and further, accuracy of tone recognition can be improved.

As shown in fig. 2b, the encoding layer may be connected to a plurality of output layers, each of which may correspond to a tone label recognition submodule. The output layer can determine a tone color tag identification result under the corresponding tone color tag system according to the basic audio characteristics and the corresponding joint attribute characteristics. Specifically, the output layer may output a category vector corresponding to the tone color tag recognition result. For example, the output layer 1, the output layer 2, and the output layer 3 may output the category vector 1, the category vector 2, and the category vector 3, respectively; the class vector 1, the class vector 2 and the class vector 3 may be vector data corresponding to the primary tone color label, the first secondary tone color label and the second secondary tone color label, respectively.

In an alternative implementation of the embodiment of the present invention, the coding layer specifically includes: the system comprises a gating circulation unit network layer, a deep neural network layer and a statistical average summarizing layer which are connected in sequence; the gating circulation unit network layer comprises a plurality of bidirectional gating circulation units which are connected in sequence.

Fig. 2c is a schematic structural diagram of a coding layer according to a second embodiment of the present invention. As shown in fig. 2c, the network layer of the gating cycle unit includes a plurality of bi-directional gating cycle units connected in sequence, and the gating cycle units (Gate Recurrent Unit, GRU) have less calculation amount compared with other units such as Long Short-Term Memory artificial neural network units (LSTM). The two-way environment information which can fully consider the input basic audio characteristics is adopted, and the extraction capability of the information can be further improved by adopting a plurality of two-way gating circulating units which are connected in sequence. The deep neural network layer (Deep Neural Networks, DNN) following the gate-loop unit network layer allows the information to be processed of the growing neural network due to the common coding layer to be processed efficiently. Namely, adding the DNN layer can improve the parameter number of the neural network and the modeling capacity of the model. The statistical average summarizing layer can summarize the combined attribute sub-features of each time point, perform average calculation, extract the stability high-dimensional characterization vector of the whole voice fragment to be recognized, and be used for calculating the probability of each tone label under each dimension by the subsequent output layer.

In the embodiment of the invention, by sharing the coding layer shown in fig. 2c, tone color labels with different dimensions can output high-dimensional characterization vectors which are output by the same coding layer, so that each dimension of the tone color labels has certain correlation. Meanwhile, tone labels with different dimensions also provide sample information with different angles for a common coding layer, which is beneficial to improving the abstract capacity of the coding layer.

And 240, combining the identification results of the tone color labels to obtain the tone color identification result of the speaker.

On the basis of the above embodiment, optionally, before inputting the basic audio feature into the pre-trained timbre tag recognition model, the method further includes: constructing a training sample set, wherein the training samples in the training sample set comprise: standard voice fragments and marking tone labels of the standard voice fragments under each tone label system; training a preset machine learning model by using a training sample set to obtain a tone label identification model.

The standard voice segment may be a voice pre-recorded by a plurality of speakers. For example, a plurality of voice segments of a host when live in a live broadcast room may be acquired as standard voice segments. The labeled tone color label can be formed by labeling the standard voice fragments in advance. For example, standard speech segments may be timbre tagged under a multiple timbre tagging system. For example, standard speech segments may be labeled with a tone color label in the form of a primary tone color label, a first auxiliary tone color label, and a second auxiliary tone color label.

Specifically, the tone color label labeling can be manually labeled, or can be labeled by using expert knowledge through a labeling platform. For example, it may be an expert to label standard speech segments with a tone color. Alternatively, the standard speech segments may be labeled with timbre tags using expert knowledge by means of a labeling platform.

In an alternative implementation of the embodiment of the present invention, constructing the training sample set includes: obtaining standard training samples marked by at least one standard marking platform, and providing each standard training sample to a plurality of auxiliary marking platforms as learning samples; and obtaining reference learning samples of each auxiliary labeling platform, labeling the obtained specific training samples, and forming a training sample set by using the standard training samples and the specific training samples.

The standard labeling platform can be a labeling platform generated by using expert knowledge. Fig. 2d is a flowchart of a labeling platform generation according to a second embodiment of the present invention. As shown in fig. 2d, standard samples may be labeled for some exemplary standard speech segments by a standard labeling platform, so as to obtain standard training samples. The standard labeling platform can be used for labeling through an expert. The auxiliary labeling platform can learn standard training samples to obtain expert knowledge, and label samples of other exemplary standard voice fragments to obtain auxiliary training samples. The auxiliary labeling platform can be used for labeling after learning expert knowledge by non-expert.

That is, the training sample set generating method provided by the embodiment of the invention may be: generating a standard training sample through a standard marking platform, and training a plurality of auxiliary marking platforms as learning samples; generating an auxiliary training sample through an auxiliary labeling platform; and finally, taking the standard training sample and the auxiliary training sample together as a training sample set. The training sample set generation method provided by the embodiment of the invention can well expand the training sample set, so that the training sample set has diversity; on one hand, the training sample set can meet expert knowledge, and on the other hand, non-experts can participate in sample labeling, so that the trained tone color tag recognition model is more fit with the cognition of the non-experts, and the applicability of the model is better. In addition, the number of samples in the training sample set can be greatly expanded, so that individual labeling errors are ignored in a statistical sense, and meanwhile, the offset problem when labeling is singly performed through expert knowledge can be avoided.

According to the technical scheme, the voice fragments to be recognized of the speaker are obtained, and are divided into audio sub-fragments corresponding to a plurality of time points respectively, so that basic audio sub-features corresponding to each audio sub-fragment are formed; combining the basic audio sub-features corresponding to the time points respectively according to the time sequence to obtain basic audio features; inputting the basic audio characteristics into a pre-trained tone color tag recognition model, and acquiring tone color tag recognition results which are output by the tone color tag recognition model and respectively correspond to each tone color tag system; the identification results of the tone color labels are combined to obtain the tone color identification result of the speaker, so that the problem of multi-dimensional identification of the tone color labels is solved, the accurate multi-dimensional tone color identification of the audio is realized, and the flexibility of tone color description is improved; and the relevance among multiple dimensions can be improved, so that the stability and the credibility of the tone color recognition result can be improved, and the calculated amount of the model can be reduced.

Example III

Fig. 3 is a flowchart of a live broadcast room classification method according to a third embodiment of the present invention, which is added based on the above embodiment, and in this embodiment, a specific implementation flow of the live broadcast room classification method is defined. As shown in fig. 3, the method of the present embodiment may include:

step 310, obtaining the voice fragments to be recognized respectively corresponding to the anchor in each live broadcasting room to be classified.

The method can intercept the voice of the host in any time period when the host is live in the live broadcasting room, and the voice is used as the voice fragment to be recognized.

Step 320, adopting the tone color recognition method provided by any embodiment of the present invention to obtain tone color recognition results corresponding to each anchor.

By way of example, the tone recognition method provided by any embodiment of the present invention may be used to perform tone recognition on the to-be-recognized voice segment of the anchor, so as to obtain the anchor tone label, the first auxiliary tone label and the second auxiliary tone label of the anchor as tone recognition results corresponding to the anchor.

And 330, adding the tone color recognition results as the main broadcasting tone color description labels in each live broadcasting room.

For example, a main tone color tag, a first auxiliary tone color tag, and a second auxiliary tone color tag of a main cast may be used as the main cast tone color description tag. The tone color labels can be displayed and distinguished by different colors. The anchor spot description tag may be added in a list page at a location corresponding to the live room. The list page can contain information of a plurality of live rooms, and the information of each live room can be displayed in a list mode. The user can conveniently enter different live rooms through live room information such as a main broadcast tone color description tag in the list page.

And 340, classifying each live broadcasting room according to the main broadcasting tone color description tags.

Specifically, the main broadcasting tone color description labels can be used as the classification basis of the live broadcasting room, so that users watching live broadcasting can conveniently and quickly know the tone color style of the main broadcasting through different main broadcasting tone color description labels, and enter the live broadcasting room of interest. For example, the user likes the positive tone, and the user can enter the sorting list of the live broadcasting rooms corresponding to the positive tone through the main broadcasting tone describing tag, so that the interested one or more live broadcasting rooms related to the positive tone can be selected at will for watching.

In an optional implementation manner of the embodiment of the present invention, after classifying each live broadcast room according to the media description tag of the anchor, the method further includes: screening to obtain a high-frequency access live broadcast room corresponding to the target audience according to the historical live broadcast room access record of the target audience; determining the high-frequency access tone color tags of target audiences under each tone color tag system according to the anchor tone color description tags of each high-frequency access live broadcast room; acquiring a target live broadcasting room type hitting each high-frequency access tone label from the classified live broadcasting room types; at least one target recommended live room corresponding to the target live room type is obtained and provided for target audience.

The high-frequency access live broadcast room can be a live broadcast room frequently accessed by target audience. Specifically, the high-frequency access live broadcast room can be the corresponding live broadcast room when the number of times of accessing the live broadcast room by the target audience exceeds the preset number of times; alternatively, the high frequency access live broadcast room may be a live broadcast room in which the target audience accesses the record in the historical live broadcast room, and the access times are arranged in order from large to small, and the order of the order is located before a preset proportion (such as 10%). The corresponding anchor tone color descriptive label of the high-frequency access live broadcast room can be used as the high-frequency access tone color label of the target audience. At least one target recommended living room of the target living room types corresponding to the high frequency access timbre tags may be provided to target viewers. The target audience can be recommended to the live broadcasting rooms with the same type of tone through the main broadcasting tone describing tag, the live broadcasting rooms watched by the user can be expanded, the user can watch more live broadcasting rooms with the same type, and the user experience is improved.

According to the technical scheme, the voice fragments to be recognized, which correspond to the anchor in each living broadcast room to be classified, are obtained; by adopting the tone color recognition method provided by any embodiment of the invention, tone color recognition results corresponding to each anchor are obtained through recognition; adding the tone color recognition results as the main broadcasting tone color description labels in each live broadcasting room; according to the main broadcasting tone color description label, each live broadcasting room is classified, the problem that the live broadcasting rooms are classified according to tone colors is solved, the diversity of the live broadcasting room classification is realized, a user can pay attention to the live broadcasting room according to the main broadcasting tone color description label, the type of the live broadcasting room is quickly known, and the user can directly enter the interested live broadcasting room conveniently.

Example IV

Fig. 4 is a schematic structural diagram of a tone color recognition device according to a fourth embodiment of the present invention. As shown in fig. 4, the tone color recognition apparatus includes: the basic audio feature extraction module 410, the tone label recognition result acquisition module 420 and the tone label recognition result combination module 430. Wherein:

the basic audio feature extraction module 410 is configured to obtain a to-be-identified voice segment of a speaker, and extract basic audio features of the to-be-identified voice segment;

the tone color tag identification result obtaining module 420 is configured to obtain a joint attribute feature corresponding to the basic audio feature according to tone color tag relevance among a plurality of tone color tag systems, and obtain a tone color tag identification result of the joint attribute feature under each tone color tag system according to tone color tag specificity among each tone color tag system;

and a tone color tag recognition result combination module 430, configured to combine the tone color tag recognition results to obtain a tone color recognition result of the speaker.

Optionally, the tone color tag identification result obtaining module 420 is specifically configured to:

inputting the basic audio characteristics into a pre-trained tone color tag recognition model, and acquiring tone color tag recognition results which are output by the tone color tag recognition model and respectively correspond to each tone color tag system;

The tone color tag identification model comprises a plurality of tone color tag identification sub-modules, each tone color tag identification sub-module comprises a coding layer and at least one output layer which are sequentially connected, the tone color tag identification sub-modules share the same coding layer, and the coding layer is connected with the input end of the tone color tag identification model and is used for outputting joint attribute characteristics corresponding to basic audio characteristics;

the tone color tag recognition sub-modules are associated with the tone color tag system, and the final output layer of each tone color tag recognition sub-module is used for outputting tone color tag recognition results of the basic audio features under the tone color tag system.

Optionally, the basic audio feature extraction module 410 includes:

the basic audio sub-feature forming unit is used for dividing the voice fragment to be recognized into audio sub-fragments corresponding to a plurality of time points respectively and forming basic audio sub-features corresponding to each audio sub-fragment respectively;

the basic audio feature generation unit is used for combining basic audio sub-features corresponding to each time point respectively according to a time sequence to obtain basic audio features;

the coding layer is specifically configured to respectively calculate joint attribute sub-features corresponding to each basic audio sub-feature in the basic audio features, and statistically average the joint attribute sub-features to obtain a joint attribute feature.

Optionally, the coding layer specifically includes: the system comprises a gating circulation unit network layer, a deep neural network layer and a statistical average summarizing layer which are connected in sequence;

the gating circulation unit network layer comprises a plurality of bidirectional gating circulation units which are connected in sequence.

Optionally, the device further includes:

the training sample set construction module is used for constructing a training sample set before inputting the basic audio characteristics into the pre-trained tone color tag identification model, wherein the training samples in the training sample set comprise: standard voice fragments and marking tone labels of the standard voice fragments under each tone label system;

and the tone color tag recognition model training module is used for training a preset machine learning model by using the training sample set to obtain a tone color tag recognition model.

Optionally, the training sample set construction module includes:

the learning sample acquisition unit is used for acquiring standard training samples marked by at least one standard marking platform and providing each standard training sample to a plurality of auxiliary marking platforms to serve as learning samples;

the training sample set forming unit is used for obtaining reference learning samples of all auxiliary labeling platforms, labeling the obtained auxiliary training samples, and forming a training sample set by using the standard training samples and the auxiliary training samples.

The tone color recognition device provided by the embodiment of the invention can execute the tone color recognition method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example five

Fig. 5 is a schematic structural diagram of a sorting device for a live broadcast room in a fifth embodiment of the present invention. As shown in fig. 5, the live room sorting apparatus includes: a anchor voice clip acquisition module 510, an anchor timbre identification module 520, an anchor timbre description tag addition module 530 and a live room classification module 540. Wherein:

the anchor voice segment obtaining module 510 is configured to obtain to-be-identified voice segments corresponding to anchors in each live broadcasting room to be classified respectively;

the anchor tone recognition module 520 is configured to recognize and obtain tone recognition results corresponding to each anchor by using the tone recognition method provided by any embodiment of the present invention;

a main broadcast tone color description tag adding module 530, configured to add each tone color recognition result as a main broadcast tone color description tag in each live broadcast room;

and the live broadcast room classification module 540 is used for classifying each live broadcast room according to the main broadcast tone color description label.

Optionally, the device further includes:

the high-frequency access live broadcasting room screening module is used for screening and obtaining the high-frequency access live broadcasting rooms corresponding to the target audience according to the historical live broadcasting room access records of the target audience after classifying each live broadcasting room according to the main broadcasting tone description label;

The high-frequency access tone color tag determining module is used for determining the high-frequency access tone color tags of target audiences under each tone color tag system according to the anchor tone color description tags of each high-frequency access live broadcasting room;

the target live broadcasting room type acquisition module is used for acquiring target live broadcasting room types hitting each high-frequency access tone label from the classified live broadcasting room types;

and the target recommendation live broadcast room providing module is used for acquiring at least one target recommendation live broadcast room corresponding to the type of the target live broadcast room and providing the target audience with the target recommendation live broadcast room.

The live broadcasting room classifying device provided by the embodiment of the invention can execute the live broadcasting room classifying method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method.

Example six

Fig. 6 is a schematic structural diagram of a computer device according to a sixth embodiment of the present invention, and as shown in fig. 6, the computer device includes a processor 60, a memory 61, an input device 62 and an output device 63; the number of processors 60 in the computer device may be one or more, one processor 60 being taken as an example in fig. 6; the processor 60, the memory 61, the input means 62 and the output means 63 in the computer device may be connected by a bus or by other means, in fig. 6 by way of example.

The memory 61 is used as a computer readable storage medium for storing software programs, computer executable programs, and modules, such as program instructions/modules corresponding to the tone color recognition method or the live broadcast room classification method in the embodiment of the present invention (for example, the basic audio feature extraction module 410, the tone color tag recognition result acquisition module 420, and the tone color tag recognition result combination module 430 in the tone color recognition apparatus, or the anchor voice clip acquisition module 510, the anchor tone color recognition module 520, the anchor tone color description tag addition module 530, and the live broadcast room classification module 540 in the live broadcast room classification apparatus). The processor 60 executes various functional applications of the computer device and data processing, namely, implements the above-described tone color recognition method or live room classification method, by running software programs, instructions, and modules stored in the memory 61:

Or,

by adopting the tone color recognition method provided by any embodiment of the invention, the tone color recognition results corresponding to the anchor are obtained through recognition;

The memory 61 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the terminal, etc. In addition, the memory 61 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 61 may further comprise memory remotely located relative to processor 60, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 62 is operable to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the computer device. The output 63 may comprise a display device such as a display screen.

Example seven

The seventh embodiment of the invention also discloses a computer storage medium, on which a computer program is stored, which when executed by a processor, implements the above tone color recognition method or live broadcasting room classification method:

Or,

The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium may be, for example, but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A tone color recognition method, comprising:

combining the identification results of the tone color labels to obtain the tone color identification result of the speaker;

Wherein the timbre tag system is a system that characterizes timbres from specified dimensions; the tone color tag system is a tag system of a main tone color or a tag system of an auxiliary tone color; the auxiliary tone and the main tone have relevance; the joint attribute features are the concrete embodiment of tone label relevance among multiple tone label systems under the basic audio features; tone color tag specificity is the exclusive manifestation between tone color tags of each dimension.

2. The method of claim 1, wherein obtaining a joint attribute feature corresponding to the base audio feature based on a tone label association among the plurality of tone label systems, and obtaining a tone label recognition result of the joint attribute feature under each tone label system based on a tone label specificity among the tone label systems, comprises:

3. The method of claim 2, wherein the extracting basic audio features of the speech segment to be recognized comprises:

dividing the voice segment to be recognized into audio sub-segments corresponding to a plurality of time points respectively, and forming basic audio sub-features corresponding to each audio sub-segment respectively;

combining the basic audio sub-features corresponding to the time points respectively according to the time sequence to obtain the basic audio features;

the coding layer is specifically configured to respectively calculate joint attribute sub-features corresponding to each basic audio sub-feature in the basic audio features, and statistically average each joint attribute sub-feature to obtain the joint attribute feature.

4. A method according to claim 3, characterized in that the coding layer comprises in particular: the system comprises a gating circulation unit network layer, a deep neural network layer and a statistical average summarizing layer which are connected in sequence;

5. The method of any of claims 2-4, further comprising, prior to inputting the base audio features into the pre-trained timbre tag recognition model:

constructing a training sample set, wherein training samples in the training sample set comprise: standard voice fragments and marking tone labels of the standard voice fragments under each tone label system;

and training a preset machine learning model by using the training sample set to obtain the tone label identification model.

6. The method of claim 5, wherein constructing a training sample set comprises:

obtaining standard training samples marked by at least one standard marking platform, and providing each standard training sample to a plurality of auxiliary marking platforms as learning samples;

and acquiring auxiliary training samples obtained by labeling the auxiliary labeling platforms by referring to the learning samples, and forming a training sample set by using the standard training samples and the auxiliary training samples.

7. A live broadcast room sorting method, comprising:

identifying and obtaining tone identification results corresponding to each anchor by adopting the method of any one of claims 1-6;

8. The method of claim 7, further comprising, after categorizing each of the live rooms based on the anchor timbre description tags:

screening to obtain a high-frequency access live broadcast room corresponding to the target audience according to the historical live broadcast room access record of the target audience;

determining the high-frequency access tone color tags of the target audience under each tone color tag system according to the main tone color description tags of each high-frequency access live broadcasting room;

acquiring a target live broadcasting room type hitting each high-frequency access tone label from the classified live broadcasting room types;

at least one target recommended live room corresponding to the target live room type is obtained and provided for target audience.

9. A tone color recognition apparatus, comprising:

The tone color tag recognition result combination module is used for combining the tone color tag recognition results to obtain tone color recognition results of the speaker;

10. A live broadcast room sorting apparatus, comprising:

a main broadcasting tone recognition module, configured to recognize and obtain tone recognition results corresponding to each main broadcasting respectively by using the method according to any one of claims 1-6;

11. A computer device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the timbre identification method of any one of claims 1-6 or to perform the live room classification method of any one of claims 7-8.

12. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the timbre identification method according to any one of claims 1-6 or to perform the live room classification method according to any one of claims 7-8.