CN112995696A

CN112995696A - Live broadcast room violation detection method and device

Info

Publication number: CN112995696A
Application number: CN202110424189.7A
Authority: CN
Inventors: 魏海巍; 王伟伟; 刘凯
Original assignee: Gongdao Network Technology Co ltd
Current assignee: Gongdao Network Technology Co ltd
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-06-18
Anticipated expiration: 2041-04-20
Also published as: CN112995696B

Abstract

The application provides a method and a device for detecting violation of live broadcast room, and the method is characterized by comprising the following steps: acquiring a live broadcast data stream of a target live broadcast room; the live data stream comprises a video stream and an audio stream corresponding to the video stream; carrying out segmentation processing on the video stream to obtain a plurality of video segments; extracting features from the video clips, and determining whether the extracted features are matched with anchor features corresponding to the target live broadcast room; and if so, converting the audio clip corresponding to the video clip into text information, and carrying out violation detection on the target live broadcast room based on the text information. According to the technical scheme, the video clip containing the anchor characteristics in the live data stream is determined, so that violation detection is more targeted; and carrying out violation detection based on the determined audio clip corresponding to the video clip, and carrying out violation detection by converting the words of the anchor in live broadcast delivery into text information, thereby improving the accuracy of violation detection.

Description

Live broadcast room violation detection method and device

Technical Field

The application relates to the technical field of communication, in particular to a live broadcast break detection method and device.

Background

Currently, with the fire heat of live broadcast platforms, selling goods through live broadcast becomes a popular sales mode.

Generally, the anchor introduces basic information, functional characteristics, preferential schemes, and the like of the commodities during live broadcasting. However, in order to attract customers, the anchor may conduct illegal activities such as exaggeration, illegal sales promotion, and even fraud or sale of counterfeit products.

The illegal act infringes the rights and interests of consumers and disturbs the market order, and the illegal act must be handled in time. However, because live broadcast is too hot, a large number of live broadcast rooms exist, and live broadcast can be carried out at any time. Therefore, the current supervision requirements are difficult to meet only by manual supervision.

Disclosure of Invention

In view of this, the present application provides a method and an apparatus for detecting a violation of a live broadcast room, where the violation of the live broadcast room is detected by determining a violation of a host.

Specifically, the method is realized through the following technical scheme:

in a first aspect, the present application provides a method for detecting a live broadcast violation, where the method includes:

acquiring a live broadcast data stream of a target live broadcast room; the live data stream comprises a video stream and an audio stream corresponding to the video stream;

carrying out segmentation processing on the video stream to obtain a plurality of video segments;

extracting features from the video clips, and determining whether the extracted features are matched with anchor features corresponding to the target live broadcast room;

and if so, converting the audio clip corresponding to the video clip into text information, and carrying out violation detection on the target live broadcast room based on the text information.

In a second aspect, the present application further provides a device for detecting violation in a live broadcast room, where the device includes:

the acquisition unit is used for acquiring a live broadcast data stream of a target live broadcast room; the live data stream comprises a video stream and an audio stream corresponding to the video stream;

the segmentation unit is used for segmenting the video stream to obtain a plurality of video segments;

the matching unit is used for extracting features from the video clips and determining whether the extracted features are matched with the anchor features corresponding to the target live broadcast room;

and the detection unit is used for converting the audio clip corresponding to the video clip into text information when the extracted features are matched with the anchor features corresponding to the target live broadcast room, and carrying out violation detection on the target live broadcast room based on the text information.

By analyzing the technical scheme, the video clip containing the anchor characteristics is determined by extracting the characteristics in the video stream of the target live broadcast room and matching the characteristics with the anchor characteristics corresponding to the target live broadcast room; further, an audio clip corresponding to the video clip is converted into text information, and violation detection is performed on the target live broadcast room based on the text information. The anchor is used as a person responsible for the live broadcast room and needs to be responsible for the live broadcast content, and the violation detection is more targeted by determining the video clip containing the anchor characteristic in the live broadcast data stream; further, because the characteristics of live broadcast area goods lie in that the anchor broadcasts promote, the information content that the audio clip contains is far more than the video clip, and this application carries out violation detection based on the audio clip that the video clip corresponds that the aforesaid was confirmed, converts text information into through saying the anchor broadcasts when live broadcast area goods and carries out violation detection, improves violation detection's the degree of accuracy.

Drawings

Fig. 1 is a flowchart illustrating a live broadcast break detection method according to an exemplary embodiment of the present application;

fig. 2 is a flow chart illustrating another live break detection method according to an exemplary embodiment of the present application;

fig. 3 is a block diagram illustrating a live broadcast violation detection apparatus according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

When violation detection is carried out in the live broadcast room, different detection strategies can be adopted according to different live broadcast contents.

For example, for living or outdoor live broadcast rooms, detection may be performed on video frames of a video stream in a live broadcast data stream to determine whether illegal content occurs in the live broadcast frames.

For another example, for a live broadcast room of an emotion type or a chat type, an audio stream in a live broadcast data stream may be identified, the audio stream is converted into text content for detection, and it is determined whether a main broadcast says illegal content during live broadcast.

Live broadcast area goods is as fusing the new live broadcast type of two kinds of above-mentioned live broadcast room characteristics, and the anchor can be carried out the propaganda to commodity when carrying out commodity show, introduces the functional characteristics, the preferential strength etc. of commodity to audience through barrage and live broadcast room is interactive, answers audience's problem. In addition, the anchor may invite guests to promote or ask assistants to help show the merchandise.

When the anchor is promoting, it is mainly through speaking to introduce to the audience of the live room, the important information is all included in the audio stream. If the video pictures are audited, since the video streams mainly include conversation pictures among people and display pictures of commodities, illegal contents are not easy to detect based on the video pictures. However, if the audio stream is converted into text content for auditing, besides the content of the commodity propaganda of the concerned anchor, the audio stream also contains a lot of useless conversations, background noises and other contents irrelevant to violation judgment, and the detection of the irrelevant contents can reduce the detection efficiency.

Therefore, for the new live broadcast mode of live broadcast and live broadcast, the existing detection strategy is continuously used, and the supervision requirement cannot be met. And the manual patrol mode is adopted, so that the accuracy during detection can be improved, but the manual mode has low efficiency and cannot meet the requirement of monitoring of a mass live broadcast room.

In view of this, the present application provides a technical solution for converting an audio clip corresponding to a video clip containing anchor characteristics into text information in a targeted manner, and performing violation detection on a live broadcast room based on the text information.

When the method is realized, when violation detection is carried out on a target live broadcast room, live broadcast data streams of the target live broadcast room can be obtained; the live data stream may include a video stream, and an audio stream corresponding to the video stream.

After the live broadcast data stream of the target live broadcast room is obtained, the video stream can be processed in a segmented mode to obtain a plurality of video segments;

for example, the audio stream may be segmented by a voice segmentation technique, an audio segment with voice is retained, and the video stream is segmented according to a timestamp range corresponding to the retained audio segment to obtain a plurality of video segments.

for example, the features extracted from the video segment may be face features, and the face features extracted from the video segment are identified by a face identification technology, so as to determine whether the face features match with the face features of the anchor broadcast corresponding to the target live broadcast room.

If the features extracted from the video clip are matched with the anchor features corresponding to the target live broadcast room, indicating that an anchor picture appears in the video clip, then further converting the audio clip corresponding to the video clip into text information, and carrying out violation detection on the target live broadcast room based on the text information;

for example, when the features extracted from the video clip are matched with the anchor features corresponding to the target live broadcast room, the audio clip corresponding to the video clip can be converted into text information according to a voice recognition technology; the text information is subjected to word segmentation to obtain a plurality of keywords, the keywords are matched with violation keywords in a preset violation keyword library, and whether the target live broadcast room violates rules or not is determined according to a matching result.

In the technical scheme, the video clip containing the anchor characteristics can be determined by extracting the characteristics in the video stream of the target live broadcast room and matching the characteristics with the anchor characteristics corresponding to the target live broadcast room; further, an audio clip corresponding to the video clip can be converted into text information, and violation detection is performed on the target live broadcast room based on the text information. The anchor is used as a person responsible for the live broadcast room and needs to be responsible for the live broadcast content, and the violation detection is more targeted by determining the video clip containing the anchor characteristic in the live broadcast data stream; further, because the characteristics of live broadcast area goods lie in that the anchor broadcasts promote, the information content that the audio clip contains is far more than the video clip, and this application carries out violation detection based on the audio clip that the video clip corresponds that the aforesaid was confirmed, converts text information into through saying the anchor broadcasts when live broadcast area goods and carries out violation detection, improves violation detection's the degree of accuracy.

Next, examples of the present application will be described in detail.

Referring to fig. 1, fig. 1 is a flowchart illustrating a live broadcast break detection method according to an exemplary embodiment of the present application, where the method includes the following steps:

step 101: acquiring a live broadcast data stream of a target live broadcast room; the live data stream comprises a video stream and an audio stream corresponding to the video stream;

step 102: carrying out segmentation processing on the video stream to obtain a plurality of video segments;

step 103: extracting features from the video clips, and determining whether the extracted features are matched with anchor features corresponding to the target live broadcast room;

step 104: and if so, converting the audio clip corresponding to the video clip into text information, and carrying out violation detection on the target live broadcast room based on the text information.

In this embodiment, when violation detection is performed on the target live broadcast room, the live broadcast data stream of the target live broadcast room may be obtained through the live broadcast platform, and then the video stream and the audio stream corresponding to the video stream may be analyzed from the live broadcast data stream.

The video stream and the audio stream correspond to each other and have the same time axis, so that the picture in the video and the sound in the audio can be synchronized.

In addition, when the live broadcast data stream of the live broadcast room is obtained, the live broadcast data stream of the live broadcast room can be obtained in real time, and uninterrupted detection of the live broadcast room is realized; random spot check can be carried out on the live broadcast room, live broadcast data flow of the live broadcast room in a specific time period is obtained, and violation detection of sampling is carried out on the live broadcast room; and the live broadcast data stream of the live broadcast room can be periodically acquired based on the preset interval time length, so that the periodic detection of the live broadcast room is realized. In the present specification, the acquisition mode of the live data stream is not specifically limited, and in practical applications, a person skilled in the art can select different live data stream acquisition modes according to the detection requirements of the live platform on different live rooms.

For example, for a live broadcast with a high possibility of violation or a live broadcast with a frequent violation, in order to meet strict detection requirements, it is possible to achieve stricter detection by increasing the acquisition frequency of the live broadcast data stream; for the live broadcast room with low possibility of violation or less violation behaviors, if the live broadcast data stream is frequently acquired and detected, a lot of useless work can be done, and at the moment, the loose detection requirement can be adopted, and the detection requirement can be met through sampling detection.

In practical application, since the violation of the anchor often has no persistence, it usually only occurs in a certain specific period of the live broadcast process; thus, in a live data stream in a live room, there may often be a large number of invalid video or audio segments that do not help with violation detection. In this case, if the video stream and the audio stream in the live data stream are subjected to the violation detection frame by frame, it is obvious that the detection efficiency of the violation detection is reduced.

For example, in the live broadcasting process, if the anchor does not speak in a certain time period but looks at the barrage of the live broadcasting room, the audio clip corresponding to the time period has no voice, only background noise and no effect on violation detection; therefore, if the audio stream in the live broadcast data stream is subjected to the frame-by-frame violation detection, it is obvious that the detection efficiency of the violation detection is affected by the need to process a large number of audio frames that do not have any effect on the violation detection.

In this embodiment, in order to improve detection efficiency of violation detection, before violation detection is performed on a live broadcast data stream based on the live broadcast, video streams and audio streams in the obtained live broadcast data stream may be segmented, and then violation detection processing is performed on video segments and audio segments obtained after the segmentation processing.

Note that, a specific segmentation method for performing segmentation processing on a video stream and an audio stream in an acquired live data stream is not particularly limited in this specification.

The live broadcast delivery is characterized in that the anchor can interact with audiences in real time and answer the problems of the audiences while the goods are displayed in all directions, so that an audio part containing the voice signals of the anchor in the live broadcast data stream is an important detection object during violation detection.

Thus, in one illustrated embodiment, a video stream in a live data stream may be segmented based on an active audio segment in the live data stream containing a main voice signal.

In implementation, Voice signals emitted by the anchor in the audio stream can be distinguished from various background noise signals by detecting Voice boundaries where Voice appears and disappears in the audio signal corresponding to the audio stream in the live data stream by using VAD (Voice Activity Detection) technology.

After distinguishing the voice signal sent by the anchor in the audio stream from various background noise signals, an audio segment containing the voice signal sent by the anchor in the audio stream can be determined as an effective audio segment, and then the audio stream and the video stream are respectively segmented based on the time stamp range corresponding to the determined effective audio segment to obtain a plurality of effective audio segments containing the voice signal sent by the anchor and video segments corresponding to the effective audio segments.

It should be noted that, the detailed process of VAD voice activation detection for the audio stream in the live data stream is not described in detail in this specification, and those skilled in the art may refer to the description in the related art.

In addition, a face detection technology can be used for detecting whether a face appears in a video frame corresponding to a video stream in the live broadcast data stream, and the video stream is divided into a part with the face and a part without the face.

When the part of the video stream containing the face is distinguished, the audio stream and the video stream can be respectively segmented based on the corresponding timestamp range when the face appears, so as to obtain the video segments containing the face and the audio segments corresponding to the video segments.

After the video stream is subjected to the segmentation processing, video segments which do not have any effect on violation detection in the live data stream can be removed, and a plurality of video segments to be detected are reserved.

It should be noted that, besides the anchor, guests or assistants of the anchor may also appear in the live broadcast; therefore, in the pictures of several video segments obtained after the segmentation process, a main broadcast may appear, and guests or assistants may also appear.

The anchor is used as a responsible person of the live broadcast room, and needs to follow the rules of the live broadcast platform, take charge of the live broadcast content and simultaneously restrain the lines of participants of the live broadcast room; therefore, when detecting the violation in the live broadcast, in order to further improve the detection efficiency of the violation detection, the video clip with the occurrence of the anchor may be focused, and the anchor may be used as a main detection object.

In this embodiment, in order to further determine a video segment containing anchor information, after a video stream is segmented to obtain a plurality of video segments, features may be extracted from each video segment, and it is determined whether the extracted features match anchor features corresponding to a target live broadcast room, so that the video segment containing the anchor features is determined as a video segment with an anchor.

The anchor characteristics corresponding to the target live broadcast room can be anchor characteristics recorded by a live broadcast platform when the anchor applies for the live broadcast room, and the characteristics can be uniformly stored in an anchor characteristic library.

Specifically, the machine learning model may be trained in advance based on a plurality of feature data samples labeled with anchor, and the trained machine learning model may be used as a model for anchor feature matching of features extracted from a video segment.

For the features extracted from the video segment, the features can be input into the anchor feature matching model, so as to obtain the result of feature matching.

In one illustrated embodiment, the features may include human face features.

Further, a video frame can be extracted from the video segment, and facial features can be extracted from the video frame; and determining whether the extracted face features are matched with the anchor face features corresponding to the target live broadcast room.

Specifically, the anchor feature matching model may be an anchor face feature matching model, and the machine learning model may be trained in advance based on a plurality of face feature sample data labeled on an anchor, and the trained machine learning model is used as a model for performing anchor face feature matching on face features extracted from video frames.

For the face features extracted from the video frame, the face features can be input into the anchor face feature matching model, so that the result of face feature matching extracted from the video frame is obtained.

For example, when a plurality of persons appear in a video clip, extracting a video frame for the video clip and extracting facial features from the video frame; and matching the extracted multiple face features with the anchor face features corresponding to the target live broadcast room, and determining whether the anchor face features exist in the multiple face features extracted from the video frame.

When the characteristics of a plurality of video segments are matched and the video segments containing the anchor characteristics are determined, in order to improve the efficiency of finding out illegal behaviors during violation detection, the video segments with higher possibility of illegal behaviors in the plurality of video segments can be determined, and priority characteristic matching is performed.

For example, it is easy for viewers to be motivated by the mood of the anchor and to be induced by the anchor for impulsive consumption when watching a live broadcast. For example, when the anchor broadcasts goods, the anchor often adopts a time-limited preferential form, and when the countdown is about to end, the anchor often describes the strength of the goods preferential, emphasizes the functions of the goods and brings a feeling of loss without buying to consumers.

In the above process, the anchor may generate exaggerated publicity, violation of advertisement law, and the like in order to sell more products. Therefore, when the anchor has a large emotion fluctuation, the anchor is likely to be accompanied by an illegal behavior, and it is necessary to give priority to whether the emotion of the anchor has a large fluctuation or not during the illegal detection, so as to improve the accuracy of the illegal detection in the live broadcast room.

Thus, in one embodiment shown, intonation emotion recognition may be performed based on the active audio segment; when the recognized emotion hits a preset emotion, the feature matching can be preferentially performed on the video segment corresponding to the effective audio segment.

The preset emotions may include strong emotions such as excitement, anger, sadness, and the like.

Specifically, the machine learning model may be trained in advance based on a plurality of voice feature data samples labeled with different emotions, and the trained machine learning model may be used as a model for performing intonation and emotion recognition on the effective audio piece.

For the effective audio segment, the sound features in the effective audio segment can be extracted first, and then the sound features are input to the speech emotion recognition model, so that a result of the speech emotion recognition is obtained.

And subsequently, if the recognized emotion hits the preset emotion, preferentially performing feature matching on the video segment corresponding to the effective audio segment meeting the condition.

For example, suppose there are two video segments, the first is a anchor introducing merchandise calmly, and the second is a strong sound of the anchor urging the viewer to catch up with the order purchase. Then, when the intonation emotion recognition is performed on the effective audio segments corresponding to the two video segments, it can be determined that the emotion recognized by the second segment hits a preset emotion, and feature matching needs to be performed on the second video segment preferentially.

In addition, after the video segments containing the anchor characteristics are determined, the intonation emotion recognition can be carried out according to the audio segments corresponding to the video segments; when the recognized emotion hits a preset emotion, subsequent processing can be preferentially performed on the video clip.

After the characteristics in the video clip are matched with the anchor characteristics corresponding to the target live broadcast room, the video clip corresponding to the anchor can be determined. Further, violation detection needs to be performed on the video segment to determine whether a violation occurs in the video segment.

For the live broadcast with goods scene, as the anchor mainly carries out propaganda of goods through words, if detection is carried out only according to pictures in the video, a lot of important information can be missed. Therefore, the important point of the violation detection of live delivery is audio information corresponding to video, and the violation detection is performed for the main speech.

In this embodiment, if the features extracted from the video clip are matched with the anchor features corresponding to the target live broadcast room, the audio clip corresponding to the video clip is converted into text information, and violation detection is performed on the target live broadcast room based on the text information.

Specifically, an ASR (Automatic Speech Recognition) may be used to input the audio segment corresponding to the video segment into a preset Speech Recognition model, and obtain the text information corresponding to the audio segment as an output result.

Furthermore, text content identification can be carried out on the obtained text information, whether illegal contents appear in the words spoken by the anchor is detected, and therefore whether illegal behaviors happen in the target live broadcast room is determined.

For example, suppose that a main audio clip is inputted to a preset voice recognition model, and text information conversion is performed, and the text information is obtained as "highly imitated handbag, absolutely in a false, and after luxury … …" is enjoyed at a low price; and identifying the content based on the text information, and identifying that the anchor sells high-imitation products, so that the fact that the illegal action occurs in the live broadcast room can be determined.

After the matching of the human face features of the anchor, if the features extracted from the video clip are matched with the anchor features corresponding to the target live broadcast room, which indicates that the anchor is present in the video clip, the audio clip corresponding to the video clip is further converted into text information, and violation detection is performed on the target live broadcast room based on the text information; and if not, indicating that the anchor is not present in the video segment.

However, when the anchor is not present in the video clip, it is still possible to introduce the merchandise and there is a possibility of violation. Therefore, when the violation detection in the live broadcast room needs to adopt a stricter detection strategy, in order to expand the detection range, the video clip without the occurrence of the anchor can be further detected.

In one embodiment, if the extracted features do not match the anchor features corresponding to the target live broadcast room, further extracting voiceprint features from an active audio clip corresponding to the video clip; determining whether the extracted voiceprint features are matched with anchor voiceprint features corresponding to the target live broadcast room; if yes, converting the effective audio clip corresponding to the video clip into text information, and carrying out live broadcast violation detection on the target live broadcast room based on the text information.

Specifically, when the features extracted from the video clip are not matched with the anchor features corresponding to the target live broadcast room, the voiceprint features can be further extracted from the effective audio clip corresponding to the video clip.

The machine learning model can be trained in advance based on a plurality of voiceprint feature data samples marked with anchor, and the trained machine learning model is used as a model for matching the extracted voiceprints with the anchor voiceprint features.

For the voiceprint features extracted from the effective audio clip, the voiceprint features can be input into the anchor voiceprint feature matching model, so that a voiceprint feature matching result is obtained.

And if the extracted voiceprint features are matched with the corresponding anchor voiceprint features of the target live broadcast room, determining that the anchor is speaking although the anchor does not appear in the video segment. In order to meet a stricter detection strategy, an effective audio clip corresponding to the video clip needs to be converted into text information, and live broadcast violation detection is performed on a target live broadcast room based on the text information.

For example, after a video frame is extracted from a certain video segment and facial features are extracted from the video frame, if the extracted facial features do not match with the anchor facial features corresponding to the target live broadcast room, it is indicated that no anchor appears in the video segment.

Further, in order to satisfy a stricter detection strategy, voiceprint features can be extracted from the video clip, the voiceprint features are input into a preset anchor voiceprint feature matching model, and whether the sound appearing in the video clip is the sound of an anchor is determined.

If the sound appearing in the video segment is the sound of the anchor, it is shown that the anchor is speaking although the anchor is not present in the video segment. Obviously, the effective audio segment corresponding to the video segment needs to be converted into text information, and live broadcast violation detection needs to be performed on the target live broadcast room based on the text information.

When the violation detection is performed on the video segment containing the anchor characteristics, the effective audio segment corresponding to the video segment needs to be converted into text information, and the live violation detection is performed on the target live broadcast room based on the text information.

When the anchor is in the process of commodity promotion, the anchor may speak all the time in order to catch the attention of the audience, or a large amount of content may interfere the judgment of the audience on the commodities, so that the audience can give an order without being deeply concerned. At this time, the text information after the audio clip conversion in the above process is too long and the information is too messy, so that the violation detection efficiency for the text information is reduced.

In an embodiment shown, the text information is subjected to word segmentation to obtain a plurality of keywords; matching the keywords with the violation keywords in a preset violation keyword library respectively; and marking the violation keywords with corresponding violation types respectively. And if the keywords are matched with any illegal keyword in the illegal keyword library, determining the illegal type corresponding to the illegal keyword as the illegal type of the target live broadcast room.

Specifically, word segmentation processing can be performed on continuous text information according to a preset word segmentation rule to obtain a plurality of keywords; the preset word segmentation rule may be a dictionary set by a person skilled in the art as required.

For example, assuming that the text message is "XX milk powder is produced by the world-leading latest technology and is top-level milk powder available at the price", based on the preset word segmentation rule, the following keywords can be obtained: XX milk powder, world leading, latest technology, top grade.

Matching the keywords with the violation keywords in a preset violation keyword library respectively; the rule violation keyword library can be composed of keywords of multiple rule violation types, and the rule violation keywords are labeled with corresponding rule violation types respectively.

And if the keywords obtained from the text information are respectively matched with the illegal keywords in the illegal keyword library, determining the illegal type corresponding to the matched keywords as the illegal type of the target live broadcast room.

Continuing the example, matching the keywords 'XX milk powder, world leading, latest technology and top level' obtained from the text information with the violation keywords in the preset violation keyword library respectively. In the preset keyword library, "world leading, latest technology and top level" are marked as limit terms in the advertising law, and the anchor uses the words to violate the advertising limit terms during propaganda.

In addition, the violation types include selling prohibited goods, using prohibited words, performing false publicity, and the like, in addition to the violation of the advertisement limit word mentioned above.

By performing word segmentation processing on the text information, keywords can be obtained and rule violation keyword library matching can be performed, and when the number of dictionaries in the preset word segmentation rule is more, the number of keywords obtained by word segmentation processing is more.

However, there is a limit to the extent of improving the violation detection accuracy by increasing keywords.

For example, whether the violation of 'XX star strongly recommends XX milk powder', 'XX star is disapproval of XX milk powder' cannot be identified through keyword matching, the star does not say the brand in fact, and the propaganda is not made for the brand, and the anchor belongs to false propaganda.

In this case, semantics in the text information can be recognized by NLU (Natural Language Understanding), and the accuracy of violation detection can be further improved.

In one embodiment, the text information is semantically recognized; and performing live broadcast violation detection on the target live broadcast room based on the semantic recognition result.

Continuing with the above example, through semantic recognition, it can be known that the anchor is using XX stars for merchandise promotion. Further, when the host mentions the star, matching can be performed according to the preset relationship between the star and the dialect brand. When the brand is not matched, the anchor is determined to have a false hype violation.

After the anchor violation is determined, the violation can be recorded and the target live broadcast room can be scored.

In one embodiment, if the target live broadcast room is determined to have violation, recording violation data to a violation database; wherein the violation data comprises a number of violations and/or the violation type; scoring the live broadcast room based on violation data corresponding to the target live broadcast room recorded in the violation database; wherein the scoring mechanism comprises a deduction or accumulation of a score; limiting the authority of the anchor when the score is below or above a threshold.

Specifically, when it is determined that the target live broadcast room has an illegal behavior, a timestamp of the illegal anchor having the illegal behavior, the number of times of the illegal behavior, and the type of the violation corresponding to each illegal behavior of the anchor may be recorded in the violation database.

When the occurrence of the violation of the anchor is detected, an alarm prompt may be generated in the live broadcast room, where the alarm prompt may include a type corresponding to the violation of the anchor.

By establishing the violation database, a scoring model of the live broadcast room can be established based on the violation database, and scoring is carried out on the live broadcast room; or when the result of violation detection disputes, the violation detection system can perform recheck according to historical data recorded in the violation database.

And according to the violation data corresponding to the target live broadcast room recorded in the violation database, the number of violations and/or the violation type are included.

It is worth noting that the score corresponding to each violation may be the same, or may be increased according to the increase of the number of times; different violation types may be assigned different scores depending on the severity of the violation.

Further, the scoring mechanism may be a decreasing score or an increasing score.

When the scoring mechanism is a score reduction system, and every time violation behavior occurs, corresponding scores are deducted; when the score is below a threshold, the authority of the anchor is restricted.

For example, assuming that each live room is initially scored at 100 points, the anchor takes a false promotion the first time a tenth, after a warning the second time a 15 point may be deducted, and when the score is below 60 points, the live room traffic may be restricted or the anchor's rights may be restricted.

When the scoring mechanism is the adding scoring system, corresponding scores are accumulated when violation behaviors occur each time; when the score is above a threshold, the authority of the anchor is restricted.

For example, assuming that each live room is initially scored 0, the anchor is added ten minutes for the first time when a false promotion occurs, and 15 points for the second time after a warning, when the score is higher than 60, the live room traffic may be limited or the authority of the anchor may be limited.

The authority of the anchor can include a purchase link of a commodity added in a live interface, live duration and the like, and the method is not limited in the application.

In the technical scheme, the video clip containing the anchor characteristics is determined by extracting the characteristics in the video stream of the target live broadcast room and matching the characteristics with the anchor characteristics corresponding to the target live broadcast room; further, an audio clip corresponding to the video clip is converted into text information, and violation detection is performed on the target live broadcast room based on the text information. The anchor is used as a person responsible for the live broadcast room and needs to be responsible for the live broadcast content, and the violation detection is more targeted by determining the video clip containing the anchor characteristic in the live broadcast data stream; further, because the characteristics of live broadcast area goods lie in that the anchor broadcasts promote, the information content that the audio clip contains is far more than the video clip, and this application carries out violation detection based on the audio clip that the video clip corresponds that the aforesaid was confirmed, converts text information into through saying the anchor broadcasts when live broadcast area goods and carries out violation detection, improves violation detection's the degree of accuracy.

Referring to fig. 2, fig. 2 is a flowchart illustrating another live broadcast violation detection method according to an exemplary embodiment of the present application.

As shown in fig. 2, in an embodiment, a live broadcast violation detection method includes the following steps:

s201: acquiring a live broadcast data stream of a target live broadcast room;

wherein the live data stream includes a video stream and an audio stream corresponding to the video stream.

S202: valid audio segments and corresponding video segments are determined by VAD voice activation detection.

Specifically, VAD voice activation detection is performed on an audio stream in the obtained live data stream to determine an effective audio segment containing a voice signal in the audio stream; and segmenting the video stream based on the determined timestamp range corresponding to the effective audio segment to obtain the video segment corresponding to the effective audio segment.

For example, by determining an audio segment of a valid spoken utterance in a live data stream, a corresponding video segment can be determined based on a timestamp of the audio segment.

S203: and determining the video clip which is preferentially matched according to the intonation emotion recognition result of the effective audio clip.

Specifically, the intonation emotion recognition is carried out based on the effective audio segments; and when the recognized emotion hits preset emotion, preferentially performing feature matching on the video segment corresponding to the effective audio segment.

For example, for a plurality of effective audio segments, when a certain effective audio segment intonation emotion recognition result hits a preset emotion, the video segments corresponding to the effective audio segments are subjected to priority matching.

S204: and matching the face features in the video frames with the anchor face features.

Specifically, video frames are extracted from the video clips, and human face features are extracted from the video frames; and determining whether the extracted face features are matched with the anchor face features corresponding to the target live broadcast room.

For example, it may be that a guest or assistant in a video segment is talking, not an anchor with an emphasis on attention, and the segment may be ignored. Therefore, the video segment appearing in the anchor can be determined according to the human face characteristics, and the video segment irrelevant to the anchor is provided.

S205: determining whether the number of video frames in the video clip that match the anchor feature corresponding to the target live broadcast room is greater than a threshold.

If yes, go to step S208;

for example, when a plurality of persons appear in the video segment, a plurality of video frames may be extracted from the video segment, and when the number of matched video frames is greater than a threshold value, the video segment is determined to be the video segment to be detected.

If not, step S206 is performed.

S206: and extracting the voiceprint features from the effective audio clip corresponding to the video clip.

Specifically, if the extracted features are not matched with the anchor features corresponding to the target live broadcast room, voiceprint features are further extracted from an effective audio clip corresponding to the video clip.

For example, when the anchor is not present in the screen, there is still a possibility that an introduction is being made to the merchandise and there is a possibility that there is a violation. Therefore, when a stricter strategy is required to be adopted for the violation detection of the live broadcast room, the detection range can be expanded, and the segments without the anchor in the picture can be further detected.

S207: and determining whether the extracted voiceprint features are matched with the anchor voiceprint features corresponding to the target live broadcast room.

If so, indicating that the anchor is not in the picture but speaking, step S208 is performed.

S208: and converting the audio clip corresponding to the video clip into text information.

S209: and performing word segmentation on the text information to obtain keywords, and matching the keywords with the rule-breaking keyword library.

Specifically, word segmentation processing is carried out on the text information to obtain a plurality of keywords;

matching the keywords with the violation keywords in a preset violation keyword library respectively; and marking the violation keywords with corresponding violation types respectively.

And if the keywords are matched with any illegal keyword in the illegal keyword library, determining the illegal type corresponding to the illegal keyword as the illegal type of the target live broadcast room.

For example, when the text information is too long, word segmentation processing may be performed on the text information to obtain keywords and match the keywords with a preset keyword library. And when the keyword is matched, determining the violation type corresponding to the keyword.

S210: and performing semantic recognition on the text information.

Specifically, the detection accuracy can be further improved by adding semantic recognition.

S211: and judging whether the target live broadcast room violates rules or not based on the text information identification result.

Specifically, whether the target live broadcast room is illegal or not can be judged according to a matching result of the acquired keywords and a preset illegal keyword library and/or a semantic recognition result.

If so, go to step S212.

S212: the live room is scored based on data recorded in the violation database.

Specifically, if the target live broadcast room is determined to have violation behaviors, violation data is recorded into a violation database;

wherein the violation data comprises a number of violations and/or the violation type;

scoring the live broadcast room based on violation data corresponding to the target live broadcast room recorded in the violation database;

wherein the scoring mechanism comprises a deduction or accumulation of a score;

for example, when a violation is detected in the target live broadcast room, data related to the violation, including a violation timestamp, a number of violations, and a type of violation, may be recorded in the violation database. Meanwhile, the live room may be scored based on historical records in the violation database.

S213: it is determined whether the live space current rating is below or above a threshold.

If not, executing S214 and outputting an alarm prompt to the live broadcast room.

If so, S215 is executed to limit the authority of the anchor.

For example, when the scoring mechanism is a score reduction system, each time a violation occurs, a corresponding score is deducted; when the score is lower than the threshold value, limiting the authority of the anchor; when the scoring mechanism is the additional scoring, corresponding scores are accumulated when violation behaviors occur each time; when the score is above a threshold, the authority of the anchor is restricted.

As can be seen from the above embodiment, by extracting the features in the video stream of the target live broadcast room and matching the features with the anchor features corresponding to the target live broadcast room, a video clip containing the anchor features is determined; further, an audio clip corresponding to the video clip is converted into text information, and violation detection is performed on the target live broadcast room based on the text information. The anchor is used as a person responsible for the live broadcast room and needs to be responsible for the live broadcast content, and the violation detection is more targeted by determining the video clip containing the anchor characteristic in the live broadcast data stream; further, because the characteristics of live broadcast area goods lie in that the anchor broadcasts promote, the information content that the audio clip contains is far more than the video clip, and this application carries out violation detection based on the audio clip that the video clip corresponds that the aforesaid was confirmed, converts text information into through saying the anchor broadcasts when live broadcast area goods and carries out violation detection, improves violation detection's the degree of accuracy.

Corresponding to the embodiment of the method, the specification further provides an embodiment of a device for detecting the violation of the live broadcast room.

Referring to fig. 3, fig. 3 is a block diagram of a live broadcast break detection apparatus according to an exemplary embodiment of the present application, including:

an obtaining unit 301, configured to obtain a live data stream of a target live broadcast room; the live data stream comprises a video stream and an audio stream corresponding to the video stream;

a segmentation unit 302, configured to perform segmentation processing on the video stream to obtain a plurality of video segments;

a matching unit 303, configured to extract features from the video segment, and determine whether the extracted features match anchor features corresponding to the target live broadcast room;

a detecting unit 304, configured to, when the extracted feature matches with an anchor feature corresponding to the target live broadcast room, convert an audio clip corresponding to the video clip into text information, and perform violation detection on the target live broadcast room based on the text information.

Optionally, the slicing unit 302 includes:

performing VAD voice activity detection on the audio stream to determine valid audio segments of the audio stream containing voice signals;

and segmenting the video stream based on the determined timestamp range corresponding to the effective audio segment to obtain the video segment corresponding to the effective audio segment.

Specifically, the features include human face features;

optionally, the matching unit 303 includes:

extracting video frames from the video clips, and extracting human face features from the video frames;

and determining whether the extracted face features are matched with the anchor face features corresponding to the target live broadcast room.

Optionally, before converting the audio stream corresponding to the video stream into text information, the method includes:

determining whether the number of video frames in the video clip matched with the anchor characteristics corresponding to the target live broadcast room is greater than a threshold value;

if yes, further converting the audio clip corresponding to the video clip into text information.

Optionally, the apparatus further comprises:

if the extracted features are not matched with the anchor features corresponding to the target live broadcast room, further extracting voiceprint features from the effective audio clip corresponding to the video clip;

determining whether the extracted voiceprint features are matched with anchor voiceprint features corresponding to the target live broadcast room;

if yes, converting the effective audio clip corresponding to the video clip into text information, and carrying out live broadcast violation detection on the target live broadcast room based on the text information.

Optionally, the apparatus further comprises:

performing intonation emotion recognition based on the effective audio segments;

and when the recognized emotion hits preset emotion, preferentially performing feature matching on the video segment corresponding to the effective audio segment.

Optionally, the detecting unit 304 includes:

performing word segmentation processing on the text information to obtain a plurality of keywords;

Optionally, the detecting unit 304 includes:

performing semantic recognition on the text information;

and performing live broadcast violation detection on the target live broadcast room based on the semantic recognition result.

Optionally, the apparatus further comprises:

if the target live broadcast room is determined to have the violation behavior, recording violation data into a violation database; wherein the violation data comprises a number of violations and/or the violation type;

scoring the live broadcast room based on violation data corresponding to the target live broadcast room recorded in the violation database; wherein the scoring mechanism comprises a deduction or accumulation of a score;

limiting the authority of the anchor when the score is below or above a threshold.

The implementation process of the functions and actions of the above devices is specifically described in the implementation process of the corresponding steps in the above method, and is not described herein again.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the present application and practice of the invention as claimed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present application should be included in the scope of the present application.

Claims

1. A live broadcast break detection method, comprising:

2. The method of claim 1, wherein the segmenting the video stream into a plurality of video segments comprises:

3. The method of claim 1, wherein the features comprise human face features;

the extracting features from the video clip and determining whether the extracted features match anchor features corresponding to the target live broadcast room includes:

4. The method of claim 3, prior to converting the audio segment corresponding to the video segment into text information, comprising:

5. The method of claim 1, further comprising:

6. The method of claim 2, further comprising:

7. The method of claim 1, wherein the detecting violations for the target live broadcast room based on text information comprises:

matching the keywords with the violation keywords in a preset violation keyword library respectively; the violation keywords are respectively marked with corresponding violation types;

8. The method of claim 1, wherein the detecting violations for the target live broadcast room based on text information comprises:

performing semantic recognition on the text information;

9. The method of claim 1, further comprising:

10. A live room violation detection apparatus, the apparatus comprising: