CN114697445A - Volume adjusting method, electronic equipment, terminal and storage medium - Google Patents

Volume adjusting method, electronic equipment, terminal and storage medium Download PDF

Info

Publication number
CN114697445A
CN114697445A CN202011638242.5A CN202011638242A CN114697445A CN 114697445 A CN114697445 A CN 114697445A CN 202011638242 A CN202011638242 A CN 202011638242A CN 114697445 A CN114697445 A CN 114697445A
Authority
CN
China
Prior art keywords
terminal
audio
specific space
received signal
audio data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011638242.5A
Other languages
Chinese (zh)
Other versions
CN114697445B (en
Inventor
张敏
杨乐鹏
袁海飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202011638242.5A priority Critical patent/CN114697445B/en
Priority to PCT/CN2021/136096 priority patent/WO2022143040A1/en
Publication of CN114697445A publication Critical patent/CN114697445A/en
Application granted granted Critical
Publication of CN114697445B publication Critical patent/CN114697445B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/165Management of the audio stream, e.g. setting of volume, audio stream path
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B17/00Monitoring; Testing
    • H04B17/30Monitoring; Testing of propagation channels
    • H04B17/309Measuring or estimating channel quality parameters
    • H04B17/318Received signal strength
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72433User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72448User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions
    • H04M1/72454User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions according to context-related or environment-related conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/725Cordless telephones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/12Details of telephonic subscriber devices including a sensor for measuring a physical value, e.g. temperature or motion
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • Electromagnetism (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Environmental & Geological Engineering (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Telephonic Communication Services (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephone Function (AREA)

Abstract

The application provides a volume adjusting method, electronic equipment, a terminal and a storage medium, and belongs to the technical field of terminals. The method is applied to the terminal and comprises the following steps: when a terminal is located in a specific space, the terminal acquires multimedia contents played/collected by equipment in the specific space; the terminal plays the multimedia content and plays audio data of the multimedia content at a first volume; detecting whether the terminal leaves the specific space; and when the terminal is detected to leave the specific space, the terminal continues to play the multimedia content and automatically adjusts the volume to a second volume which is larger than the first volume to play the audio data of the multimedia content. According to the method, whether the terminal leaves the specific space or not is detected, and the volume of the audio played by the terminal is automatically adjusted according to the detection result.

Description

Volume adjusting method, electronic equipment, terminal and storage medium
Technical Field
The present application relates to the field of terminal technologies, and in particular, to a volume adjustment method, an electronic device, a terminal, and a storage medium.
Background
Currently, terminal technology is rapidly developed and widely used. Besides daily communication, users can also carry terminals, such as mobile phones, with them when participating in a conference. When a meeting is opened, a user can typically view and listen to the meeting content through the conferencing equipment. Of course, for convenience, as technology develops, the user can synchronize the conference content to the terminal for playing.
If a user joins a conference with a mobile phone, other participants can be influenced if the mobile phone is directly played in a conference room; however, if the mobile phone plays the conference content in a mute manner, the user needs to manually adjust the volume back and forth after leaving the conference room, so that the user can avoid missing the conference content. For the above scenarios, there is currently no suitable intelligent control strategy.
Disclosure of Invention
The embodiment of the application provides a volume adjusting method, electronic equipment, a terminal and a storage medium, and solves the problem that a user needs to manually adjust the playing volume of the terminal when entering or exiting different occasions.
In a first aspect, the present application provides a volume adjusting method, applied to a terminal, including: when a terminal is positioned in a specific space, firstly, multimedia contents played/collected by equipment in the specific space are obtained; then playing the audio data of the multimedia content at a first volume; detecting whether the terminal leaves the specific space; and when the terminal is detected to leave the specific space, the terminal continues to play the multimedia content and automatically adjusts the volume to a second volume to play the audio data of the multimedia content, wherein the second volume is larger than the first volume.
Whether this application embodiment leaves specific space through detecting terminal, when detecting to leave specific space, automatically regulated terminal to suitable volume makes the volume control at terminal more intelligent, avoids the not good problem of experience that frequent manual regulation volume brought.
Optionally, the first volume is mute, so as to prevent the sound volume from being externally played to affect people in a specific space, for example, meeting people in a meeting room or watching people in a movie theater; the second volume is the volume before the terminal enters the specific space or the volume setting frequently used, and after the user leaves the specific space, the audio content is played according to the volume which is customary by the user.
In a possible implementation, the detecting whether the terminal leaves the specific space includes: acquiring second audio data, wherein the second audio data is audio data currently acquired by the terminal; comparing the similarity degree of first audio data and second audio data to determine audio matching degree, wherein the first audio data is the audio data of the multimedia content; determining whether the terminal leaves the specific space based on at least the audio matching degree.
According to the embodiment of the application, whether the audio data played in the specific space and the audio data collected by the terminal currently are matched or not is judged, whether the terminal leaves the specific space or not is judged, and then the terminal is automatically adjusted to the appropriate volume according to the judgment result, so that the volume adjustment of the terminal is more intelligent, and the problem of poor experience caused by frequent manual volume adjustment is avoided.
In one possible implementation, the determining whether the terminal leaves a specific space based on at least the audio matching degree is implemented by: and if the audio matching degree is smaller than a first threshold value, determining that the terminal leaves the specific space.
In a possible implementation manner, the determining whether the terminal leaves the specific space based on at least the audio matching degree is performed by: determining whether the terminal leaves a specific space at least based on the audio matching degree and the received signal strength mutation result; the received signal strength mutation result represents the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths received by the terminal in continuous time and transmitted by communication devices in the specific space.
Through two judgment conditions of the audio matching degree and the received signal strength mutation result, whether the terminal leaves the specific space or not is judged more accurately, and the situation that the terminal does not leave the specific space and the volume of the terminal is adjusted to the second volume is avoided.
In another possible implementation manner, the determining whether the terminal leaves a specific space based on at least the audio matching degree and the abrupt change result of the received signal strength is implemented by: carrying out weighted summation on the received signal strength mutation result and the audio matching degree, and determining the probability P that the terminal leaves the specific space; and if the P is greater than or equal to a preset probability value, determining that the terminal leaves the specific space.
In another possible implementation manner, the determining whether the terminal leaves a specific space based on at least the audio matching degree is performed by: determining whether the terminal leaves a specific space based on the audio matching degree, the received signal mutation result and the image recognition result; the received signal strength mutation result represents the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths which are sent by the communication devices in the specific space and received by the terminal in continuous time; the image recognition result represents the probability that the terminal is located in the specific space and is obtained based on the image of the target object and an image recognition model, and the image recognition model is used for judging the probability that the terminal is located in the specific space based on the image recognition of the target object.
The judgment condition of the image identification result is further increased, whether the terminal leaves a specific space or not is judged through three judgment results of the audio matching degree, the received signal mutation result and the image identification result, the judgment accuracy is further increased, and the use experience of a user is improved.
In another possible implementation manner, the determining whether the terminal leaves a specific space based on the audio matching degree, the received signal mutation result, and the image recognition result is as follows: the audio matching degree is smaller than a first threshold, the received signal strength mutation result is larger than a second threshold, and the image identification result is smaller than a third threshold; or the audio matching degree is smaller than the first threshold and the received signal strength mutation result is larger than a second threshold; or the audio matching degree is smaller than a first threshold and the image recognition result is smaller than a third threshold; or the received signal strength mutation result is larger than a second threshold value and the image identification result is smaller than a third threshold value; it is determined that the terminal leaves the specific space.
In another possible implementation, the comparing the similarity degree between the first audio data and the second audio data and the determining the audio matching degree includes: acquiring the second audio data; aligning the first audio data and the second audio data; extracting continuous N frames of audio frames of the first audio data in a preset time period to obtain a first audio frame sequence; extracting N audio frames aligned with the first audio frame sequence in the second audio data to obtain a second audio frame sequence; n is a positive integer greater than or equal to 1; calculating the similarity between each frame of audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence; and determining the audio matching degree of the first audio data and the second audio data in the preset time period based on the ratio of the number of audio frames reaching the standard in the second audio frame sequence to N, wherein the audio frames reaching the standard are the audio frames of which the similarity between the audio frames in the second audio frame sequence and the corresponding audio frames in the first audio frame sequence is greater than a fourth threshold.
In another possible implementation manner, the above-mentioned calculating the similarity between each audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence is performed by: respectively extracting and obtaining a first feature vector representing each frame of audio frame in the first audio frame sequence and a second feature vector representing each frame of audio frame in the second audio frame sequence; and determining the similarity of each audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence based on the similarity of the first feature vector and the second feature vector.
In another possible implementation manner, the aligning the first audio data and the second audio data is implemented by: extracting continuous M-frame audio frame sequences starting from a first moment in the second audio data to obtain a third audio frame sequence, and extracting continuous M-frame audio frame sequences starting from a plurality of different second moments in the first audio data to obtain a plurality of fourth audio frame sequences, wherein the second moment is greater than or equal to the first moment, and M is a positive integer greater than or equal to 1; determining a delay compensation parameter based on the similarity of the third audio frame sequence and the fourth audio frame sequence; aligning the first audio data and the second audio data based on the delay compensation parameter.
In another possible implementation, the method further comprises: acquiring multiple groups of received signal intensities of continuous time; determining the abrupt change result of the received signal strength based on the plurality of groups of received signal strengths.
And determining the strength mutation result of the received signals by using a plurality of groups of received signal strengths so as to avoid the influence of the abnormality of the strength of a single received signal on the accuracy of the strength mutation result of the received signal.
In another possible implementation manner, the determining the abrupt change result of the received signal strength based on the plurality of sets of received signal strengths is as follows: determining a characteristic vector representing the variation characteristic of the strength of the received signal based on the difference value of the strength of the two groups of received signals at adjacent moments; and inputting the characteristic vector into a preset prediction model, and determining the intensity mutation result of the received signal.
In another possible implementation manner, the determining the eigenvector characterizing the variation of the received signal strength based on the difference between the two sets of received signal strengths at adjacent time instants is as follows: and determining the signal strength abrupt change feature vector based on whether the difference value of the signal strength of the adjacent time moments is larger than a fifth threshold value.
In another possible implementation, the multiple sets of received signal strengths are multiple sets of X bluetooth received signal strengths, where the multiple sets of X bluetooth received signal strengths are bluetooth signal strengths transmitted by X bluetooth devices in the specific space received by the terminal in a continuous time, and X is a positive integer greater than or equal to 3; and one of the multiple groups of received signal strengths is the strength of the bluetooth signal sent by the X bluetooth devices in the specific space received by the terminal at the same time.
This application need not to predetermine any bluetooth equipment through the bluetooth signal intensity who utilizes the terminal in the specific space as received signal intensity, saves the deployment cost, and application scope is more extensive simultaneously.
In another possible implementation, when it is detected that the terminal leaves the specific space, the terminal continues to play the multimedia content and automatically adjusts to the second volume to play the audio data of the multimedia content in the following manner: if the terminal is detected to leave the specific space, controlling a front camera of the terminal to be opened; judging whether the preset time for the face image of the target user collected by the front-facing camera is met and whether the display page of the terminal is a designated display page; and if so, automatically adjusting the terminal to the second volume to play the audio data of the multimedia content. The audio content can be played when the user needs the audio content, and the method is more humanized.
In another possible implementation, the method further comprises: if the terminal is detected to leave the specific space; acquiring a first signal fingerprint, wherein the first signal fingerprint is determined based on the strength of a received signal acquired by the terminal at the current position; acquiring a plurality of second signal fingerprints at a preset frequency, when the second signal fingerprints are matched with the first signal fingerprints; and adjusting the volume of the terminal to a third volume, wherein the third volume is greater than the first volume and less than the second volume. When a user is about to enter a specific space (such as a meeting room or a movie theater), the volume is turned down to avoid influencing personnel in the specific space.
Optionally, the third volume is a smaller volume, for example, when the maximum volume of the terminal is 100, the third volume is 10.
In another possible implementation, the method further comprises: if the terminal is detected to leave the specific space, continuously detecting whether the terminal enters the specific space; and if the terminal is detected to enter the specific space, automatically adjusting the volume of the terminal to the first volume to play the audio data of the multimedia content.
In another possible implementation, the multimedia data further includes display contents of devices in the specific space.
In another possible implementation, the method further comprises: and adjusting a fourth threshold, a time delay compensation parameter and a fifth threshold according to the audio matching degree, the received signal mutation result and the image identification result.
In another possible implementation, the implementation manner of adjusting the fourth threshold, the delay compensation parameter, and the fifth threshold according to the audio matching degree, the received signal mutation result, and the image recognition result is as follows: when the image identification result is smaller than a third threshold value and the received signal strength mutation result is larger than a second threshold value, the audio matching degree is larger than or equal to a first threshold value; or, the image recognition result is greater than or equal to a third threshold value, the received signal strength mutation result is less than or equal to a second threshold value, and the audio matching degree is less than a first threshold value; adjusting the delay compensation parameter and a fourth threshold;
when the image recognition result is smaller than a third threshold value and the audio matching degree is smaller than a first threshold value, and the received signal strength mutation result is smaller than or equal to a second threshold value; or, when the image recognition result is greater than or equal to the third threshold and the audio matching degree is greater than or equal to the first threshold, and the received signal strength mutation result is greater than the second threshold; the fifth threshold is adjusted. The mutual learning and dynamic adjustment of the decision parameter of the audio matching degree and the decision parameter of the received signal strength mutation result are realized, and the accuracy of judging whether the terminal leaves a specific space is improved.
In another possible implementation, when the image recognition result is less than a third threshold and the received signal strength mutation result is greater than a second threshold, the audio matching degree is greater than or equal to the first threshold; or, the image recognition result is greater than or equal to a third threshold value, the received signal strength mutation result is less than or equal to a second threshold value, and the audio matching degree is less than a first threshold value; the implementation manner of adjusting the delay compensation parameter and the fourth threshold is as follows: when the image identification result is smaller than a third threshold value and the received signal strength mutation result is larger than a second threshold value, the audio matching degree is larger than or equal to the first threshold value; increasing the fourth threshold value until the audio matching degree is smaller than the first threshold value; when the image recognition result is greater than or equal to a third threshold value and the received signal strength mutation result is less than or equal to a second threshold value, and the audio matching degree is less than a first threshold value; judging whether the audio similarity change rate is greater than a preset threshold value or not; if so, adjusting the time delay compensation parameter until the audio similarity change rate is smaller than a preset threshold value; if not, reducing the fourth threshold value until the audio matching degree is greater than or equal to the first threshold value;
the audio similarity change rate is determined based on a ratio of a difference between the similarity of the audio frame of the first audio data and the audio frame of the second voice data at a second time and the similarity of the audio frame of the first audio data and the audio frame of the second voice data at a third time to a difference between the second time and the third time, the second time is a time when the image recognition result is greater than or equal to a third threshold and the received signal strength mutation result is less than or equal to the second threshold, the audio matching degree is less than the first threshold, and the third time is a time adjacent to the second time.
In another possible implementation, when the image recognition result is smaller than the third threshold and the audio matching degree is smaller than the first threshold, the received signal strength abrupt change result is smaller than or equal to the second threshold; or, when the image recognition result is greater than or equal to the third threshold and the audio matching degree is greater than or equal to the first threshold, and the received signal strength mutation result is greater than the second threshold; the implementation manner of adjusting the fifth threshold is as follows: when the image identification result is smaller than a third threshold value and the audio matching degree is smaller than a first threshold value, and the received signal strength mutation result is smaller than or equal to a second threshold value; reducing the fifth threshold value until the sudden change result of the received signal strength is greater than a second threshold value; when the image recognition result is greater than or equal to the third threshold and the audio matching degree is greater than or equal to the first threshold, and the received signal strength mutation result is greater than the second threshold; the fifth threshold is increased until the result of the abrupt change in the received signal strength is less than or equal to the second threshold.
In a second aspect, the present application further provides an electronic device, comprising: the terminal comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring multimedia contents played/collected by equipment in a specific space when the terminal is positioned in the specific space; the playing module is used for playing the multimedia content and playing audio data of the multimedia content at a first volume; the detection module is used for detecting whether the terminal leaves the specific space or not; and the adjusting module is used for continuously playing the multimedia content by the terminal when detecting that the terminal leaves the specific space, and automatically adjusting the multimedia content to a second volume to play the audio data of the multimedia content, wherein the second volume is larger than the first volume.
In one possible implementation, the obtaining module is further configured to: acquiring second audio data, wherein the second audio data is audio data currently acquired by the terminal; the detection module is further configured to: comparing the similarity degree of first audio data and second audio data to determine audio matching degree, wherein the first audio data is the audio data of the multimedia content; determining whether the terminal leaves the specific space based on at least the audio matching degree.
In one possible implementation, the determining whether the terminal leaves the specific space based on at least the audio matching degree includes: and if the audio matching degree is smaller than a first threshold value, determining that the terminal leaves the specific space.
In another possible implementation, the determining whether the terminal leaves a specific space based on at least the audio matching degree includes: determining whether the terminal leaves a specific space at least based on the audio matching degree and the received signal strength mutation result; the received signal strength mutation result represents the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths received by the terminal in continuous time and transmitted by communication devices in the specific space.
In another possible implementation, the determining whether the terminal leaves a specific space based on at least the audio matching degree and the abrupt change result of the received signal strength includes: carrying out weighted summation on the received signal strength mutation result and the audio matching degree, and determining the probability P that the terminal leaves the specific space; and if the P is greater than or equal to a preset probability value, determining that the terminal leaves the specific space.
In another possible implementation, the determining whether the terminal leaves a specific space based on at least the audio matching degree includes: determining whether the terminal leaves a specific space based on the audio matching degree, the received signal mutation result and the image recognition result; the received signal strength mutation result represents the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths which are sent by the communication devices in the specific space and received by the terminal in continuous time; the image recognition result represents the probability that the terminal is located in the specific space, and is obtained based on the image of the target object and an image recognition model, and the image recognition model is used for judging the probability that the terminal is located in the specific space based on the image recognition of the target object.
In another possible implementation, the determining whether the terminal leaves a specific space based on the audio matching degree, the received signal mutation result, and the image recognition result includes: the audio matching degree is smaller than a first threshold, the received signal strength mutation result is larger than a second threshold, and the image identification result is smaller than a third threshold; or the audio matching degree is smaller than the first threshold and the received signal strength mutation result is larger than a second threshold; or the audio matching degree is smaller than a first threshold and the image recognition result is smaller than a third threshold; or the received signal strength mutation result is greater than a second threshold value and the image identification result is less than a third threshold value; it is judged that the terminal leaves the specific space.
In another possible implementation, the detection module is further configured to align the first audio data and the second audio data; extracting continuous N frames of audio frames of the first audio data in a preset time period to obtain a first audio frame sequence; extracting N audio frames aligned with the first audio frame sequence in the second audio data to obtain a second audio frame sequence; n is a positive integer greater than or equal to 1; calculating the similarity between each frame of audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence; and determining the audio matching degree of the first audio data and the second audio data in the preset time period based on the ratio of the number of qualified audio frames in a second audio frame sequence to N, wherein the qualified audio frames are the audio frames of which the similarity between the audio frames in the second audio frame sequence and the corresponding audio frames in the first audio frame sequence is greater than a fourth threshold.
In another possible implementation, the calculating the similarity between each audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence includes: respectively extracting and obtaining a first feature vector representing each frame of audio frame in the first audio frame sequence and a second feature vector representing each frame of audio frame in the second audio frame sequence; and determining the similarity of each audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence based on the similarity of the first feature vector and the second feature vector.
In another possible implementation, the aligning the first audio data and the second audio data includes: extracting continuous M-frame audio frame sequences starting from a first moment in the second audio data to obtain a third audio frame sequence, and extracting continuous M-frame audio frame sequences starting from a plurality of different second moments in the first audio data to obtain a plurality of fourth audio frame sequences, wherein the second moment is greater than or equal to the first moment, and M is a positive integer greater than or equal to 1; determining a delay compensation parameter based on the similarity of the third audio frame sequence and the fourth audio frame sequence; aligning the first audio data and the second audio data based on the delay compensation parameter.
In another possible implementation, the detection module is further configured to obtain multiple sets of received signal strengths of consecutive times; determining the abrupt change result of the received signal strength based on the plurality of groups of received signal strengths.
In another possible implementation, the determining the abrupt change in the received signal strength result based on the plurality of sets of received signal strengths includes: determining a characteristic vector representing the variation characteristic of the received signal strength based on the difference value of the two groups of received signal strengths at adjacent moments; and inputting the characteristic vector into a preset prediction model, and determining the intensity mutation result of the received signal.
In another possible implementation, the multiple sets of received signal strengths are multiple sets of X bluetooth received signal strengths, where the multiple sets of X bluetooth received signal strengths are bluetooth signal strengths transmitted by X bluetooth devices in the specific space received by the terminal in continuous time, and X is a positive integer greater than or equal to 3; and one of the multiple groups of received signal strengths is the strength of the bluetooth signal sent by the X bluetooth devices in the specific space received by the terminal at the same time.
In another possible implementation, the adjusting module is further configured to: if the terminal is detected to leave the specific space, controlling a front camera of the terminal to be opened; judging whether the preset time for the face image of the target user collected by the front-facing camera is met and whether the display page of the terminal is a designated display page; and if so, automatically adjusting the terminal to the second volume to play the audio data of the multimedia content.
In another possible implementation, the apparatus further comprises: the acquisition module is used for acquiring the specific space if the terminal leaves the specific space; acquiring a first signal fingerprint, wherein the first signal fingerprint is determined based on the strength of a received signal acquired by the terminal at the current position; collecting a plurality of second signal fingerprints at a preset frequency; the adjustment module is further configured to, when the second signal fingerprint matches the first signal fingerprint; and adjusting the volume of the terminal to a third volume, wherein the third volume is greater than the first volume and less than the second volume.
In another possible implementation, the detection module is further configured to: if the terminal is detected to leave the specific space, continuously detecting whether the terminal enters the specific space; and if the terminal is detected to enter the specific space, automatically adjusting the volume of the terminal to the first volume to play the audio data of the multimedia content.
In another possible implementation, the multimedia content further includes display content of devices in the particular space.
In another possible implementation, the apparatus further includes a parameter optimization module, configured to adjust the fourth threshold, the delay compensation parameter, and the fifth threshold according to the audio matching degree, the received signal mutation result, and the image recognition result.
In another possible implementation, the adjusting the fourth threshold, the delay compensation parameter, and the fifth threshold according to the audio matching degree, the received signal mutation result, and the image recognition result includes: when the image identification result is smaller than a third threshold value and the received signal strength mutation result is larger than a second threshold value, the audio matching degree is larger than or equal to the first threshold value; or, the image recognition result is greater than or equal to a third threshold value, the received signal strength mutation result is less than or equal to a second threshold value, and the audio matching degree is less than a first threshold value; adjusting the delay compensation parameter and a fourth threshold;
when the image recognition result is smaller than a third threshold value and the audio matching degree is smaller than a first threshold value, and the received signal strength mutation result is smaller than or equal to a second threshold value; or, when the image recognition result is greater than or equal to the third threshold and the audio matching degree is greater than or equal to the first threshold, and the received signal strength mutation result is greater than the second threshold; the fifth threshold is adjusted. The mutual learning and dynamic adjustment of the decision parameter of the audio matching degree and the decision parameter of the received signal strength mutation result are realized, and the accuracy of judging whether the terminal leaves a specific space is improved.
In another possible implementation, when the image recognition result is smaller than a third threshold and the received signal strength abrupt change result is larger than a second threshold, the audio matching degree is larger than or equal to the first threshold; or, the image recognition result is greater than or equal to a third threshold value, the received signal strength mutation result is less than or equal to a second threshold value, and the audio matching degree is less than a first threshold value; adjusting the delay compensation parameter and a fourth threshold, including: when the image identification result is smaller than a third threshold value and the received signal strength mutation result is larger than a second threshold value, the audio matching degree is larger than or equal to the first threshold value; increasing the fourth threshold value until the audio matching degree is smaller than the first threshold value;
when the image recognition result is greater than or equal to a third threshold value and the received signal strength mutation result is less than or equal to a second threshold value, and the audio matching degree is less than a first threshold value; judging whether the audio similarity change rate is greater than a preset threshold value or not; if so, adjusting the time delay compensation parameter until the audio similarity change rate is smaller than a preset threshold value; if not, reducing the fourth threshold value until the audio matching degree is greater than or equal to the first threshold value;
the audio similarity change rate is determined based on a ratio of a difference between the similarity of the audio frame of the first audio data and the audio frame of the second voice data at a second time and the similarity of the audio frame of the first audio data and the audio frame of the second voice data at a third time to a difference between the second time and the third time, the second time is a time when the image recognition result is greater than or equal to a third threshold and the received signal strength mutation result is less than or equal to the second threshold, the audio matching degree is less than the first threshold, and the third time is a time adjacent to the second time.
In another possible implementation, when the image recognition result is smaller than the third threshold and the audio matching degree is smaller than the first threshold, the received signal strength abrupt change result is smaller than or equal to the second threshold; or, when the image recognition result is greater than or equal to the third threshold and the audio matching degree is greater than or equal to the first threshold, and the received signal strength mutation result is greater than the second threshold; adjusting the fifth threshold value, including: when the image recognition result is smaller than a third threshold value and the audio matching degree is smaller than a first threshold value, and the received signal strength mutation result is smaller than or equal to a second threshold value; reducing the fifth threshold value until the sudden change result of the received signal strength is greater than a second threshold value; when the image recognition result is greater than or equal to the third threshold and the audio matching degree is greater than or equal to the first threshold, and the received signal strength mutation result is greater than the second threshold; the fifth threshold is increased until the result of the abrupt change in the received signal strength is less than or equal to the second threshold.
In a third aspect, the present application further provides a terminal, including a memory and a processor, where the memory stores executable codes, and the processor executes the executable codes to implement the method described in the first aspect or any one of the possible implementation manners of the first aspect.
In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of the first aspect or any one of the possible implementation manners of the first aspect.
In a fifth aspect, the present application also provides a computer program or a computer program product, which comprises instructions that, when executed, implement the method described in the first aspect or any one of the possible implementation manners of the first aspect.
The present application can further combine to provide more implementations on the basis of the implementations provided by the above aspects.
Drawings
Fig. 1 is an application scenario diagram according to an embodiment of the present application;
fig. 2a is a flowchart of a volume adjustment method according to an embodiment of the present disclosure;
FIG. 2b is a flow chart of a volume adjustment method according to another embodiment;
fig. 3 is a flowchart of a method for detecting whether a terminal leaves a conference room according to an embodiment of the present application;
FIG. 4 is a flow chart of audio match determination;
FIG. 5 is a schematic diagram of MFCC feature extraction for audio data;
fig. 6 is a schematic diagram illustrating an alignment process of first audio data and second audio data;
FIG. 7 is a schematic diagram of RSSI acquisition;
FIG. 8 is a flow chart of RSSI mutation result determination;
fig. 9 is a flowchart of a volume adjustment method in another embodiment;
FIG. 10 is a diagram illustrating parameter adjustment for audio matching in another embodiment;
FIG. 11 is a diagram illustrating adjustment of parameters of sudden change in received signal strength in another embodiment;
FIG. 12 is a diagram of another exemplary application scenario according to an embodiment of the present application;
fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 14 is a schematic structural diagram of a terminal according to an embodiment of the present application.
Detailed Description
The technical solution of the present application is further described in detail by the accompanying drawings and examples.
The volume adjusting method and the terminal provided by the embodiment of the application can be applied to scenes when a user goes in and out of a specific space needing the terminal to be quiet, such as a conference room, a movie theater and other places. The following describes the scheme of the embodiment of the present application in detail by taking a conference room scene as an example.
Fig. 1 is an application scenario diagram according to an embodiment of the present application. As shown in fig. 1, a user is participating in a conference with multiple participants, and when the user needs to leave a conference room for a short time (for example, go to a toilet) but does not want to miss the conference content, the user can establish a connection with the conference device 1 through a terminal carried by the user, and synchronize the conference content with the terminal of the user for playing. In order to not affect other participants, the conference content is played in a mute mode, and when the user leaves the conference room, the volume is automatically adjusted to the normal volume to play the conference content. Therefore, the method and the device ensure that the user cannot miss important meeting content, realize that the terminal intelligently adjusts the volume according to the scene, do not cause sound interference to other participants, and improve the user experience.
Fig. 2a is a flowchart of a volume adjustment method according to an embodiment of the present disclosure. As shown in fig. 2a, the method comprises the following steps:
s201, conference content data at least comprising conference audio data is obtained, and the conference audio data is played in a first volume.
When a user needs to leave a conference room (i.e., a specific space) temporarily, in order not to miss conference contents, the user acquires conference contents data (i.e., multimedia data) including at least conference audio data through a terminal and continues playing the acquired conference audio data at a first volume on the terminal.
It is understood that there are various methods for the terminal to establish a communication connection with the conference device to obtain the conference content data, for example, when there is a conference auxiliary device in the conference room that collects the conference content data and can synchronize the collected conference content to a specific terminal, the terminal can establish a connection with the conference auxiliary device in a "bump-and-bump" manner to synchronize the conference content to the terminal for playing. Or the terminal establishes connection with the conference equipment in a two-dimensional code scanning mode, and synchronizes the conference content data to the terminal for playing. Or, the terminal establishes a communication connection with a specific conference device by receiving the instruction, synchronizes the conference content data to the terminal for playing, and the like, and can acquire the conference content data in any achievable acquisition manner according to the actual situation, which is not limited in the present application.
It should be noted that the conference audio data includes audio data related to a conference, such as audio data played by a conference device, audio data generated by a talk discussion of a conference participant, audio data generated by other conference devices (such as far-end voice of a remote conference participant played through a speaker), and the like.
In one example, the first volume is a mute or a very low volume, e.g., a maximum volume of 100, so as not to affect the participants, the first volume is between 1-5, i.e., the first volume is preferably not perceived by humans at close distances.
The terminal can be a terminal with a voice receiving and playing function, such as a smart phone, an intelligent wearable device, a tablet computer, a notebook computer, a palm computer and a personal digital assistant.
The conference device may be a display device with communication function (e.g., a large screen display that displays the conference content data), or a sound pickup device (e.g., a microphone that picks up audio data generated by the speech of the conference participants) or the like that is related to the conference.
S202, detecting whether the terminal leaves the conference room.
When a user acquires conference content data through a terminal (i.e., when the terminal establishes a communication connection with a conference device), it starts to detect whether the user leaves a conference room. For example, when the terminal establishes a communication connection with the conference auxiliary device by "bumping", or when the terminal scans the two-dimensional code to establish a communication connection with the conference device, or when the terminal receives an instruction to establish a communication connection with the conference device, the terminal starts a built-in detection algorithm to detect whether the terminal leaves the conference room. The specific scheme for detecting whether the terminal leaves the conference room is described below.
And S203, when the terminal is detected to leave the conference room, automatically adjusting the terminal to a second volume which is larger than the first volume to play the audio data of the conference content.
It can be understood that the second volume here may be a volume before the terminal enters the conference room or a volume frequently used, and after the user leaves the conference room, the conference audio data is played according to a volume that the user is accustomed to at ordinary times, and of course, the second volume may also be a preset volume, for example, the maximum playing volume of the terminal is 100, and the second volume may be preset to 50, which is not limited in the present application.
According to the volume adjusting method, whether the terminal leaves the conference room or not is detected, when the terminal leaves the conference room, the terminal is automatically adjusted to the appropriate volume to play audio data of conference contents, so that the volume adjustment of the terminal is more intelligent, and the problem of poor experience caused by frequent manual volume adjustment is avoided.
In another embodiment, in order to further improve the user experience, the volume is turned up again when the user needs it. As shown in fig. 2b, first, when it is determined that the terminal leaves the conference room, a front camera of the terminal is controlled to be turned on, and it is determined whether the preset duration of the facial image of the target user acquired by the front camera is met and a display page of the terminal is a conference playing page (i.e., an appointed page); if so, indicating that the user has a requirement for acquiring the conference content, and at the moment, turning up the volume of the terminal to a second volume, namely, playing the conference audio data at a normal volume.
Fig. 3 is a flowchart of a method for detecting whether a terminal leaves a conference room according to an embodiment of the present application. As shown in fig. 3, the method comprises the following steps:
s301, the terminal collects audio data of the current environment.
When the user acquires the conference content data through the terminal, the sound pickup device (such as a microphone) of the terminal is controlled to start to collect the audio data of the current environment.
S302, the similarity degree of the audio data collected by the terminal and the audio data in the conference content is compared, and the audio matching degree is determined.
For convenience of description, the conference audio data is represented as first audio data, and the audio data currently captured by the terminal is represented as second audio data. The method for comparing the similarity between the audio data collected by the terminal and the audio data in the conference content to determine the audio matching degree is shown in fig. 4.
As shown in fig. 4, first, the first audio data and the second audio data are aligned in S401.
Specifically, a continuous M-frame audio frame sequence starting from a first time in the second audio data is extracted to obtain a third audio frame sequence, a continuous M-frame audio frame sequence starting from a plurality of different second times in the first audio data is extracted to obtain a plurality of fourth audio frame sequences, the second time is greater than or equal to the first time, and M is a positive integer greater than or equal to 1. And then finding out a fourth audio frame sequence with the highest similarity to the third audio frame sequence, wherein the difference value between the second moment and the first moment corresponding to the fourth audio frame sequence is the time delay compensation parameter. And finally, compensating the time delay of the first audio data based on the time delay compensation parameter, namely realizing the alignment of the first audio data and the second audio data.
The similarity between the third audio frame sequence and the fourth audio frame sequence may be obtained by calculating the similarity between the audio features of each frame in the third audio frame sequence and the audio features in the fourth audio frame sequence.
The audio features may be MFCC (Mel Frequency Cepstrum Coefficient) features, fbank (filterbank) features, etc., and here, the audio features are described as MFCC features, and the MFCC feature extraction process is shown in fig. 5.
In one example, audio data is divided into 40 millisecond frames with a frame shift of 10 milliseconds, 12 frames can be extracted every 0.5 seconds, and feature vectors of 26 features can be extracted each frame. The MFCC feature vector of the i-th frame extracted from the first audio data is Mi=[mi1 mi2 … mip]P is 26 (i.e., the first feature vector), and the MFCC feature vector of the jth frame extracted from the second audio data is Fj=[fj1 fj2 … fjp]And p is 26 (i.e., the second eigenvector).
There are various methods for calculating the similarity, such as euclidean distance, cosine similarity, manhattan distance, etc., and here, the euclidean distance is taken as an example for explanation, if the MFCC feature of each frame in the fourth audio frame sequence and the MFCC feature of each frame in the third audio frame sequence satisfy:
Figure BDA0002877378150000101
that is, the sum of the euclidean distances of the consecutive M frames is the minimum, it is determined that the first audio data and the second audio data start to match and coincide, and then the Delay compensation parameter Delay is Δ T (i-1), where Δ T is the frame length, i.e., 40 ms in the foregoing, and the first audio data and the second audio data are aligned by compensating the first audio data using the Delay compensation parameter (as shown in fig. 6).
In steps S402 and S403, N consecutive audio frames of a preset time period (for example, 0.5 second) of the first audio data are extracted to obtain a first audio frame sequence, and N audio frames aligned with the first audio frame sequence in the second audio data are extracted to obtain a second audio frame sequence.
In step S404, the similarity between each frame of audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence is calculated. The calculation method is described in step S401 for the similarity calculation.
S405, determining the audio matching degree of the first audio data and the second audio data in the preset time period based on the ratio of the number of audio frames reaching the standard in the second audio frame sequence to N.
And if the similarity between the audio frame in the second audio frame sequence and the corresponding audio frame in the first audio frame sequence is greater than a fourth threshold value, judging that the audio frame reaches the standard. Every 0.5 second is a moment, one moment has 12 (namely N) frames, and the audio matching degree P of the moment is calculatedvoicesimilarityNumber of audio frames up to standard/12.
Returning to fig. 3, after the execution of step S302 is completed, step S303 is executed. In step S303, it is determined whether the terminal leaves the conference room based on at least the audio matching degree.
The audio matching degree refers to the similarity degree of two pieces of audio data, the higher the similarity degree of the two pieces of audio data is, the higher the matching degree is, and when the two pieces of audio data are completely the same, the audio matching degree at this time is the highest. If the terminal is located in the conference room, the acquired audio data necessarily includes conference audio data, and the conference audio data has a high degree of similarity, that is, a high degree of audio matching. When the terminal leaves the conference room and the currently acquired audio data does not include conference audio data, the similarity between the acquired audio data and the conference audio data is very low, that is, the audio matching degree is low. Therefore, whether the terminal leaves the conference room can be judged by using the audio matching degree. For example, when the audio matching degree is smaller than the first threshold, it is determined that the terminal leaves the conference room.
Compared with the first method, the method for detecting whether the terminal leaves the conference room has more accurate judgment on whether the user leaves the conference room, avoids the problem that the precision of determining whether the terminal leaves the conference room by depending on the signal strength is very low because a large number of offline test acquisition and entry thresholds and the signal strength of a wireless network always have abnormal values in the first method, prevents the situation that the user does not leave the conference room and adjusts the volume of the terminal to be played at normal volume, and improves the user experience.
In another embodiment, in order to further improve the accuracy of judging whether the terminal leaves the conference, the terminal determines whether the terminal leaves the conference room based on the audio matching degree and the Received Signal Strength (RSSI) mutation result.
The RSSI mutation result represents the probability of mutation of the received signal strength, and is determined based on multiple groups of RSSIs, wherein the multiple groups of RSSIs are the signal strength sent by the communication device in the conference room and received by the terminal in continuous time.
For example, because the function of the existing mobile phone is powerful, people hardly leave the mobile phone, the bluetooth RSSIs of the mobile phones of multiple conference participants in a conference room received by the terminal in continuous time can be used as multiple sets of RSSIs (as shown in fig. 7), and each set of number X can be the larger value of the upper limit number of the available bluetooth devices of the terminal and the searched number of the mobile phones of the conference participants, or the number of the RSSI larger than a preset threshold, so that the RSSI has certain strength, and the influence caused by the abnormal fluctuation of the weaker RSSI is reduced.
In one example, a feature vector Y representing variation characteristics of the RSSI is determined based on a difference between two sets of the RSSI at adjacent time, the feature vector Y is input into a trained prediction model, and the prediction model outputs an RSSI mutation result.
The method for extracting the feature vector Y may be determined based on a difference between two sets of RSSI at adjacent times, for example, a set of RSSI data is collected every 0.5 seconds, 0.5 seconds is used as a time, and the RSSI of the jth bluetooth signal at the ith time is defined as rijThen i the feature vector Y of the momenti=[Yi1 Yi2 … Yij … YIx]Wherein Y isijThe value taking method comprises the following steps:
Figure BDA0002877378150000111
where Δ j is the allowable maximum fluctuation error (i.e., the fifth threshold) for the jth bluetooth RSSI.
It will be appreciated that the predictive model may learn from the training data the associations between respective inputs and outputs, such that after training is complete, for a given input, a corresponding output may be generated. The predictive model may also be referred to as a predictive neural network, a learning model, or a learning network, etc. There are many prediction models, such as Long Short Term Memory (LSTM), Deep Neural Networks (DNN), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and so on.
Taking the prediction network is LSTM as an example, as shown in fig. 8, a feature vector Y representing RSSI change characteristics is input into the trained LSTM, and an output RSSI sudden change result is obtained. Namely, the mathematical mapping relation between the feature vector Y and the RSSI is determined through the trained LSTM.
And then carrying out weighted summation on the obtained RSSI mutation result and the audio matching degree to obtain the probability P of the terminal leaving the conference room, and when the probability P is greater than or equal to the preset probability value, determining that the terminal leaves the conference room, otherwise, determining that the terminal does not leave the conference room.
In another embodiment, an image recognition result is also obtained, and whether the terminal leaves the conference room is determined based on the audio matching degree, the received signal mutation result and the image recognition result.
For example, when the audio matching degree is smaller than the first threshold RSSI mutation result and is larger than the second threshold, the image recognition result is smaller than the third threshold; or the audio matching degree is smaller than a first threshold value and the RSSI mutation result is larger than a second threshold value; or the audio matching degree is smaller than a first threshold and the image recognition result is smaller than a third threshold; or the received signal strength mutation result is larger than a second threshold value and the image identification result is smaller than a third threshold value; the terminal is judged to leave the conference room, otherwise, the terminal does not leave the conference room. Three judgment factors are added to judge whether the terminal leaves the conference room, and the judgment accuracy is further improved.
The image recognition result represents the probability that the terminal is located in the conference room, and is obtained based on the image of the user and the image recognition model, and the image recognition model is used for judging the probability that the terminal is located in the specific space based on the image recognition of the user object. The image recognition model can be arranged in the terminal, for example, the terminal establishes a communication connection with a camera in a conference room to acquire an image collected by the camera in real time, and inputs the collected image into the image recognition model to obtain an image recognition result output by the image recognition model. Of course, the image recognition model may also be set in the conference auxiliary device, and the terminal directly establishes a communication connection with the conference auxiliary device to obtain the image recognition result output by the image recognition model. The image recognition result is a mature prior art, and is not described herein for brevity.
Of course, it may also be determined whether the terminal leaves the conference room based on only the abrupt change result of the received signal strength, for example, when the abrupt change result of the received signal strength is greater than the second threshold, it is determined that the terminal leaves the conference room.
Alternatively, it is determined whether the terminal leaves the conference room based only on the image recognition result, and for example, when the image recognition result is less than a third threshold, it is determined that the terminal leaves the conference room.
Or, whether the terminal leaves the conference room is judged based on the received signal strength mutation result and the image recognition result, for example, when the received signal strength mutation result and the image recognition result are weighted and summed, the probability P that the terminal leaves the conference room is determined2(ii) a The P is2And if the value is larger than or equal to the preset probability value, determining that the terminal leaves the conference room. The signal intensity mutation result and the image recognition result obtaining method are described above, and are not described herein for brevity.
In another embodiment, as shown in fig. 9, the volume adjusting method further includes: when the terminal leaves the conference room, the RSSI of the current position is collected to be used as a first signal fingerprint, then a plurality of second signal fingerprints are collected at a preset frequency, when the second signal fingerprints are matched with the first signal fingerprint, the user is about to enter the conference room, at the moment, the volume of the terminal is adjusted to a third volume in order not to influence conference participants in the conference room, and the third volume is larger than the first volume and smaller than the second volume. The third volume is a small volume, for example, when the upper limit of the volume of the terminal is 100, the third volume may be set to about 10 so that the third volume can be heard by itself without affecting other human beings.
It is understood that the above mentioned signal fingerprint is generated based on one or more of bluetooth RSSI, wifi RSSI, etc. wireless signal RSSI in the acquisition conference room.
In another embodiment, the volume adjusting method further includes: when the second signal fingerprint is matched with the first signal fingerprint, whether the terminal enters the conference room or not is determined at least based on one of the audio matching degree, the signal intensity mutation result and the image identification result, if yes, the playing volume of the terminal is adjusted to the first volume, and influence on other conference participants in the conference room is avoided.
The audio matching degree, the signal intensity mutation result and the image recognition result obtaining method refer to the above description, and are not repeated here for brevity.
Based on the audio matching degree, the implementation mode of determining whether the terminal enters the conference room is as follows: and if the audio matching degree is greater than or equal to the first threshold value, determining that the terminal enters the conference room.
Based on the signal intensity mutation result, the implementation mode for determining whether the terminal enters the conference room is as follows: and if the signal intensity mutation result is larger than a second threshold value, determining that the terminal enters the conference room.
Based on the image recognition result, the implementation mode of determining whether the terminal enters the conference room is as follows: and if the image recognition result is greater than or equal to the third threshold value, determining that the terminal enters the conference room.
Based on the audio matching degree and the signal intensity mutation result, the implementation mode for determining whether the terminal enters the conference room is as follows: carrying out weighted summation on the audio matching degree and the signal intensity mutation result to determine the probability P of the terminal entering the conference roomj,PjAnd if the value is larger than or equal to the first preset probability value, determining that the terminal enters the conference room. Based on the audio matching degree and the image recognition result or the signal intensity mutation result and the image recognition result, the implementation manner of determining that the terminal enters the conference room is similar to that of the terminal, and details are not repeated here.
The implementation mode for determining whether the terminal enters the conference room or not based on the audio matching degree, the signal intensity mutation result and the image identification result is as follows: the audio matching degree is greater than or equal to a first threshold, the received signal strength mutation result is greater than a second threshold, and the image identification result is greater than or equal to a third threshold; or the audio matching degree is greater than or equal to a first threshold value and the received signal strength mutation result is greater than a second threshold value; or the audio matching degree is greater than or equal to a first threshold and the image recognition result is greater than or equal to a third threshold; or the result of the sudden change of the received signal strength is greater than a second threshold value and the result of the image recognition is greater than or equal to a third threshold value; it is determined that the terminal enters the conference room.
It is understood that, when the conference content is played using a conference device (e.g., a display having a communication function) in a conference room, the acquired conference content also includes display content of the conference device, such as display content of text information, image information, video information, and the like. Or the conference content further includes image information or video information acquired by a camera in the conference room, for example, for facilitating understanding, the information written on a blackboard by participants is shot by the camera and is synchronously played by the terminal.
In another embodiment, the volume adjusting method further includes: and adjusting the fourth threshold, the time delay compensation parameter and the fifth threshold based on the audio matching degree, the received signal mutation result and the image identification result, namely, adjusting and optimizing the parameter for determining the audio matching degree and the parameter for determining the received signal mutation result in real time, so that the parameters are more accurate.
Specifically, when the image recognition result is smaller than a third threshold and the received signal strength mutation result is larger than a second threshold, the audio matching degree is larger than or equal to the first threshold; or, the image recognition result is greater than or equal to a third threshold value, the received signal strength mutation result is less than or equal to a second threshold value, and the audio matching degree is less than a first threshold value; the delay compensation parameter and the fourth threshold are adjusted.
Referring to fig. 10, if the audio matching degree is smaller than the first threshold, the audio similarity change rate K at that time isijWhether or not it is greater than a preset threshold, i.e. KijIf the change is obvious, if so, the first audio data and the second audio data are not aligned, the time delay compensation parameter calculated last time is wrong, and the time delay compensation parameter is adjusted to KijLess than or equal to a predetermined threshold value, i.e. KijThere is no significant change; if not, the first audio data and the second sound are explainedThe audio data are aligned, the requirement on the similarity is too strict, namely the fourth threshold is too small, so that the qualified audio frames are too few, and therefore, the fourth threshold is reduced until the audio matching degree is greater than or equal to the first threshold.
Wherein K isijRepresenting the change rate of the audio similarity, and explaining K by taking Euclidean distance representation similarity as an exampleijThe calculating method of (2): euclidean distance D of the current time (i.e., the second time)iEuclidean distance D of the previous time (third time)i-1When the time difference between the current time and the previous time is Delta T, Kij=(Di-Di-1)/ΔT。
If the audio matching degree is greater than or equal to the first threshold, it is indicated that the fourth threshold is too small, and the similarity requirement is too low, resulting in too many audio frames reaching the standard, so that the fourth threshold is increased until the audio matching degree is less than the first threshold. Therefore, the decision parameters (the time delay compensation parameter and the fourth threshold) for optimizing the audio matching degree are dynamically adjusted in real time, so that whether the terminal is located in the conference room or not is judged more accurately.
When the image recognition result is smaller than a third threshold value and the audio matching degree is smaller than a first threshold value, and the received signal strength mutation result is smaller than or equal to a second threshold value; or, when the image recognition result is greater than or equal to the third threshold and the audio matching degree is greater than or equal to the first threshold, and the received signal strength mutation result is greater than the second threshold; the fifth threshold is adjusted.
Referring to fig. 11, if the RSSI sudden change result is greater than the second threshold, it indicates that the fifth threshold is too small, and the fifth threshold is increased until the RSSI sudden change result is less than or equal to the second threshold; otherwise, if the fifth threshold is too large, the fifth threshold is turned down until the RSSI mutation result is larger than the second threshold.
The parameters of the audio matching degree and the parameters of the received signal strength are learned mutually, dynamic adjustment is achieved, whether the terminal leaves or enters a conference room or not is judged more accurately, voice adjustment is more accurate, and user experience is improved.
The volume adjusting method is not limited to the scenes of entering and exiting a meeting room, and can be applied to any places needing terminal silence, such as movie theater, monitoring room and the like. The following description will be given taking a movie theater scene as an example.
As shown in fig. 12, a large-screen projection device 12 in a movie theater is playing a movie, when a user needs to leave the theater (for example, go to a bathroom) but does not want to miss a highlight, the user can scan a two-dimensional code on a chair through a portable terminal (for example, a mobile phone) to request that the content shown on the large-screen projection device is synchronized to the mobile phone for playing, and the terminal determines whether the terminal leaves the theater through a built-in algorithm, automatically plays in a mute mode before the terminal leaves the theater so as not to affect the viewing of other people, and automatically adjusts to a normal volume for playing after the terminal leaves the theater. When the terminal is about to enter the auditorium, the volume is automatically adjusted to be smaller for playing, and then the volume is automatically adjusted to be mute for playing or the playing is closed after the terminal enters the auditorium. The method of whether the terminal itself leaves/enters/is about to enter the auditorium is described above and will not be described here.
Fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 13, the electronic device 130 at least includes:
an obtaining module 131, configured to, when the terminal is located in a specific space, obtain multimedia content played/collected by a device in the specific space;
a playing module 132, configured to play the multimedia content, and play audio data of the multimedia content at a first volume;
a detecting module 133, configured to detect whether the terminal leaves the specific space;
an adjusting module 135, configured to, when it is detected that the terminal leaves the specific space, continue to play the multimedia content, and automatically adjust to a second volume to play audio data of the multimedia content, where the second volume is greater than the first volume.
In one possible implementation, the obtaining module 131 is further configured to: acquiring second audio data, wherein the second audio data is audio data currently acquired by the terminal;
the detection module 133 is further configured to: comparing the similarity degree of first audio data and second audio data to determine audio matching degree, wherein the first audio data is the audio data of the multimedia content; determining whether the terminal leaves the specific space based on at least the audio matching degree.
In one possible implementation, the determining whether the terminal leaves the specific space based on at least the audio matching degree includes: and if the audio matching degree is smaller than a first threshold value, determining that the terminal leaves the specific space.
In another possible implementation, the determining whether the terminal leaves a specific space based on at least the audio matching degree includes: determining whether the terminal leaves a specific space at least based on the audio matching degree and the received signal strength mutation result; the received signal strength mutation result represents the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths received by the terminal in continuous time and transmitted by communication devices in the specific space.
In another possible implementation, the determining whether the terminal leaves a specific space based on at least the audio matching degree and the result of abrupt change of the received signal strength includes: carrying out weighted summation on the received signal strength mutation result and the audio matching degree, and determining the probability P that the terminal leaves the specific space; and if the P is greater than or equal to a preset probability value, determining that the terminal leaves the specific space.
In another possible implementation, the determining whether the terminal leaves a specific space based on at least the audio matching degree includes: determining whether the terminal leaves a specific space based on the audio matching degree, the received signal mutation result and the image recognition result; the received signal strength mutation result represents the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths which are sent by the communication devices in the specific space and received by the terminal in continuous time; the image recognition result represents the probability that the terminal is located in the specific space, and is obtained based on the image of the target object and an image recognition model, and the image recognition model is used for judging the probability that the terminal is located in the specific space based on the image recognition of the target object.
In another possible implementation, the determining whether the terminal leaves a specific space based on the audio matching degree, the received signal mutation result, and the image recognition result includes: the audio matching degree is smaller than a first threshold, the received signal strength mutation result is larger than a second threshold, and the image identification result is smaller than a third threshold; or the audio matching degree is smaller than the first threshold and the received signal strength mutation result is larger than a second threshold; or the audio matching degree is smaller than a first threshold and the image recognition result is smaller than a third threshold; or the received signal strength mutation result is larger than a second threshold value and the image identification result is smaller than a third threshold value; it is judged that the terminal leaves the specific space.
In another possible implementation, the detecting module 133 is further configured to align the first audio data and the second audio data; extracting continuous N frames of audio frames of the first audio data in a preset time period to obtain a first audio frame sequence; extracting N audio frames aligned with the first audio frame sequence in the second audio data to obtain a second audio frame sequence; n is a positive integer greater than or equal to 1; calculating the similarity between each frame of audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence; and determining the audio matching degree of the first audio data and the second audio data in the preset time period based on the ratio of the number of qualified audio frames in a second audio frame sequence to N, wherein the qualified audio frames are the audio frames of which the similarity between the audio frames in the second audio frame sequence and the corresponding audio frames in the first audio frame sequence is greater than a fourth threshold.
In another possible implementation, the calculating the similarity between each audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence includes: respectively extracting and obtaining a first feature vector representing each frame of audio frame in the first audio frame sequence and a second feature vector representing each frame of audio frame in the second audio frame sequence; and determining the similarity of each audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence based on the similarity of the first feature vector and the second feature vector.
In another possible implementation, the aligning the first audio data and the second audio data includes: extracting continuous M-frame audio frame sequences starting from a first moment in the second audio data to obtain a third audio frame sequence, and extracting continuous M-frame audio frame sequences starting from a plurality of different second moments in the first audio data to obtain a plurality of fourth audio frame sequences, wherein the second moment is greater than or equal to the first moment, and M is a positive integer greater than or equal to 1; determining a delay compensation parameter based on the similarity of the third audio frame sequence and the fourth audio frame sequence; aligning the first audio data and the second audio data based on the delay compensation parameter.
In another possible implementation, the detecting module 133 is further configured to obtain multiple sets of received signal strengths of consecutive times; determining the abrupt change result of the received signal strength based on the plurality of groups of received signal strengths.
In another possible implementation, the determining the abrupt change in the received signal strength result based on the plurality of sets of received signal strengths includes: determining a characteristic vector representing the variation characteristic of the received signal strength based on the difference value of the two groups of received signal strengths at adjacent moments; and inputting the characteristic vector into a preset prediction model, and determining the intensity mutation result of the received signal.
In another possible implementation, the multiple sets of received signal strengths are multiple sets of X bluetooth received signal strengths, where the multiple sets of X bluetooth received signal strengths are bluetooth signal strengths transmitted by X bluetooth devices in the specific space received by the terminal in continuous time, and X is a positive integer greater than or equal to 3; and one of the multiple groups of received signal strengths is the strength of the bluetooth signal transmitted by the X bluetooth devices in the specific space received by the terminal at the same time.
In another possible implementation, the adjusting module 135 is further configured to: if the terminal is detected to leave the specific space, controlling a front camera of the terminal to be opened; judging whether the preset time for the face image of the target user collected by the front-facing camera is met and whether the display page of the terminal is a designated display page; and if so, automatically adjusting the terminal to the second volume to play the audio data of the multimedia content.
In another possible implementation, the electronic device 130 further includes: an acquisition module 134, configured to, if the terminal leaves the specific space; acquiring a first signal fingerprint, wherein the first signal fingerprint is determined based on the strength of a received signal acquired by the terminal at the current position; collecting a plurality of second signal fingerprints at a preset frequency; the adjustment module 135 is further configured to, when the second signal fingerprint matches the first signal fingerprint; and adjusting the volume of the terminal to a third volume, wherein the third volume is greater than the first volume and less than the second volume.
In another possible implementation, the detection module 133 is further configured to: if the terminal is detected to leave the specific space, continuously detecting whether the terminal enters the specific space; and if the terminal is detected to enter the specific space, automatically adjusting the volume of the terminal to the first volume to play the audio data of the multimedia content.
In another possible implementation, the multimedia content further includes display content of devices in the particular space.
In another possible implementation, the electronic device 130 further includes a parameter optimization module 136 configured to adjust the fourth threshold, the delay compensation parameter, and the fifth threshold according to the audio matching degree, the received signal mutation result, and the image recognition result.
In another possible implementation, the adjusting the fourth threshold, the delay compensation parameter, and the fifth threshold according to the audio matching degree, the received signal mutation result, and the image recognition result includes: when the image identification result is smaller than a third threshold value and the received signal strength mutation result is larger than a second threshold value, the audio matching degree is larger than or equal to the first threshold value; or, the image recognition result is greater than or equal to a third threshold value, the received signal strength mutation result is less than or equal to a second threshold value, and the audio matching degree is less than a first threshold value; adjusting the delay compensation parameter and a fourth threshold;
when the image recognition result is smaller than a third threshold value and the audio matching degree is smaller than a first threshold value, and the received signal strength mutation result is smaller than or equal to a second threshold value; or, when the image recognition result is greater than or equal to the third threshold and the audio matching degree is greater than or equal to the first threshold, and the received signal strength mutation result is greater than the second threshold; the fifth threshold is adjusted. The mutual learning and dynamic adjustment of the decision parameter of the audio matching degree and the decision parameter of the received signal strength mutation result are realized, and the accuracy of judging whether the terminal leaves a specific space is improved.
In another possible implementation, when the image recognition result is less than a third threshold and the received signal strength mutation result is greater than a second threshold, the audio matching degree is greater than or equal to the first threshold; or, the image recognition result is greater than or equal to a third threshold value, the received signal strength mutation result is less than or equal to a second threshold value, and the audio matching degree is less than a first threshold value; adjusting the delay compensation parameter and a fourth threshold, including: when the image identification result is smaller than a third threshold value and the received signal strength mutation result is larger than a second threshold value, the audio matching degree is larger than or equal to the first threshold value; increasing the fourth threshold value until the audio matching degree is smaller than the first threshold value;
when the image recognition result is greater than or equal to a third threshold value and the received signal strength mutation result is less than or equal to a second threshold value, and the audio matching degree is less than a first threshold value; judging whether the audio similarity change rate is greater than a preset threshold value or not; if so, adjusting the time delay compensation parameter until the audio similarity change rate is smaller than a preset threshold value; if not, reducing the fourth threshold value until the audio matching degree is greater than or equal to the first threshold value;
the audio similarity change rate is determined based on a ratio of a similarity between an audio frame of the first audio data and an audio frame of the second voice data at a second time to a similarity between an audio frame of the first audio data and an audio frame of the second voice data at a third time to a similarity between the second time and the third time, the second time is a time when the image recognition result is greater than or equal to a third threshold and the received signal strength mutation result is less than or equal to the second threshold, the audio matching degree is less than the first threshold, and the third time is a time adjacent to the second time.
In another possible implementation, when the image recognition result is smaller than the third threshold and the audio matching degree is smaller than the first threshold, the received signal strength abrupt change result is smaller than or equal to the second threshold; or, when the image recognition result is greater than or equal to the third threshold and the audio matching degree is greater than or equal to the first threshold, and the received signal strength mutation result is greater than the second threshold; adjusting the fifth threshold value, including: when the image recognition result is smaller than a third threshold value and the audio matching degree is smaller than a first threshold value, and the received signal strength mutation result is smaller than or equal to a second threshold value; reducing the fifth threshold value until the sudden change result of the received signal strength is greater than a second threshold value; when the image recognition result is greater than or equal to the third threshold and the audio matching degree is greater than or equal to the first threshold, and the received signal strength mutation result is greater than the second threshold; the fifth threshold is increased until the result of the abrupt change in the received signal strength is less than or equal to the second threshold.
The electronic device 130 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module in the electronic device 130 are respectively for implementing corresponding flows of each method in fig. 2 to 11, and are not described herein again for brevity.
It should be noted that the above-described embodiments are merely illustrative, and the modules described as separate components may or may not be physically separate, and components displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the device provided by the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be specifically implemented as one or more communication buses or signal lines.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform any of the methods described above.
The present application also provides a computer program or computer program product comprising instructions which, when executed, cause a computer to perform any of the methods described above.
The application also provides a terminal, which comprises a memory and a processor, wherein the memory stores executable codes, and the processor executes the executable codes to realize any one of the methods.
Fig. 14 is a schematic structural diagram of a terminal provided in the present application.
As shown in fig. 14, the terminal 140 includes a processor 141, a memory 142, a bus 143, a microphone 144, a speaker 145, a display 146, and a communication interface 147. The processor 141, the memory 142, the microphone 144, the speaker 145, the display 146, and the communication interface 147 may communicate with each other via the bus 143, or may communicate with each other by other means such as wireless transmission. The microphone 144 may collect audio data, such as first audio data; speaker 145 may play audio data, such as second audio data; the display 146 can display multimedia content, such as meeting image content, meeting text content, meeting video content, and the like, displayed by meeting large-screen equipment in a meeting room; the communication interface 147 is used for communication connection with other communication devices, such as a conference large-screen device or a projection large-screen device in a projection hall; the memory 142 stores executable program codes, and the processor 141 may call the program codes stored in the memory 142 to perform the volume adjustment method in the aforementioned method embodiment.
It should be understood that, in the embodiment of the present application, the processor 141 may be a central processing unit CPU, and the processor 141 may also be other general purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 142 may include a read-only memory and a random access memory, and provides instructions and data to the processor 141. Memory 142 may also include non-volatile random access memory. For example, the memory 142 may also store a training data set.
The memory 142 may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM).
The bus 143 may include a power bus, a control bus, a status signal bus, and the like, in addition to the data bus. But for clarity of illustration the various buses are labeled as bus 143 in the figures.
It should be understood that the terminal 140 according to the embodiment of the present application may correspond to an electronic device in the embodiment of the present application, and may correspond to a corresponding main body in executing the methods shown in fig. 2 to 11 according to the embodiment of the present application, and the above and other operations and/or functions of each device in the terminal 140 are respectively for implementing corresponding flows of the methods in fig. 2 to 11, and are not described herein again for brevity.
It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the particular application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments, objects, technical solutions and advantages of the present application are described in further detail, it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present application should be included in the scope of the present application.

Claims (29)

1. A volume adjusting method is applied to a terminal and comprises the following steps:
when the terminal is located in a specific space, the terminal acquires multimedia content played/collected by equipment in the specific space;
the terminal plays the multimedia content and plays audio data of the multimedia content at a first volume;
detecting whether the terminal leaves the specific space;
and when the terminal is detected to leave the specific space, the terminal continues to play the multimedia content and automatically adjusts the volume to a second volume to play the audio data of the multimedia content, wherein the second volume is larger than the first volume.
2. The method according to claim 1, wherein the detecting whether the terminal leaves the specific space comprises:
acquiring second audio data, wherein the second audio data is audio data currently acquired by the terminal;
comparing the similarity degree of first audio data and second audio data to determine audio matching degree, wherein the first audio data is the audio data of the multimedia content;
determining whether the terminal leaves the specific space based on at least the audio matching degree.
3. The method according to claim 2, wherein the determining whether the terminal leaves the specific space based on at least the audio matching degree comprises:
and if the audio matching degree is smaller than a first threshold value, determining that the terminal leaves the specific space.
4. The method according to claim 2, wherein the determining whether the terminal leaves the specific space based on at least the audio matching degree comprises:
determining whether the terminal leaves the specific space based on at least the audio matching degree and a received signal strength mutation result;
the received signal strength mutation result represents the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths received by the terminal in continuous time and transmitted by communication devices in the specific space.
5. The method according to claim 4, wherein the determining whether the terminal leaves the specific space based on at least the audio matching degree and the received signal strength mutation result comprises:
carrying out weighted summation on the received signal strength mutation result and the audio matching degree, and determining the probability P that the terminal leaves the specific space;
and if the P is greater than or equal to a preset probability value, determining that the terminal leaves the specific space.
6. The method according to claim 2, wherein the determining whether the terminal leaves the specific space based on at least the audio matching degree comprises:
determining whether the terminal leaves the specific space based on the audio matching degree, the received signal mutation result and the image recognition result;
the received signal strength mutation result represents the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths which are sent by the communication devices in the specific space and received by the terminal in continuous time;
the image recognition result represents the probability that the terminal is located in the specific space, and is obtained based on the image of the target object and an image recognition model, and the image recognition model is used for judging the probability that the terminal is located in the specific space based on the image recognition of the target object.
7. The method according to claim 6, wherein the determining whether the terminal leaves the specific space based on the audio matching degree, the received signal mutation result and the image recognition result comprises:
the audio matching degree is smaller than a first threshold, the received signal strength mutation result is larger than a second threshold, and the image identification result is smaller than a third threshold;
or the audio matching degree is smaller than the first threshold and the received signal strength mutation result is larger than a second threshold;
or the audio matching degree is smaller than a first threshold and the image recognition result is smaller than a third threshold;
or the received signal strength mutation result is larger than a second threshold value and the image identification result is smaller than a third threshold value;
it is judged that the terminal leaves the specific space.
8. The method of claims 2-7, wherein comparing the similarity of the first audio data and the second audio data to determine the audio match comprises:
aligning the first audio data and the second audio data;
extracting continuous N frames of audio frames of the first audio data in a preset time period to obtain a first audio frame sequence;
extracting N audio frames aligned with the first audio frame sequence in the second audio data to obtain a second audio frame sequence; n is a positive integer greater than or equal to 1;
calculating the similarity between each frame of audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence;
and determining the audio matching degree of the first audio data and the second audio data in the preset time period based on the ratio of the number of qualified audio frames in a second audio frame sequence to N, wherein the qualified audio frames are the audio frames of which the similarity between the audio frames in the second audio frame sequence and the corresponding audio frames in the first audio frame sequence is greater than a fourth threshold.
9. The method of claim 8, wherein the calculating the similarity between each frame of audio frames in the first sequence of audio frames and the corresponding audio frames in the second sequence of audio frames comprises:
respectively extracting and obtaining a first feature vector representing each frame of audio frame in the first audio frame sequence and a second feature vector representing each frame of audio frame in the second audio frame sequence;
and determining the similarity of each audio frame in the first audio frame sequence and the corresponding audio frame in the second audio frame sequence based on the similarity of the first feature vector and the second feature vector.
10. The method of claim 8, wherein the aligning the first audio data and the second audio data comprises:
extracting continuous M-frame audio frame sequences starting from a first moment in the second audio data to obtain a third audio frame sequence, and extracting continuous M-frame audio frame sequences starting from a plurality of different second moments in the first audio data to obtain a plurality of fourth audio frame sequences, wherein the second moment is greater than or equal to the first moment, and M is a positive integer greater than or equal to 1;
determining a delay compensation parameter based on the similarity of the third audio frame sequence and the fourth audio frame sequence;
aligning the first audio data and the second audio data based on the delay compensation parameter.
11. The method of any of claims 4-10, further comprising:
acquiring multiple groups of received signal intensities of continuous time;
determining the abrupt change result of the received signal strength based on the plurality of groups of received signal strengths.
12. The method of claim 11, wherein the determining the rssi abrupt change result based on the plurality of sets of rssi comprises:
determining a characteristic vector representing the variation characteristic of the received signal strength based on the difference value of the two groups of received signal strengths at adjacent moments;
and inputting the characteristic vector into a preset prediction model, and determining the intensity mutation result of the received signal.
13. The method according to claim 11 or 12, wherein the multiple sets of received signal strengths are multiple sets of X bluetooth received signal strengths, the multiple sets of X bluetooth received signal strengths are bluetooth signal strengths transmitted by X bluetooth devices in the specific space received by the terminal in consecutive time, and X is a positive integer greater than or equal to 3;
and one of the multiple groups of received signal strengths is the strength of the bluetooth signal sent by the X bluetooth devices in the specific space received by the terminal at the same time.
14. The method according to any one of claims 1 to 13, wherein the detecting that the terminal leaves the specific space, the terminal continues to play the multimedia content and automatically adjusts to a second volume to play audio data of the multimedia content comprises:
if the terminal is detected to leave the specific space, controlling a front camera of the terminal to be opened;
judging whether the preset time for the face image of the target user collected by the front-facing camera is met and whether the display page of the terminal is a designated display page;
and if so, automatically adjusting the terminal to the second volume to play the audio data of the multimedia content.
15. The method of any of claims 1-14, further comprising:
if the terminal is detected to leave the specific space;
acquiring a first signal fingerprint, wherein the first signal fingerprint is determined based on the strength of a received signal acquired by the terminal at the current position;
acquiring a plurality of second signal fingerprints at a preset frequency, wherein when the second signal fingerprints are matched with the first signal fingerprints;
and automatically adjusting the volume of the terminal to a third volume to play the audio data of the multimedia content, wherein the third volume is greater than the first volume and less than the second volume.
16. The method of any one of claims 1-15, further comprising:
if the terminal is detected to leave the specific space, continuously detecting whether the terminal enters the specific space;
and if the terminal is detected to enter the specific space, automatically adjusting the volume of the terminal to the first volume to play the audio data of the multimedia content.
17. The method of any of claims 1-16, wherein the multimedia content further comprises display content of devices in the particular space.
18. An electronic device, comprising:
the terminal comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring multimedia contents played/collected by equipment in a specific space when the terminal is positioned in the specific space;
the playing module is used for playing the multimedia content and playing the audio data of the multimedia content at a first volume;
a detection module, configured to detect whether the terminal leaves the specific space;
and the adjusting module is used for continuously playing the multimedia content by the terminal when detecting that the terminal leaves the specific space, and automatically adjusting the multimedia content to a second volume to play the audio data of the multimedia content, wherein the second volume is larger than the first volume.
19. The device of claim 18, wherein the obtaining module is further configured to: acquiring second audio data, wherein the second audio data is audio data currently acquired by the terminal;
the detection module is further configured to: comparing the similarity degree of first audio data and second audio data to determine audio matching degree, wherein the first audio data is the audio data of the multimedia content;
determining whether the terminal leaves the specific space based on at least the audio matching degree.
20. The apparatus of claim 19, wherein the determining whether the terminal leaves the specific space based on at least an audio matching degree comprises:
and if the audio matching degree is smaller than a first threshold value, determining that the terminal leaves the specific space.
21. The apparatus of claim 19, wherein the determining whether the terminal leaves the specific space based on at least an audio matching degree comprises:
determining whether the terminal leaves the specific space based on at least the audio matching degree and a received signal strength mutation result;
the received signal strength mutation result represents the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths received by the terminal in continuous time and transmitted by communication devices in the specific space.
22. The apparatus according to claim 21, wherein the determining whether the terminal leaves the specific space based on at least the audio matching degree and the abrupt change result of the received signal strength comprises:
carrying out weighted summation on the received signal strength mutation result and the audio matching degree, and determining the probability P that the terminal leaves the specific space;
and if the P is greater than or equal to a preset probability value, determining that the terminal leaves the specific space.
23. The apparatus of claim 19, wherein the determining whether the terminal leaves the specific space based on at least an audio matching degree comprises:
determining whether the terminal leaves a specific space based on the audio matching degree, the received signal mutation result and the image recognition result;
the received signal strength mutation result represents the probability of the received signal strength mutation, and is determined based on multiple groups of received signal strengths, wherein the multiple groups of received signal strengths are the signal strengths which are sent by the communication devices in the specific space and received by the terminal in continuous time;
the image recognition result represents the probability that the terminal is located in the specific space, and is obtained based on the image of the target object and an image recognition model, and the image recognition model is used for judging the probability that the terminal is located in the specific space based on the image recognition of the target object.
24. The apparatus of claim 23, wherein the determining whether the terminal leaves the specific space based on the audio matching degree, the received signal mutation result and the image recognition result comprises:
the audio matching degree is smaller than a first threshold, the received signal strength mutation result is larger than a second threshold, and the image identification result is smaller than a third threshold;
or the audio matching degree is smaller than the first threshold and the received signal strength mutation result is larger than a second threshold;
or the audio matching degree is smaller than a first threshold and the image recognition result is smaller than a third threshold;
or the received signal strength mutation result is greater than a second threshold value and the image identification result is less than a third threshold value;
it is judged that the terminal leaves the specific space.
25. The apparatus of any of claims 18-24, wherein the adjustment module is further configured to:
if the terminal is detected to leave the specific space, controlling a front camera of the terminal to be opened;
judging whether the preset time for the face image of the target user collected by the front-facing camera is met and whether the display page of the terminal is a specified display page;
and if so, automatically adjusting the terminal to a second volume to play the audio data of the multimedia content.
26. The apparatus according to any one of claims 18-25, further comprising:
the acquisition module is used for acquiring a first signal fingerprint if the terminal is detected to leave the specific space, wherein the first signal fingerprint is determined based on the strength of a received signal acquired by the terminal at the current position;
collecting a plurality of second signal fingerprints at a preset frequency;
the adjusting module is further configured to automatically adjust the volume of the terminal to a third volume to play the audio data of the multimedia content when the second signal fingerprint matches the first signal fingerprint, where the third volume is greater than the first volume and less than the second volume.
27. The apparatus of any of claims 18-26, wherein the determining module is further configured to: if the terminal is detected to leave the specific space, continuously detecting whether the terminal enters the specific space;
and if the terminal is detected to enter the specific space, automatically adjusting the volume of the terminal to the first volume to play the audio data of the multimedia content.
28. A terminal comprising a memory and a processor, wherein the memory has stored therein executable code, and the processor executes the executable code to implement the method of any one of claims 1-17.
29. A computer-readable storage medium, on which a computer program is stored, which, when the computer program is executed in a computer, causes the computer to carry out the method of any one of claims 1-17.
CN202011638242.5A 2020-12-31 2020-12-31 Volume adjusting method, terminal and readable storage medium Active CN114697445B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011638242.5A CN114697445B (en) 2020-12-31 2020-12-31 Volume adjusting method, terminal and readable storage medium
PCT/CN2021/136096 WO2022143040A1 (en) 2020-12-31 2021-12-07 Volume adjusting method, electronic device, terminal, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011638242.5A CN114697445B (en) 2020-12-31 2020-12-31 Volume adjusting method, terminal and readable storage medium

Publications (2)

Publication Number Publication Date
CN114697445A true CN114697445A (en) 2022-07-01
CN114697445B CN114697445B (en) 2023-09-01

Family

ID=82135153

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011638242.5A Active CN114697445B (en) 2020-12-31 2020-12-31 Volume adjusting method, terminal and readable storage medium

Country Status (2)

Country Link
CN (1) CN114697445B (en)
WO (1) WO2022143040A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116208700A (en) * 2023-04-25 2023-06-02 深圳市华卓智能科技有限公司 Control method and system for communication between mobile phone and audio equipment

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750258A (en) * 2011-03-30 2012-10-24 微软公司 Mobile device configuration based on status and location
US20130024018A1 (en) * 2011-07-22 2013-01-24 Htc Corporation Multimedia control method and multimedia control system
CN103761063A (en) * 2013-12-19 2014-04-30 北京百度网讯科技有限公司 Method and device for controlling audio output volume in playing device
CN103888579A (en) * 2012-12-21 2014-06-25 中国移动通信集团广西有限公司 Method and device for adjusting beep volume of mobile terminal, and mobile terminal
US20140223500A1 (en) * 2013-02-04 2014-08-07 Samsung Electronics Co., Ltd. Method and system for transmitting wirelessly video in portable terminal
CN104363563A (en) * 2014-11-24 2015-02-18 广东欧珀移动通信有限公司 Network-based mobile terminal voice control method and system
CN104616675A (en) * 2013-11-05 2015-05-13 华为终端有限公司 Method for switching playing equipment and mobile terminal
CN105592195A (en) * 2016-01-20 2016-05-18 努比亚技术有限公司 Volume adaptive adjusting method and apparatus
CN105607735A (en) * 2015-12-17 2016-05-25 深圳Tcl数字技术有限公司 Output controlling system and method of multimedia equipment
CN105827797A (en) * 2015-07-29 2016-08-03 维沃移动通信有限公司 Method for adjusting volume of electronic device and electronic device
CN106453860A (en) * 2016-09-26 2017-02-22 广东小天才科技有限公司 Switching method and device of sound mode and user terminal
CN107135308A (en) * 2017-04-26 2017-09-05 努比亚技术有限公司 Multimedia file plays audio control method, mobile terminal and readable storage medium storing program for executing
CN107172295A (en) * 2017-06-21 2017-09-15 上海斐讯数据通信技术有限公司 One kind control mobile terminal mute method and mobile terminal
CN107431860A (en) * 2015-03-12 2017-12-01 四达时代通讯网络技术有限公司 Audio system based on location-based service
CN107566888A (en) * 2017-09-12 2018-01-09 中广热点云科技有限公司 The audio setting method of multiple multimedia play equipments, multimedia play system
CN108156328A (en) * 2018-01-24 2018-06-12 维沃移动通信有限公司 The switching method and device of contextual model
CN108647005A (en) * 2018-05-15 2018-10-12 努比亚技术有限公司 Audio frequency playing method, mobile terminal and computer readable storage medium
CN108933914A (en) * 2017-05-24 2018-12-04 中兴通讯股份有限公司 A kind of method and system carrying out video conference using mobile terminal
US20190313158A1 (en) * 2016-12-30 2019-10-10 Arris Enterprises Llc Method and apparatus for controlling set top box volume based on mobile device events
US20200162524A1 (en) * 2017-07-11 2020-05-21 Zte Corporation Control method of multimedia conference terminal and multimedia conference server

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6330022B1 (en) * 1998-11-05 2001-12-11 Lucent Technologies Inc. Digital processing apparatus and method to support video conferencing in variable contexts
JP3661768B2 (en) * 2000-10-04 2005-06-22 インターナショナル・ビジネス・マシーンズ・コーポレーション Audio equipment and computer equipment
US20100153497A1 (en) * 2008-12-12 2010-06-17 Nortel Networks Limited Sharing expression information among conference participants
CN105991710A (en) * 2015-02-10 2016-10-05 黄金富知识产权咨询(深圳)有限公司 Brief-report synchronous display content method and corresponding system
CN109068088A (en) * 2018-09-20 2018-12-21 明基智能科技(上海)有限公司 Meeting exchange method, apparatus and system based on user's portable terminal

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750258A (en) * 2011-03-30 2012-10-24 微软公司 Mobile device configuration based on status and location
US20130024018A1 (en) * 2011-07-22 2013-01-24 Htc Corporation Multimedia control method and multimedia control system
CN103888579A (en) * 2012-12-21 2014-06-25 中国移动通信集团广西有限公司 Method and device for adjusting beep volume of mobile terminal, and mobile terminal
US20140223500A1 (en) * 2013-02-04 2014-08-07 Samsung Electronics Co., Ltd. Method and system for transmitting wirelessly video in portable terminal
CN104616675A (en) * 2013-11-05 2015-05-13 华为终端有限公司 Method for switching playing equipment and mobile terminal
CN103761063A (en) * 2013-12-19 2014-04-30 北京百度网讯科技有限公司 Method and device for controlling audio output volume in playing device
CN104363563A (en) * 2014-11-24 2015-02-18 广东欧珀移动通信有限公司 Network-based mobile terminal voice control method and system
CN107431860A (en) * 2015-03-12 2017-12-01 四达时代通讯网络技术有限公司 Audio system based on location-based service
CN105827797A (en) * 2015-07-29 2016-08-03 维沃移动通信有限公司 Method for adjusting volume of electronic device and electronic device
CN105607735A (en) * 2015-12-17 2016-05-25 深圳Tcl数字技术有限公司 Output controlling system and method of multimedia equipment
CN105592195A (en) * 2016-01-20 2016-05-18 努比亚技术有限公司 Volume adaptive adjusting method and apparatus
CN106453860A (en) * 2016-09-26 2017-02-22 广东小天才科技有限公司 Switching method and device of sound mode and user terminal
US20190313158A1 (en) * 2016-12-30 2019-10-10 Arris Enterprises Llc Method and apparatus for controlling set top box volume based on mobile device events
CN107135308A (en) * 2017-04-26 2017-09-05 努比亚技术有限公司 Multimedia file plays audio control method, mobile terminal and readable storage medium storing program for executing
CN108933914A (en) * 2017-05-24 2018-12-04 中兴通讯股份有限公司 A kind of method and system carrying out video conference using mobile terminal
CN107172295A (en) * 2017-06-21 2017-09-15 上海斐讯数据通信技术有限公司 One kind control mobile terminal mute method and mobile terminal
US20200162524A1 (en) * 2017-07-11 2020-05-21 Zte Corporation Control method of multimedia conference terminal and multimedia conference server
CN107566888A (en) * 2017-09-12 2018-01-09 中广热点云科技有限公司 The audio setting method of multiple multimedia play equipments, multimedia play system
CN108156328A (en) * 2018-01-24 2018-06-12 维沃移动通信有限公司 The switching method and device of contextual model
CN108647005A (en) * 2018-05-15 2018-10-12 努比亚技术有限公司 Audio frequency playing method, mobile terminal and computer readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116208700A (en) * 2023-04-25 2023-06-02 深圳市华卓智能科技有限公司 Control method and system for communication between mobile phone and audio equipment

Also Published As

Publication number Publication date
WO2022143040A1 (en) 2022-07-07
CN114697445B (en) 2023-09-01

Similar Documents

Publication Publication Date Title
US11023690B2 (en) Customized output to optimize for user preference in a distributed system
US10743107B1 (en) Synchronization of audio signals from distributed devices
EP3963576B1 (en) Speaker attributed transcript generation
US11875796B2 (en) Audio-visual diarization to identify meeting attendees
US20210407516A1 (en) Processing Overlapping Speech from Distributed Devices
CN102254556B (en) Estimating a Listener's Ability To Understand a Speaker, Based on Comparisons of Their Styles of Speech
US10812921B1 (en) Audio stream processing for distributed device meeting
Schmalenstroeer et al. Online diarization of streaming audio-visual data for smart environments
JP6562790B2 (en) Dialogue device and dialogue program
CN115482830B (en) Voice enhancement method and related equipment
WO2022253003A1 (en) Speech enhancement method and related device
US11468895B2 (en) Distributed device meeting initiation
CN114697445B (en) Volume adjusting method, terminal and readable storage medium
US20100266112A1 (en) Method and device relating to conferencing
CN109102813B (en) Voiceprint recognition method and device, electronic equipment and storage medium
US20190272828A1 (en) Speaker estimation method and speaker estimation device
US11875800B2 (en) Talker prediction method, talker prediction device, and communication system
CN114694685A (en) Voice quality evaluation method, device and storage medium
CN114650492A (en) Wireless personal communication via a hearing device
CN117135305A (en) Teleconference implementation method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant