Welcome to MySide

Identifying Intrusive Mobile Apps using Peer Group Analysis

Posted by Martin Pelikan, Giles Hogben, and Ulfar Erlingsson of Google's
Security and Privacy team


Mobile apps entertain and assist us, make it easy to communicate with friends
and family, and provide tools ranging from maps to electronic wallets. But these
apps could also seek more device information than they need to do their job,
such as personal data and sensor data from components, like cameras and GPS
trackers.



To protect our users and help developers navigate this complex environment,
Google analyzes privacy and security signals for each app in Google Play. We
then compare that app to other apps with similar features, known as
functional peers. Creating peer groups allows us to calibrate our
estimates of users' expectations and set adequate boundaries of behaviors that
may be considered unsafe or intrusive. This process helps detect apps that
collect or send sensitive data without a clear need, and makes it easier for
users to find apps that provide the right functionality and respect their
privacy. For example, most coloring book apps don't need to know a user's
precise location to function and this can be established by analyzing other
coloring book apps. By contrast, mapping and navigation apps need to know a
user's location, and often require GPS sensor access.



One way to create app peer groups is to create a fixed set of categories and
then assign each app into one or more categories, such as tools, productivity,
and games. However, fixed categories are too coarse and inflexible to capture
and track the many distinctions in the rapidly changing set of mobile apps.
Manual curation and maintenance of such categories is also a tedious and
error-prone task.



To address this, Google developed a machine-learning algorithm for clustering
mobile apps with similar capabilities. Our approach uses deep learning of vector
embeddings to identify peer groups of apps with similar functionality, using app
metadata, such as text descriptions, and user metrics, such as installs. Then
peer groups are used to identify anomalous, potentially harmful signals related
to privacy and security, from each app's requested permissions and its observed
behaviors. The correlation between different peer groups and their security
signals helps different teams at Google decide which apps to promote and
determine which apps deserve a more careful look by our security and privacy
experts. We also use the result to help app developers improve the privacy and
security of their apps.






Apps are split into groups of similar functionality, and in each cluster of
similar apps the established baseline is used to find anomalous privacy and
security signals.


These techniques build upon earlier ideas, such as using href="https://arxiv.org/abs/1605.08797">peer groups to analyze
privacy-related signals, href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality">deep
learning for language models to make those peer groups better, and href="https://arxiv.org/abs/1605.08797">automated data analysis to draw
conclusions.



Many teams across Google collaborated to create this algorithm and the
surrounding process. Thanks to several, essential team members including Andrew
Ahn, Vikas Arora, Hongji Bao, Jun Hong, Nwokedi Idika, Iulia Ion, Suman Jana,
Daehwan Kim, Kenny Lim, Jiahui Liu, Sai Teja Peddinti, Sebastian Porst, Gowdy
Rajappan, Aaron Rothman, Monir Sharif, Sooel Son, Michael Vrable, and Qiang Yan.



For more information on Google's efforts to detect and fight potentially harmful
apps (PHAs) on Android, see href="https://source.android.com/security/reports/Google_Android_Security_PHA_classifications.pdf">Google
Android Security Team's Classifications for Potentially Harmful
Applications.


References




S. Jana, Ú. Erlingsson, I. Ion (2015). href="https://arxiv.org/abs/1510.07308">Apples and Oranges: Detecting
Least-Privilege Violators with Peer Group Analysis. arXiv:1510.07308
[cs.CR].



T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean (2013). href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality">Distributed
Representations of Words and Phrases and their Compositionality. Advances in
Neural Information Processing Systems 26 (NIPS 2013).



Ú. Erlingsson (2016). Data-driven
software security: Models and methods.
Proceedings of the 29th IEEE Computer
Security Foundations Symposium (CSF'16), Lisboa, Portugal.




Subscribe to receive free email updates: