Center Project Reports

What Makes a Privacy Policy Need Automatic Summarization? Investigation of Readability, Length, and Category in Usage Patterns of Privacy Enhancing Technologies
Razieh Nokhbeh Zaeem, K. Suzanne Barber, UT CID Report #21-01, August 2021

Abstract
Show Abstract A thriving field of research develops Privacy Enhancing Technologies (PET) that utilize a variety of machine learning, natural language processing, and crowd sourcing methods to automatically summarize long and hard-to-read online privacy policies. Very few researchers, however, have looked at how and why users actually run these PET tools. We present the first work to investigate the usage patterns of such tools to identify what features of a privacy policy make users interested in running PETs. We consider PrivacyCheck and Polisis, two well-known PET tools available as browser extensions. After collecting the privacy policies on which a PET tool is executed, we perform null hypothesis test-ing to see if there is a statistically significant difference between the readability, length, and category of the privacy policies of interest to PET users and a control group of 68K+ policies from the websites of the DMOZ project. We report the following findings: (1) In 20221, at least 16 years of education are required to understand an average privacy policy in the control group, which has an average length of over 2K words. (2) We observe no statistically significant difference between the read-ability or length of the policies in the control group and those on which PET tools are executed. (3) Users are keen on running PET tools on privacy policies that be-long to particular categories. Most notably, privacy polifcies of the Games websites were almost four times, and those from Computers, and Kids and Teens websites were more than three times more likely to be investigated with PET tools, compared to the control group. Our work motivates and guides the flourishing field of PET tools and enlightens privacy policy users, writers, and regulators alike.

 

Access Publication: Download PDF of Report

A Large Publicly Available Corpus of Website Privacy Policies Based on DMOZ
Razieh Nokhbeh Zaeem, K. Suzanne Barber, UT CID Report #20-16, November 2020

Abstract
Show Abstract Studies have shown website privacy policies are too long and hard to comprehend for their target audience. These studies and a more recent body of research that utilizes machine learning and natural language processing to automatically summarize privacy policies greatly benefit, if not rely on, corpora of privacy policies collected from the web. While there have been smaller annotated corpora of web privacy policies made public, we are not aware of any large publicly available corpus. We use DMOZ, a massive open-content directory of the web, and its manually categorized 1.5 million web-sites, to collect hundreds of thousands of privacy policies associated with their categories, enabling research on privacy policies across different categories/market sectors. We review statistics of this cor-pus and make it available for research. We also obtain valuable insights about privacy policies, e.g., which websites post them less often. Our corpus of web privacy policies is a valuable tool at the researchers’ disposal to investigate privacy policies. For example, it facilitates comparison among different methods of privacy pol-icy summarization by providing a benchmark, and can be used in unsupervised machine learning to automatically digest privacy policies.

 

Access Publication: Download PDF of Report

Self-Sovereign Identity and User Control for Privacy-Preserving Contact Tracing,
Wenting Song, Razieh Nokhbeh Zaeem, David Liau, Kai Chih Chang, Michael R. Lamison, Manah M. Khalil, K. Suzanne Barber; UT CID Report#: 20-13, July 2020

Abstract
Show Abstract Contact tracing apps use mobile devices to keep track of and promptly identify those who come in contact with an individual who tests positive for COVID-19. However, privacy is a major obstacle to the wide-spread use of such apps since users are concerned about sharing their contact and diagnosis data. This research overcomes multiple challenges facing contact tracing apps: (1) As researchers have pointed out, there is a need to balance contact tracing effectiveness with the amount of user identity and diagnosis information shared. (2) No matter what information the user chooses to share, the app should safeguard the privacy of user information. (3) On the other hand, some essential test result information must be shared for the contact tracing app to work. While contact tracing apps have done a good job maintaining contact information on the user’s device, most such apps publish positive COVID-19 test results to a central server which have some risks for compromise. (4) Finally, following the spirit of privacy and in the absence of significant collection of user information, the app must innovate new methods to identify deliberate false reports of COVID-19. We address these challenges by (1) giving the user the right to choose how much information to share about their diagnosis and their identity, (2) building our novel contact tracing app on top of Self Sovereign Identity (SSI) to assure privacy preserving user authentication with verifiable credentials, (3) decentralizing the storage of COVID-19 test results, and (4) incorporating innovate fraud detection methods with limited user information. We, in collaboration with a top multi-national telecommunications corporation, have implemented our Privacy-preserving Contact Tracing (PpCT) app, leveraging Self-Sovereign Identity advances based on the blockchain for their 5G network. 

 

Access Publication: Download PDF of Report

How Much Identity Management with Blockchain Would Have Saved Us? A Longitudinal Study of Identity Theft
R. Nokhbeh Zaeem, K. Suzanne Barber, UT CID Report #20-14, July, 2020.

Abstract
Show Abstract The use of blockchain for identity management (IdM) has been on the rise in the past decade. We present the first work to study the actual, large-scale impact of using blockchain for identity management, particularly how it can protect Personally Identifiable Information (PII) to curb identity theft and fraud. Our insight is that if blockchain-based IdM protects PII, it can reduce the number of theft and fraud cases that take advantage of such PII. At the Center for Identity at the University of Texas at Austin, we have modeled about 6,000 cases of identity theft, and PII exploited in them. We utilize this model to investigate how three real-world blockchain-based IdM solutions (Civic, ShoCard, and Authenteq) could have reduced the identity theft loss over the past 20 years if they had been universally used. We identify which PII protected by blockchain is more critical. We also suggest new PII to include in blockchain-based IdM. Our work paves the way for the design of more effective blockchain-based IdM or any other new line of IdM for that matter.

Access Publication: Download PDF of Report

On Sentiment of Online Fake News,
R. Nokhbeh Zaeem, C. Li, and K. Suzanne Barber.  UT CID Report#: 20-13, July 2020

Abstract
Show Abstract The presence of disinformation and fake news on the Internet and especially social media has become a major concern. Prime examples of such fake news surged in the 2016 U.S. presidential election cycle and the COVID-19 pandemic. We quantify sentiment differences between true and fake news on social media using a diverse body of datasets from the literature that contain about 100K previously labeled true and fake news. We also experiment with a variety of sentiment analysis tools. We model the association between sentiment and veracity as conditional probability and also leverage statistical hypothesis testing to uncover the relationship between sentiment and veracity. With a significance level of 99.999%, we observe a statistically significant relationship between negative sentiment and fake news and between positive sentiment and true news. The degree of association, as measured by Goodman and Kruskal’s gamma, ranges between .037 to .475. Finally, we make our data and code publicly available to support reproducibility. Our results assist in the development of automatic fake news detectors. Index Terms—disinformation, misinformation, fake news, sen-timent analysis, social networks, veracity

 

Access Publication: Download PDF of Report

A Survival Game Analysis to Common Personal Identity Protection Strategies
D. Liau, R. Nokhbeh Zaeem, and K. Suzanne Barber. UT CID Report#: 20-12, June 2020.

Abstract
Show Abstract

Throughout the years, authentication processes of individ-uals’ identities have become essential parts of our modern daily life. These authentication processes also introduced the heavy use of Per-sonally Identifiable Information (PII) in various applications. On the other hand, the continuous increase of identity–the unauthorized use of such PII–has created rich business opportunities for identity protection service providers. These services usually consist of a monitoring system that continuously searches through the Internet for incidents that sup-posedly indicates identity theft activities. However, these solutions are largely based on case studies and a quantified method is missing among different identity protection services.
This research offers a tool that provides quantitative analysis among dif-ferent identity protection services. By bringing together previous work in the field, namely the UT Center for Identity (CID) Identity Ecosystem (a Bayesian network mathematical representation of a person’s identity), real world identity theft data, stochastic game theory, and Markov deci-sion processes, we generate and evaluate the best strategy for defending against the theft of personal identity information. One of the research problems that this paper addresses is the computation complexity of quantitatively evaluating identity protection strategies with real world data. In a real world database like Identity Threat Assessment and Pre-diction (ITAP) project which the UT CID Identity Ecosystem is built on, the number of PII attributes in use are normally in the order of 103. We propose a reinforcement learning algorithm for solving the optimal strategy to protect the user’s identity against a malicious and efficient at-tacker. We aim to understand how initial individual PII exposure evolves into crucial PII breaches over time in terms of the dynamic integrity of the Identity Ecosystem. Real world identity protection strategies are then translated into the system and fight against the malicious attacker for quantitative comparison in our experiment. We present the survival analysis to these strategies and calculate the survival gap between these strategies against our active protection strategy as our experiment result. This study is aimed to understand the evolutionary process of identity under attack which may inspire a new direction for future identity pro-tection strategies.

 

Access Publication: Download PDF of Report

A Framework for Estimating Privacy Risk Scores of Mobile Apps
K. C. Chang, R. Nokhbeh Zaeem, K. Suzanne Barber. UT CID Report#: 20-11, June 2020

Abstract
Show Abstract With the rapidly growing popularity of smart mobile de-vices, the number of mobile applications available has surged in the past few years. Such mobile applications collect a treasure trove of Personally Identifiable Information (PII) attributes (such as age, gender, location, and fingerprints). Mobile applications, however, are many and often not well understood, especially for their privacy-related activities and func-tions. To fill this critical gap, we recommend providing an automated yet effective assessment of the privacy risk score of each application. The design goal is that the higher the score, the higher the potential pri-vacy risk of this mobile application. Specifically, we consider excessive data access permissions and risky privacy policies. We first calculate the privacy risk of over 600 PII attributes through a longitudinal study of over 20 years of identity theft and fraud news reporting. Then, we map the access rights and privacy policies of each smart application to our dataset of PII to analyze what PII the application collects, and then cal-culate the privacy risk score of each smart application. Finally, we report our extensive experiments of 100 open source applications collected from Google Play to evaluate our method. The experimental results clearly prove the effectiveness of our method.

Access Publication: Download PDF of Report

PrivacyCheck’s Machine Learning to Digest Privacy Policies: Competitor Analysis and Usage Patterns R. Nokhbeh Zaeem, S. Anya, A. Issa, J. Nimergood, I. Rogers, V. Shah, A. Srivastava, and K.Suzanne Barber. UT CID Report# 20-10, June 2020

Abstract
Show AbstractOnline privacy policies are lengthy and hard to comprehend. To address this problem, researchers have utilized machine learning (ML) to devise tools that automatically sum-marize online privacy policies for web users. One such tool is our free and publicly available browser extension, PrivacyCheck. In this paper, we enhance PrivacyCheck by adding a competitor analysis component—a part of PrivacyCheck that recommends other organizations in the same market sector with better privacy policies. We also monitored the usage patterns of about a thousand actual PrivacyCheck users, the first work to track the usage and traffic of an ML-based privacy analysis tool. Results show: (1) there is a good number of privacy policy URLs checked repeatedly by the user base; (2) the users are particularly interested in privacy policies of software services; and (3) PrivacyCheck increased the number of times a user consults privacy policies by 80%. Our work demonstrates the potential of ML-based privacy analysis tools and also sheds light on how these tools are used in practice to give users actionable knowledge they can use to pro-actively protect their privacy. 

 

Access Publication: Download PDF of Report

Election Prediction with Trust Filters
T. Huang, R. Nokhbeh Zaeem, and K. Suzanne Barber. UT CID Report#: 20-09, June 2020.

Abstract
Show Abstract Social media has become an essential aspect of our life, and we are used to expressing our thoughts on these platforms. Using social media as an opinion finder has become a popular measure. For any topic that the public opinion matters, there is the potential of using social media to evaluate the problem. Presidential election definitely falls into this category. Previous researches have proven the effectiveness of using social media such as Twitter to predict the outcome of elections. Nevertheless, the composition of social media users can never be the same as the real demographic. What makes things worse is the existence of malicious users who intend to manipulate the public’s tendencies toward candidates or parties. In this paper, we aim to increase the predicting precision under the premise that the extracted tweets are noisy. By taking an individual’s trustworthiness, participation bias and the influence into account, we propose a novel method to forecast the U.S. presidential election.</p>

 

Access Publication: Download PDF of Report

PrivacyCheck v2: A Tool that Recaps Privacy Policies for You R.N. Zaeem, Anya S., Issa,J. Nimergood,I. Rogers, V. Shah, A. Srivastava, and K. Suzanne Barber. UT CID Report#: 20-08, June 2020

Abstract
Show Abstract Despite the efforts to regulate privacy policies to protect user privacy, these policies remain lengthy and hard to comprehend. Powered by machine learning, our publicly available browser extension, PrivacyCheck v2, automatically summarizes any privacy policy by answering 20 questions based upon User Control and the General Data Protection Regulation. Furthermore, PrivacyCheck v2 incorporates a competitor analysis tool that highlights the top competitors with the best privacy policies in the same market sector. PrivacyCheck v2 enhances the users’ understanding of privacy policies and empowers them to make informed decisions when it comes to selecting services with better privacy policies.

 

Access Publication: Download PDF of Report

Get Center for Identity Updates