Data Mining in the United States
Trevor Merz
Freedom, Privacy and Technology
Imagine having a large string of data that is unreadable to normal human eyes, but in it could contain information relative to risk and deep knowledge. Perhaps the string could contain one chunk of material that would signify uncertainty, and another that would confirm risky behavior. Strings can be nearly infinitely long, which would take almost forever for humans to individually analyze individually. In today’s age we are seeing more and more super computers processing these strings to translate apparently random code into beneficial data that companies and the government will use for years to come. The foregoing introduces the process and function of data mining.
Data mining, defined as “data processing using sophisticated data search capabilities and statistical algorithms to discover patterns and correlations in large preexisting databases; a way to discover new meaning in data,” 1 is being used by businesses and governments worldwide to better serve their users. Data mining takes knowledge that consumers/users are unaware they are sharing, such as their actions while using the database, or how long it took to input information, and turns it into useful and actionable patterns. Data mining can also be referred to as “knowledge discovery in databases,” which is extracting implicit information such as knowledge mining from databases, knowledge extraction, data archaeology, data dredging, and data analysis.2 The action behind this informational data is limitless; whatever you can fathom, you can analyze an algorithm for.
Data used from mining can be used in any and every way you can think of. Most businesses will use it to increase their revenues through adding more valuable information to their Customer Relationship Manager software system (“CRM”) (). It can also be used by websites like Facebook. Indeed, how do you think the website can predict what you like or who your friends are? Now that’s some serious social algorithm study. Beyond web development and CRM, there are researchers from a variety of fields such as artificial intelligence, machine learning, statistics, and data visualization also showing significant interest in database mining techniques.
Data mining’s current capabilities:
Data mining is being used for various reasons, including but not limited to phishing protection, movie spoilers, the stock market, and medical error. Below I will dissect each of these with real life examples of data mining use.
Microsoft Uses Data Mining to Fight Phishing
Back in 2006, Microsoft was using data mining to retaliate against phishing ‘attackers’ who would gather information from users unwillingly and place personal viruses and adware on computers. As Internet Explorer is and was a major browser back then, Microsoft teamed up with Trusted Server technology to crawl the Internet and build lists of websites and their true IP addresses. After the data mining began with these two companies, they inserted researched and analyzed strings into products like Internet Explorer and Windows Live Toolbar.
The research began with analyzing financial companies and e-commerce sites focusing on 39 characteristics of each one, such as IP address, location, domain and service provider. Instead of using the blacklist technique, which is defined as “ a list of persons who are disapproved of or are to be punished or boycotted,”3 the companies focused on matching credible IP’s to the users browsers and ensuring they are inputting information on the correct site. The service acts as a middleman between user and site, flagging faux sites and alerting the user.
Student uses Data Mining to stop movie spoilers
Ever been on Facebook and seen the end to a movie or book you haven’t seen, but intended on? This could completely ruin a movie experience, and some people may not see the movie due to knowing the ending. These blurbs are called spoilers, as they give away a major piece of the plot before given the chance to seek it on ones own.
Luckily, through the process of data mining, Chinese graduate student Sheng Guo and his advisor, Naren Ramakrishnan, developed an algorithm looking for, ”linguistic cues to spot and flag spoilers before you read them,”4 which will save frustration for many people who enjoy the art of surprise. The algorithm uses keywords to scan articles on the Internet with familiar phrases inputted by someone who has seen the movie. Take Disney’s the Lion King for example. If developing the Lion King spoiler algorithm, the coder would input keywords such as “Mufasa dies” and “stampede” into the code. After developed, a user would be able to safely browse sites such as IMDB (Internet Movie Database), being alerted when a spoiler is recognized on the page and given the option to continue or not.
Stock market returns and foreign currency exchange rates
The stock market is a fragile yet almost predictable economic structure for investments. Data miners have stepped up this beast of a market via neural networks and decision trees. This type of data mining analysis based more on historical data than anything else, and rapid swings can be difficult to interpret. There are four different categories for data mining in the stock market and foreign currency:
- Five Time series: Data miners gather the index values at market open and close, and mark the highest and lowest index values in history as well as trading volume.
- Fundamental factors: measuring the price of gold, retail sales, industrial production rates, and foreign currency exchange rates
- Time lag in returns from interest
- Technical factors: analyzing variables that are functions of one or more time series5
Successful data miners in the stock market, you can imagine, have potential to predict when to buy, sell, trade, put, and call. This could make for a private system of software that could really make a dollar or millions.
Using Data Mining to Predict Errors in Chronic Disease Care
Patient safety research is advancing due to the use of data mining in the medical field. They are able to predict treatment errors in a large population of patients using two decision-making strategies called “feedback strategies” and “feedforward strategies” that guide treatment based on anticipated patient states.6
Data mining in the medical field generally relates to patterns of physician decision-making, with the goal of predicting errors and applying accurate information. Advances like these can make being a doctor a little less difficult, minimizing the unknown and applying truly effective treatment.
The future of Data Mining:
The four uses of data mining as stated above are four diverse ways that data mining is being used in today’s society, but there are so many more uses that are emerging. As computer scientists continue to develop algorithm analyzers, we will see more businesses, insurance companies, social networking sites, etc. you name it, attempting to dig through more consumer data to better serve needs. However, this art is almost scary – if you have a Facebook account you may run into suggested pages, where it takes pages and items you’ve already liked and suggests more that are similar, all done via a complex algorithm.
Services with a large customer base can all benefit from the work of data mining. There are plenty of services available to business owners, where a need is expressed and an algorithm is developed to scan through endless lines of data. Not only will companies be collecting our consumer data under the table, but companies like insurance brokers will be calculating risk with thee types of algorithms – now that’s a privacy concern. Imagine your name being scanned for public record every day, and if a match occurs it is stored somewhere, analyzed, and you are given a numerical quantity associated with your risk as an individual. Data mining will continue to develop and used for purposes both beneficial and counter-active.
Human Rights
One might think the Government would be all over data mining techniques, auditing those using mined data. The spectrum is wide, and surprisingly the government is using data mining techniques as well. For reasons predictable, the government indulges in as much publicly released information as they can, and here is how United States citizens are protected:
Federal Agency Data Mining Reporting Act 2007
According to the Federal Agency Data Mining Reporting Act of 2007, the government is allowed to mine data if detailed reports are submitted to congress regarding their programs. There are certain exclusions to this Act such as publicly available data. Publicly available data is not subject to protection from this act, which includes, “telephone directories, news reporting services, databases of legal and administrative rulings, and other databases and services providing public information without a fee.”1 This means any personally submitted information to companies such as Yellow Pages or Google are fair game for all data miners, including the government, in the United States. The alternative would be privately collected data, such as secure form input to insurance providers, education websites, and any other chargeable service – these are protected.
Although this Act requires “a description of the activity, its goals, the technology used, and the basis for determining whether a particular pattern or anomaly is indicative of terrorist or criminal activity; the data sources used; assessments of the program’s impact on privacy, including the actions that will be taken as a result of the implementation of the activity; and a description of the agency’s privacy protection and data accuracy policies,”8 here are a few problems the data mining Act has presented:
Personally Identifiable pattern-based electronic searches, queries or analyses
Pattern-based electronic searches are what the Act will protect individuals from, but it says nothing about personally identifiable patterned data Personally identifiable pattern data includes data that is specific to an individual, such as a license plate or social security number. Although these alphanumeric combinations are not linked directly to an individual, proper tracing can lead to one specific person. The government has excluded this in the definition of the act, leaving data miners the capability to link these combinations to individuals using search terms including these phrases.
Terrorist and criminal activities by individuals
Protection is limited to data mining being conducted for identifying predictive patterns or anomalies indicative of terrorist or criminal activity. 9 But where is the line drawn on crime and terrorism? The government could be mining out data in electronic databases that contain year’s worth of information –giving the miner a good idea of trend analysis and error. Is it a crime for the government to profile individuals using data mining even though the profiled victim has never inquired, or obtained wrong information associated and the innocent are being deemed guilty? Is the data being mined subject to racial profiling? These are questions the government will not likely answer or release to the public until a situation occurs.
Other Government Regulations for Data Mining:
The Privacy Act of 1975 – Regulates the federal governments use, retention, and disclosure of personal data. The interpretation of this act is extremely broad, leading to a large number of exceptions and limited interpretations. The use of exceptions can be formulated to any thesis, ending up leaving this Act with protection as null especially in data mining. This Act is only regulated for government agencies, leaving the private parties (that may or may not be disclosing information to the government) free.
The Electronic Communications Privacy Act of 1986 – This Act regulates electronic surveillance for law enforcement purposes, but according to TCP (see below) and consistent with provisions of the Federal Agency Data Mining Act, what is being allowed for proper mining is on low surveillance. The scope of this law is so broad that there is little to no limitation on how the government will use the information after obtaining it.
The E-Government Act of 2002 – Requires federal agencies to release “Privacy Impact Assessments” on data collection conducted through IT and also requires policies to be posted on their websites. Once again, the provisions of this law do not control third parties, allowing them to harvest data for the government if they so choose and if the government asks.
Constitutional limits on data mining centrally concern privacy rights, or as Justice Brandeis calls it, “the right to be let alone.” 10 Inside privacy come two other important words – confidentiality and anonymity. Confidentiality is protection that your personal information will be secure within a few persons, and anonymity is form of privacy that “occurs when the individual is in public places or performing public acts but still seeks, and finds, freedom from identification and surveillance.” 3 Both these protections, and privacy as a whole, have endorsements from the Supreme Court, but only in certain situations.
The Internet is a broad and relatively unknown realm still, leaving privacy, confidentiality, and the Fourth Amendment on a cliffhanger between expected privacy in the home while being connected to a worldwide public realm. There is a correlation between technology increasing and privacy laws being blurred. With the introduction of social networking in the last decade, individuals are willingly submitting their information to who they think are their friends but what they don’t realize is the 3rd party crawling? if privacy settings are not enabled. Sometimes the user doesn’t have a choice in submitting their information, in the case of online banking, communicating via email or phone, and even using credit cards to make purchases. 11
Fortunately for United States citizens, there is a program based out of Washington, DC called The Constitution Project, which gathers experts and practitioners from across the political spectrum to promote the America’s founding rights. This group is, “working to reform the nation’s broken criminal justice system and to strengthen the rule of law,” through public education and policy reform. The Constitution Project was formed to cast away labels that divide us in order to keep our Constitution and democracy strong, as stated in all documents released regarding privacies such the article published on data mining.
An Analysis of the Costs and Benefits of Data Mining Technology in the United States
National Security
Benefits:
Data mining is very beneficial to national security due to its algorithm analysis capabilities and how fast it can perform these operations. Imagining humans using trend analysis with a line of code is not only impossible? Since human memory cannot save a historical database—while computer-assisted data mining is unlimited. Even minor trends can be spotted, such as an email being sent through a number of different forwarding addresses; something that may appear to be one-way communication to humans, but through server connections could be two-way conversation. A super computer can easily spot this.
Time is also a large factor when it comes to protecting national security, because of the human data processing vs. data mining software difference in capabilities. Data mining can run all day, every day, and save as much data as it requested. We can spot acts of planning for terror or questionable items while we’re sleeping—alerting our government and allowing them to start investigating or preparing for something major.
Disadvantages:
As stated previously in the law section, the government has complete access to any data if terrorism is under investigation. Terrorists will manage to find a way to communicate offline or through other methods if data investigations become strict. There is not so much a disadvantage to this; it is more ‘securing the channel’ for national security reasons.
Personal Safety
Benefits:
There’s only so much we know about what is being mined. A lot of it is private, but corporations like Google are using it to benefit users in their daily routine. Assuming Google is using data collection in a helpful way, the mobile GPS tool could benefit as much as it could disadvantage. For the sake of benefits, GPS location is tracked and stored in a Google database. This could benefit those whose possessions are stolen (including phones) to track them back or give information to the police. On the contrary, it could also give a stalker with a Gmail password your location. There are companies such as Safe-T-Informant12 who geo target crime and place these high crime areas into GPS devices, so when a device enters that location an alert will be sent to the primary users cell phone. This is an example of how Google and Safe-T-Informant use historical data mining to benefit its users.
Disadvantages:
I got in contact with an individual who runs a business central to data mining for projects and clients. I asked him about benefits of personal safety through data mining from his perspective, and he pointed out the inexistent tradeoff between having your data out in the open for either Google or terrorists, who could target you for whatever they decide.
Disadvantages to personal safety are extremely prevalent today. Business giants are allowed to collect as much information as an individual puts out, and a lot of people willingly submit data without care to an unknown third party. People are less likely to pay attention to the man behind the curtain; especially since data mining is so unknown. I relate this disadvantage to the ‘iron cloud.’ The iron cloud is always around us even though we cannot see it, is extremely dense, and has capabilities to crush the earth with its magnitude.
This gentleman pointed out, when I gave him my email address, it was the only thing he needed in order to find out who I am. A simple Google search gave him a picture, places of employment, web submissions, and anything on the web I submitted as a user. “One piece of information unlocks them all,” he stated with paranoia, “and how do I sleep at night? I have guns… lots of guns and a mean dog.” 13
Freedom and Privacy
Benefits:
Whether we like it our not, our private lives are being exposed through data mining. This is not only happening on the individual level, but as a whole population. The benefit is good average trend data that implement norms we might not have been able to track ten years ago. Take for example a dating site called OKCupid. OkCupid has over 50,000 people online at any given time, and has created an easy and acceptable way to share information. Users feel comfortable answering these questions as it will appear publicly on their profile and give them a possibility to land a hot date. OkCupid created a partner blog, OKTrends, which is a trend analysis blog that gathers information from user submitted data. OkTrends makes correlations between things such as race, sex, status, habits, relationships, image, and posts them in a graph to generalize data across the board.14
Exhibit A is a graph exploring ‘How Long Relationships Last,’ comparing ‘people who use twitter every day vs. everyone else’ and age.14:
Description: http://akcdn.okccdn.com/blog/10charts/Twitter.png
This data states that frequent tweeters have shorter real-life relationships than anyone else. It is information inputted by OkCupid users– some know and some don’t know that their data would be published in a survey like this. Chances are users on a dating site may not be concerned with privacy as much as others, leaving them (and the rest of us) with relevant interest data that may influence people’s love life. This information will be stored forever and accessible to anyone.
Disadvantages:
On a specific level some things can be targeted, which may be a freedom or privacy disadvantage if it falls into the wrong hands. But in general, people who are privacy curious can read through privacy statements and accept or decline terms. These terms have started to become dozens of pages long, leaving the user with no idea or comprehension of what they actually agreed to. It has come to the point where people are submitting their privacy to the Internet without concern, and find technologic privacy inexistent.
Data mining is becoming the antagonist to freedom and privacy, grouping data as a whole and spitting out a historical-trend-analysis answer. It depends on who is asked and the comprehension they have of data mining to decide if it’s a benefit or disadvantage to the freedom and privacy of the United States.
Data Mining Recommendations
The journal of commercial law breaks down data mining investigations into two categories: goal-oriented and global-oriented. See the table below for a breakdown of these two different data mining investigations:
Description: Untitled:Users:TMerz:Desktop:Screen Shot 2011-08-05 at 11.38.24 AM.png15
My recommendations and advocacy stances are delegated to the goal-oriented side of the table. When seeking specific knowledge as a goal, I recommend improving laws to state privacy concerns in a readable Terms of Service with the necessity of scrolling down and stating exactly what the information collected will be used for. Implementing this type of TOS would benefit the positive goal-oriented data miners by explaining their cause, and leave the negative goal-oriented another hurdle to jump over in gathering data if the cause is faux or to profit.
Since data mining is relatively new, laws have potential to be understated especially as technology increases. Third parties will continue to mine data and use it for benefit, but there is hardly a way to monitor proper and legal use of this tool. Generally, action taken against the false use of data mining stems from the discovery of a problem after it’s too late. This sounds bad, but is almost a trade-off; who wants big brother monitoring all of their services? There must be further education to people around the world about data mining services and submitting information to third parties.
Social networking as a liberation technology has little potential for educating the United States about data mining. The reason I believe social media is not the answer for liberation is because social sites are mining data too; they require registration and request a surplus of unnecessary information to store in their database. Most networks will use it to personalize advertisements, or in the case of OKCupid give some good date advice, but often there will be data miners with foul intentions who may sell your input for cash. Using the Internet to better educate people about data mining principles could work, but would be most effective on a government or reputable organization website.
Identity theft is serious in the United States; an estimate of around 9 million American’s have their identity stolen every year.16 The procedure of stealing an identity centrally involves the internet, data input, and mining. The Federal Trade Commission lists a few of these Internet procedures as phishing, changing your address via web, and pretexting. There is a great deal a person can do to fight identity theft, according to the FTC, and it all stems around awareness. “Be aware of how information is stolen and what you can do to protect yours,” is the big concept the FTC wants us to know. To enforce my recommendation and advocacy, all data input and mining services should link to FTC.gov documents about privacy.
Organizations like EPIC (Electronic Privacy Information Center) are creating positive change by publishing a surplus of articles about electronic privacy concerns. EPIC provides updates and alerts as cases in court are expanded and ruled upon. If providers like Comcast believed in positive privacy change, they would send their users to EPIC with the first launch of a new connection. The Bureau of Consumer Protection is on a similar track, by providing legal resources for advertising and marketing, credit and finance, and privacy and security. The BCP lists four ways to share their resources, through buttons and banners for web developers, and downloadable brochures for the office. Both EPIC and the BCP are promoting positive change through appropriate ad copy and web awareness.17
The information gathered in goal-oriented data mining may raise some concerns to privacy, as it is detailed information concerning specific people and events. If this data is not encrypted, there is a possibility for professional hackers to enter the database and download all of the goal-oriented information. Therefore, I believe in encryption of this data to avoid any leak or release of information unwillingly. As stated previously about the iron cloud, if a hacker were to gain information from a database such as Google, the country’s privacy levels would skyrocket and there would be a technology lockdown.
To wrap up and conclude, we must segregate between goal-oriented and global-oriented data mining, encrypt our mined data, have closer ties to the Electronic Privacy Information Center, the Federal Trade Commission, and the Bureau of Consumer Privacy, and improve our terms of service to better inform our users where their information is going. Focusing on the goal-oriented third party data miners is important because of the information they are seeking. Encryption of data for all data miners is crucial so that we don’t see the iron cloud fall and Identity Theft (or something greater) skyrocket until the point of technology lockdown. The EPIC, FTC, and BCP all have great article about protecting yourself, and even have advertisements you can place anywhere to raise awareness for these privacy issues. If we can increase terms of service readability and inform the user of where inputted data is going, our privacy concerns will relax if we have a check and balance to ensure quality. With implementation of all of these recommendations we will find a more secure and safe Internet experience. I predict throughout the next decade we will see more implementation of these recommendations due to problematic happenings and Supreme Court cases relating to Internet privacy.
ANNEXES:
- Princeton University “About WordNet.” WordNet. Princeton University. 2010. http://wordnet.princeton.edu
- G. Piatetsky-Shapiro and W.J. Frawley, Knowledge Discovery in Databases. AAAI/MIT Press, 1991.
- Kirk, Jeremy. “Microsoft Uses Data Mining to Fight Phishing Computerworld. Web. 11 July 2011. http://www.computerworld.com/s/article/9002996/Microsoft_uses_data_mining_to_fight_phishing.
- “Virginia Tech Researchers Develop Method to Stop Movie Villain Known as The Spoiler | Virginia Tech News | Virginia Tech.” Virginia Tech News | Virginia Tech. Web. 11 July 2011. http://www.vtnews.vt.edu/articles/2010/09/090710-engineering-moviespoilers.html
- “Blacklist – Definition and More from the Free Merriam-Webster Dictionary.” Dictionary and Thesaurus – Merriam-Webster Online. Web. 11 July 2011. http://www.merriam-webster.com/dictionary/blacklist.
- Langdell PhD, Stephen. “Examples of the Use of Data Mining in Financial Applications.” Numerical Algorithms Group. Web. 11 July 2011. http://www.nag.co.uk/.Using Data
- McCabe, R. M., G. Adomavicius, P. E. Johnson, G. Ramsey, E. Rund, W. A. Rush, P. J. O’Connor, and J. A. Sperl-Hillen. “Using Data Mining to Predict Errors in Chronic Disease Care.” PubMed. Gov, 01 Aug. 2008. Web. 11 July 2011. http://www.ncbi.nlm.nih.gov/pubmed/21249933.
- “Data Mining.” The IT Law Wiki. Web. 23 July 2011. http://itlaw.wikia.com/wiki/Data_mining.
- THE IDENTITY PROJECT. ““Secure Flight” Data Formats Added to the AIRIMP.” THE IDENTITY PROJECT. 19 May 2009. Web. http://www.papersplease.org.
- “UNITED STATES V. MILLER, 425 U. S. 435 :: Volume 425 :: 1976 :: Full Text.” US Supreme Court Cases from Justia & Oyez. Web. 23 July 2011. http://supreme.justia.com/us/425/435/case.html.
- “Principles for Government Data Mining:.” The Constitution Project. Web. 22 July 2011. http://www.constitutionproject.org/.
- Drive, Dynamic. Safe-T-Informant – Be Street Smart, Be Safe! Web. 30 July 2011. http://www.safe-t-informant.com/.
- Anonymous Delvingware.com Employee. Personal Interview. 29 July 2011.
- Rudder, Christian. “10 Charts About Sex.” OkTrends. Web. 30 July 2011. http://blog.okcupid.com/,
- Vanderlooy, Stijn, Joop Verbeek, and Jaap Van Den Herik. “Towards Privacy-Preserving Data Mining in Law Enforcement.” DOAJ — Directory of Open Access Journals. Denmark, 2007. Web. 05 Aug. 2011. http://www.doaj.org/doaj?func=openurl
-This source is from a Journal in Denmark, but still addresses a comparison of global-oriented investigation and goal-oriented investigation. - “About Identity Theft – Deter. Detect. Defend. Avoid ID Theft.” Federal Trade Commission. US Government. Web. 05 Aug. 2011. http://www.ftc.gov/bcp/edu/microsites/idtheft/consumers/about-identity-theft.html.
- EPIC – Electronic Privacy Information Center. Web. 05 Aug. 2011. http://epic.org.
I just wanted to thank Trevor Merz for writing this detailed report on Data Mining in the United States.
Most Popular -
Tags: data mining, Data Mining in the United States, Data Mining to Fight Phishing, trevor merz





