Technology
 

Data mining

From The IT Law Wiki

Contents

[edit] Overview

Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. These tools can include statistical models, mathematical algorithms, and machine learning methods (algorithms that improve their performance automatically through experience, such as neural networks or decision trees). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction.

Data mining enables corporations and government agencies to analyze massive volumes of data quickly and relatively inexpensively. The use of this type of information retrieval has been driven by the exponential growth in the volumes and availability of information collected by the public and private sectors, as well as by advances in computing and data storage capabilities. In response to these trends, generic data mining tools are increasingly available for — or built into — major commercial database applications. Today, mining can be performed on many types of data, including those in structured, textual, spatial, Web, or multimedia forms.

Data mining applications can use a variety of parameters to examine the data. They include association (patterns where one event is connected to another event), sequence or path analysis (patterns where one event leads to another event, such as the birth of a child and purchasing diapers), classification (identification of new patterns), clustering (finding and visually documenting groups of previously unknown facts), and forecasting (discovering patterns from which one can make reasonable predictions regarding future activities).

Data mining has become increasingly common in both the public and private sectors. Industries such as banking, insurance, medicine, and retailing commonly use data mining to reduce costs, enhance research, and increase sales. For example, the insurance and banking industries use data mining applications to detect fraud and assist in risk assessment (e.g., credit scoring). Using customer data collected over several years, companies can develop models that predict whether a customer is a good credit risk, or whether an accident claim may be fraudulent and should be investigated more closely.

The medical community sometimes uses data mining to help predict the effectiveness of a procedure or medicine. Pharmaceutical firms use data mining of chemical compounds and genetic material to help guide research on new treatments for diseases. Retailers can use information collected through affinity programs (e.g., shoppers’ club cards, frequent flyer points, contests) to assess the effectiveness of product selection and placement decisions, coupon offers, and which products are often purchased together. Companies such as telephone service providers and music clubs can use data mining to create a “churn analysis,” to assess which customers are likely to remain as subscribers and which ones are likely to switch to a competitor.

The proliferation of data mining has raised implementation and oversight issues, including concerns about the quality of the data being analyzed, the interoperability of the databases and software, and potential infringements on privacy.

In the public sector, data mining applications were initially used as a means to detect fraud and waste, but they have grown also to be used for purposes such as measuring and improving program performance. In the public sector, the most frequent uses of data mining are in the following areas:

  • improving service or performance;
  • detecting fraud, waste, and abuse;
  • analyzing scientific and research information;
  • managing human resources;
  • detecting criminal activities or patterns; and
  • analyzing intelligence and detecting terrorist activities.[1]

[edit] Definitions

Although the use and sophistication of data mining have increased in both the government and the private sector, data mining remains an ambiguous term. According to some experts, data mining overlaps a wide range of analytical activities, including data profiling, data warehousing, online analytical processing, and enterprise analytical applications.[2] Some of the terms used to describe data mining or similar analytical activities include “factual data analysis” and “predictive analytics.”

[edit] Government reports

Government reports have defined "data mining" variously:

  • The Government Accountability Office (GAO) defined data mining in its May 2004 report entitled "Data Mining: Federal Efforts Cover a Wide Range of Uses" as “the application of database technology and techniques — such as statistical analysis and modeling — to uncover hidden patterns and subtle relationships in data and to infer rules that allow for the prediction of future results.”
  • The Congressional Research Service (CRS) defined data mining in its January 27, 2006, report to Congress entitled, "Data Mining and Homeland Security: An Overview," in more generic terms. It states that data mining “involves the uses of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets.” The report describes data mining as using a “discovery approach” in which algorithms examine data relationships to identify patterns. It distinguishes this method from analytical tools that use a “verification based approach,” where the user develops a hypothesis and then uses data to test the hypothesis.
  • The Department of Homeland Security Office of the Inspector General (DHS OIG) defines data mining in its August 2006 Survey of DHS Data Mining Activities, simply as “the process of knowledge discovery, predictive modeling, and analytics.” It stated that this has traditionally involved the discovery of patterns and relationships from structured databases of historical occurrences.

[edit] House Committee report

The House Conf. Rept. No. 109-699 has defined “data mining” as

a query or search or other analysis of 1 or more electronic databases, whereas — (A) at least 1 of the databases was obtained from or remains under the control of a non-Federal entity, or the information was acquired initially by another department or agency of the Federal Government for purposes other than intelligence or law enforcement; (B) a department or agency of the Federal Government or a non-Federal entity acting on behalf of the Federal Government is conducting the query or search or other analysis to find a predictive pattern indicating terrorist or criminal activity; and (C) the search does not use a specific individual’s personal identifiers to acquire information concerning that individual.

This definition is to be used by government departments and agencies in evaluating whether or not their information processing activities constitute data mining activities.

[edit] Federal legislation

The Federal Agency Data Mining Reporting Act of 2007 defines “data mining” as:

a program involving pattern-based[3] queries, searches, or other analyses of 1 or more electronic databases, where —
(A) a department or agency of the Federal Government, or a non-Federal entity acting on behalf of the Federal Government, is conducting the queries, searches, or other analyses to discover or locate a predictive pattern or anomaly indicative of terrorist or criminal activity on the part of any individual or individuals;
(B) the queries, searches, or other analyses are not subject-based and do not use personal identifiers of a specific individual, or inputs associated with a specific individual or group of individuals, to retrieve information from the database or databases; and
(C) the purpose of the queries, searches, or other analyses is not solely —
(i) the detection of fraud, waste, or abuse in a Government agency or program; or
(ii) the security of a Government computer system.[4]

[edit] Data Quality

Data quality is a multifaceted issue that represents one of the biggest challenges for data mining. Data quality refers to the accuracy and completeness of the data. Data quality can also be affected by the structure and consistency of the data being analyzed. The presence of duplicate records, the lack of data standards, the timeliness of updates, and human error can significantly impact the effectiveness of the more complex data mining techniques, which are sensitive to subtle differences that may exist in the data. To improve data quality, it is sometimes necessary to “clean” the data, which can involve the removal of duplicate records, normalizing the values used to represent information in the database (e.g., ensuring that “no” is represented as a 0 throughout the database, and not sometimes as a 0, sometimes as an N, etc.), accounting for missing data points, removing unneeded data fields, identifying anomalous data points (e.g., an individual whose age is shown as 142 years), and standardizing data formats (e.g., changing dates so they all include MM/DD/YYYY).

All data collection efforts suffer accuracy concerns to some degree. Ensuring the accuracy of information can require costly protocols that may not be cost effective if the data is not of inherently high economic value. In well-managed data mining projects, the original data collecting organization is likely to be aware of the data’s limitations and account for these limitations accordingly. However, such awareness may not be communicated or heeded when data is used for other purposes. For example, the accuracy of information collected through a shopper’s club card may suffer for a variety of reasons, including the lack of identity authentication when a card is issued, cashiers using their own cards for customers who do not have one, and/or customers who use multiple cards.[5] For the purposes of marketing to consumers, the impact of these inaccuracies is negligible to the individual. If a government agency were to use that information to target individuals based on food purchases associated with particular religious observances though, an outcome based on inaccurate information could be, at the least, a waste of resources by the government agency, and an unpleasant experience for the misidentified individual.

[edit] Anti-terrorism Activities

Since the terrorist attacks of September 11, 2001, data mining has been seen increasingly as a useful tool to help detect terrorist threats by improving the collection and analysis of public and private sector data. One response to these concerns was the creation of the Information Awareness Office (IAO) at the Defense Advanced Research Projects Agency (DARPA) in January 2002. The role of IAO was “in part to bring together, under the leadership of one technical office director, several existing DARPA programs focused on applying information technology to combat terrorist threats.”[6] The mission statement for IAO suggested that the emphasis on these technology programs was to “counter asymmetric threats by achieving total information awareness useful for preemption, national security warning, and national security decision making.”[7]

In a report on information sharing and analysis to address the challenges of homeland security, it was noted that agencies at all levels of government are now interested in collecting and mining large amounts of data from commercial sources.[8] The report noted that agencies may use such data not only for investigations of known terrorists, but also to perform large-scale data analysis and pattern discovery in order to discern potential terrorist activity by unknown individuals. Such use of data mining by federal agencies has raised public and congressional concerns regarding privacy.

[edit] Legal Issues

Federal government access to and mining of information on individuals held in a multiplicity of databases, public and private, raises a plethora of issues — both legal and policy. To what extent should the government be able to gather and mine information about individuals to aid the war on terrorism?[9] Should unrestricted access to personal information be permitted? Should limitations, if any, be imposed on the government’s access to personal information? In resolving these issues, the current state of the law in this area may be consulted. The following is a description of selected information access, collection and disclosure laws and regulations that relate to these issues.

[edit] Laws Governing Federal Government Access to Information

Generally there are no blanket prohibitions on federal government access to publicly available information (e.g., real property records, liens, mortgages, etc.). Occasionally a statute will specifically authorize access to such data. The USA Patriot Act, for example, in transforming the Treasury Department’s Financial Crimes Enforcement Network (FinCEN) from an administratively established bureau to one established by statute, specified that it was to provide government-wide access to information collected under the anti-money laundering laws, records maintained by other government offices, as well as privately and publicly held information.

Other government agencies have also availed themselves of computer software products that provide access to a range of personal information. The FBI reportedly purchases personal information from ChoicePoint, Inc., a provider of identification and credential verification services, for data analysis.[10]

[edit] Privacy Concerns

Mining government and private databases containing personal information creates a range of privacy concerns. Through data mining, government agencies can quickly and efficiently obtain information on individuals or groups by exploiting large databases containing personal information aggregated from public and private records. Information can be developed about a specific individual or about unknown individuals whose behavior or characteristics fit a specific pattern. Before data aggregation and data mining came into use, personal information contained in paper records stored at widely dispersed locations, such as courthouses or other government offices, was relatively difficult to gather and analyze. As one expert noted, data mining technologies that provide for easy access and analysis of aggregated data challenge the concept of privacy protection afforded to individuals through the inherent inefficiency of government agencies analyzing paper, rather than aggregated, computer records.[11]

Privacy concerns about mined or analyzed personal data also include concerns about the quality and accuracy of the mined data; the use of the data for other than the original purpose for which the data were collected without the consent of the individual (mission creep); the protection of the data against unauthorized access, modification, or disclosure; and the right of individuals to know about the collection of personal information, how to access that information, and how to request a correction of inaccurate information.[12]

Some observers contend that tradeoffs may need to be made regarding privacy to ensure security. Other observers suggest that existing laws and regulations regarding privacy protections are adequate, and that these initiatives do not pose any threats to privacy. Still other observers argue that not enough is known about how data mining projects will be carried out, and that greater oversight is needed. There is also some disagreement over how privacy concerns should be addressed. Some observers suggest that technical solutions are adequate. In contrast, some privacy advocates argue in favor of creating clearer policies and exercising stronger oversight. As data mining efforts move forward, Congress may consider a variety of questions including, the degree to which government agencies should use and mix commercial data with government data, whether data sources are being used for purposes other than those for which they were originally designed, and the possible application of the 1974 Privacy Act to these initiatives.

[edit] References

  1. See GAO, Data Mining: Federal Efforts Cover a Wide Range of Uses (GAO-04-548 May 4, 2004).
  2. See Lou Agosta, Data Mining Is Dead—Long Live Predictive Analytics! (Forrester Research, Oct. 30, 2003)[1].
  3. The limitation to predictive, "pattern-based" data mining is significant because analysis performed within the ODNI and its constituent elements for counterterrorism and similar purpose is also performed using various types of link analysis tools. These tools start with a known or suspected terrorist or other subject of foreign intelligence interest and use various methods to uncover links between the known subject and potential associates or other persons with whom that subject is or has been in contact. The Act does not include such analyses within its definition of "data mining" because such analyses are not "pattern-based." Rather, these analyses rely on inputting the "personal identifiers of a specific individual, or inputs associated with a specific individual or group of individuals," which is excluded from the definition of "data mining" under the Act.
  4. Pub. L. No. 110-53, 121 Stat. 266, §804(b)(1).
  5. Department of Defense, Technology and Privacy Advisory Comm., Safeguarding Privacy in the Fight Against Terrorism 40 (Mar, 2004).
  6. Department of Defense, Report to Congress Regarding the Terrorism Information Awareness Program, Executive Summary, at 2 (May 20, 2003).
  7. Id. at 1 (emphasis added).
  8. Creating a Trusted Information Network for Homeland Security (Markle Foundation, Dec. 2003).[2]
  9. The Markle Foundation Task Force on National Security in the Information Age has proposed guidelines to allow the effective use of information (including the use of data mining technologies) in the war against terrorism while respecting individuals’ interests in the use of private information. See Markle Foundation Task Force on National Security in the Information Age: Protecting America’s Freedom in the Information Age 32-34 (Oct. 2002).[3]
  10. Glenn R. Simpson, "Big Brother-in-Law: If the FBI Hopes to Get the Goods on You, It May Ask ChoicePoint — U.S. Agencies’ Growing Use of Outside Data Suppliers Raises Privacy Concerns," Wall St. J., Apr. 13, 2001 (The company “specialize[s] in doing what the law discourages the government from doing on its own — culling, sorting and packaging data on individuals from scores of sources, including credit bureaus, marketers and regulatory agencies.”)
  11. K.A. Taipale, “Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data,” 5 Columbia Sci. & Tech. L. Rev. (2003-04)[4]
  12. These privacy concerns are reflected in the Fair Information Practices proposed in 1980 by the Organization for Economic Cooperation and Development and endorsed by the U.S. Department of Commerce in 1981. These practices govern collection limitation, purpose specification, use limitation, data quality, security safeguards, openness, individual participation, and accountability.