What Is Data Science?
- Data science is a new field of study, just sufficiently coalesced to be called a science.
- Data science is truly a hybrid discipline, sitting at the intersection of statistics and computer science.
- What then is data science? Data science is an amalgam of analytic methods aimed at extracting information from data.
- Two broad technologies are most responsible—the internet and automated data collection devices.
- Devices that collect data have become nearly ubiquitous and often surreptitious in the human environment.
- The smartphone in your pocket is measuring the ambient temperature and your latitude and longitude.
- Phone’s gravitometer is measuring the local gravitational field.
- In the past century, data collection often was carried out by statistical sampling since data was manually collected and therefore expensive.
- In this century, data are arriving by firehose and without design.
- The theme is the extraction of information from data.
- Statistical science overlaps with data science with respect to the objective of extracting information from data but data science spans situations that are beyond the scope of statistics.
- The validity of many statistical methods is nullified if the data are not collected by a probabilistic sampling design such as random sampling.
- Hypothesis testing is nearly irrelevant in data science because opportunistically collected data lack the necessary design.
- One of the skills of an accomplished data scientist is the ability to judge what statistical techniques are useful and figure out how apply them at large scale. Both programming techniques and statistical methods are necessary and ubiquitous in data science.
- The data of data science may be loosely described as being of three varieties:
- Massive in volume but static
- Arriving in a stream and needing immediate analysis
- High Dimensional.
- The problem of high dimensionality differs in several respects from the first two varieties.