Skip to content

Data Science Definition

For our purposes Data Science is an interdisciplinary and applied information processing activity that uses tools and techniques from statistics, scientific computing, visualization, numerical algorithms (machine learning), databases, programming languages or systems and related IT systems (collectively termed the Data Science Toolkit) to extract or extrapolate knowledge and insights from structured, or unstructured data.

Use Contexts

Applications of Data Science are widespread and can be classified into the following main groups depending on the entity that is the primary beneficiary and

  • Commercially oriented uses in various business sectors
  • Public Sector applications
  • Education
  • Academic Research
  • Central / Local Government Units
  • Law Enforcement / Security
  • Civil Society / NGO's
  • Personal (individual) applications

The Use Context is rather instrumental in shaping the Data Science toolkit:

  • The type of data that are available
  • The type of algorithms or other quantitative procedures that are applicable
  • The type of inferences, metrics or insights that are useful
  • The manner in which Data Science insights can be productionized
  • The type of legal and regulatory framework within which Data Science can be applied

Definitional Challenges

The variety of use contexts, interdisciplinary nature and relative novelty of Data Science means that there are considerable ambiguities as to what exactly it entails. The following statements aim to clarify the use of the term in the context of the Python Data Science project:

  • Data Science is narrower than the more general term "Data Analytics", which itself is narrower than "Information Technology". A non-trivial algorithmic element is at the core of all Data Science applications, even if it is seldom the only or even most important element.
  • Developing specialized Data Processing infrastructures is not part of Data Science. There is an emerging, related field of Data Engineering.
  • Data Science includes subsets of Statistics, Machine Learning and Scientific Computing. It does not include these domains in their entirety but only to the extent they can applied as part of a toolkit to deliver the desired knowledge extraction.
  • Specific Sectoral Toolkits for the analysis of data may or may not be part of Data Science. An important criterion is indeed to what degree such tools might be re-usable is a slightly wider context.

Practitioners

People engaging in Data Science are professionals who use the Data Science Toolkit in some combination of capabilities to create insights from data. The labeling of Data Science practitioner roles may vary, depending on use context.

PyPI Intended Audience Tags (Relevant Subset, adapted)

  • Science/Research
  • Education
  • Customer Service (Business)
  • Financial and Insurance Industry
  • IT / Telecommunications Industry
  • Healthcare Industry
  • Legal Industry
  • Manufacturing (Business)
  • End Users (Individuals)
  • Other Audience