What is data science?
Data science is a method to glean insights from structured and unstructured data using approaches ranging from statistical analysis to machine learning (ML). For most organizations, it’s employed to transform data into value in the form of improved revenue, reduced costs, business agility, improved customer experience, developing new products, and so on. In short, data science gives the data collected by an organization a purpose.
Data science vs. data analytics
While closely related, data analytics is a component of data science, used to understand what an organization’s data looks like. Data science takes the output of analytics to solve problems. Data scientists say that investigating something with data is simply analysis, so data science takes analysis a step further to explain and solve problems. Another difference between data analytics and data science is timescale. Data analytics describes the current state of reality, whereas data science uses that data to predict and understand the future.
The benefits of data science
The business value of data science depends on organizational needs. Data science could help an organization build tools to predict hardware failures, enabling the organization to perform maintenance and prevent unplanned downtime. It could also help predict what to put on supermarket shelves, or how popular a product will be based on its attributes.
For further insight into the business value of data science, see The unexpected benefits of data analytics and Demystifying the dark science of data analytics.
Data science jobs
While the number of data science degree programs are increasing at a rapid clip, they aren’t necessarily what organizations look for when seeking data scientists. Candidates with a statistics background are popular, especially if they can demonstrate they know whether they’re looking at real results, have domain knowledge to put results in context, and have communication skills that allow them to convey results to business users.
Many organizations look for candidates with PhDs, especially in physics, math, computer science, economics, or even social science. A PhD proves a candidate is capable of doing deep research on a topic and disseminating information to others.
Some of the best data scientists or leaders in data science groups have untraditional backgrounds, even ones with little formal computer training. In many cases, the key is an ability to look at something from a unconventional perspective and understand it.
For further information about data scientist skills, see What is a data scientist? A key data analytics role and a lucrative career, and Essential skills and traits of elite data scientists.
Data science salaries
Here are some of the most popular job titles related to data science and the average salary for each position, according to the most recent data from Indeed:
- Analytics manager: $80,000-$176,000
- Business intelligence analyst: $56,000-$147,000
- Data analyst: $50,000-$128,000
- Data architect: $67,000-$173,000
- Data engineer: $83,000-$195,000
- Data scientist: $76,000-$195,000
- Research analyst: $41,000-$134,000
- Statistician: $50,000-$143,000
Data science degrees
According to Fortune, these are the top graduate degree programs in data science:
- University of California, Berkeley
- University of Illinois at Urbana-Champaign
- Marshall University
- Bay Path University
- University of Texas, Austin
- University of Missouri, Columbia
- Texas Tech University
- University of Chicago
- University of California, Riverside
- Clemson University
Data science training and bootcamps
Given the current shortage of data science talent, many organizations are building out programs to develop internal data science talent.
Bootcamps are another fast-growing avenue for training workers to take on data science roles, and for more details on data science bootcamps, see 15 best data science bootcamps for boosting your career.
Data science certifications
Organizations need data scientists and analysts with expertise in techniques to analyze data. They also need big data architects to translate requirements into systems, data engineers to build and maintain data pipelines, developers who know their way around Hadoop clusters and other technologies, and system administrators and managers to tie everything together. Certifications are one way for candidates to show they have the right skillset. Some of the top data science certifications include:
- Certified Analytics Professional (CAP)
- Cloudera Data Platform Generalist Certification
- Data Science Council of America (DASCA) Senior Data Scientist (SDS)
- Data Science Council of America (DASCA) Principal Data Scientist (PDS)
- IBM Data Science Professional Certificate
- Microsoft Certified: Azure AI Fundamentals
- Microsoft Certified: Azure Data Scientist Associate
- Open Certified Data Scientist (Open CDS)
- SAS Certified Professional: AI and Machine Learning
- SAS Certified Advanced Analytics Professional
- SAS Certified Data Scientist
- Tensorflow Developer Certificate
For more information about big data and data analytics certifications, see The top 9 data analytics certifications, and 12 data science certifications that will pay off.
Data science teams
Data science is generally a team discipline, and data scientists are the core of most data science teams. But moving from data to analysis to production value requires a range of skills and roles. For example, data analysts should be on board to investigate the data before presenting it to the team and to maintain data models. Data engineers are necessary to build data pipelines to enrich data sets and make the data available to the rest of the company.
For further insight into building data science teams, see How to assemble a highly effective analytics team and The secrets of highly successful data analytics teams.
Data science goals and deliverables
The goal of data science is to construct the means to extract business-focused insights from data, and ultimately optimize business processes or provide decision support. This requires an understanding of how value and information flows in a business, and the ability to use that understanding to identify business opportunities. While that may involve one-off projects, data science teams more typically seek to identify key data assets that can be turned into data pipelines that feed maintainable tools and solutions. Examples include credit card fraud monitoring solutions used by banks, or tools used to optimize the placement of wind turbines in wind farms.
Incrementally, presentations that communicate what the team is up to are also important deliverables.
Data science processes
Production engineering teams work on sprint cycles, with projected timelines. That’s often difficult for data science teams to do because a lot of time upfront can be spent just determining whether a project is feasible. Data must be collected and cleaned, and then the team must determine whether it can answer the question efficiently.
Data science ideally should follow the scientific method, though that’s not always the case, or even feasible. Real science takes time: You spend a little bit confirming your hypothesis and then a lot trying to disprove yourself. In business, time-to-answer is important. As a result, data science can often mean going with the good enough answer rather than the best answer. The danger, though, is results can fall victim to confirmation bias or overfitting.
According to computer science portal GeeksforGeeks, a typical data science process includes the following steps:
- Define the problem and create a project charter. A data science project charter outlines the objectives, resources, deliverables, and timeline to ensure all stakeholders are aligned.
- Retrieve data. Data relevant to the project could be stored in databases, data warehouses, or data lakes. Accessing that data may require navigating the organization’s policies and requesting permissions.
- Employ data cleansing, integration, and transformation. Data cleansing removes errors, inconsistencies, and outliers in the data. Integration combines datasets from various sources. Transformation prepares the data for modeling.
- Enact exploratory data analysis (EDA). This step uses graphical techniques like scatter plots, histograms, and box plots to visualize data and identify trends. This step helps in the selection of the correct modeling techniques for the project.
- Build models. This step involves building ML or deep learning models to make predictions or classifications based on the data.
- Present findings and deploy models. After completing the analysis, this step involves presenting the results to stakeholders and deploying models into production systems to automate decision-making or support ongoing analysis.
Data science tools
Data science teams make use of a wide range of tools, including SQL, Python, R, Java, and a cornucopia of open source projects such as Hive, oozie, and TensorFlow. These tools are used for a variety of data-related tasks, ranging from extracting and cleaning data, to subjecting data to algorithmic analysis via statistical methods or ML. According to the Data Science Council of America, some of the most popular data science tools include:
- Python: A versatile programming language that’s a favorite of data scientists. It features extensive libraries for manipulating and analyzing data and implementing ML algorithms, including: NumPy, Pandas, seaborn, and scikit-learn.
- R: A language and environment for statistical computing and graphics. R is an integral part of the data science toolkit, useful for data exploration, visualization, and statistical modeling.
- JupyterLab: This web-based interactive development environment for notebooks, code, and data offers a flexible interface to configure and arrange workflows in data science and ML.
- Excel: Microsoft’s spreadsheet software is perhaps the most extensively used BI tool around. It’s also handy for data scientists, working with smaller datasets.
- ChatGPT: This generative pre-trained transformer (GPT) has become a powerful tool for data science tasks that can generate and execute Python code, and produce comprehensive analysis reports. It also features plugins for research, math, statistics, automation, and document review.
- TensorFlow and PyTorch: These deep learning frameworks help data scientists develop and deploy ML models in the domain of neural networks. They help data scientists perform complex tasks including image recognition and natural language processing (NLP).
- Tableau: Now owned by Salesforce, Tableau is a data visualization tool used to create interactive and shareable dashboards.
- Apache Spark: This unified analytics engine is designed to process large-scale data, with support for data cleansing, transformation, model building, and evaluation.
- Power BI: Microsoft’s Power BI facilitates data gathering, analysis, and presentation.