Self-driving cars, robotic investment advisors, wearable health technologies are just some of the innovations that have been made possible by the recent progress in Data technologies.
Data-related terminology such as ‘Big Data’, ‘Data Mining’ and ‘Machine Learning’ can be seen or heard everywhere – the news, magazines or even the conversation you overheard at the canteen. But do you really know what those words mean?
In this article, we strive to provide readers with a clear picture of what Data Analytics and Science are and develop an interest in them.
Data Analytics vs Data Science
Oftentimes, people think that Data Analytics and Data Science mean the same thing and can be used interchangeably. In truth, there is a distinction between them. There may be some overlap between what a Data Analyst and what a Data Scientist does, but they do perform distinct functions.
Fig 1: Differences between a Data Analyst & a Data Scientist
A Data Scientist’s primary goal is to either interpret data to make smart business decisions, develop data products or both.
Data Scientists will seek out patterns within the data that they have and determine how they can exploit their findings. Unlike Data Analysts, they scrutinize data more rigorously and may employ scientific techniques comprising segmentation analysis, inferential models, synthetic control experiments and time series forecasting. For example, an e-commerce website may analyze the purchasing patterns of their customers to formulate an effective promotional strategy.
Data Scientists also develop tools that process data to produce algorithmically-generated results. These tools are called data products and do not require the Data Scientist to play an active role for them to function. One example would be the Google-generated advertisements that can be found on many websites; the advertisements are usually relevant to your interests because Google has inputted multiple data points (your age, Google search queries, gender, internet usage patterns, geographic location, etc.) into their proprietary algorithm.
Generally, a Data Scientist is very proficient in programming and mathematics. The same level of proficiency is not expected in a Data Analyst.
Data Analysts merely look at data to obtain insights. They do not usually construct advanced algorithms or develop data products. Often, a Data Analyst is also competent in a business function (e.g. finance, operations, marketing), and they use data to complement their main professional role. A fundamental understanding of statistics and basic computing or programming capability should be sufficient for an individual to perform effectively as a Data Analyst.
The following is a list of terms that data enthusiasts should know. As different literatures and people define these terms differently, we will only state what are widely accepted and most applicable to Data Analytics and Science.
A sequence of actions that perform calculation, data processing and automated reasoning tasks when executed, usually visualized on a flow chart for ease of interpretation. A simple algorithm would be as follow:
- Prompt user for input (What is your name?)
- Scan and store input (Record ‘Charlie’ into a data array.)
- Display preset message (Your registration is successful!)
- Initialize the program (Go back to step 1.)
Big Data was initially used to describe large and complex data sets that required sophisticated data processing application software to process them. In recent years, however, more emphasis is placed on the effort and skills required to extract value from the data than on the size of the data.
A technological infrastructure that allows multiple devices and resources to store and exchange data in a ‘cloud’ or data center, providing businesses and users convenience and improving their productivity.
The process of eliminating or rectifying irrelevant or inaccurate parts of a dataset so that it is consistent with other datasets in the system. Some examples include removing typographical errors or standardizing the terms used in a category, i.e. changing ‘G stn’, ‘g_station’ and ‘GS’ into ‘Green Station’.
The process of gathering and measuring information that, when analyzed, reveals insights pertaining to the topic of interest. Traditional methods include focus group discussions, ethnographic research, face-to-face interviews and surveys. The more sophisticated collectors may use technologies such as Radio-frequency identification (RFID) tags or Computer-Assisted Self Interviewing (CASI).
Data Extraction refers to the process of retrieving data from data sources that may be structured or unstructured depending on their nature. To illustrate, the process is relatively simple if the source is an excel sheet with data collected from questionnaires arranged neatly. Contrariwise, more effort is required to extract data from a webpage which was not setup for the purpose of data collection.
Many people mistake Data Mining for Data Extraction. They are, in fact, two very different processes. As stated above, Data Extraction is about retrieving data from data sources. Data Mining, on the other hand, is about making sense of the data through discovering patterns in them. This is done through identifying outliers or unusual data records (Anomaly Detection), searching for relationships between variables (Association Rule Learning), discovering groups of data that are similar to one another in some attributes (Clustering), generalizing known structure to apply to new data (Classification), finding a function that most accurately models the data (Regression) and presenting the findings in a compact manner (Summarization).
A Data Point is a measurable variable that can be collected from every member of a statistical population. For example, if we would like to find out the determinants of a person’s punctuality, the data points we are interested in may be his years of education, income, age and number of dependents.
The term was coined by Professor Arthur Lee Samuel, an expert in the field of computer gaming and artificial intelligence. In his words, Machine Learning gives “computers the ability to learn without being explicitly programmed.” Specifically, the program can observe the ‘environment’ in which it is operating and adapt its behavior according to its experience and the data it studies, unlike traditional programming procedure (If ‘P’ is recorded, output ‘Q’. ‘P’ has been recorded. Output ‘Q’.)
There are three categories of Machine Learning:
- Supervised learning: provide the program with a data set with predetermined labels and it will analyze new data and correctly determine the appropriate labels for them; applications include speech recognition and recommended video list on Youtube.
- Unsupervised learning: without any predetermined labels, the program discovers hidden similarities and patterns among data; applications include bioinformatics for genetic clustering and facial recognition.
- Reinforcement learning: constantly provide the program with positive and negative feedback like how a trainer would give a dog treats when it listens to instructions and punish it when it refuses to do so, and it will formulate a strategy to improve its chances of making optimal decisions; applications include poker bots and robotic arms.
The act of presenting a large amount of quantitative data concisely and meaningfully. Some end products include charts, graphs, infographics and even a subway map.
A Final Word
The world of Data Analytics is still changing and is everchanging. As enterprises and even individuals become hungrier for data, we will collect, analyze and interpret them differently. Data Analytics and Science is an exciting field indeed. It is the engine for growth in today’s world, bringing us sophisticated technologies that we never thought possible and enabling us to do things in ways never done before.