Week 6: R Programming Fundamentals

Learn R, RStudio and the tidyverse - the language of modern data analysis

Duration: 7 Days | Level: Beginner-Friendly | No prior coding required

If you have finished Weeks 1-5 of EDUSHARK, you can already analyse data with Excel and SQL, and you understand the statistics behind it. This week, we add the third leg of the analyst's tripod: a real programming language built for data. R is free, open-source, used in pharma, banks and research labs everywhere, and it pairs with R Shiny - the dashboarding tool we will build in Weeks 7-8.

This guide is written for absolute beginners. No prior programming required - we start with arithmetic and end with a complete data-wrangling pipeline.

Day 1

R, RStudio and Hello World

Set up the tool you will use for the next three weeks

Installing R and RStudio

R is the engine; RStudio is the comfortable cockpit you drive it from. Download both (both are free):

Go to cran.r-project.org, click your operating system, install R.
Go to posit.co/download/rstudio-desktop and install the free RStudio Desktop.
Open RStudio. You will see four panes.

The four RStudio panes:

Script (top-left): where you write code you want to keep.
Console (bottom-left): where R executes commands.
Environment / History (top-right): what variables you have created.
Files / Plots / Help (bottom-right): file browser, charts, documentation.

Your first R commands

# R is a calculator at the smallest scale
2 + 2                  # [1] 4
log(100, base = 10)    # [1] 2

# Variables store values for re-use
revenue <- 1250000
cost    <- 870000
profit  <- revenue - cost
profit                 # [1] 380000

# Getting help
?mean
help.search("regression")
                    

Pro tip: <- is R's preferred assignment operator. The keyboard shortcut in RStudio is Alt + -. Use it religiously - = works but is reserved for function arguments by convention.

Day 2

Data Types and Structures

The five shapes of R data you'll see every day

The atomic types

R recognises a handful of basic types: numeric (any real number), integer, character (text), logical (TRUE/FALSE), and a couple of rarely used ones (complex, raw). Use typeof(x) or class(x) to check.

Vectors - R's workhorse

A vector is an ordered collection of values of the same type. Almost everything in R is a vector under the hood.

prices  <- c(199, 249, 89, 1299, 349)
in_sale <- prices < 300              # logical vector: TRUE/FALSE per element
prices[in_sale]                       # [1] 199 249  89
mean(prices); sd(prices); range(prices)
                    

Data frames - the table

A data frame is a table: rows are observations, columns are variables. This is what you read from CSV files and what most analyses revolve around.

sales <- data.frame(
  product = c("A", "B", "C"),
  units   = c(10, 25, 7),
  price   = c(199, 249, 89)
)
sales$revenue <- sales$units * sales$price
sales[sales$revenue > 1000, ]
                    

Lists and factors

A list can hold elements of different types and lengths - think of it as the JSON of R. A factor stores a categorical variable with a fixed set of levels, and optionally an order - always use it for ordered categories like ratings or T-shirt sizes.

Day 3

Control Flow and Functions

Make your code reusable in 90 seconds

# A simple custom function
classify <- function(score) {
  if      (score >= 80) "A"
  else if (score >= 60) "B"
  else                  "C"
}

# Vectorised conditional - the R way
ifelse(scores >= 60, "pass", "fail")

# Loops exist, but you'll rarely need them in analyst code
sapply(prices, function(p) p * 1.18)   # adds 18% GST
                    

If you find yourself writing a for-loop over rows of a data frame, stop. There is almost always a vectorised dplyr or apply version that is shorter, cleaner and 10x faster. Loops are for control flow, not for data manipulation.

Day 4

The Tidyverse and the Pipe

One philosophy, eight packages, infinite analyses

The tidyverse is a collection of R packages that share a consistent philosophy. Install once and forget:

install.packages("tidyverse")
library(tidyverse)
                    

The pipe (%>%) takes the result on its left and passes it as the first argument of the function on its right. It turns nested function calls into a readable, left-to-right recipe.

# Nested - read inside-out, painful
head(arrange(filter(mtcars, hp > 100), desc(mpg)), 5)

# Piped - read top-down, like a recipe
mtcars %>%
  filter(hp > 100) %>%
  arrange(desc(mpg)) %>%
  head(5)
                    

Reading data

library(readr)
orders <- read_csv("data/orders.csv")
write_csv(orders, "data/clean_orders.csv")
                    

Day 5

Data Manipulation with dplyr

Five verbs that run half your job

Verb	What it does
`select()`	Keep or drop specific columns
`filter()`	Keep rows that match a condition
`mutate()`	Create or modify columns
`arrange()`	Sort rows ascending/descending
`summarise()`	Collapse a group into one row
`group_by()`	Set the grouping for subsequent verbs

library(dplyr); library(lubridate)

orders %>%
  filter(order_date >= today() - days(90)) %>%
  inner_join(products, by = "product_id") %>%
  mutate(revenue = quantity * unit_price) %>%
  group_by(category) %>%
  summarise(
    n_orders = n(),
    revenue  = sum(revenue),
    avg_aov  = mean(revenue)
  ) %>%
  arrange(desc(revenue))
                    

Day 6

Reshaping with tidyr and Joins

Long, wide and joined

Tidy data: one row per observation, one column per variable, one table per kind of observational unit. Almost every painful analysis problem becomes easy once your data is tidy.

# Wide -> long
sales_long <- sales_wide %>%
  pivot_longer(cols = jan:dec, names_to = "month", values_to = "revenue")

# Long -> wide
sales_long %>%
  pivot_wider(names_from = month, values_from = revenue)
                    

Joins

The dplyr joins map directly to SQL: inner_join(), left_join(), right_join(), full_join(), plus the useful semi_join() (keep matching rows from x) and anti_join() (keep non-matching rows from x).

orders %>%
  left_join(customers, by = "customer_id") %>%
  left_join(products,  by = "product_id")
                    

Day 7

Project: Data Wrangling Pipeline

Build a complete pipeline using everything from this week

Build a single R script (wrangle.R) that:

Reads four CSV files (orders, items, customers, products) from /data/week6/.
Joins them into one tidy table.
Cleans column names with janitor::clean_names().
Handles missing prices by imputing the category median.
Pivots to a monthly category-revenue table (one row per category, one column per month).
Writes the final tidy file to data/analysis_ready.csv.

A complete starter script is provided in /data/week6-r-wrangling-starter.R - download it, walk through line-by-line, and adapt to your dataset.

Coming up: Week 7 - Advanced R and ggplot2

Master publication-quality charts, advanced dplyr, string handling with stringr and dates with lubridate. The bridge to R Shiny.

View Detailed Curriculum