Week 6: R Programming Fundamentals

Learn R, RStudio and the tidyverse - the language of modern data analysis

Duration: 7 Days | Level: Beginner-Friendly | No prior coding required

If you have finished Weeks 1-5 of EDUSHARK, you can already analyse data with Excel and SQL, and you understand the statistics behind it. This week, we add the third leg of the analyst's tripod: a real programming language built for data. R is free, open-source, used in pharma, banks and research labs everywhere, and it pairs with R Shiny - the dashboarding tool we will build in Weeks 7-8.

This guide is written for absolute beginners. No prior programming required - we start with arithmetic and end with a complete data-wrangling pipeline.

Day 1

R, RStudio and Hello World

Set up the tool you will use for the next three weeks

Installing R and RStudio

R is the engine; RStudio is the comfortable cockpit you drive it from. Download both (both are free):

  1. Go to cran.r-project.org, click your operating system, install R.
  2. Go to posit.co/download/rstudio-desktop and install the free RStudio Desktop.
  3. Open RStudio. You will see four panes.
The four RStudio panes:
  • Script (top-left): where you write code you want to keep.
  • Console (bottom-left): where R executes commands.
  • Environment / History (top-right): what variables you have created.
  • Files / Plots / Help (bottom-right): file browser, charts, documentation.

Your first R commands

# R is a calculator at the smallest scale 2 + 2 # [1] 4 log(100, base = 10) # [1] 2 # Variables store values for re-use revenue <- 1250000 cost <- 870000 profit <- revenue - cost profit # [1] 380000 # Getting help ?mean help.search("regression")
Pro tip: <- is R's preferred assignment operator. The keyboard shortcut in RStudio is Alt + -. Use it religiously - = works but is reserved for function arguments by convention.
Day 2

Data Types and Structures

The five shapes of R data you'll see every day

The atomic types

R recognises a handful of basic types: numeric (any real number), integer, character (text), logical (TRUE/FALSE), and a couple of rarely used ones (complex, raw). Use typeof(x) or class(x) to check.

Vectors - R's workhorse

A vector is an ordered collection of values of the same type. Almost everything in R is a vector under the hood.

prices <- c(199, 249, 89, 1299, 349) in_sale <- prices < 300 # logical vector: TRUE/FALSE per element prices[in_sale] # [1] 199 249 89 mean(prices); sd(prices); range(prices)

Data frames - the table

A data frame is a table: rows are observations, columns are variables. This is what you read from CSV files and what most analyses revolve around.

sales <- data.frame( product = c("A", "B", "C"), units = c(10, 25, 7), price = c(199, 249, 89) ) sales$revenue <- sales$units * sales$price sales[sales$revenue > 1000, ]

Lists and factors

A list can hold elements of different types and lengths - think of it as the JSON of R. A factor stores a categorical variable with a fixed set of levels, and optionally an order - always use it for ordered categories like ratings or T-shirt sizes.

Day 3

Control Flow and Functions

Make your code reusable in 90 seconds

# A simple custom function classify <- function(score) { if (score >= 80) "A" else if (score >= 60) "B" else "C" } # Vectorised conditional - the R way ifelse(scores >= 60, "pass", "fail") # Loops exist, but you'll rarely need them in analyst code sapply(prices, function(p) p * 1.18) # adds 18% GST
If you find yourself writing a for-loop over rows of a data frame, stop. There is almost always a vectorised dplyr or apply version that is shorter, cleaner and 10x faster. Loops are for control flow, not for data manipulation.
Day 4

The Tidyverse and the Pipe

One philosophy, eight packages, infinite analyses

The tidyverse is a collection of R packages that share a consistent philosophy. Install once and forget:

install.packages("tidyverse") library(tidyverse)
The pipe (%>%) takes the result on its left and passes it as the first argument of the function on its right. It turns nested function calls into a readable, left-to-right recipe.
# Nested - read inside-out, painful head(arrange(filter(mtcars, hp > 100), desc(mpg)), 5) # Piped - read top-down, like a recipe mtcars %>% filter(hp > 100) %>% arrange(desc(mpg)) %>% head(5)

Reading data

library(readr) orders <- read_csv("data/orders.csv") write_csv(orders, "data/clean_orders.csv")
Day 5

Data Manipulation with dplyr

Five verbs that run half your job

VerbWhat it does
select()Keep or drop specific columns
filter()Keep rows that match a condition
mutate()Create or modify columns
arrange()Sort rows ascending/descending
summarise()Collapse a group into one row
group_by()Set the grouping for subsequent verbs
library(dplyr); library(lubridate) orders %>% filter(order_date >= today() - days(90)) %>% inner_join(products, by = "product_id") %>% mutate(revenue = quantity * unit_price) %>% group_by(category) %>% summarise( n_orders = n(), revenue = sum(revenue), avg_aov = mean(revenue) ) %>% arrange(desc(revenue))
Day 6

Reshaping with tidyr and Joins

Long, wide and joined

Tidy data: one row per observation, one column per variable, one table per kind of observational unit. Almost every painful analysis problem becomes easy once your data is tidy.
# Wide -> long sales_long <- sales_wide %>% pivot_longer(cols = jan:dec, names_to = "month", values_to = "revenue") # Long -> wide sales_long %>% pivot_wider(names_from = month, values_from = revenue)

Joins

The dplyr joins map directly to SQL: inner_join(), left_join(), right_join(), full_join(), plus the useful semi_join() (keep matching rows from x) and anti_join() (keep non-matching rows from x).

orders %>% left_join(customers, by = "customer_id") %>% left_join(products, by = "product_id")
Day 7

Project: Data Wrangling Pipeline

Build a complete pipeline using everything from this week

Build a single R script (wrangle.R) that:

  1. Reads four CSV files (orders, items, customers, products) from /data/week6/.
  2. Joins them into one tidy table.
  3. Cleans column names with janitor::clean_names().
  4. Handles missing prices by imputing the category median.
  5. Pivots to a monthly category-revenue table (one row per category, one column per month).
  6. Writes the final tidy file to data/analysis_ready.csv.

A complete starter script is provided in /data/week6-r-wrangling-starter.R - download it, walk through line-by-line, and adapt to your dataset.

Coming up: Week 7 - Advanced R and ggplot2

Master publication-quality charts, advanced dplyr, string handling with stringr and dates with lubridate. The bridge to R Shiny.

View Detailed Curriculum