Aug 29, 2017

Clean or shorten Column names while importing the data itself

When it comes to clumsy column headers namely., wide ones with spaces and special characters, I see many get panic and change the headers in the source file, which is an awkward option given variety of alternatives that exist in R for handling them.






One easy handling of such scenarios is using library(janitor), as name suggested can be employed for cleaning and maintaining. Janitor has function by name clean_names() which can be useful while directly importing the data itself as show in the below example:
" library(janitor); newdataobject <- read.csv("yourcsvfilewithpath.csv", header=T) %>% clean_names() " 

Author undertook several projects, courses and programs in data sciences for more than a decade, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail or besteconometrician@gmail.com for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri

2 comments:

SANJAY said...

Hi Pradeep, read this and will be trying out soon. Thanks.

Dulani said...

I just grabbed a set of Excel data for a project. As always the column names are a nightmare that need to be fixed before proceeding. And then, I think to myself… what was the name of that package I heard about a while ago that addresses this problem (turns to email)? Oh yeah, it was "janitor."

Here's what it looks like in action on my dataset (Australian Bureau of Statistics: 3401.0 - Overseas Arrivals and Departures, Australia, Jun 2017):

install.packages("janitor")
library(janitor)
library(openxlsx)
library(tidyverse)

before <- read.xlsx("../data/australia file.xlsx", sheet = "Data1")
names(before)

Names before:
[1] "X1"
[2] "Number.of.movements.;.Permanent.Settler.Arrivals.;"
[3] "Number.of.movements.;.Long-term.(more.than.one.year).Residents.returning.;"
[4] "Number.of.movements.;.Long-term.(more.than.one.year).Visitors.arriving.;"
[5] "Number.of.movements.;.Permanent.and.Long-term.Arrivals.;"
[6] "Number.of.movements.;.Short-term.(less.than.one.year).Residents.returning.;"
[7] "Number.of.movements.;.Short-term.(less.than.one.year).Visitors.arriving.;"
[8] "Number.of.movements.;.Short-term.(less.than.one.year).Visitors.arriving.;"
[9] "Number.of.movements.;.Short-term.(less.than.one.year).Visitors.arriving.;"
[10] "Number.of.movements.;.Total.Arrivals.;"


after <- before %>%
clean_names()

names(after)

Names after:
[1] "x1"
[2] "number_of_movements_permanent_settler_arrivals"
[3] "number_of_movements_long_term_more_than_one_year_residents_returning"
[4] "number_of_movements_long_term_more_than_one_year_visitors_arriving"
[5] "number_of_movements_permanent_and_long_term_arrivals"
[6] "number_of_movements_short_term_less_than_one_year_residents_returning"
[7] "number_of_movements_short_term_less_than_one_year_visitors_arriving"
[8] "number_of_movements_short_term_less_than_one_year_visitors_arriving_2"
[9] "number_of_movements_short_term_less_than_one_year_visitors_arriving_3"
[10] "number_of_movements_total_arrivals"

Nice… so, I save 5-10 minutes futzing with goofy names, breathe a sigh of relief and get back to work!!!