Data Pre-Processing with R

No comments
In this post I hope to discuss how we can pre-process data using R language. I'm using R Studio for the data analysis.

For this analysis I'm using freely available Ta-Feng Supermarket data set. You can download the data set here. A description about the data set can be found here.

First of all we have to set our working directory. Let's assume that we are going to use the folder named "R_Work_Space" in the Desktop as our working directory. Then we can set the working directly as:

 setwd("{Path to Desktop}/Desktop/R_Work_Space")  

Then let's load our data set as follows.

 suppermarket_dataset <- read.csv(file="SupperMarketData.csv",head=TRUE,sep=",")  

You can view the loaded data set from "View" command.


You may see the data set as follows in R Studio.

Let's start pre-processing.

First of all let's see the type of each attribute in the data set. This will be useful for our future analysis. Use "str ( )" function in R for this purpose.


The output will be a description as follows:

Here Customer ID and Product Sub class being integer values doesn't make any sense as those fields have distinct values.  We should convert those fields to have factorial values. Following code segment does the job.

 suppermarket_dataset$Customer.ID <- as.factor(suppermarket_dataset$Customer.ID)  
 suppermarket_dataset$Product.Subclass <- as.factor(suppermarket_dataset$Product.Subclass)  

Again use "str( )" function to verify your conversion.

Next, for our analysis we need the day of the week information. Using R we can add a new column named "Day" to our data set with the day of the week related to value in the "Date" column.

 suppermarket_dataset$Day <- as.factor(weekdays(as.Date(suppermarket_dataset$Date, "%m/%d/%Y")))  

Use "View" command to view the data set and you can now see that a new column called "Day" has been added to the data set.

If we had the "Time" information and if we want to analysis data hour wise, we can extract hour information from time using "hour" in "lubridate" package. For that first we have to install "lubridate" package using install.packages( ).


Next step is loading the installed package.


Now we can call "hour" function as below to derive hour from the "Time" information.

 suppermarket_dataset$Hour <- hour(strptime(suppermarket_dataset$Time, "%H:%M:%S"))  

Please note that to use the above command, we should have our time in the international standard notation.

Then, we have to calculate total amount spend by each customer in each transaction. We can calculate that by,
 suppermarket_dataset$Total_Amount <- suppermarket_dataset$Amount*suppermarket_dataset$Sales.price  

In our data set we have a column called "Assest" which would not be used for our analysis. Therefore let's remove that column

 suppermarket_dataset <- suppermarket_dataset[,-c(8)]  

Let's assume that in our analysis we want to exclude the purchases done by people belong "Below 25" age category. "Below 25" age category is represented by the letter "A" in "Age" column.

 suppermarket_dataset <- suppermarket_dataset[-(suppermarket_dataset$Age == "A"),]  

We can use the above code segment to remove records that belong to "Below 25" age category.
Now we are done with pre-processing our data set. We should save our new data set for future use. We can write the data set to a .csv file using following command.

 write.csv(file="Final_Suppermarket_Dataset.csv", x=suppermarket_dataset)  

Done for the day!
Let's meet with the next post which will discuss how to do a descriptive analysis of this data set.

No comments :

Post a Comment