Skip to content

EDA – Week 2

September 3, 2018

Your turn-in

Generally you found interesting datasets and were successful in creating csv files in your first assignment.  But here is some advice that will help you avoid some issues when you want to work with your datasets in R.

  1.  First, use simple names (short with no blanks or special characters) for variable names — it will then be easier to work with these variables.
  2. Some of your data values contain commas, dollar signs, etc.  You want to remove them before you bring the data into R.   You don’t have to remove these weird values if you are comfortable doing string manipulation in R, but I suspect that most of you are not familiar with these data science skills.
  3. Generally, you want your dataset to be clean.  Eliminate any variable from your dataset that will not be used.

Making life easier

Your “Single Batch” assignment can be a bit tedious since it requires you to find outliers using a particular procedure.  (You have to find the five number summary, compute the quartile spread and a “step”, set up fences, etc.) . To make this process easier, I just added a new function “lval_plus” to my LearnEDAfunctions package that will compute the fences and locate the outliers — you can use this function either for a single data frame and variable, and for a data frame with a numeric variable and a grouping variable (this will locate outliers for each group).  I added a document “Finding_outliers.html” which illustrates the use of this new function.

My rubric

I don’t use a formal rubric when I grade assignments.  But I am more interested in your explanation and interpretation and less interested in your R work.  (In my experience, students tend to do well in the mechanics of implementing methods and not as well in explaining what you learn from the R work.) . I rarely criticize a student for writing too much.

 

 

 

From → Uncategorized

Comments are closed.