R语言代写|R语言代做|R语言代考
当前位置:以往案例 > >案例R 大数据分析Data Mining for Business Analytics
2017-08-28


Topics: Data Exploration and Visualization


A note about submissions: Unlike Olympic figure skating and ski jumping, AD699 does not award style points. There are some fancy tools within R for generating reports, such as RMarkdown, but learning them is not within the scope of this course. The most important thing here is to answer the questions that ask for written answers, and to show screenshots where screenshots are asked for.


This project is due by 11:59 p.m. on Monday, September 17th.


Step 1:

Download this file from our course Blackboard site:


a) athlete_events.csv Part I: Data Exploration

1. Bring this file into your R environment. Assign the name athletes to this file. Show the code that you used to do this. (Remember to first set your working directory to the folder that contains your files).


2. How many rows and how many columns does athletes contain? How do you know this?


3. Are there any missing values in the athletes data set? If so, how do you know this? (Note: There are MANY ways that you could answer this question, and any valid way is completely fine).


4. Remove all rows in the athletes data set that contain any missing values, and store the results of this operation in a new variable called athletes2. What are the dimensions of athletes2?


5. Based on the data in athletes2, what is the mean age of an Olympic medalist? What is the median age? Show the code that you used to find this out, along with a screenshot of your results.


6. How many Olympic medalists were male, and how many were female? (Hint: Use the table function to help you with this). Show the code that you used to find this out, along with a screenshot of your results.





7. How old was the youngest Olympic medal winner in the dataset? How old was the oldest Olympic medal winner in the dataset? Show the code that you used to find this out, along with a screenshot of your results.



Part II: Data Visualization

1. Filter the dataset so that it only contains information for your particular Olympiad. Student Olympiad projects can be found in Blackboard, in the same folder that contains this project prompt. Assign a new variable name to  this dataset that only contains your Olympiad. Show the code that you used to find this out, along with a screenshot of your results.


2. Using ggplot, create a histogram that depicts the distribution of medal winners for your Olympiad by age. Show the code that you used to accomplish this, along with a screenshot of your results.


3. Now, modify your histogram by specifying a number of binwidths that you chose (i.e. not the default number). Specify a color for the bins in your histogram, and specify another color to use for the borders of the bins. Give your histogram a descriptive title. Show the code that you used to accomplish this, along with a screenshot of your results.


4. Imagine that your boss is a smart person, but has no idea what a histogram is — how would you explain this plot to your boss? Write a one or two sentence description of what your histogram shows you.


5. Which six NOCs received the greatest numbers of medals? Show the code that you used to find this out, along with a screenshot of your results. Create a  filtered dataset that only contains medalists from these six NOCs. Show the code that you used to accomplish this, along with a screenshot of your results.


6. Using ggplot, create a scatterplot that depicts the heights (on the x-axis) and the weights (on the y-axis) of the athletes from the six NOCs with the most medals. Give your plot a descriptive title. Show the code that you used to accomplish  this, along with a screenshot of your results. Write a one or two sentence description of what this scatterplot shows you (again, explain it to your boss).


7. Now, add to the scatterplot that you just created by including a categorical variable (gender). Show the code that you used to accomplish this, along with a screenshot of your results. Write a one or two sentence description of what this



scatterplot shows you (again, explain it to your boss).


8. Include yet another categorical variable on your scatterplot — NOC. Use shape to represent NOC. Show the code that you used to accomplish this, along with a screenshot of your results. Write one or two sentences about something that this plot tells you (you don’t need to summarize the entire plot for this — you can just pick a couple data points and describe them here).


9. Again using ggplot, create a barplot that compares the total number of bronze, silver, and gold medals among the top six NOCs. What do you notice about these totals? If every Olympic competition generates one gold, one silver, and one bronze, why might your bars be different heights? (Hint: think about how you created this subset of the original dataset).


在线提交订单