R数据科学案例 | AREC 380 | Analysis Projects案例 |
当前位置:以往案例 > >R数据科学案例 | AREC 380 | Analysis Projects案例 |

R数据科学 The Analysis Projects are assignments where you apply the concepts and tools used in the course to analyze data as might be expected of a professional data scientist.

ARC 380 _Fall 2019

Analysis Projects Description

The Analysis Projects are assignments where you apply the concepts and tools used in the course to analyze data as might be expected of a professional data scientist. The goal of this assignment is to give you experience approaching a problem from an analytical perspective and clearly communicating the results of data analysis to non-specialists.

1 Memo Topics

Each deliverable is expected to address a topic related to energy, environmental, or resource economics. You should formulate a question then select data and perform analyses that inform an answer to that question. This is not a research seminar and I do not expect your research topics to be groundbreaking nor memos to be the definitive solution to your topic. I have provided several suggestions for topics below, but you are welcome (and encouraged) to address a question of your own design with my approval.

1.1 Potential Topics

I have compiled several potential memo topics below. You are free to use any or develop your own. Impor- tantly, any potential topic must involve data from at least two different sources.

Is there a relationship between electricity demand and air quality? How much does electricity demand vary with temperature?

How do major storm events affect agricultural production? How do volcanoes affect air quality?

How does air quality affect housing prices?

How do fuel prices affect the demand for air transportation? How does airplane traffic affect air quality around airports?

I will provide a sample memo which investigates the relationship between gasoline prices and how much people drive. Therefore, you may not use that as a memo topic.

2 Deliverables

The deliverables for this assignment will consist of short memos, two to three pages plus any tables or figures you may wish to include, submitted through ELMS as either a single Microsoft Word or PDF document. You must also submit clearly documented R script files used to wrangle your data and generate results. Your memos should contain all of the information you want to convey, your R script files will only be evaluated for following the best practices of data science promoted in this class. Additional ftles (e.g., data ftles, stand-alone images) will not be reviewed. Unlike the problem sets, you do not need to generate your final document using R Markdown, although you can if you wish. You may find it useful to generate tables and figures using R Markdown and then copy and paste them into a final document.

Data science is an input to a decision making process and memos of this form are often produced by professional data scientists. Importantly, the target audience of your memo should be an intelligent and competent decision maker who is not an economist or data scientist (e.g., a business executive or government policy maker). This person wants answers and arguments to support them but will be uninterested in technical details of the tools you used to reach those conclusions. For example, you should not include examples of any R code in your memos.

Each memo should clearly communicate the following to the reader:

  • An outline the question you are trying to answer including an argument for why the question is important toconsider
  • Adescription of data (and its source) used to support your conclusions
  • Anon-technical description of the methods you used
  • Aclear statement of your conclusions and
  • Theresults of some analyses on which you based your conclusions
  • Anylimitations of your analysis or areas for potential additional work The accompanying R script files should
  • Becleanly formatted and easy to read
  • Documentdata wrangling and analysis steps using comments
  • Employmethods that follow best practices in data science
  • Whenpossible, employ tools in the tidyverse (tibbles, dplyr, pipes, ggplot2, modelr, ) The assignment consists of two memos with differing expectations.

2.1 Memo 1

The first memo is due Friday, November 1st at 5 PM. In this memo you should provide a description, summary statistics, and visualization of a single dataset you think will be useful in addressing a topic of your choosing. This memo should clearly explain whether you feel the dataset, possibly combined with other data, will be useful for addressing this question and additional steps you would consider taking in the future.

2.2 Memo 2

The second memo is due Friday, December 6th at 5 PM. In this memo you should combine at least two datasets again with the goal of providing insights into a topic of your choosing. You don’t need to use the same data or question as your first memo but doing so might make your life easier. This memo should describe the data and provide summary statistics, visualizations, and modeling results which provide insight into your topic the provide some recommendation or conclusions about this topic supported by analyses of these datasets.

3 Potential data sources

The data you use for these projects should be public. This means it should meet two criteria, first anyone who wants access to the data may get access to it and, second, the creator of the data allows unrestricted non- commercial use of the data. For example, data on government websites is almost always public. Proprietary company sales data that you are allowed to use but not share is not public. Importantly, data exposed via hacking, scraped from websites that prohibit data collection, or downloaded from WikiLeaks do not meet the standard of public data and should not be used in this class.

A goal of this project is that you get experience wrangling real-world data. While you are free to use any public data source, you may not use any data file I have provided as part of the problem sets or lecture examples in this class. If you choose to use data that I have presented in class, you should download and wrangle the data from its original source.

There any many great sources of public data. I’ve outlined some in Table 1. You are free to use any public dataset you would like to satisfy the objectives of the assignment.

3.1 Data search engines

There are numerous tools to help you search for public data by topic. Some of these search engines include: US government data https://www.data.gov/

Google Dataset Search (Beta) https://toolbox.google.com/datasetsearch

Kaggle Datasets https://www.kaggle.com/

A warning about pre-packaged data: Be aware that datasets on Kaggle and other data repositories are generally compiled and wrangled by others then contributed for public consumption. They are academic contributions and require citation, even in your memos for this class. Second, other people have invested substantial effort in wrangling datasets on Kaggle. I will generally expect projects using data from Kaggle to be more ambitious that projects using data from other sources.

3.2 Academic Honesty

Your memo and supporting R code should be your original work. Following my syllabus, code adapted from other sources must be properly documented with attribution to the original author.

You may invest a lot of effort in your analyses to find they aren’t very informative or give results you didn’t expect. It is quite possible to write a great memo where the data lead to ambiguous or counter-intuitive conclusions. In these cases, you should explain the situation and why you think it occurred or what you could potentially do down the road to resolve it (or maybe you discovered something new!). Do not be tempted to selectively filter, modify, or generate data to result in analyses that fit your (or your guess as to my) preferred conclusions. Fabricating data or results is a serious breach of academic integrity and on par with plagiarism.

3.3 Example data sources

The next page contains a list of public data sources which you could find useful to address questions in energy, environmental, or resource economics.

GIS Shapeftles

Table 1: Potential Public Data Sources

US Shapefiles From Census Tiger https://www.census.gov/geo/maps-data/data/tiger-line.html


Weather station data from NOAA’s QCLCD. https://www.ncdc.noaa.gov/cdo-web/datatools/lcd

Direct download https://www.ncdc.noaa.gov/orders/qclcd/

Major storm events https://www.ncdc.noaa.gov/stormevents/


Air quality data https://www.epa.gov/outdoor-air-quality-data

Mauna Loa CO2 data https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html

US Wildfires https://wildfire.cr.usgs.gov/firehistory/data.html

EPA Air Markets Program Data https://ampd.epa.gov/ampd/

Volcanic activity https://volcanoes.usgs.gov/index.html

Earthquakes https://earthquake.usgs.gov/data/

Streamflow on US rivers https://waterdata.usgs.gov/nwis/rt


US electricity https://www.eia.gov/electricity/data.php

US Petroleum https://www.eia.gov/petroleum/data.php

World energy data from the World Bank https://data.worldbank.org/topic/energy-and-mining

US Demographics

US Census American Factfinder https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml

Economic Data

US Economic output data http://www.bea.gov/data

US Employment, wages, and productivity data https://www.bls.gov/data/

World Economic Data https://data.worldbank.org/topic/economy-and-growth

Agriculture data

USDA Crop yields, prices, livestock data https://www.nass.usda.gov/Data_and_Statistics/


US highway traffic counts https://www.fhwa.dot.gov/policyinformation/tables/tmasdata/

US Airline flight and airport data https://www.bts.gov/topics/airlines-and-airports-0

4 Evaluation

Overall evaluation of your analysis projects will depend on the ambition of your approach and applications of tools and methods discussed in this class. My goal is not for you to spend long nights tackling very difficult datasets and performing complicated analyses. However, projects relying on pre-wrangled data or simple sets of summary statistics will be at a disadvantage.

Your memos will be evaluated on the organization and clarity of the exposition, the consistency between your topic, the analyses presented, and your conclusions and relevance of analyses and information included in the memo.

Your R script files will be evaluated on your adherence to data science best practices of transparency and reproducibility. This includes legibility, clear documentation and comments, and demonstration of other data science best practices discussed in this course.

I will grant credit beyond the total points allowed on the final memo for exceptional projects. An exceptional project would utilize at least two large “in the wild” datasets combined with both modeling and visualizations to answer an interesting question. They need not lead to substantially longer memos. Feel free to discuss potential ideas with me.

AREC 380

©2019 James Archsmith

Anaysis Projects

Fall 2019