You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _extras/guide.md
+61-82Lines changed: 61 additions & 82 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -101,7 +101,7 @@ rev[2] = "apple-sauce"
101
101
~~~
102
102
{: .language-python}
103
103
104
-
## 01-starting-with-data
104
+
## 02-starting-with-data
105
105
106
106
###Bug Note:
107
107
@@ -111,27 +111,27 @@ Pandas < .18.1 has a bug where surveys_df['weight'].describe() may return a runt
111
111
112
112
*`surveys_df.columns`
113
113
114
-
column names (optional: show `surveys_df.columns[4] = "siteid"` The index is not mutable; recap of previous lesson. Adapting the name is done by `rename` function `surveys_df.rename(columns={"site_id": "siteid"})`)
114
+
column names (optional: show `surveys_df.columns[4] = "plotid"` The index is not mutable; recap of previous lesson. Adapting the name is done by `rename` function `surveys_df.rename(columns={"plot_id": "plotid"})`)
115
115
116
116
*`surveys_df.head()`. Also, what does `surveys_df.head(15)` do?
117
117
118
-
Show first `N` lines
118
+
Show first 5 lines. Show first 15 lines.
119
119
120
120
*`surveys_df.tail()`
121
121
122
-
Show last `N` lines
122
+
Show last 5 lines
123
123
124
124
*`surveys_df.shape`. Take note of the output of the shape method. What format does it return the shape of the DataFrame in?
125
125
126
126
`type(surveys_df.shape)` -> `Tuple`
127
127
128
128
### Calculating Statistics Challenges
129
129
130
-
* Create a list of unique site ID's found in the surveys data. Call it `site_names`. How many unique sites are in the data? How many unique species are in the data?
130
+
* Create a list of unique plot ID's found in the surveys data. Call it `plot_names`. How many unique plots are in the data? How many unique species are in the data?
131
131
132
-
`site_names = pd.unique(surveys_df["site_id"])` Number of unique site ID's: `site_names.size` or `len(site_names)`; Number of unique species in the data: `len(pd.unique(surveys_df["species"]))`
132
+
`plot_names = pd.unique(surveys_df["plot_id"])` Number of unique plot ID's: `plot_names.size` or `len(plot_names)`; Number of unique species in the data: `len(pd.unique(surveys_df["species"]))`
133
133
134
-
* What is the difference between `len(site_names)` and `surveys_df['site_id'].nunique()`?
134
+
* What is the difference between `len(plot_names)` and `surveys_df['plot_id'].nunique()`?
135
135
136
136
Both do result in the same output, making it alternative ways of getting the unique values. `nunique` combines the count and unique value extraction.
137
137
@@ -143,19 +143,19 @@ Both do result in the same output, making it alternative ways of getting the uni
143
143
144
144
* What happens when you group by two columns using the following syntax and then grab mean values?
145
145
146
-
The mean value for each combination of site and sex is calculated. Remark that the mean does not make sense for each variable, so you can specify this column-wise: e.g. I want to know the last survey year, median foot-length and mean weight for each site/sex combination:
146
+
The mean value for each combination of plot and sex is calculated. Remark that the mean does not make sense for each variable, so you can specify this column-wise: e.g. I want to know the last survey year, median foot-length and mean weight for each plot/sex combination:

184
184
185
-
## 02-index-slice-subset
185
+
## 03-index-slice-subset
186
186
187
187
Tip: use `.head()` method throughout this lesson to keep your display neater for students. Encourage students to try with and without `.head()` to reinforce this useful tool and then to use it or not at their preference. For example, if a student worries about keeping up in pace with typing, let them know they can skip the `.head()`, but that you'll use it to keep more lines of previous steps visible.
188
188
@@ -225,9 +225,9 @@ Tip: use `.head()` method throughout this lesson to keep your display neater for
225
225
`surveys_df[(surveys_df["year"] == 1999) & (surveys_df["weight"] <= 8)]`; when only interested in how many,
226
226
the sum of `True` values could be used as well: `sum((surveys_df["year"] == 1999) & (surveys_df["weight"] <= 8))`
227
227
228
-
* You can use the `isin` command in Python to query a DataFrame based upon a list of values as follows: `surveys_df[surveys_df['species_id'].isin([listGoesHere])]`. Use the `isin` function to find all sites that contain particular species in the surveys DataFrame. How many records contain these values?
228
+
* You can use the `isin` command in Python to query a DataFrame based upon a list of values as follows: `surveys_df[surveys_df['species_id'].isin([listGoesHere])]`. Use the `isin` function to find all plots that contain particular species in the surveys DataFrame. How many records contain these values?
229
229
230
-
For example, using `PB` and `PL`: `surveys_df[surveys_df['species_id'].isin(['PB', 'PL'])]['site_id'].unique()` provides a list of the sites with these species involved. With `surveys_df[surveys_df['species_id'].isin(['PB', 'PL'])].shape` the number of records can be derived.
230
+
For example, using `PB` and `PL`: `surveys_df[surveys_df['species_id'].isin(['PB', 'PL'])]['plot_id'].unique()` provides a list of the plots with these species involved. With `surveys_df[surveys_df['species_id'].isin(['PB', 'PL'])].shape` the number of records can be derived.
231
231
232
232
* Create a query that finds all rows with a weight value > or equal to 0.
233
233
@@ -255,14 +255,14 @@ print(len(new))
255
255
256
256
Can verify the number of Nan values with `sum(surveys_df['sex'].isnull())`, which is equal to the number of none female/male records.
257
257
258
-
* Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by site with male vs female values stacked for each site.
258
+
* Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0. Create a stacked bar plot of average weight by plot with male vs female values stacked for each plot.

286
+

287
287
288
-
## 03-data-types-and-format
288
+
## 04-data-types-and-format
289
289
290
290
### Challenge - Changing Types
291
291
@@ -297,9 +297,9 @@ surveys_df.isnull()
297
297
If the students have trouble generating the output, or anything happens with that, there is a file
298
298
called "sample output" that contains the data file they should generate.
299
299
300
-
## 04-merging-data
300
+
## 05-merging-data
301
301
302
-
* In the data folder, there are two survey data files: survey2001.csv and survey2002.csv. Read the data into Python and combine the files to make one new data frame. Create a plot of average site weight by year grouped by sex. Export your results as a CSV and make sure it reads back into Python properly.
302
+
* In the data folder, there are two survey data files: survey2001.csv and survey2002.csv. Read the data into Python and combine the files to make one new data frame. Create a plot of average plot weight by year grouped by sex. Export your results as a CSV and make sure it reads back into Python properly.

386
+

387
387
388
388
This is not really the best plot choice: not readable,... A first option to make this better, is to make facets. However, pandas/matplotlib do not provide this by default. Just as a pure matplotlib example (`M|F` if for not-defined sex records):

403
+

404
404
405
405
However, it would be better to link to [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/) and [Altair](https://github.com/ellisonbg/altair) for its kind of multivariate visualisations.
406
406
407
-
* In the data folder, there is a site CSV that contains information about the type associated with each site. Use that data to summarize the number of sites by site type.
407
+
* In the data folder, there is a plot CSV that contains information about the type associated with each plot. Use that data to summarize the number of plots by plot type.
408
408
409
409
~~~
410
-
site_info = pd.read_csv("data/sites.csv")
411
-
site_info.groupby("site_type").count()
410
+
plot_info = pd.read_csv("data/plots.csv")
411
+
plot_info.groupby("plot_type").count()
412
412
~~~
413
413
{: .language-python}
414
414
415
-
* Calculate a diversity index of your choice for control vs rodent exclosure sites. The index should consider both species abundance and number of species. You might choose the simple biodiversity index described here which calculates diversity as `the number of species in the site / the total number of individuals in the site = Biodiversity index.`
415
+
* Calculate a diversity index of your choice for control vs rodent exclosure plots. The index should consider both species abundance and number of species. You might choose the simple biodiversity index described here which calculates diversity as `the number of species in the plot / the total number of individuals in the plot = Biodiversity index.`
0 commit comments