Analyzing Large Amount of Data with PySpark on AWS

You will be implementing the following functions in this notebook.

user()

bucket()

long_trips()

manhattan_trips()

weighted_profit()

final_output()

Please do not remove or modify the following utility functions:

load_data()

main()

Do not change the below cell. Run it to initialize your PySpark instance.
WARNING: Do NOT remodify the below cell. It contains the function for loading data and all imports, and the function for running your code.

Implement the below functions for this lab:

WARNING: Do NOT change any function inputs or outputs, and ensure that the dataframes your code returns align with the schema definitions commented in each function

3a. Update the user() function

This function should return your username, eg: janedoe3

3b. Update the long_trips() function

This function filters trips to keep only trips longer than 2 miles.

3c. Update the manhattan_trips() function

This function determines the top 20 locations with a DOLocationID in manhattan by passenger_count (pcount).

Example output formatting:

+--------------+--------+
| DOLocationID | pcount |
+--------------+--------+
|             5|      15|
|            16|      12| 
+--------------+--------+

3d. Update the weighted_profit() function

This function should determine the average total_amount, the total count of trips, and the total count of trips ending in the top 20 destinations and return the weighted_profit as discussed in the homework document.

Example output formatting:

+--------------+-------------------+
| PULocationID |  weighted_profit  |
+--------------+-------------------+
|            18| 33.784444421924436| 
|            12| 21.124577637149223| 
+--------------+-------------------+

3e. Update the final_output() function

This function will take the results of weighted_profit, links it to the borough and zone and returns the top 20 locations with the highest weighted_profit.

Example output formatting:

+------------+---------+-------------------+
|    Zone    | Borough |  weighted_profit  |
+----------------------+-------------------+
| JFK Airport|   Queens|  16.95897820117925|
|     Jamaica|   Queens| 14.879835188762488|
+------------+---------+-------------------+
Test your code on the small dataset first, as the large dataset will take a significantly longer time to run
WARNING: Do NOT use the same bucket url for multiple runs of the `main()` function, as this will cause errors. Make sure to change the name of your output location every time. (ie: s3://lab11-janedoe3/output-small2)

Update the below cell with the address to your bucket, then run the below cell to run your code to store the results in S3.

When you have confirmed the results of the small dataset, run it again using the large dataset. Your output file will appear ina folder in your s3 bucket called YOUROUTPUT.csv as a csv file with a name something like part-0000-4d992f7a-0ad3-48f8-8c72-0022984e4b50-c000.csv. Download this file and rename it to output.csv for submission. Do not make any other changes to the file.