697 - Data Sourcing for Sharing Excess#720
Conversation
ical and normalize records for the resource table
|
@icycoldveins Got Claude's help with the edge function ( @gcardonag @vontell curious to get your thoughts on this regarding automating and maintaining the sync script by using edge functions. Otherwise we can just use the Python script with Lambda in AWS like the other one. |
Match existing records by name + source URL, update those (preserving date_created), insert only new ones. Stale entries that are no longer in the scraped data get cleaned up. Aligns with the approach in PR #720.
… of https://github.com/phlask/phlask-map into 697-data-sourcing-for-sharing-excess-food-distribution
…rors with the JSONB object dates, debugging logs for the records to delete to check if the API key is read-only
| supabase = get_supabase_client() | ||
| delete_by_creator(supabase) | ||
| insert_resources(supabase, resources) |
There was a problem hiding this comment.
Having a discussion with Ron and Añil, what is the use of the delete function? I also heard that we want to delete anything overall, is there a reason why might we use the delete functionality in the database rather than just update?
There was a problem hiding this comment.
@RRodriguez26 Updating is definitely better, and I can tweak this a bit further to do it properly. I settled on delete so that we could update our database with current data, which was becoming a problem. Now that it's updated, I can revisit this and implement it more intelligently.
One of the key issues with doing the update route was that recurring events, despite having many distinct occurrences, all collide under one gp_id, so the script needs to handle this on repeated syncs and update the timestamps for whatever occurrence of that event is up next instead of processing every occurrence as a unique resource.
But overall, yes, you all are right. Deletion creates an issue with churned resource IDs, especially if there are crowdsourced edits using that ID as a foreign key. In the long term that's not sustainable, so I'll work on resolving that update issue for the recurring events.
|
I also heard that these data scripts should be in its own repo, we see that there is a repo dedicated to it but we are not sure if this is the right one. |
Yep, I'm going to put up a joint PR for this script and the other one on that repo. I don't think anyone can recall if there was another reason for it, so it's a good fit. |
Pull Request
Change Summary
FYI: Claude helped quite a bit here in building out a basic CLI component and adding Supabase helper functions. Extra scrutiny on those is welcome.
Addresses #697. Introduces a standalone Python script that is designed to pull down events from a public Google Calendar, such as Sharing Excess, and normalize the retrieved events to be able to store them in the
resourcestable in Supabase.As the
resourcestable does not have start/end date fields, these are pulled from the site and inserted into the description with some clear delimiters, like:This allows us to do some post-processing/filtering to determine whether the event is "live" or not.
We can do this scrape periodically by using the
LOOK_FORWARD_DAYSproperty to get all events for a specific window into the future, or just do this monthly in one of the PHLASK sessions or something. Not sure how we want to handle.Change Reason
Billy summed this up quite nicely on #697. Essentially, we would like to be able to actively maintain "live" food sites posted by Sharing Excess and help them and us get the word out a little easier.
Verification [Optional]
Here is an example of a CSV debug output that we can get by using the basic CLI component that Claude helped write:
events.csv
These records can then be written to the DB either directly with CSV import in Supabase, or enter the credentials in the
.envfile here and run the script with the helper to write them to theresourcestable.Related Issue: #697