diff --git a/.github/workflows/pull_request.yml b/.github/workflows/pull_request.yml index 6da2c638..d9f49924 100644 --- a/.github/workflows/pull_request.yml +++ b/.github/workflows/pull_request.yml @@ -46,6 +46,7 @@ jobs: python -m pip install --upgrade pip pip install -r requirements.txt pip install -r certificate_automation/requirements.txt + pip install -r blog_automation/requirements.txt - name: Run pytest run: | diff --git a/.github/workflows/run_blog_exporter.yml b/.github/workflows/run_blog_exporter.yml new file mode 100644 index 00000000..eb4db064 --- /dev/null +++ b/.github/workflows/run_blog_exporter.yml @@ -0,0 +1,72 @@ +name: Publish reviewed blogs + +on: + workflow_dispatch: + schedule: + - cron: '0 7 * * *' # daily at 07:00 UTC: publish any newly reviewed blogs + +jobs: + publish-blogs: + if: github.repository == 'Women-Coding-Community/WomenCodingCommunity.github.io' + runs-on: ubuntu-latest + + steps: + - name: Checkout repository + uses: actions/checkout@v5 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: '3.12' + + - name: Cache pip + uses: actions/cache@v4 + with: + path: ~/.cache/pip + key: ${{ runner.os }}-pip-blog-${{ hashFiles('tools/blog_automation/requirements.txt') }} + restore-keys: | + ${{ runner.os }}-pip-blog- + + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install -r tools/blog_automation/requirements.txt + + - name: Write service account key + run: echo "$SERVICE_ACCOUNT_KEY" > tools/blog_automation/service_account_key.json + env: + SERVICE_ACCOUNT_KEY: ${{ secrets.BLOG_AUTOMATION_SERVICE_ACCOUNT }} + + - name: Export reviewed blogs + run: | + cd tools/blog_automation + python publish_reviewed_blogs.py + + - name: Remove service account key + if: always() + run: rm -f tools/blog_automation/service_account_key.json + + - name: Create or Update Pull Request + id: create-pr + uses: peter-evans/create-pull-request@v7 + with: + token: ${{ secrets.GHA_ACTIONS_ALLOW_TOKEN }} + commit-message: "Automated import of reviewed blog posts" + branch: "automation/import-blog" + team-reviewers: "Women-Coding-Community/leaders" + title: "Automated import of reviewed blog posts" + body: | + This PR was created automatically by a GitHub Action. + + It contains every blog marked `isReviewedandApproved` (and not yet + `isPublished`) in the submissions spreadsheet: + - new posts under `_posts/` + - cover images under `assets/images/blog/` + + The spreadsheet's `isPublished` column has already been set to TRUE for + these rows. Please review the rendered posts before merging. + labels: | + automation + add-paths: | + _posts/** + assets/images/blog/** diff --git a/tools/blog_automation/README.md b/tools/blog_automation/README.md index 426862a9..e49c4d33 100644 --- a/tools/blog_automation/README.md +++ b/tools/blog_automation/README.md @@ -17,10 +17,10 @@ To allow our scripts to access Google Drive and export documents, you need to cr ๐Ÿ‘‰ **Note:** You need the **Project Editor** or **Owner** role on this project to create service accounts and keys. If youโ€™re the one who created the project, you already have these permissions. -### 1. Enable the Drive API +### 1. Enable the Drive and Sheets APIs 1. In the left menu, go to **APIs & Services โ†’ Library**. -2. Search for **Google Drive API**. -3. Click **Enable**. +2. Search for **Google Drive API** and click **Enable**. +3. Search for **Google Sheets API** and click **Enable** (needed to read the submissions spreadsheet). ### 2. Create a Service Account 1. In the left menu, go to **IAM & Admin โ†’ Service Accounts**. @@ -47,6 +47,7 @@ If youโ€™re the one who created the project, you already have these permissions. 4. Give it at least **Viewer** access. 5. Save changes. - Now the service account can read/export files in that folder or doc. +6. Repeat the **Share** step for the **blog submissions spreadsheet** (the Google Form responses sheet), giving the service account **Editor** access. Editor (not just Viewer) is required because the pipeline writes `isPublished = TRUE` back to a row after exporting it. --- @@ -75,8 +76,53 @@ Then the **Document ID** is: Use this ID in your scripts when exporting the document. -## Run Automation +## Export a single blog manually (for testing) 1. Activate virtual environment: `source venv/bin/activate` -2. Run the script: `python doc_to_html_conversion.py ` +2. Export one Google Doc into a post: + `python blog_exporter.py --doc_id --author_name "Jane Doe" --image_link ""` + +This is handy to check a Doc renders correctly. The full pipeline below reads all +of this metadata from the spreadsheet automatically. + +## Tests + +Run `pytest test_blog_exporter.py` + +## CI/CD pipeline: publish a blog when you mark it reviewed + +The Google Sheet is the **single source of truth** โ€” there is no local CSV. The +GitHub Action [`.github/workflows/run_blog_exporter.yml`](../../.github/workflows/run_blog_exporter.yml) +turns a reviewed blog into a draft pull request automatically. + +### How to publish a blog (the editor's workflow) +1. In the submissions spreadsheet (the **Form Responses 1** sheet), set the row's + **`isReviewedandApproved`** cell to **`TRUE`** once the draft is reviewed. + Leave **`isPublished`** blank/`FALSE`. +2. Within a day (or immediately via **Actions โ†’ Publish reviewed blogs โ†’ Run + workflow**) the action exports the blog, sets that row's **`isPublished`** to + `TRUE` in the sheet, and opens/updates a PR + (`Automated import of reviewed blog posts`) with the new post and cover image. +3. **Review the rendered post and merge.** + +### What runs +`publish_reviewed_blogs.py` reads the sheet and exports every row where +`isReviewedandApproved` is `TRUE` and `isPublished` is not `TRUE`. Because the +`isPublished` flag is written straight back to the sheet, a blog is never exported +twice โ€” and the existing backlog (already `isPublished = TRUE`) is left alone. + +> The draft must be a **native Google Doc** (Drive can only export those to +> Markdown). If a submitter uploaded a `.docx`/`.pdf`, open it and do +> **File โ†’ Save as Google Docs** first, otherwise that row is skipped with an error. + +### One-time repo setup +- **Service account needs Editor access to the spreadsheet** (see setup step 4) so + the pipeline can write back `isPublished`. +- **Secret `BLOG_AUTOMATION_SERVICE_ACCOUNT`** โ€” paste the full contents of + `service_account_key.json` into a repository secret with this name + (Settings โ†’ Secrets and variables โ†’ Actions). The workflow writes it to disk at + runtime and deletes it afterwards; the key is never committed. +- **Secret `GHA_ACTIONS_ALLOW_TOKEN`** โ€” already used by the other automations; it + lets the action open the pull request. + diff --git a/tools/blog_automation/blog_exporter.py b/tools/blog_automation/blog_exporter.py new file mode 100644 index 00000000..1facf879 --- /dev/null +++ b/tools/blog_automation/blog_exporter.py @@ -0,0 +1,322 @@ +import argparse +import json +import os +import re +import shutil +import datetime as dt +from pathlib import Path +import bleach +import markdown +import pandas as pd +from google.oauth2 import service_account +from googleapiclient.discovery import build +from googleapiclient.errors import HttpError + +# --- Configuration --- +SERVICE_ACCOUNT_FILE = 'service_account_key.json' +# Used when a submission's cover image can't be downloaded (missing/not shared). +DEFAULT_IMAGE_PATH = '/assets/images/blog/default.jpg' + +# Allowlist for sanitizing HTML converted from submitted Google Docs. Covers the +# formatting blog posts need; everything else (scripts, iframes, event handlers, +# etc.) is stripped. See _markdown_to_html. +ALLOWED_TAGS = [ + 'p', 'br', 'hr', 'span', + 'strong', 'b', 'em', 'i', 'u', 's', 'sub', 'sup', 'small', 'mark', + 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', + 'ul', 'ol', 'li', 'dl', 'dt', 'dd', + 'a', 'img', + 'code', 'pre', 'blockquote', + 'table', 'thead', 'tbody', 'tr', 'th', 'td', 'caption', +] +ALLOWED_ATTRIBUTES = { + 'a': ['href', 'title', 'rel'], + 'img': ['src', 'alt', 'title'], +} +ALLOWED_PROTOCOLS = ['http', 'https', 'mailto'] +YAML_HEADER = '''--- +layout: post +title: {title} +date: {date} +author_name: {author_name} +author_role: {author_role} +image: {image_path} +image_source: {image_source} +description: {description} +category: blog +--- +''' + +def _yaml_scalar(value): + """Return a YAML-safe double-quoted scalar. + + Free-text fields (title, description, ...) can contain ``:``, ``&``, quotes + etc. that break unquoted YAML front matter. A JSON-encoded string is always a + valid YAML double-quoted scalar, so json.dumps gives us correct escaping. + """ + return json.dumps('' if value is None else str(value), ensure_ascii=False) + +def _current_directory(): + return os.path.dirname(os.path.abspath(__file__)) + +def drive_connection(): + service_account_path = os.path.join(_current_directory(), SERVICE_ACCOUNT_FILE) + if not os.path.exists(service_account_path): + print(f"ERROR: Service account key file '{service_account_path}' not found.\n" + "Please obtain your own Google service account key and place it at this path.\n" + "(Never commit this file to version control.)") + exit(1) + creds = service_account.Credentials.from_service_account_file( + service_account_path, + scopes=['https://www.googleapis.com/auth/drive.readonly'] + ) + drive = build('drive', 'v3', credentials=creds) + return drive + +def _posts_directory(): + script_dir = Path(_current_directory()) + posts_dir = (script_dir / "../../_posts").resolve() + return posts_dir + +def _today_date_str(): + return dt.date.today().isoformat() + +def _create_blog_filename_with_date(doc_name, date_str): + # Slugify: lowercase, and collapse any run of non-alphanumeric characters + # (spaces, ':', ',', etc.) into a single hyphen so the filename is valid. + slug = re.sub(r'[^a-z0-9]+', '-', doc_name.lower()).strip('-') + return f"{date_str}-{slug}" + +def _get_doc_name_from_drive(doc_id, drive): + """Fetch document name from Google Drive.""" + try: + file = drive.files().get(fileId=doc_id, fields='name').execute() + return file['name'] + except HttpError as error: + print(f"ERROR: Could not fetch document from Drive (ID: {doc_id})\n{error}") + return None + +def _get_doc_content_as_markdown(doc_id, drive): + """Export Google Doc as markdown.""" + try: + request = drive.files().export_media(fileId=doc_id, mimeType='text/markdown') + file_content = request.execute() + return file_content.decode('utf-8') + except HttpError as error: + print(f"ERROR: Could not export document from Drive (ID: {doc_id})\n{error}") + return None + +def _markdown_to_html(markdown_text): + """Convert Markdown to HTML with custom formatting. + + Blog content comes from community-submitted Google Docs, which can contain + arbitrary raw HTML. We sanitize the converted HTML against an explicit + allowlist so a submitted document cannot inject