Skip to content

feat(huggingFace): refactor operator into per-task codegen + text-generation#5278

Open
PG1204 wants to merge 12 commits into
apache:mainfrom
ELin2025:hf/02-operator-textgen
Open

feat(huggingFace): refactor operator into per-task codegen + text-generation#5278
PG1204 wants to merge 12 commits into
apache:mainfrom
ELin2025:hf/02-operator-textgen

Conversation

@PG1204
Copy link
Copy Markdown
Contributor

@PG1204 PG1204 commented May 28, 2026

⚠️ This PR is stacked on #5124. Until that lands, the diff below includes #5124's HuggingFaceModelResource.scala and the 1-line registration in TexeraWebApplication.scala. The new code in this PR is everything under common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/huggingFace/ and the new test under common/workflow-operator/src/test/.../huggingFace/HuggingFaceInferenceOpDescSpec.scala. Once #5124 merges, this diff will auto-clean to ~839 lines.

What changes were proposed in this PR?

Refactors the monolithic 1,278-line HuggingFaceInferenceOpDesc from the team's feature branch into a dispatcher + per-task codegen architecture and ships the first task family (text-generation):

  • codegen/TaskCodegen.scala introduces the trait + CodegenContext that model per-task variation.
  • codegen/PythonCodegenBase.scala emits the shared provider-fallback / process_table / _parse_response infrastructure with two holes for the per-task payload and parse snippets.
  • codegen/TextGenCodegen.scala supplies text-generation's chat-completions payload and the body["choices"][0 ["message"]["content"] parse branch.
  • HuggingFaceInferenceOpDesc.scala becomes a thin (~180-line) dispatcher holding the @JsonProperty fields and the registeredCodegens map.

User-input string fields are typed EncodableString and emitted via the pyb"..." macro so values reach Python as self.decode_python_template('<base64>') rather than raw literals. Class constants are assigned in open(self) so self is in scope for the decode call. The generated process_table runs a defensive _HF_MODEL_ID_PATTERN check at runtime before any HF URL is composed.

The TaskCodegen trait also exposes a tasks: Set[String] default so a single codegen can register under multiple task strings, this becomes relevant in PR 3 (image family).

Any related issues, documentation, or discussions?

Tracked in #5277 & #5041(umbrella issue for the HuggingFace operator end-to-end implementation).

Stacked on #5124 (PR 1 - REST resource).

This is PR 2 of a multi-PR series landing the HuggingFace operator end-to-end. The full plan and umbrella issue live separately; this PR's scope is exactly the dispatcher pattern + text-generation codegen.

How was this PR tested?

  • sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile" clean.
  • sbt scalafmtCheck clean.
  • sbt "WorkflowOperator/testOnly org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec" - 10/10 pass (operator info, validation, codegen wiring, MODEL_ID runtime check, leak-prevention, clamping, schema).
  • sbt "WorkflowOperator/testOnly org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec" - 117/117 descriptors py_compile cleanly, no raw-text leaks. The new operator is included in this scan.
  • Generated Python verified via python3 -m py_compile on a sample output.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.7

PG1204 and others added 8 commits May 17, 2026 13:02
…d media proxy

Introduces a new Jersey REST resource exposing endpoints used by the
upcoming HuggingFace operator UI:

- GET  /api/huggingface/models       — browse / search models per task
- GET  /api/huggingface/tasks        — list HF pipeline tags with hosted inference
- POST /api/huggingface/upload-audio — upload audio for HF audio tasks
- GET  /api/huggingface/audio-preview — stream uploaded audio (path-validated)
- GET  /api/huggingface/media-proxy   — proxy remote media URLs to bypass CORS

This is the first PR in a stacked series landing the HF operator end-to-end.
No operator code yet; this resource is independently useful and lets the
frontend integrate with HF before the operator class lands.
Addresses xuang7's review on PR apache#5124 — both endpoints previously
buffered the full payload into a heap-resident byte[] with no upper
bound, leaving the JVM open to OOM on a hostile or buggy upstream
response (/media-proxy) or out-of-band write into the audio temp dir
(/audio-preview).

- /media-proxy: switch from Unirest.asBytes() to
  asObject(Function<RawResponse, T>), streaming the upstream body in
  8 KiB chunks with a running byte counter. Aborts with 413 if the
  declared Content-Length exceeds the cap (pre-check) or if the body
  crosses the cap mid-read (defends against missing/lying
  Content-Length). New MAX_MEDIA_PROXY_BYTES = 50 MiB, sized for HF
  inference media (text-to-image ~5 MiB, text-to-video ~30 MiB) with
  headroom.
- /audio-preview: add Files.size() defense-in-depth check before
  readAllBytes. /upload-audio already enforces MAX_AUDIO_BYTES on
  ingest; this catches the case where a bug or out-of-band write puts
  an oversized file in the temp dir.

Adds a spec covering the audio-preview cap using a sparse-file fixture
so the test stays fast (87/87 spec passes). The media-proxy cap path
is exercised via the existing input-validation suite plus the new
streamMediaWithCap helper - a follow-up can add a fake-RawResponse
unit test if reviewers want explicit coverage of the chunked-read cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…eration

Splits the monolithic 1,278-line HuggingFaceInferenceOpDesc from the
team's feature branch into a dispatcher + per-task codegen architecture
and ships the first task family (text-generation) end-to-end.

- TaskCodegen trait + CodegenContext model the per-task variation
- PythonCodegenBase emits the shared provider-fallback / process_table /
  _parse_response infrastructure with two holes for the per-task payload
  and parse snippets
- TextGenCodegen supplies text-generation's chat-completions payload and
  the body["choices"][0]["message"]["content"] parse branch
- HuggingFaceInferenceOpDesc becomes a thin dispatcher (~180 lines)
  holding @JsonProperty fields and the registeredCodegens map

User-input string fields are typed as EncodableString and emitted via
the pyb"..." macro so values reach Python as
self.decode_python_template('<base64>') rather than raw literals; class
constants are assigned in open(self) so self is in scope for the decode
call. Generated process_table runs a defensive _HF_MODEL_ID_PATTERN
check at runtime before any HF URL is composed.

PR 2 of a stacked 9-PR series. PR 1 (apache#5124) ships the supporting REST
resource; PRs 3-5 will add image, audio + media-gen, and QA/ranking
task families by registering new *Codegen objects in the dispatcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 28, 2026

Codecov Report

❌ Patch coverage is 71.75926% with 122 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.89%. Comparing base (953e2c4) to head (61e6c41).

Files with missing lines Patch % Lines
...texera/web/resource/HuggingFaceModelResource.scala 67.04% 90 Missing and 27 partials ⚠️
...rator/huggingFace/HuggingFaceInferenceOpDesc.scala 92.68% 0 Missing and 3 partials ⚠️
...a/org/apache/texera/web/TexeraWebApplication.scala 0.00% 1 Missing ⚠️
...ber/operator/huggingFace/codegen/TaskCodegen.scala 88.88% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5278      +/-   ##
============================================
- Coverage     49.16%   48.89%   -0.28%     
- Complexity     2384     2434      +50     
============================================
  Files          1051     1047       -4     
  Lines         40350    40351       +1     
  Branches       4279     4313      +34     
============================================
- Hits          19837    19728     -109     
- Misses        19353    19440      +87     
- Partials       1160     1183      +23     
Flag Coverage Δ *Carryforward flag
access-control-service 41.89% <ø> (ø)
agent-service 33.76% <ø> (ø) Carriedforward from 767219a
amber 52.16% <71.75%> (+0.50%) ⬆️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 0.00% <ø> (ø)
file-service 38.42% <ø> (ø)
frontend 40.10% <ø> (-0.97%) ⬇️ Carriedforward from 767219a
python 90.50% <ø> (-0.30%) ⬇️ Carriedforward from 767219a
workflow-compiling-service 56.81% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

PG1204 and others added 2 commits May 28, 2026 12:12
…degen specs

Addresses Codecov's 66.85% patch coverage warning by exercising the
defensive null-handling branches in HuggingFaceInferenceOpDesc.scala and
the TextGenCodegen contract that previously had no spec hits.

- null-tolerance: feed null into every @JsonProperty (token, model, prompt
  col, system prompt, result col, task, maxNewTokens, temperature) and
  assert generatePythonCode still emits a parseable ProcessTableOperator
  with sane defaults (TASK falls back to text-generation, MAX_NEW_TOKENS
  clamps to 256, TEMPERATURE to 0.7). Covers the `if (x == null) ... else
  x` branches that previously had no test that took the null side.
- TextGenCodegen.task: trivial canonical-value check.
- TextGenCodegen ctx-independence: pass an "irrelevant"-filled ctx and
  assert payloadPython / parsePython still reference self.MODEL_ID and
  body["choices"]…. Catches a future refactor that accidentally splices
  ctx fields into the static snippets.

13/13 in HuggingFaceInferenceOpDescSpec, 2/2 in PythonCodeRawInvalidTextSpec
(117/117 descriptors still py_compile cleanly).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@PG1204
Copy link
Copy Markdown
Contributor Author

PG1204 commented May 28, 2026

/request-review @Ma77Ball

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants