Feature Summary
The HuggingFace inference operator (#5041) needs to cover ~20 HF pipeline tasks (text-generation, image-classification, ASR, text-to-image, …). To land it cleanly and let the per-task work proceed in parallel, the operator is introduced via a dispatcher + per-task codegen architecture: a thin HuggingFaceInferenceOpDesc selects a TaskCodegen based on the configured task, and the selected codegen contributes the per-task Python payload + parse snippets. Shared infrastructure (provider fallback, HTTP loop, response-parsing framework) lives in PythonCodegenBase.
This issue covers shipping the dispatcher pattern + the first task family (text-generation) end-to-end. Subsequent child issues add the image, audio / media-generation, and QA / ranking task families by introducing new *Codegen objects and registering them in the dispatcher map. The architecture lets each task-family PR stay focused: a new task family means one new file plus one entry in the dispatcher map — no surgery on the shared infrastructure or other codegens.
Concretely, landing this would enable:
- A working HuggingFace operator on the workspace for text-generation tasks against HF Hub and any OpenAI-compatible third-party provider (Cerebras, Groq, Sambanova, Together, …).
- A clean extension point for the image / audio / QA task families to plug into via subsequent PRs without modifying the operator class or the shared Python infrastructure.
Proposed Solution or Design
- New files under
common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/huggingFace/:
HuggingFaceInferenceOpDesc.scala — thin (~180-line) dispatcher holding the @JsonProperty fields and the registeredCodegens map.
codegen/TaskCodegen.scala — trait + CodegenContext case class; default tasks: Set[String] = Set(task) for single-task codegens, overridable by multi-task codegens.
codegen/PythonCodegenBase.scala — shared provider-fallback (HF router + OpenAI-compatible third-party providers), process_table loop, _parse_response framework, with two holes for the per-task payload + parse snippets.
codegen/TextGenCodegen.scala — text-generation's chat-completions payload and body["choices"][0]["message"]["content"] parse.
- Register
HuggingFaceInferenceOpDesc in LogicalOp.scala's @JsonSubTypes.
- Design constraints baked into the codegen:
- Safe codegen via
EncodableString + pyb"...": user-input string fields are typed as EncodableString (String @EncodableStringAnnotation); the pyb macro emits them as self.decode_python_template('<base64>') runtime expressions instead of raw Python literals, so they never appear in the generated source as-is. This is what satisfies PythonCodeRawInvalidTextSpec's leakage check.
- Constants in
open(self): per-instance attributes (self.MODEL_ID, self.PROMPT_COLUMN, …) are assigned in the lifecycle method so self is in scope for the decode call.
- Codegen totality:
generatePythonCode never throws on arbitrary @JsonProperty values — unknown task strings fall back to TextGenCodegen, and the generated Python's else branch produces a generic {"inputs": prompt_value} payload, matching the original monolithic operator's behavior. Required by the regression test contract.
- Defensive
MODEL_ID validation at runtime: generated Python rejects malformed model IDs (path-traversal segments, query strings, fragments, control characters) with a clear ValueError before any HF URL is composed.
References:
Impact / Priority
(P2) Medium — required for the HuggingFace inference operator (#5041) to function. Does not affect existing functionality.
Affected Area
Workflow Engine (Amber) — operator descriptor + Python codegen.
Task Type
Feature Summary
The HuggingFace inference operator (#5041) needs to cover ~20 HF pipeline tasks (text-generation, image-classification, ASR, text-to-image, …). To land it cleanly and let the per-task work proceed in parallel, the operator is introduced via a dispatcher + per-task codegen architecture: a thin
HuggingFaceInferenceOpDescselects aTaskCodegenbased on the configured task, and the selected codegen contributes the per-task Python payload + parse snippets. Shared infrastructure (provider fallback, HTTP loop, response-parsing framework) lives inPythonCodegenBase.This issue covers shipping the dispatcher pattern + the first task family (text-generation) end-to-end. Subsequent child issues add the image, audio / media-generation, and QA / ranking task families by introducing new
*Codegenobjects and registering them in the dispatcher map. The architecture lets each task-family PR stay focused: a new task family means one new file plus one entry in the dispatcher map — no surgery on the shared infrastructure or other codegens.Concretely, landing this would enable:
Proposed Solution or Design
common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/huggingFace/:HuggingFaceInferenceOpDesc.scala— thin (~180-line) dispatcher holding the@JsonPropertyfields and theregisteredCodegensmap.codegen/TaskCodegen.scala— trait +CodegenContextcase class; defaulttasks: Set[String] = Set(task)for single-task codegens, overridable by multi-task codegens.codegen/PythonCodegenBase.scala— shared provider-fallback (HF router + OpenAI-compatible third-party providers),process_tableloop,_parse_responseframework, with two holes for the per-task payload + parse snippets.codegen/TextGenCodegen.scala— text-generation's chat-completions payload andbody["choices"][0]["message"]["content"]parse.HuggingFaceInferenceOpDescinLogicalOp.scala's@JsonSubTypes.EncodableString+pyb"...": user-input string fields are typed asEncodableString(String @EncodableStringAnnotation); thepybmacro emits them asself.decode_python_template('<base64>')runtime expressions instead of raw Python literals, so they never appear in the generated source as-is. This is what satisfiesPythonCodeRawInvalidTextSpec's leakage check.open(self): per-instance attributes (self.MODEL_ID,self.PROMPT_COLUMN, …) are assigned in the lifecycle method soselfis in scope for the decode call.generatePythonCodenever throws on arbitrary@JsonPropertyvalues — unknown task strings fall back toTextGenCodegen, and the generated Python'selsebranch produces a generic{"inputs": prompt_value}payload, matching the original monolithic operator's behavior. Required by the regression test contract.MODEL_IDvalidation at runtime: generated Python rejects malformed model IDs (path-traversal segments, query strings, fragments, control characters) with a clearValueErrorbefore any HF URL is composed.References:
Impact / Priority
(P2) Medium — required for the HuggingFace inference operator (#5041) to function. Does not affect existing functionality.
Affected Area
Workflow Engine (Amber) — operator descriptor + Python codegen.
Task Type