Skip to content

Add HuggingFace audio and media generation tasks #5288

@anishshiva7

Description

@anishshiva7

Task Summary

Feature Summary

The HuggingFace inference operator (#5041) is being landed as a sequence of focused task-family PRs. The dispatcher + per-task codegen architecture was introduced in #5277 with text-generation as the first task family.

This issue covers adding the audio and media-generation task families to that architecture. The new tasks plug into the existing dispatcher by adding dedicated TaskCodegen implementations for audio and media generation, then registering their task strings in HuggingFaceInferenceOpDesc.

Concretely, landing this would enable:

  • Audio inference tasks:
    • automatic-speech-recognition
    • audio-classification
    • text-to-speech
  • Media-generation tasks:
    • text-to-image
    • text-to-video
  • A cleaner codegen structure where audio and media-generation Python payload / parse logic lives in separate files instead of expanding the operator descriptor.

Proposed Solution or Design

Add new files under:

common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/huggingFace/codegen/

File Purpose
AudioTaskCodegen.scala Payload and response parsing for ASR, audio-classification, and text-to-speech
MediaGenCodegen.scala Payload and response parsing for text-to-image and text-to-video

Modify:

File Change
HuggingFaceInferenceOpDesc.scala Add audio input fields and register the new task codegens
TaskCodegen.scala Extend CodegenContext with audio input fields
PythonCodegenBase.scala Add shared audio/media helpers, audio source resolution, raw audio body support, and media data URL handling
HuggingFaceInferenceOpDescSpec.scala Add descriptor/codegen coverage for audio and media-generation tasks

Design constraints:

  • Follow the dispatcher pattern from Add HuggingFaceInferenceOpDesc with dispatcher + per-task codegen architecture (text-generation) #5277.
  • Keep task-specific Python generation in separate TaskCodegen files.
  • Preserve EncodableString + pyb"..." safety for user-provided string fields.
  • Keep generatePythonCode total so arbitrary @JsonProperty values do not throw during code generation.
  • Normalize media responses into data URLs where applicable so downstream result rendering can consume image, audio, and video outputs consistently.

References:

Impact / Priority

(P2) Medium — required for broader HuggingFace operator task coverage. Does not affect existing operators.

Affected Area

Workflow Engine (Amber) — HuggingFace operator descriptor and Python codegen.

Task Type

Testing / QA

Other

Task Type

  • Refactor / Cleanup
  • DevOps / Deployment / CI
  • Testing / QA
  • Documentation
  • Performance
  • Other

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions