Supported data formats

The dataset should be stored in one of the following formats:

Files
Shards
Sharded files

Files format

The files format is a csv file with metadata and paths to images, videos, etc. A csv file can look like this:

image_path,text,width,height
images/1.jpg,caption,512,512

Reading a dataset in files format:

from DPF import FilesDatasetConfig, DatasetReader

config = FilesDatasetConfig.from_path_and_columns(
    'tests/datasets/files_correct/data.csv',
    image_path_col='image_path',
    text_col='caption'
)

reader = DatasetReader()
processor = reader.read_from_config(config)

Shards format

In this format, the dataset is divided into shards of N samples each. The files in each shard stored in `tar archive, and the metadata is stored in csv file. The tar archive and csv file of each shard must have the same names (shard index).

Example of shards structure:

0.tar
0.csv
1.tar
1.csv
...

0.csv file:

image_name, caption
0.jpg, caption for image 1
1.jpg, caption for image 2
...

Reading a dataset in shards format:

from DPF import ShardsDatasetConfig, DatasetReader

config = ShardsDatasetConfig.from_path_and_columns(
  'tests/datasets/shards_correct',
  image_name_col='image_name',
  text_col='caption'
)

reader = DatasetReader()
processor = reader.read_from_config(config)

Sharded files format

This format is similar to shards, but instead of tar archives, files are stored in folders.

Example of sharded files structure:

.
├── 0/
│   ├── 0.jpg
│   ├── 1.jpg
│   └── ...
├── 0.csv
├── 1/
│   ├── 1000.jpg
│   ├── 1001.jpg
│   └── ...
├── 1.csv
└── ...

Reading a dataset from sharded files format:

from DPF import ShardedFilesDatasetConfig, DatasetReader

config = ShardedFilesDatasetConfig.from_path_and_columns(
  'tests/datasets/shards_correct',
  image_name_col='image_name',
  text_col='caption'
)

reader = DatasetReader()
processor = reader.read_from_config(config)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supported data formats

Files format

Shards format

Sharded files format

FilesExpand file tree

formats.md

Latest commit

History

formats.md

File metadata and controls

Supported data formats

Files format

Shards format

Sharded files format