Skip to content

Latest commit

 

History

History
102 lines (80 loc) · 2.06 KB

File metadata and controls

102 lines (80 loc) · 2.06 KB

Supported data formats

The dataset should be stored in one of the following formats:

  • Files
  • Shards
  • Sharded files

Files format

The files format is a csv file with metadata and paths to images, videos, etc. A csv file can look like this:

image_path,text,width,height
images/1.jpg,caption,512,512

Reading a dataset in files format:

from DPF import FilesDatasetConfig, DatasetReader

config = FilesDatasetConfig.from_path_and_columns(
    'tests/datasets/files_correct/data.csv',
    image_path_col='image_path',
    text_col='caption'
)

reader = DatasetReader()
processor = reader.read_from_config(config)

Shards format

In this format, the dataset is divided into shards of N samples each. The files in each shard stored in `tar archive, and the metadata is stored in csv file. The tar archive and csv file of each shard must have the same names (shard index).

Example of shards structure:

0.tar
0.csv
1.tar
1.csv
...

0.csv file:

image_name, caption
0.jpg, caption for image 1
1.jpg, caption for image 2
...

Reading a dataset in shards format:

from DPF import ShardsDatasetConfig, DatasetReader

config = ShardsDatasetConfig.from_path_and_columns(
  'tests/datasets/shards_correct',
  image_name_col='image_name',
  text_col='caption'
)

reader = DatasetReader()
processor = reader.read_from_config(config)

Sharded files format

This format is similar to shards, but instead of tar archives, files are stored in folders.

Example of sharded files structure:

.
├── 0/
│   ├── 0.jpg
│   ├── 1.jpg
│   └── ...
├── 0.csv
├── 1/
│   ├── 1000.jpg
│   ├── 1001.jpg
│   └── ...
├── 1.csv
└── ...

Reading a dataset from sharded files format:

from DPF import ShardedFilesDatasetConfig, DatasetReader

config = ShardedFilesDatasetConfig.from_path_and_columns(
  'tests/datasets/shards_correct',
  image_name_col='image_name',
  text_col='caption'
)

reader = DatasetReader()
processor = reader.read_from_config(config)