The dataset should be stored in one of the following formats:
- Files
- Shards
- Sharded files
The files format is a csv file with metadata and paths to images, videos, etc. A csv file can look like this:
image_path,text,width,height
images/1.jpg,caption,512,512Reading a dataset in files format:
from DPF import FilesDatasetConfig, DatasetReader
config = FilesDatasetConfig.from_path_and_columns(
'tests/datasets/files_correct/data.csv',
image_path_col='image_path',
text_col='caption'
)
reader = DatasetReader()
processor = reader.read_from_config(config)In this format, the dataset is divided into shards of N samples each. The files in each shard stored in `tar archive, and the metadata is stored in csv file. The tar archive and csv file of each shard must have the same names (shard index).
Example of shards structure:
0.tar
0.csv
1.tar
1.csv
...
0.csv file:
image_name, caption
0.jpg, caption for image 1
1.jpg, caption for image 2
...Reading a dataset in shards format:
from DPF import ShardsDatasetConfig, DatasetReader
config = ShardsDatasetConfig.from_path_and_columns(
'tests/datasets/shards_correct',
image_name_col='image_name',
text_col='caption'
)
reader = DatasetReader()
processor = reader.read_from_config(config)This format is similar to shards, but instead of tar archives, files are stored in folders.
Example of sharded files structure:
.
├── 0/
│ ├── 0.jpg
│ ├── 1.jpg
│ └── ...
├── 0.csv
├── 1/
│ ├── 1000.jpg
│ ├── 1001.jpg
│ └── ...
├── 1.csv
└── ...
Reading a dataset from sharded files format:
from DPF import ShardedFilesDatasetConfig, DatasetReader
config = ShardedFilesDatasetConfig.from_path_and_columns(
'tests/datasets/shards_correct',
image_name_col='image_name',
text_col='caption'
)
reader = DatasetReader()
processor = reader.read_from_config(config)