Datasets are very similar to NumPy arrays. They are homogenous collections of data elements, with an immutable datatype and (hyper)rectangular shape. Unlike NumPy arrays, they support a variety of transparent storage features such as compression, error-detection, and chunked I/O.
They are represented in h5py by a thin proxy class which supports familiar NumPy operations like slicing, along with a variety of descriptive attributes:
- shape attribute
- size attribute
- dtype attribute
To make an empty dataset, all you have to do is specify a name, shape, and optionally the data type (defaults to 'f'):
>>> dset = f.create_dataset("default", (100,)) >>> dset = f.create_dataset("ints", (100,), dtype='i8')
You may initialize the dataset to an existing NumPy array:
>>> arr = np.arange(100) >>> dset = f.create_dataset("init", data=arr)
Keywords shape and dtype may be specified along with data; if so, they will override data.shape and data.dtype. It’s required that (1) the total number of points in shape match the total number of points in data.shape, and that (2) it’s possible to cast data.dtype to the requested dtype.
An HDF5 dataset created with the default settings will be contiguous; in other words, laid out on disk in traditional C order. Datasets may also be created using HDF5’s chunked storage layout. This means the dataset is divided up into regularly-sized pieces which are stored haphazardly on disk, and indexed using a B-tree.
Chunked storage makes it possible to resize datasets, and because the data is stored in fixed-size chunks, to use compression filters.
To enable chunked storage, set the keyword chunks to a tuple indicating the chunk shape:
>>> dset = f.create_dataset("chunked", (1000, 1000), chunks=(100, 100))
Data will be read and written in blocks with shape (100,100); for example, the data in dset[0:100,0:100] will be stored together in the file, as will the data points in range dset[400:500, 100:200].
Chunking has performance implications. It’s recommended to keep the total size of your chunks between 10 KiB and 1 MiB, larger for larger datasets. Also keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk.
Since picking a chunk shape can be confusing, you can have h5py guess a chunk shape for you:
>>> dset = f.create_dataset("autochunk", (1000, 1000), chunks=True)
Auto-chunking is also enabled when using compression or maxshape, etc., if a chunk shape is not manually specified.
In HDF5, datasets can be resized once created up to a maximum size, by calling Dataset.resize(). You specify this maximum size when creating the dataset, via the keyword maxshape:
>>> dset = f.create_dataset("resizable", (10,10), maxshape=(500, 20))
Any (or all) axes may also be marked as “unlimited”, in which case they may be increased up to the HDF5 per-axis limit of 2**64 elements. Indicate these axes using None:
>>> dset = f.create_dataset("unlimited", (10, 10), maxshape=(None, 10))
Resizing an array with existing data works differently than in NumPy; if any axis shrinks, the data in the missing region is discarded. Data does not “rearrange” itself as it does when resizing a NumPy array.
Chunked data may be transformed by the HDF5 filter pipeline. The most common use is applying transparent compression. Data is compressed on the way to disk, and automatically decompressed when read. Once the dataset is created with a particular compression filter applied, data may be read and written as normal with no special steps required.
Enable compression with the compression keyword to Group.create_dataset():
>>> dset = f.create_dataset("zipped", (100, 100), compression="gzip")
Options for each filter may be specified with compression_opts:
>>> dset = f.create_dataset("zipped_max", (100, 100), compression="gzip", compression_opts=9)
Lossless compression filters¶
- GZIP filter ("gzip")
- Available with every installation of HDF5, so it’s best where portability is required. Good compression, moderate speed. compression_opts sets the compression level and may be an integer from 0 to 9, default is 4.
- LZF filter ("lzf")
- Available with every installation of h5py (C source code also available). Low to moderate compression, very fast. No options.
- SZIP filter ("szip")
- Patent-encumbered filter used in the NASA community. Not available with all installations of HDF5 due to legal reasons. Consult the HDF5 docs for filter options.
Filters enabled with the compression keywords are _lossless_; what comes out of the dataset is exactly what you put in. HDF5 also includes a lossy filter which trades precision for storage space.
Works with integer and floating-point data only. Enable the scale-offset filter by setting Group.create_dataset() keyword scaleoffset to an integer.
For integer data, this specifies the number of bits to retain. Set to 0 to have HDF5 automatically compute the number of bits required for lossless compression of the chunk. For floating-point data, indicates the number of digits after the decimal point to retain.
Block-oriented compressors like GZIP or LZF work better when presented with runs of similar values. Enabling the shuffle filter rearranges the bytes in the chunk and may improve compression ratio. No significant speed penalty, lossless.
Enable by setting Group.create_dataset() keyword shuffle to True.
Reading & writing data¶
HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file. The following slicing arguments are recognized:
- Indices: anything that can be converted to a Python long
- Slices (i.e. [:] or [0:10])
- Field names, in the case of compound data
- At most one Ellipsis (...) object
Here are a few examples (output omitted)
>>> dset = f.create_dataset("MyDataset", (10,10,10), 'f') >>> dset[0,0,0] >>> dset[0,2:10,1:9:3] >>> dset[:,::2,5] >>> dset >>> dset[1,5] >>> dset[0,...] >>> dset[...,6]
For compound data, you can specify multiple field names alongside the numeric slices:
>>> dset["FieldA"] >>> dset[0,:,4:5, "FieldA", "FieldB"] >>> dset[0, ..., "FieldC"]
To retrieve the contents of a scalar dataset, you can use the same syntax as in NumPy: result = dset[()]. In other words, index into the dataset using an empty tuple.
For simple slicing, broadcasting is supported:
>>> dset[0,:,:] = np.arange(10) # Broadcasts to (10,10)
Broadcasting is implemented using repeated hyperslab selections, and is safe to use with very large target selections. It is supported for the above “simple” (integer, slice and ellipsis) slicing only.
A subset of the NumPy fancy-indexing syntax is supported. Use this with caution, as the underlying HDF5 mechanisms may have different performance than you expect.
For any axis, you can provide an explicit list of points you want; for a dataset with shape (10, 10):
>>> dset.shape (10, 10) >>> result = dset[0, [1,3,8]] >>> result.shape (3,) >>> result = dset[1:6, [5,8,9]] >>> result.shape (5, 3)
The following restrictions exist:
- List selections may not be empty
- Selection coordinates must be given in increasing order
- Duplicate selections are ignored
- Very long lists (> 1000 elements) may produce poor performance
NumPy boolean “mask” arrays can also be used to specify a selection. The result of this operation is a 1-D array with elements arranged in the standard NumPy (C-style) order. Behind the scenes, this generates a laundry list of points to select, so be careful when using it with large masks:
>>> arr = numpy.arange(100).reshape((10,10)) >>> dset = f.create_dataset("MyDataset", data=arr) >>> result = dset[arr > 50] >>> result.shape (49,)
Length and iteration¶
As with NumPy arrays, the len() of a dataset is the length of the first axis, and iterating over a dataset iterates over the first axis. However, modifications to the yielded data are not recorded in the file. Resizing a dataset while iterating has undefined results.
On 32-bit platforms, len(dataset) will fail if the first axis is bigger than 2**32. It’s recommended to use Dataset.len() for large datasets.
- class Dataset(identifier)¶
Dataset objects are typically created via Group.create_dataset(), or by retrieving existing datasets from a file. Call this constructor to create a new Dataset bound to an existing DatasetID identifier.
- read_direct(array, source_sel=None, dest_sel=None)¶
Read from an HDF5 dataset directly into a NumPy array, which can avoid making an intermediate copy as happens with slicing. The destination array must be C-contiguous and writable, and must have a datatype to which the source data may be cast. Data type conversion will be carried out on the fly by HDF5.
source_sel and dest_sel indicate the range of points in the dataset and destination array respectively. Use the output of numpy.s_[args]:
>>> dset = f.create_dataset("dset", (100,), dtype='int64') >>> arr = np.zeros((100,), dtype='int32') >>> dset.read_direct(arr, np.s_[0:10], np.s_[50:60])
Return a context manager allowing you to read data as a particular type. Conversion is handled by HDF5 directly, on the fly:
>>> dset = f.create_dataset("bigint", (1000,), dtype='int64') >>> with dset.astype('int16'): ... out = dset[:] >>> out.dtype dtype('int16')
- resize(size, axis=None)¶
Change the shape of a dataset. size may be a tuple giving the new dataset shape, or an integer giving the new length of the specified axis.
Datasets may be resized only up to Dataset.maxshape.
Return the size of the first axis.
NumPy-style shape tuple giving dataset dimensions.
NumPy dtype object giving the dataset’s type.
Integer giving the total number of elements in the dataset.
NumPy-style shape tuple indicating the maxiumum dimensions up to which the dataset may be resized. Axes with None are unlimited.
String with the currently applied compression filter, or None if compression is not enabled for this dataset. See Filter pipeline.
Setting for the HDF5 scale-offset filter (integer), or None if scale-offset compression is not used for this dataset. See Scale-Offset filter.
Value used when reading uninitialized portions of the dataset, or None if no fill value has been defined, in which case HDF5 will use a type-appropriate default value. Can’t be changed after the dataset is created.
String giving the full path to this dataset.