Strings in HDF5¶
The Most Important Thing¶
If you remember nothing else, remember this:
All strings in HDF5 hold encoded text.
You can’t store arbitrary binary data in HDF5 strings. Not only will this break, it will break in odd, hard-to-discover ways that will leave you confused and cursing.
How to store raw binary data¶
If you have a non-text blob in a Python byte string (as opposed to ASCII or
UTF-8 encoded text, which is fine), you should wrap it in a
void type for
storage. This will map to the HDF5 OPAQUE datatype, and will prevent your
blob from getting mangled by the string machinery.
Here’s an example of how to store binary data in an attribute, and then recover it:
>>> binary_blob = b"Hello\x00Hello\x00" >>> dset.attrs["attribute_name"] = np.void(binary_blob) >>> out = dset.attrs["attribute_name"] >>> binary_blob = out.tostring()
How to store text strings¶
At the high-level interface, h5py exposes three kinds of strings. Each maps to a specific type within Python (but see Compatibility below):
- Fixed-length ASCII (NumPy
- Variable-length ASCII (Python 2
str, Python 3
- Variable-length UTF-8 (Python 2
unicode, Python 3
If you want to write maximally-compatible files and don’t want to read the whole chapter:
numpy.string_for scalar attributes
- Use the NumPy
Sdtype for datasets and array attributes
These are created when you use
>>> dset.attrs["name"] = numpy.string_("Hello")
>>> dset = f.create_dataset("string_ds", (100,), dtype="S10")
In the file, these map to fixed-width ASCII strings. One byte per character is used. The representation is “null-padded”, which is the internal representation used by NumPy (and the only one which round-trips through HDF5).
Technically, these strings are supposed to store only ASCII-encoded text, although in practice anything you can store in NumPy will round-trip. But for compatibility with other programs using HDF5 (IDL, MATLAB, etc.), you should use ASCII only.
This is the most-compatible way to store a string. Everything else can read it.
These are created when you assign a byte string to an attribute:
>>> dset.attrs["attr"] = b"Hello"
or when you create a dataset with an explicit “bytes” vlen type:
>>> dt = h5py.special_dtype(vlen=bytes) >>> dset = f.create_dataset("name", (100,), dtype=dt)
Note that they’re not fully identical to Python byte strings. You can only store ASCII-encoded text, without NULL bytes:
>>> dset.attrs["name"] = b"Hello\x00there" ValueError: VLEN strings do not support embedded NULLs
In the file, these are created as variable-length strings with character set H5T_CSET_ASCII.
These are created when you assign a
unicode string to an attribute:
>>> dset.attrs["name"] = u"Hello"
or if you create a dataset with an explicit
unicode vlen type:
>>> dt = h5py.special_dtype(vlen=unicode) >>> dset = f.create_dataset("name", (100,), dtype=dt)
They can store any character a Python unicode string can store, with the exception of NULLs. In the file these are created as variable-length strings with character set H5T_CSET_UTF8.
Exceptions for Python 3¶
Most strings in the HDF5 world are stored in ASCII, which means they map to byte strings. But in Python 3, there’s a strict separation between data and text, which intentionally makes it painful to handle encoded strings directly.
So, when reading or writing scalar string attributes, on Python 3 they will
always be returned as type
str, regardless of the underlying storage
mechanism. The regular rules for writing apply; to get a fixed-width ASCII
numpy.string_, and to get a variable-length ASCII string, use
What about NumPy’s
NumPy also has a Unicode type, a UTF-32 fixed-width format (4-byte characters). HDF5 has no support for wide characters. Rather than trying to hack around this and “pretend” to support it, h5py will raise an error when attempting to create datasets or attributes of this type.
Unicode strings are used exclusively for object names in the file:
>>> f.name u'/'
You can supply either byte or unicode strings (on both Python 2 and Python 3) when creating or retrieving objects. If a byte string is supplied, it will be used as-is; Unicode strings will be encoded down to UTF-8.
In the file, h5py uses the most-compatible representation; H5T_CSET_ASCII for characters in the ASCII range; H5T_CSET_UTF8 otherwise.
>>> grp = f.create_dataset(b"name") >>> grp2 = f.create_dataset(u"name2")