File Formats ============ The battery data toolkit stores data and metadata in two formats: - *HDF5*: A format for saving all available information about a battery into a single file - *Parquet*: A format optimized for storing column data, but requires saving separate files for each type of data (cycle vs raw) .. contents:: :local: :depth: 1 :class:`~battdat.data.BatteryDataset` objects support reading and writing to these classes via ``to_[format]`` and ``from_[format]`` methods, such as :meth:`~battdat.data.BatteryDataset.to_hdf` and :meth:`~battdat.data.BatteryDataset.from_parquet` .. _hdf5: HDF5 ---- The `HDF5 format `_ stores array data as a nested series of dictionaries. ``battdat`` stores each type of data known about a battery in separate groups and the metadata for the battery as the metadata. .. code-block:: python import h5py import json with h5py.File('example.h5') as f: metadata = json.loads(f.attrs['metadata']) # Data describing the cell and how it was tested version = json.loads(f.attrs['battdat_version']) # BattDat version used to save dataset raw_data = f['raw_data'] # HDF5 group holding raw data schema = raw_data.attrs['metadata'] # Description of each column The internal structure of each group (e.g., ``f['raw_data']``) are that of the `PyTables Table format `_: a one-dimensional chunked array with a compound data type. .. dropdown:: HDF5 content .. code-block:: $ h5ls -rv single-resistor-complex-charge_from-discharged.hdf Opened ".\single-resistor-complex-charge_from-discharged.hdf" with sec2 driver. / Group Attribute: CLASS scalar Type: 5-byte null-terminated UTF-8 string Attribute: PYTABLES_FORMAT_VERSION scalar Type: 3-byte null-terminated UTF-8 string Attribute: TITLE null Type: 1-byte null-terminated UTF-8 string Attribute: VERSION scalar Type: 3-byte null-terminated UTF-8 string Attribute: battdat_version scalar Type: 5-byte null-terminated UTF-8 string Attribute: json_schema scalar Type: 8816-byte null-terminated ASCII string Attribute: metadata scalar Type: 242-byte null-terminated UTF-8 string Location: 1:96 Links: 1 /raw_data Dataset {3701/Inf} Attribute: CLASS scalar Type: 5-byte null-terminated UTF-8 string Attribute: FIELD_0_FILL scalar Type: native double Attribute: FIELD_0_NAME scalar Type: 9-byte null-terminated UTF-8 string Attribute: FIELD_1_FILL scalar Type: native double Attribute: FIELD_1_NAME scalar Type: 7-byte null-terminated UTF-8 string Attribute: FIELD_2_FILL scalar Type: native double Attribute: FIELD_2_NAME scalar Type: 7-byte null-terminated UTF-8 string Attribute: FIELD_3_FILL scalar Type: native long long Attribute: FIELD_3_NAME scalar Type: 12-byte null-terminated UTF-8 string Attribute: NROWS scalar Type: native long long Attribute: TITLE null Type: 1-byte null-terminated UTF-8 string Attribute: VERSION scalar Type: 3-byte null-terminated UTF-8 string Attribute: json_schema scalar Type: 2824-byte null-terminated UTF-8 string Attribute: metadata scalar Type: 2824-byte null-terminated UTF-8 string Location: 1:10240 Links: 1 Chunks: {2048} 65536 bytes Storage: 118432 logical bytes, 6670 allocated bytes, 1775.59% utilization Filter-0: shuffle-2 OPT {32} Filter-1: deflate-1 OPT {9} Type: struct { "test_time" +0 native double "current" +8 native double "voltage" +16 native double "cycle_number" +24 native long long } 32 bytes Multiple Batteries per File +++++++++++++++++++++++++++ Data from multiple batteries can share a single HDF5 file as long as they share the same metadata. Add multiple batteries into an HDF5 file by providing a "prefix" to name each cell. .. code-block:: python test_a.to_battdat_hdf('test.h5', prefix='a') test_b.to_battdat_hdf('test.h5', prefix='b', overwrite=False) # Overwrite is mandatory Load a specific cell by providing a specific prefix on load .. code-block:: python test_a = BatteryDataset.from_battdat_hdf('test.h5', prefix='a') or load any of the included cells by providing an index .. code-block:: python test_a = BatteryDataset.from_battdat_hdf('test.h5', prefix=0) Load all cells by iterating over them: .. code-block:: python for name, cell in BatteryDataset.all_cells_from_battdat_hdf('test.h5'): do_some_processing(cell) Parquet ------- The `Apache Parquet format `_ is designed for high performance I/O of tabular data. ``battdat`` stores each type of data in a separate file and the metadata in `file-level metadata `_ of each file. .. code-block:: python from pyarrow import parquet as pq import json # Reading the metadata file_metadata = pq.read_metadata('raw_data.parquet') # Parquet metadata metadata = json.loads(file_metadata.metadata[b'battery_metadata']) # For the battery schema = json.loads(file_metadata.metadata[b'table_metadata']) # For the columns # Reading the data table = pq.read_table('raw_data.parquet') # In pyarrow's native Table format df = table.to_pandas() # As a dataframe The internal structure of a Parquet file saved by ``battdat`` has column names and data types which match those provided when saving the file. Any numeric types will be the same format (e.g., ``float32`` vs ``float64``) and times are stored as floating point numbers, rather than Parquet's time format.