Introduction to HDF5 Indexing Survey

indexing hdf5 a survey n.w
1 / 18
Embed
Share

The content provides an overview of indexing in HDF5, discussing its flexibility, storage capabilities, the problem of data access, solutions through indexing, existing implementations, and PyTables as a tool for efficient data handling in HDF5 files.

  • HDF5
  • Indexing
  • Data Storage
  • PyTables
  • Technology

Uploaded on | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Indexing HDF5: A Survey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C DM_PPT_NP_v01 SESIP_0715_JP

  2. The Technology The HDF5 hierarchical data file format and API is flexible it supports self-describing, portable, and compact storage, as well as efficient I/O. It is a well-described and well-supported format that is used in a wide variety of disciplines. 2 SESIP_0715_JP July 14, 2015 DM_PPT_NP_v01

  3. The Problem The HDF5 API does not include mechanisms to efficiently find and access data based on data values, like one would perform a query on a relational database. Members of the HDF Community have developed this capability so that their applications can quickly access targeted pieces of data rapidly search and select interesting portions of data based on ad hoc search criteria. 3 DM_PPT_NP_v01 SESIP_0715_JP

  4. A Solution Solutions to this problem are called indexing. This is done by adding a layer between the HDF5 API and an application that builds a index on one or more parameters, saving enough information in the index to more efficiently find and retrieve specific parts of one or more datasets in an HDF5 file. Index HDF5 API HDF5 File Application Query 4 SESIP_0715_JP July 14, 2015 DM_PPT_NP_v01

  5. Implementations Implementations exist for adding indexed access to HDF5 files. A few of them are: PyTables FastQuery / FastBit Alacrity HDF5 (prototype) Other experimental work in progress 5 SESIP_0715_JP July 14, 2015 DM_PPT_NP_v01

  6. PyTables Uses the Python programming language Built on top of the HDF5 library and the NumPy package Uses Optimized Partially Sorted Index (OPSI) technology designed for fast access to very large (>100M rows) tables 6 SESIP_0715_JP July 14, 2015 DM_PPT_NP_v01

  7. PyTables Example create a table: table = h5file.create_table(group, 'readout', Particle, "Readout example ) Query a table: condition = '(name == "Particle: 5") | (name == "Particle: 7") for record in table.where(condition): # do something with "record 7 SESIP_0715_JP July 14, 2015 DM_PPT_NP_v01

  8. PyTables Limitations No support for relationships between datasets Future work: No specifics; a continuing effort that welcomes additional developers, testers, and users Future maintenance and extended development proposals underway The HDF Group is very interested in taking a significant role in this work as it moves forward. 8 SESIP_0715_JP July 14, 2015 DM_PPT_NP_v01

  9. Alacrity Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying Exploits the representation of floating-point values by binning on significant bits, using an inverted index to map each bin The software is a research vessel for a group at University of North Carolina 9 SESIP_0715_JP July 14, 2015 DM_PPT_NP_v01

  10. FastQuery / FastBit FastQuery is an extension to HDF5 from the visualization Group at Lawrence Berkley National Laboratory (LBNL) Based on LBNL s FastBit, an efficient searching technology that uses bitmap indexing for processing complex, multi-dimensional ad hoc queries on read-only numeric data Extends HDF5 s hyperslab selection mechanism to allow arbitrary range conditions on the data values contained in the datasets Compound queries can span multiple datasets 10 SESIP_0715_JP July 14, 2015 DM_PPT_NP_v01

  11. FastQuery / FastBit Assumptions Data is: 0-3 dimensional block-structured Limited datatypes: float, double, int32, int64, byte Two-level hierarchical organization: TimeStep, VariableName Future work: Arbitrary nesting More data schemas (unstructured, AMR, etc.) SESIP_0715_JP July 14, 2015 11 DM_PPT_NP_v01

  12. HDF5 Data Analysis Extensions The HDF Group is developing support for indexing and querying to enable application developers to create complex and high-performance queries on both metadata and data elements within an HDF5 container. These are in the form of objects and associated APIs: Query Objects: The H5Q API is used to define a query and apply it to an HDF5 container View Objects: The H5V API is used to generate a selection from a query Index Objects: The H5X API is used to attach / build an index to data; it is plug-in based to leverage multiple technologies Note: These extensions were developed under Intel s subcontract with Lawrence Livermore National Security, LLC under U.S. Department of Energy contract DE-AC52-07NA27344. 12 SESIP_0715_JP July 14, 2015 DM_PPT_NP_v01

  13. HDF5 Data Analysis Extensions Example Add index to existing dataset dataset = H5Dopen(file, dataset_name, H5P_DEFAULT); /* Add indexing information */ H5Xcreate(dataset, H5X_PLUGIN_FASTBIT, H5P_DEFAULT); H5Dclose(dataset); Create and apply query float query_lb = 39.1f, query_ub= 42.6f; hid_t query, query1, query2; /* Create a simple query:39.1 < x */ query1 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_GREATER_THAN, H5T_NATIVE_FLOAT, &query_lb); /* Create a second simple query: x < 42.1 */ query2 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_LESS_THAN, H5T_NATIVE_FLOAT, &query_ub); /* Combine query: 39.1 < x < 42.1 */ query = H5Qcombine(query1, H5Q_COMBINE_AND, query2); /* Use query to get selection */ dataset = H5Dopen(file, dataset_name, H5P_DEFAULT); H5Dquery(dataset, query, &dataspace); /* Read data here using dataspace */ 13 H5Dclose(dataset); SESIP_0715_JP July 14, 2015 DM_PPT_NP_v01

  14. HDF5 Data Analysis Extensions Status Phase I status (2014): Prototype implementations for H5Q, H5V, H5X APIs H5X API plugins for Alacrity and FastBit technologies Incremental update of data is not supported by indexing packages Current work (started July 1): Views generated from queries to abstract selection results on multiple objects Support for indexing on chunked datasets Support for compound types Support for parallel indexing Query optimization Additional indexing plugins 14 SESIP_0715_JP July 14, 2015 DM_PPT_NP_v01

  15. Summary A variety of index methods exist that can be used to speed targeted access to data in HDF5 files. Capabilities and underlying technologies differ so use the best fit for your application. Work is ongoing let developers know of your needs and experiences! 15 SESIP_0715_JP July 14, 2015 DM_PPT_NP_v01

  16. References & Sources PyTables http://www.pytables.org/index.html Alacrity J. Jenkins, I. Arkatkar, S. Lakshminarasimhan, I. Boyuka, DavidA., E. Schendel, N. Shah, S. Ethier, C.-S.Chang, J. Chen, H. Kolla, R. Ross, S. Klasky, N. Samatova, ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying, Transactions on Large-Scale Data- and Knowledge-Centered Systems, Vol 10 (2013). FastQuery / FastBit http://www-vis.lbl.gov/Events/SC05/HDF5FastQuery/ K. Wu, FastBit: an efficient indexing technology for accelerating data-intensive science, Journal of Physics: Conference Series, vol. 16, no. 1 (2005) HDF5-FastQuery: An API for Simplifying Access to Data Storage, Retrieval, Indexing and Querying. - Report Number: LBNL/PUB-958 (2006) HDF Data Analysis Extensions J. Soumagne, Q. Koziol, RFC: Data Analysis Extensions, RFC THG 2014-07-17.v4; The HDF Group (2014) 16 16 DM_PPT_NP_v01 SESIP_0715_JP

  17. 17 DM_PPT_NP_v01 SESIP_0715_JP

  18. This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C 18 DM_PPT_NP_v01 SESIP_0715_JP

More Related Content