39python数据分析numpy基础之h5py读写数组数据到h5文件

1 python数据分析numpy基础之h5py读写数组数据到h5文件

HDF5(分层数据格式文件)是Hierarchical Data Format Version 5的缩写，是一种用于存储和管理大数据的文件格式。经历了20多年的发展，HDF格式的最新版本是HDF5，它包含了数据模型，库，和文件格式标准。

一个hdf5文件包括“dataset”和“group”。

HDF5 文件一般以 .h5 或者 .hdf5 作为后缀名，HDF5 文件结构中有 2 primary objects: Groups 和 Datasets。

Groups 就类似于文件夹，每个 HDF5 文件其实就是根目录 (root) group’/'，可以看成目录的容器，其中可以包含一个或多个 dataset 及其它的 group。

Datasets 类似于 NumPy 中的数组 array，可以当作数组的数据集合。

每个 dataset 可以分成两部分: 原始数据 (raw) data values 和元数据 metadata。

1.1 安装h5py

通过pip install h5py安装h5py库。

D:\python39>pip3 install h5py
Collecting h5py
  Downloading h5py-3.10.0-cp39-cp39-win_amd64.whl (2.7 MB)
     |████████████████████████████████| 2.7 MB 79 kB/s
Requirement already satisfied: numpy>=1.17.3 in d:\python39\lib\site-packages (from h5py) (1.26.1)
Installing collected packages: h5py
Successfully installed h5py-3.10.0
WARNING: You are using pip version 20.2.3; however, version 24.0 is available.
You should consider upgrading via the 'd:\python39\python.exe -m pip install --upgrade pip' command.

1.2 读写hdf5文件

通过h5py.File(file,mode)创建一个h5文件。通过create_dataset()将数组写到hdf5文件。

用法

h5py.File(name, mode='r')

描述

python的h5py库的File()函数创建一个h5文件。

NO	mode	描述1
1	r	默认值r，为只读，文件必须存在
2	r+	读写，文件必须存在
3	w	创建文件，如果存在则截断
4	w-或x	创建文件，如果存在则失败
5	a	读和写，如果不存在则创建

用法

create_dataset(name, shape=None, dtype=None, data=None, **kwds)

描述

python的通过h5py.File.create_dataset()向h5文件写内容。

name：数据集名称，通过此名称进行存取数组。

data：要写到h5文件的数组数据。

模式为w时，每次调用create_dataset()会截断文件，覆盖h5文件原有的内容。

模式为a时，每次调用create_dataset()不会覆盖h5文件原有内容，通过切片修改达到修改文件的效果。

示例

>>> import numpy as np
>>> import h5py
>>> ar1=np.arange(24).reshape(2,3,4)
>>> ar2=np.arange(24).reshape(1,3,8)
>>> fname1=r'E:\ls\h5f1.h5'
# h5py.File()写模式创建一个h5文件
>>> h5f1=h5py.File(fname1,mode='w')
# 将数组写到h5文件
>>> h5f1.create_dataset('ar1',data=ar1)
<HDF5 dataset "ar1": shape (2, 3, 4), type "<i4">
>>> h5f1.create_dataset('ar2',data=ar2)
<HDF5 dataset "ar2": shape (1, 3, 8), type "<i4">
# 读模式打开一个h5文件
>>> h5f1=h5py.File(fname1,mode='r')
# 通过切片获取数组
>>> h5f1['ar1'][:]
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
>>> h5f1['ar2'][:]
array([[[ 0,  1,  2,  3,  4,  5,  6,  7],
        [ 8,  9, 10, 11, 12, 13, 14, 15],
        [16, 17, 18, 19, 20, 21, 22, 23]]])
>>> h5f1.close()
# 切换a模式，添加数组到dataset，达到向文件添加内容的效果
>>> h5f1=h5py.File(fname1,mode='a')
# 已经存在的dataset不可再次create添加
>>> h5f1.create_dataset('ar2',data=[1,2])
Traceback (most recent call last):
  File "<pyshell#64>", line 1, in <module>
    h5f1.create_dataset('ar2',data=[1,2])
  File "D:\python39\lib\site-packages\h5py\_hl\group.py", line 183, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
  File "D:\python39\lib\site-packages\h5py\_hl\dataset.py", line 163, in make_new_dset
    dset_id = h5d.create(parent.id, name, tid, sid, dcpl=dcpl, dapl=dapl)
  File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py\h5d.pyx", line 137, in h5py.h5d.create
ValueError: Unable to synchronously create dataset (name already exists)
# 通过切片方式进行修改
>>> h5f1['ar2'][0,0]=[20,21,22,23,25,26,27,28]
>>> h5f1['ar2'][:]
array([[[20, 21, 22, 23, 25, 26, 27, 28],
        [ 8,  9, 10, 11, 12, 13, 14, 15],
        [16, 17, 18, 19, 20, 21, 22, 23]]])
>>> h5f1['ar1'][:]
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
# 添加dataset到h5文件，不会截断之前的文件内容
>>> h5f1.create_dataset('ar3',data=[1,2])
<HDF5 dataset "ar3": shape (2,), type "<i4">
>>> h5f1['ar3'][:]
array([1, 2])
>>> h5f1['ar2'][:]
array([[[20, 21, 22, 23, 25, 26, 27, 28],
        [ 8,  9, 10, 11, 12, 13, 14, 15],
        [16, 17, 18, 19, 20, 21, 22, 23]]])