[Reference Site]
How to resize an HDF5 array with h5py
H5py adding more data to an existing dataset
Datasets - h5py 3.7.0 documentation
Example of resizing an array with h5py using a Python class.
대용량의 데이터를 빠르게 읽고 쓸수 있는 장점을 가지고 있다.
[Reference Site] Comparing storage and read time between PNG files and HDF5 (highly recommend !!!)
Three Ways of Storing and Accessing Lots of Images in Python - Real Python
쉽게 카테고리별 또는 속성별로 데이터를 나뉘서 저장할 수 있는 장점을 가지고 있다.
[Component of h5py]
[Mode of h5py]
Mode | Description |
---|---|
r | Read only, file must exist (default) |
r+ | Read/ write, file must exist |
w | Create file, truncate if exists |
w- or x | Create file, fail if exists |
a | Read/ write if exists, create otherwise |
[Function of h5py]
Function | Description |
---|---|
h5py.File(path, mode) | path 경로에 mode에 따라서 읽고 쓰기 위한 함수, ‘type(path) = string’ |
create_group(name) | name 이름을 가진 group을 생성, ‘type(name) = string’ |
create_dataset(name) | name 이름을 가진 dataset을 생성, ‘type(name) = string’ |
name.attrs[attribute] = ~ | name 이름을 가진 group의 ~라는 속성을 attribute를 넣음, ‘type(attribute) = string’ |
name | 해당 파일 이름을 반환 |
keys() | 해당 경로에 속해 있는 내용들을 반환 |
values() | 해당 경로에 대한 정보 및 하위 내용들을 반환 |
close() | 해당 h5py 관련 memory 및 변수 제거 |
[Example of h5py]
위의 그림처럼 구성한 code
# Set file path and make h5py file
h5_filename = "~"
# Write h5py file and write
hw = h5py.File(h5_filename, 'w')
# Make group named /subPano
hw.create_group('/subPano')
# Make dataset named "/subPano/1" and put data1
idx1 = "/subPano" + "/1"
data1 = np.arange(10)
hw.create_dataset(idx1, data=data1)
# Make dataset named "/subPano/2" and put data2
idx2 = "/subPano" + "/2"
data2 = np.arange(20)
hw.create_dataset(idx2, data=data2)
# Erase h5py memory
hw.close()
# Read h5py file alreay existed
h5 = h5py.File(h5_filename, 'r')
위의 그림을 기반하여 debugging을 한 결과
########################################################
$ (Pdb) h5
-> <HDF5 file "subpanoDB.h5" (mode r)>
$ (Pdb) type(h5)
-> <class 'h5py._hl.files.File'>
$ (Pdb) h5.name
-> '/'
$ (Pdb) h5.keys()
-> <KeysViewHDF5 ['subPano']>
$ (Pdb) h5.values()
-> ValuesViewHDF5(<HDF5 file "subpanoDB.h5" (mode r)>)
########################################################
$ (Pdb) type(h5['subPano'])
-> <class 'h5py._hl.dataset.Dataset'>
$ (Pdb) h5['subPano'].name
-> '/subPano'
$ (Pdb) h5['subPano'].keys()
-> <KeysViewHDF5 ['1', '2']>
$ (Pdb) h5['subPano'].values()
-> ValuesViewHDF5(<HDF5 group "/subPano" (2 members)>)
########################################################
$ (Pdb) h5['subPano']['1']
-> <HDF5 dataset "1": shape (10,), type "<i8">
$ (Pdb) h5['/subPano/1']
-> <HDF5 dataset "1": shape (10,), type "<i8">
$ (Pdb) h5['subPano']['2']
-> <HDF5 dataset "2": shape (20,), type "<i8">
$ (Pdb) h5['/subPano/2']
-> <HDF5 dataset "2": shape (20,), type "<i8">
$ (Pdb) h5['/subPano/1'].shape
-> (10,)
$ (Pdb) h5['/subPano/2'].shape
-> (20,)
########################################################
$ (Pdb) type(h5['/subPano/1'][:])
-> <class 'numpy.ndarray'>
$ (Pdb) h5['/subPano/1'][:]
-> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
$ (Pdb) h5['/subPano/1'][0]
-> 0
$ (Pdb) h5['/subPano/1'][1]
-> 1
$ (Pdb) h5['/subPano/2'][:]
-> array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
$ (Pdb) h5['/subPano/2'][0]
-> 0
$ (Pdb) h5['/subPano/2'][1]
-> 1
########################################################
####################### ERROR ! ########################
$ (Pdb) h5['subpano'].values()
-> *** AttributeError: 'Dataset' object has no attribute 'values'
$ (Pdb) h5['subpano'].keys()
-> *** AttributeError: 'Dataset' object has no attribute 'keys'
$ (Pdb) h5['/subPano/1'][]
-> *** SyntaxError: invalid syntax
$ (Pdb) h5['/subPano/2/0']
-> *** KeyError: 'Unable to open object (message type not found)'
$ (Pdb) h5['/subPano/2/1']
-> *** KeyError: 'Unable to open object (message type not found)'
$ (Pdb) h5['subpano'][:].values()
-> *** AttributeError: 'numpy.ndarray' object has no attribute 'values'
$ (Pdb) h5['subpano'][:].name
-> *** AttributeError: 'numpy.ndarray' object has no attribute 'name'
$ (Pdb) h5['subpano'][:].keys()
-> *** AttributeError: 'numpy.ndarray' object has no attribute 'keys'
########################################################
[Resize h5py]
Sample Code
# Set file path and make h5py file
h5_filename = "~"
# Initialization
path = 'subpano'
data1 = np.arange(10)
data2 = np.arange(20)
lend1 = data1.shape[0]
lend2 = data2.shape[0]
# Set initial dataset
h5 = h5py.File(h5_filename, 'w')
h5.create_dataset(path, data=data1, maxshape=(None,))
h5.close()
# Change dataset size
h5 = h5py.File(h5_filename, 'a')
total_len = lend1 + lend2
total_len = np.array([total_len])
h5[path].resize(total_len)
h5[path][lend1:] = data2
########################################################
###################### RESULT ! ########################
$ h5[path]
-> <HDF5 dataset "subpano": shape (30,), type "<i8">
$ h5[path][:]
-> array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
$ print("h5: ", h5)
-> h5: <HDF5 file "subpanoDB.h5" (mode r+)>
$ print("h5 type: ", type(h5))
-> h5 type: <class 'h5py._hl.files.File'>
$ print("h5[path]: ", h5[path])
-> h5[path]: <HDF5 dataset "subpano": shape (30,), type "<i8">
$ print("h5[path] type: ", type(h5[path]))
-> h5[path] type: <class 'h5py._hl.dataset.Dataset'>
$ print("h5[paht].shape: ", h5[path].shape)
-> h5[paht].shape: (30,)
$ print("h5[path][:]: ", h5[path][:])
-> h5[path][:]: [ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
$ print("h5[path][:] type: ", type(h5[path][:]))
-> h5[path][:] type: <class 'numpy.ndarray'>
[Error List]
[Solution] dataset을 생성할 때, maxshape를 None으로 할 뿐만 아니라 chunk 설정을 통해 가변적으로 변할 수 있음을 flag(”chunks=True”)를 통해서 설정해준다.
create_dataset(path, data, maxshape, **chunks=True**)
[Reference Site]