dcnum.write
Submodules
Classes
Convenience class for writing to data outside the main loop |
|
Sort events into predefined arrays for bulk access |
|
Write events from a queue to an .rtdc file |
|
Write events from a queue to an .rtdc file |
|
Write deformability cytometry HDF5 data |
Functions
|
Reassemble basin data in the output file |
|
Copy feature data from one HDF5 file to another |
|
Copy attributes, tables, and logs from one H5File to another |
|
Create an .rtdc file with basins |
|
Package Contents
- class dcnum.write.ChunkWriter(path_out: pathlib.Path | dcnum.common.h5py.File, dq: collections.deque, write_queue_size: multiprocessing.sharedctypes.Synchronized, ds_kwds: dict | None = None, mode: str = 'a', parent_logger: logging.Logger | None = None, *args, **kwargs)[source]
Bases:
threading.ThreadConvenience class for writing to data outside the main loop
Data are numpy arrays collected from a dequeue object
- Parameters:
path_out – Path to the output HDF5 file
dq (collections.deque) – collections.deque object from which data are taken using popleft().
write_queue_size – Multiprocessing value to which the size of dq is written periodically
ds_kwds – keyword arguments for dataset creation, passed to
HDF5Writermode – HDF5 file opening mode, passed to
HDF5Writer
- writer
- dq
- may_stop_loop = False
- must_stop_loop = False
- write_queue_size
- run()[source]
Method representing the thread’s activity.
You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.
- class dcnum.write.EventStash(index_offset: int, feat_nevents: list[int])[source]
Sort events into predefined arrays for bulk access
- Parameters:
index_offset – This is the index offset at which we are working on. Normally, feat_nevents is just a slice of a larger array and index_offset defines at which position it is taken.
feat_nevents – List that defines how many events there are for each input frame. If summed up, this defines self.size.
- events
Dictionary containing the event arrays
- feat_nevents
List containing the number of events per input frame
- nev_idx
Cumulative sum of feat_nevents for determining sorting offsets
- size
Number of events in this stash
- num_frames
Number of frames in this stash
- index_offset
Global offset compared to the original data instance.
- indices_for_data
Array containing the indices in the original data instance. These indices correspond to the events in events.
- _tracker
Private array that tracks the progress.
- class dcnum.write.QueueWriterProcess(log_queue: multiprocessing.Queue, *args, **kwargs)[source]
Bases:
dcnum.write.queue_writer_base.QueueWriterBase,mp_spawnWrite events from a queue to an .rtdc file
Events coming from a queue cannot be guaranteed to be in order. The
QueueWriterThreaduses aEventStashto sort events into the correct order before sending them to theChunkWriterfor storage.- Parameters:
event_queue – A queue object to which other processes or threads write events as tuples (frame_index, events_dict).
write_queue_size – A mp.Value that is populated with the number of event chunks waiting to be written to the output file by the ChunkWriter.
feat_nevents – This 1D array contains the number of events for each frame in the input data. This serves two purposes: (1) it allows us to determine how many events we are writing when we are writing data from write_threshold frames, and (2) it allows us to keep track how many frames have actually been processed (and thus we can expect entries in event_queue for). If an entry in this array is -1, this means that there is no event in event_queue. See write_threshold below.
path_out – Output path for writer
hdf5_dataset_kwargs – Dictionary of keyword arguments (e.g. “compression”) for HDF5 dataset creation.
write_threshold – This integer defines how many frames should be collected at once and put into writer_dq. For instance, with a value of 500, at least 500 items are taken from the event_queue (they should match the expected frame index, frame indices that do not match are kept in a
EventStash). Then, for each frame, we may have multiple or None events, so the output size could be 513 which is computed via np.sum(feat_nevents[idx:idx+write_threshold]).
- log_queue
queue for logging
- class dcnum.write.QueueWriterThread(*args, **kwargs)[source]
Bases:
dcnum.write.queue_writer_base.QueueWriterBase,threading.ThreadWrite events from a queue to an .rtdc file
Events coming from a queue cannot be guaranteed to be in order. The
QueueWriterThreaduses aEventStashto sort events into the correct order before sending them to theChunkWriterfor storage.- Parameters:
event_queue – A queue object to which other processes or threads write events as tuples (frame_index, events_dict).
write_queue_size – A mp.Value that is populated with the number of event chunks waiting to be written to the output file by the ChunkWriter.
feat_nevents – This 1D array contains the number of events for each frame in the input data. This serves two purposes: (1) it allows us to determine how many events we are writing when we are writing data from write_threshold frames, and (2) it allows us to keep track how many frames have actually been processed (and thus we can expect entries in event_queue for). If an entry in this array is -1, this means that there is no event in event_queue. See write_threshold below.
path_out – Output path for writer
hdf5_dataset_kwargs – Dictionary of keyword arguments (e.g. “compression”) for HDF5 dataset creation.
write_threshold – This integer defines how many frames should be collected at once and put into writer_dq. For instance, with a value of 500, at least 500 items are taken from the event_queue (they should match the expected frame index, frame indices that do not match are kept in a
EventStash). Then, for each frame, we may have multiple or None events, so the output size could be 513 which is computed via np.sum(feat_nevents[idx:idx+write_threshold]).
- class dcnum.write.HDF5Writer(obj: dcnum.common.h5py.File | pathlib.Path | str, mode: str = 'a', ds_kwds: dict | None = None)[source]
Write deformability cytometry HDF5 data
- Parameters:
obj (h5py.File | pathlib.Path | str) – object to instantiate the writer from; If this is already a
h5py.Fileobject, then it is used, otherwise the argument is passed toh5py.Filemode (str) – opening mode when using
h5py.Fileds_kwds (dict) – keyword arguments with which to initialize new Datasets (e.g. compression)
- events
- ds_kwds = None
- static get_best_nd_chunks(item_shape, feat_dtype=np.float64)[source]
Return best chunks for HDF5 datasets
Chunking has performance implications. It’s recommended to keep the total size of dataset chunks between 10 KiB and 1 MiB. This number defines the maximum chunk size as well as half the maximum cache size for each dataset.
- require_feature(feat: str, item_shape: tuple[int], feat_dtype: numpy.dtype, ds_kwds: dict | None = None, group_name: str = 'events')[source]
Create a new feature in the “events” group
- Parameters:
feat (str) – name of the feature
item_shape (tuple[int]) – shape for one event of this feature, e.g. for a scalar event, the shape would be (1,) and for an image, the shape could be (80, 300).
feat_dtype (np.dtype) – dtype of the feature
ds_kwds (dict) – HDF5 Dataset keyword arguments (e.g. compression, fletcher32)
group_name (str) – name of the HDF5 group where the feature should be written to; defaults to the “events” group, but a different group can be specified for storing e.g. internal basin features.
- store_basin(name: str, paths: list[str | pathlib.Path] | None = None, features: list[str] | None = None, description: str | None = None, mapping: numpy.ndarray | None = None, internal_data: dict | None = None, identifier: str | None = None)[source]
Write an HDF5-based file basin
- Parameters:
name (str) – basin name; Names do not have to be unique.
paths (list of str or pathlib.Path or None) – location(s) of the basin; must be None when storing internal data, a list of paths otherwise
features (list of str) – list of features provided by paths
description (str) – optional string describing the basin
mapping (1D array) – integer array with indices that map the basin dataset to this dataset
internal_data (dict of ndarrays) – internal basin data to store; If this is set, then features and paths must be set to None.
identifier (str) – the measurement identifier of the basin as computed by the
get_measurement_identifier()function.
- dcnum.write.copy_basins(h5_src: dcnum.common.h5py.File, h5_dst: dcnum.common.h5py.File, internal_basins: bool = True)[source]
Reassemble basin data in the output file
This does not just copy the datasets defined in the “basins” group, but it also loads the “basinmap?” features and stores them as new “basinmap?” features in the output file.
- dcnum.write.copy_features(h5_src: dcnum.common.h5py.File, h5_dst: dcnum.common.h5py.File, features: list[str], mapping: numpy.ndarray | None = None, ds_kwds: dict | None = None)[source]
Copy feature data from one HDF5 file to another
The feature must not exist in the destination file.
- Parameters:
h5_src (h5py.File) – Input HDF5File containing features in the “events” group
h5_dst (h5py.File) – Output HDF5File opened in write mode not containing features
features (list[str]) – List of features to copy from source to destination
mapping (1D array) – If given, contains indices in the input file that should be written to the output file. If set to None, all features are written.
ds_kwds – keyword arguments with which to initialize new Datasets (e.g. compression); only relevant when mapping is not None
- dcnum.write.copy_metadata(h5_src: dcnum.common.h5py.File, h5_dst: dcnum.common.h5py.File)[source]
Copy attributes, tables, and logs from one H5File to another
Notes
Metadata in h5_dst are never overridden, only metadata that are not defined already are added.
- dcnum.write.create_with_basins(path_out: str | pathlib.Path, basin_paths: list[str | pathlib.Path] | list[list[str | pathlib.Path]])[source]
Create an .rtdc file with basins
- Parameters:
path_out – The output .rtdc file where basins are written to
basin_paths – The paths to the basins written to path_out. This can be either a list of paths (to different basins) or a list of lists for paths (for basins containing the same information, commonly used for relative and absolute paths).