Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 28 January 2021

A data reduction and compression description for high throughput time-resolved electron microscopy

  • Abhik Datta 1 , 2 ,
  • Kian Fong Ng 1 , 2 ,
  • Deepan Balakrishnan   ORCID: orcid.org/0000-0001-8684-1158 1 , 2 ,
  • Melissa Ding 3 ,
  • See Wee Chee   ORCID: orcid.org/0000-0003-0095-3242 1 , 2 , 4 ,
  • Yvonne Ban 1 , 2 ,
  • Jian Shi 1 , 2 &
  • N. Duane Loh   ORCID: orcid.org/0000-0002-8886-510X 1 , 2 , 4  

Nature Communications volume  12 , Article number:  664 ( 2021 ) Cite this article

3690 Accesses

11 Citations

9 Altmetric

Metrics details

  • Cryoelectron microscopy
  • Imaging techniques
  • Transmission electron microscopy

Fast, direct electron detectors have significantly improved the spatio-temporal resolution of electron microscopy movies. Preserving both spatial and temporal resolution in extended observations, however, requires storing prohibitively large amounts of data. Here, we describe an efficient and flexible data reduction and compression scheme (ReCoDe) that retains both spatial and temporal resolution by preserving individual electron events. Running ReCoDe on a workstation we demonstrate on-the-fly reduction and compression of raw data streaming off a detector at 3 GB/s, for hours of uninterrupted data collection. The output was 100-fold smaller than the raw data and saved directly onto network-attached storage drives over a 10 GbE connection. We discuss calibration techniques that support electron detection and counting (e.g., estimate electron backscattering rates, false positive rates, and data compressibility), and novel data analysis methods enabled by ReCoDe (e.g., recalibration of data post acquisition, and accurate estimation of coincidence loss).

Similar content being viewed by others

experiment data reduction

Real-time cryo-electron microscopy data preprocessing with Warp

experiment data reduction

High temporal-resolution scanning transmission electron microscopy using sparse-serpentine scan pathways

experiment data reduction

An integrated platform for high-throughput nanoscopy

Introduction.

Fast, back-thinned direct electron detectors are rapidly transforming electron microscopy. These detectors ushered in a “resolution revolution” for electron cryo-microscopy (cryo-EM), and the prospect of seeing sub-millisecond dynamics for in-situ electron microscopy. These transformations are driven by three key factors: (1) improved detection efficiency, (2) shorter detector readout times to better resolve individual electron events, and (3) algorithms that translate these advances into improved spatial and temporal resolution. Whereas the first two factors have received considerable attention, it remains impractical for many existing algorithms to process the very large raw output produced by these movie-mode detectors. Fortunately, the useful information on these raw data are typically sparse, hence a suitable data reduction and compression scheme should allow us to fully reap the advantages offered by these detectors.

Nearly all the useful information in a single raw detector image is contained within “secondary electron puddles”, each of which is digitized from the cloud of secondary charged particles formed in the wake of individual high energy electrons passing through the detector’s sensor. While the size and shape of secondary electron puddles contain some information 1 , localizing the entry point of the incident electron from its electron cloud already noticeably improves the spatial resolution of the image. To accurately localize these electron puddles they must be spatiotemporally well separated (by increasing the frame rate or reducing the incident electron flux), thereby reducing the so-called coincidence loss 2 . This separation creates a very high raw data load when acquiring images that add up to a desired accumulated electron dose. For example, the memory needed to store the incident electron entry points, in a low coincidence-loss image (~6%) acquired at 0.01 e/pixel/frame is approximately a hundredth that of the raw detector readout, with the remainder holding only thermal readout noise.

Currently, there are three popular options to manage the large raw data loads that a high-throughput electron detector generates. First, which is typical in cryo-EM, is to employ a higher internal frame rate on the detector for counting electrons at low coincidence loss, but add many of these frames together before they are stored to disk. The downside here is the loss of temporal resolution in the added, stored images. The second option is to reduce the total data acquisition time. Here, an experimenter may fill terabytes of local hard disk with raw data for 10 min, then wait at least twice as long to offload this data to a larger networked drive before more data acquisition can proceed. The third option is to collect data at the maximum detector frame rate but only store the frames that contain significant information. However, this strategy only works at high dose rates where individual pre-selected frames still show sufficient contrast for the experimenter to judge whether to keep or discard them. At such high dose rates, the experimenter has to either sacrifice spatial resolution or be limited to atomic resolution only for radiation-hard samples.

None of these three options are ideal, especially since the vast majority of these high data loads are storing only the detector’s thermal and readout noise. Furthermore, these options also limit us from using faster detectors 3 to study dynamics at even shorter timescales. Naturally, reducing and compressing the raw data would obviate the need to choose between these three compromising options. If we stored only electron arrival events, we can enjoy high temporal and spatial resolution, while continuously acquiring movies of dose-sensitive samples at very low dose rates for practically hours, uninterrupted.

While more experiments that require both high temporal and spatial resolution are emerging 4 , 5 , 6 , acquiring such movies for long time scales remains expensive, and in many cases, infeasible. For perspective, a 4 TB hard drive only accommodates about 21 min of data collection at a DE-16 detector’s maximum data rate of 3.08 GB/s. A prominent example that exploits fast detectors is motion-correction in TEM (transmission electron microscopy). Here, the imaging resolution is demonstrably improved when fast detectors fractionate the total electron dose on a sample onto a time series of micrographs that are individually corrected for relative dose-induced motion 7 . In fact, recent work suggests that using more efficient data representation to further increase dose fractionation, hence finer time resolution, can improve spatial resolution 8 .

As electron microscopy becomes increasingly reliant on larger datasets and more complex processing and analysis workflows, there is an ever-greater push for publications to include raw data necessary for others to validate and reproduce the analyses 9 . Improved compression will make public archives like EMPIAR more accessible and increase their adoption, and encourage deposition of raw micrographs facilitating validation of the structures produced using them 10 . Without an effective data reduction and compression scheme, storing raw detector data will be costly: at ~US$20 per terabyte (TB) of archival storage on commodity HDDs and ~US$400 per TB on SSDs (based on the prices of lower end external hard disk drives and solid-state drives, as of April 2019 11 , 12 , just 15 min of continuous data acquisition per day on the DE-16 (Direct Electron, LP) detector at its maximum frame rate (3 GB/s throughput) will cost between US$ 20,000 to US$ 400,000 per year, respectively.

Here, we propose a data reduction and compression scheme capable of file size reductions that are as high as 100× for realistic electron-counting scenarios. The output of this scheme is a file format known as ReCoDe (Reduced Compressed Description). For simplicity, we refer to the reduction compression scheme as the ReCoDe scheme (or simply ReCoDe when the context is clear). In this scheme, the original raw data is first reduced to keep only the information regarding identified electron puddles, which are then further compressed. The ReCoDe scheme permits four reduction levels and several different types of lossless compression algorithms, whose combinations are discussed in this work. We show how data stored in the least lossy ReCoDe reduction level can be re-processed post-acquisition to remove detector artifacts, especially those owing to slow drifts in the thermal background. Moreover, storing data at this reduction level retains the puddle shape and intensity information. Through several use cases, we show the benefits of retaining this information. One of these is coincidence loss estimation, where we show that puddle shape information is essential for accurate estimation. We also develop methods for estimating the prevalence of back scattered electrons and estimating the false positive rates of electron events using this information. For the DE-16 detector we estimated the ratio of primary to backscattered electrons to be ~8.6. The ReCoDe scheme is sufficiently parallelizable such that data streams from even the fastest current detectors can be reduced and compressed on-the-fly onto networked storage disks using only modest computing resources, provided the raw data can be accessed before it is written to disk. For instance, the raw data stream of a low dose experiment (0.8 e/pixel/s) collected on a DE-16 detector (~3.08 GB/s throughput) can be reduced, compressed by 10 Intel Xeon CPU cores, then written to network-attached storage devices via a modest 10 gigabit ethernet connection. Furthermore, the ReCoDe data format has been designed for fast sequential and random access, so that frames can be decompressed into the reduced representation on demand. Using ReCoDe in-situ electron microscopy movies can retain the high dose fractionation and sub-millisecond time resolution while extending acquisition time from minutes to hours. Giving users the flexibility to fractionate their doses from tens to thousands of frames per second for more precise temporal resolution and drift-correction where possible. Finally, using several publicly available EMPIAR 9 datasets with moderate to high dose rates, ranging from 0.5 to 5.0 electrons/pixel/frame, we show that ReCoDe achieves 2–8× compression, outperforming existing compression approaches.

Data reduction levels

A secondary electron puddle is characterized by its spatial location, its two-dimensional (2D) shape, and the pixel intensities within this shape. To accommodate various downstream processing needs we define four logical data reduction levels: L1, L2, L3, and L4, with progressively higher levels of file size reduction and hence information loss from L1 to L4 (Fig.  1 ).

figure 1

a The leftmost image (L0) depicts a 10 × 10 pixel image (the raw detector output) with four secondary electron puddles. The remaining four images from left to right correspond to the four data reduction levels, L1 to L4, respectively. Each image represents a reconstruction of the original image (L0) using only the information retained at that level (see table at the bottom). The L1 image retains all the useful information about the secondary puddles by first removing detector readout/thermal noise from L0. In L2, the spatial location of the four puddles, the number of pixels (area) in each puddle, the shape of the four puddles and an intensity summary statistic (sum, maximum or mean) for each puddle are retained. Each reduction level offers different advantages in terms of speed, compression, information loss, spatial or temporal resolution, etc (see row labeled “Optimized For”). The row labeled “Reduced Representation” describes how the information retained at each level is packed in the reduced format. These packings are tuned to provide a good balance between reduction speed and compressibility. In L3, the puddle area, shape and location information are all encoded in a single binary image, which is easily computed and highly compressible. These three aspects in L1 and L2 are packed as the binary image used in L3. Only the most likely locations of incident electrons are saved as binary maps in L4. Panels b , c , d , and e are the reduction compression pipelines for reduction levels L1, L2, L3, and L4, respectively. Here, the thresholding step produces a binary map identifying pixels as signal or noise. Bit packing removes unused bits and converts the list of ADU values into a continuous string of bits. The connected components labeling algorithm identifies clusters of connected pixels that constitute individual electron puddles from this binary map. Puddle centroid extraction further reduces each puddle to a single representative pixel; and puddle feature extraction computes puddle specific features such as mean or maximum ADU.

All four data reduction schemes begin with a thresholding step, which produces a binary map identifying pixels as containing useful signal or not. The ADU (analog-digital unit) threshold used to label signal pixels are independent for each pixel and decided based on the signal-noise calibration procedure (discussed below). In ReCoDe level L1, the sparsified signal pixel intensities are then bit-packed into a dense format. Bit packing removes unused bits and converts the list of ADU values into a continuous string of bits. The binary map and the bit packed intensity values are independently compressed and the two compressed data are stacked to create an L1 reduced compressed frame. As L1 reduction retains all the information about electron puddles, electron counting can be performed long afterward, should the user wish to explore different counting algorithms and/or parameters. Both thresholding and packing are sufficiently fast to make L1 suitable for on-the-fly processing (discussed in “Demonstration of on-the-fly Reduction and Compression” section). Even for relatively high electron flux data (0.05 e/pixel/frame) L1 reduction alone achieves a 10× reduction in file size. This reduced data can be further compressed to achieve an overall 25× file size reduction (see below).

In L3 reduction, the pixel intensities are discarded during thresholding and only the binary map is retained and compressed. L3 is therefore optimized for speed, at the expense of puddle specific ADU (analog-digital unit, or pixel intensity) information.

To compute puddle specific features, in L2 and L4 reductions, the clusters of connected pixels that constitute individual puddles are identified from the binary map using a connected components labeling algorithm, discussed in the Methods section. In L4 reduction, each puddle in the binary map is further reduced to the single pixel, where the primary electron was likely incident. L4 reduction, therefore, results in a highly sparse binary map that is optimized for maximum compression. At the same electron flux (0.05 e/pixel/frame) L4 reduction and compression results in 45× file size reduction. This increased compression comes at the cost of throughput since counting has to be performed as part of the reduction step.

In L2 reduction, a summary statistic, such as mean, maximum or sum of ADU (analog-digital unit), is extracted for each electron puddle. Preliminary studies suggest that such information may correlate with whether a measured electron was elastically or inelastically scattered 1 . The sparse puddle features are then packed into a dense format and the binary map and the dense puddle features are independently compressed. Several applications that record diffraction patterns benefit from a high dynamic range but do not necessarily need to retain the entire signal as done in L1. L2 is designed for such applications.

In L1 and L2 reductions, the binary maps and the packed intensity summary statistics are independently compressed and then stacked. As the binary maps and intensity values have very different characteristics, compressing them independently results in optimal compression (Fig.  1b–d ).

The reduced compressed data formats are detailed in Supplementary Method  1 .

All four data reduction schemes in ReCoDe first reduce the data by removing primarily readout noise (thresholding) and then compressing the signal. Accurate signal-noise separation is therefore critical. To remove pixel-level differences in dark noise and gain that can bias the identification of isolated electron puddles, individual thresholds are calculated per pixel based on calibration data (see “Methods” section). For the DE-16 detector, this calibration can be done with a single dataset with flat-field illumination at a low dose rate and extended exposure times. Since different detectors may require custom calibration, ReCoDe only requires the per pixel thresholds for separating signal-noise as input and is agnostic of the calibration method used. These thresholds are specified in a single image frame, which is reloaded by ReCoDe at user-specified intervals. External programs can update the thresholds intermittently for on-the-fly recalibration to accommodate changing detector response.

Calibrating parameters for data reduction

An appropriate threshold separating signal from noise is critical for electron counting to be effective. Typically, this threshold is established through calibration, based on dark and gain references obtained during data acquisition. In most imaging software these calibrations depend on several hyper-parameters that are predetermined (for instance, the number of frames used in the dark reference). Once the calibrated frames are reduced to electron counted images, the calibration cannot be revised, and the effects of the hyper-parameters are permanent. The L1 reduction presents an alternative, where the data can be recalibrated post-acquisition without having to store the entire dataset, as long as a sufficiently permissive threshold is used. In low dose rate experiments, during data acquisition, the quality of images cannot be verified through visual inspection. The effectiveness of the calibration can, therefore, be difficult to judge. The ability to recalibrate datasets in such cases can significantly improve image quality, as shown in Fig.  2 . Here, the data was recalibrated by using a higher threshold for separating dark noise and signal and pixel gains were recalculated after removing single pixel puddles (see Fine calibration in “Methods” section for details). We observed that such recalibration can significantly reduce the number of false positive electron events.

figure 2

Panels a and b are Fourier transforms (FT) of summed L1 reduced frames of HRTEM movies of a molybdenum disulfide 2-D crystal, acquired using a JEOL 2200 microscope operating at 200 keV and a DE-16 detector running at 300 fps, with a pixel resolution of 0.2□ ( a ) is L1 reduced using fast on-the-fly calibration using a 3□ threshold (see “Methods” section) ( b ) is the result of recalibrating ( a ) with a more stringent fine calibration that uses an area threshold and a 4□ threshold (see “Methods” section). The Fourier peaks indicated with orange arrows in a are due to detector artifacts, which are not readily visible in the image but can severely impact drift correction. a and b are the sum of FFTs of 9000 frames.

Even small deviations in calibration can significantly bias counting and therefore recalibration (or at least a quality assessment) should be a necessary step in ensuring accurate counting. L1 reduced data facilitates such post-hoc analysis. This includes using the electron puddle size/shape distributions to estimate realistic coincidence losses specific to the detector and imaging conditions (Table  1 ).

Table  1 shows coincidence losses estimated using five different techniques. In the first three (columns from left to right) puddles are assumed to be of fixed shape and size, whereas, in the last two, the actual puddle shape and size information are included in the calculation (see “Methods” section). Clearly, the knowledge of puddle shape and size is essential for accurate coincidence loss estimation. Therefore, accurately estimating coincidence loss requires retaining data at reduction levels L1–L3.

A recent study 8 has proposed storing L4 reduced data in a sparse format to benefit from higher dose fractionation without overwhelming acquisition systems with storage requirements. To achieve super-resolution electron counting, which is critical for improving reconstruction resolution in cryo-EM, they propose subdividing each pixel before counting and storing the higher-resolution spatial locations of electron events using a higher bit-depth. ReCoDe’s L1 reduction scheme enables super-resolution electron counting without the need to subdivide pixels at the time of acquisition, thus eliminating the need to predetermine to what extent pixels should be partitioned.

Reducibility and compressibility with increasing electron fluxes

With increasing electron flux, the data naturally becomes less reducible and less compressible. To quantify this change, we simulated images at eight electron fluxes between 0.0025 to 0.07 e/pixel/frame (Fig.  3 ). This range was chosen for tolerable coincidence loss during electron counting (Table  1 ). For data without any reduction (unreduced compression line in Fig.  3 ), the compression ratio remains similar across all fluxes (~4×), because of readout dark noise. L3 and L4 reduced data are essentially binary images with 1-bit per pixel (Fig.  1 ). Therefore, if the input data uses n bits to represent each pixel’s intensity, a factor of n reduction is achieved using L3 or L4 reduction alone. In Fig.  3 , a 16× reduction is seen for the 16-bit simulated data. In L1 and L2, pixel intensity information and event summary statistics are retained in addition to the L3 binary map. As electron flux increases, more pixel intensities/event statistics need to be stored. However, due to coincidence loss the number of counted electron events and L1 and L2 file sizes increase only sub-linearly.

figure 3

The solid black line (“unreduced compression”) shows the compression ratios achieved on unreduced raw data (including dark noise) using Deflate-1. The dashed lines show the compression ratios achieved with just the four levels of data reduction and without any compression. The solid lines show the compression ratios after compressing the reduced data using Deflate-1. The coincidence loss levels corresponding to the electron fluxes label the second y -axis on the right.

With increasing electron flux the binary images used to store location and shape information in the reduced format, also become less compressible. This is evident from the L3 and L4 “reduction + compression” lines in Fig.  3 . At the same time, for L1 reduction, the proportion of reduced data containing pixel intensities increases rapidly with increasing electron flux. As a result, the compressibility of L1 reduced data falls very quickly with increasing electron flux.

At moderate (0.01 e/pixel/frame) and low (0.001 e/pixel/frame) electron flux L1 reduction compression results in 60× and 170× data reduction, respectively.

Reduction L4, where only puddle locations are retained, is optimized for maximum compression and can achieve reduction compression ratios as high as 45×, 100×, and 250× at high (0.05 e/pixel/frame), moderate (0.01 e/pixel/frame) and low (0.001 e/pixel/frame) electron flux, respectively.

Compression algorithms exploit the same basic idea: by representing the more frequently occurring symbols with fewer bits the total number of bits needed to encode a dataset is effectively reduced. Consequently, data is more compressible when symbols are sparsely distributed. Such sparse distributions are readily present in the back-thinned DE-16 electron detector, where nearly 80% of the digitized secondary electron puddles span fewer than three pixels (Supplementary Fig.  8 ). Even for puddles that span four pixels (of which there are 110 possibilities) nearly half (48.3%) are the 2 × 2-pixel square motif.

The randomly distributed centroids of secondary electron puddles account for the largest fraction of memory needed to store the reduced frames. We considered three representations for storing these centroids and ultimately adopted a binary image representation (Methods section).

Compression algorithms

Any lossless compression algorithm can operate on the reduced data levels in Fig.  1 . Compression algorithms are either optimized for compression power or for compression speed, and the desired balance depends on the application. For on-the-fly compression, a faster algorithm is preferable even if it sub-optimally compresses the data, whereas an archival application may prefer higher compression power at the expense of compression speed.

We evaluated the compression powers and speeds of six popular compression algorithms that are included by default in the ReCoDe package: Deflate 13 , Zstandard (Zstd), bzip2 (Bzip) 14 , 15 , LZMA 16 , LZ4 17 , and Snappy 18 (Fig.  4 ). Each algorithm offers different advantages; bzip, for instance, is optimized for compression power whereas Snappy is optimized for compression and decompression speed. All five algorithms can be further parameterized to favor compression speed or power. We evaluated the two extreme internal optimization levels of these algorithms: fastest but sub-optimal compression, and slowest but optimal compression.

figure 4

Each scatter plot shows the reduction compression ratios and the compression throughputs of six compression algorithms (Deflate, Zstandard (Zstd), bzip2 (Bzip), LZ4, LZMA, and SNAPPY), plus the Blosc variants of Deflate, Zstandard (Zstd), LZ4, and SNAPPY. Reduction compression ratio (horizontal axes in all panels) is the ratio between the raw (uncompressed) data and the reduced compressed data sizes. The three rows of scatter plots correspond to three different electron fluxes: 0.01, 0.03, and 0.05 e/pixel/frame, from top to bottom. The left and right columns of scatter plots correspond to the two most extreme internal optimization levels of the compression algorithms: fastest but suboptimal compression labeled “Optimal Speed” (left column), and optimal but slow compression labeled “Optimal Compression” (right column). The data throughputs (vertical axes in all panels) are based on single threaded operation of ReCoDe and include the time taken for both reduction and compression. The decompression throughputs of the six algorithms are presented in Supplementary Fig.  2 .

Data reduction schemes similar to L1 have been previously used to compress astronomical radio data in Masui et al. 19 . They proposed the bitshuffle algorithm for compressing radio data after removing thermal noise from it. We experimented with Blosc, a meta-compressor for binary data, that implements bitshuffle in addition to breaking the data into chunks that fit into the system’s L1 cache, to improve compression throughputs (Fig.  4 ).

LZ4 and SNAPPY have the highest throughputs across all reduction levels and electron fluxes, with reduction compression ratios slightly worse than the remaining five algorithms. At the lowest dose rate (0.01 e/pixel/frame) Bzip results in the best reduction compression ratios, regardless of the internal optimization level. At higher dose rates (0.03 and 0.05 e/pixel/frame) Zstd has the highest compression ratio. Considering all dose rates and internal optimization levels, Zstd on average offers the best balance between compression ratio and throughput. The choice of internal optimization level only marginally affects the reduction compression level but significantly improves throughput.

Deflate optimized for speed for instance is almost ~25× faster than Deflate optimized for compression, across the three dose rates. We use Deflate optimized for speed (referred to as Deflate-1) as the reference compression algorithm for the rest of the paper, as it represents a good average case performance among all the compression algorithms. In subsequent sections, we will show that Deflate-1 is fast enough for on-the-fly compression. All algorithms have higher decompression throughput than compression throughput (Supplementary Fig.  2 ). Deflate-1 has ten times higher decompression throughput than compression throughput, which means the same computing hardware for reduction and compression can support on-the-fly retrieval and decompression of frames for downstream data processing.

For most compression algorithms Blosc marginally improves compression throughput (Fig.  4 ), except in the case of optimal compression with LZ4, where Blosc improves throughput by as much as 400 MB/s.

Demonstration of on-the-fly reduction and compression

Electron microscopy imaging often has to be performed at low electron flux to reduce beam induced damage, if the sample is dose sensitive, as well as to minimize beam induced reactions. Observing rare events or slow reactions in such cases require extended acquisition, that is not feasible with current detector software without compromising temporal resolution. Loss of temporal resolution, in turn, degrades drift correction and therefore limits spatial resolution. ReCoDe’s on-the-fly reduction compression fills this critical gap, enabling hours long continuous acquisition without overwhelming storage requirements, compromising on temporal resolution, or losing puddle information.

ReCoDe is easily parallelized, with multiple threads independently reducing and compressing different frames in a frame stack. In this multithreaded scheme, each thread reduces and compresses the data to an intermediate file, which are merged when data collection is complete. The merging algorithm reads from the intermediate files and writes to the merged file sequentially and is therefore extremely fast. The intermediate and merged (ReCoDe) file structures and the merging process are described in Supplementary Method  1 .

With this multithreaded scheme, ReCoDe can achieve throughputs matching that of the detectors enabling on-the-fly reduction and compression. In addition, intermediate files can be accessed sequentially in both forward and reverse directions, with frames indexed by frame number, time stamp, and optionally scan position. Owing to the small size of the reduced compressed frames, they can be read from intermediate files by external programs for live processing and feedback during acquisition even without merging them back into a single file. Users also have the option of retaining raw (unreduced and uncompressed) frames at specified intervals for validation or for on-the-fly recalibration. In electron microscopy facilities data is often archived in high capacity network-attached storage (NAS) servers. A schematic of this on-the-fly reduction compression pipeline is shown in Fig.  5a . We evaluated the feasibility of directly collecting the reduced-compressed data onto NAS servers, to avoid the overhead of transferring data after collecting it on the microscope’s local computer.

figure 5

a ReCoDe’s multithreaded reduction compression pipeline used for live data acquisition. The CMOS detector writes data into the RAM-disk in timed chunks, which the ReCoDe server processes onto local buffers and then moves to NAS servers. The ReCoDe Queue Manager synchronizes interactions between the ReCoDe server and the detector. b L1 reduction and compression throughput (GB/s) of Deflate-1, with multiple cores at four electron fluxes. The throughput of ReCoDe depends only on the number of electron events every second, hence the four dose rates (horizontal axis) are labeled in million electrons/second. The simulations were performed on a 28-core system, as a result, throughput scales non-linearly when using more than 28 cores (Supplementary Fig.  3 ). c , d Show throughputs when using 10 GbE and IPoIB connections to write directly to NAS, respectively. In e , throughput of L1 reduction without any compression; ( f ) throughput of Deflate-1 when compressing the unreduced raw data. g Shows the conversion between million e/s and e/pixel/frame for two different frame size-frame rate configurations of the DE-16 detector.

With the DE-16 detector running at 400 fps, at a dose rate of 0.001 e/pixel/frame and ReCoDe using 10 CPU cores of the acquisition computer that shipped with the DE-16 detector, we continuously captured data directly onto NAS servers connected by a 10 gigabits/s Ethernet (10 GbE) connection, for 90 min (see “Methods” subsection: On-the-fly Compression Pipeline). To further evaluate this multithreaded scheme, we simulated a series of on-the-fly data reduction and compression at different electron fluxes. The implementation used for these simulations emulates the worst-case write performance of ReCoDe, where a single thread sequentially accesses the disk (see Supplementary Discussion  3 for details). At relatively low electron flux (0.01 e/pixel/s) we are able to achieve throughputs as high as 8.3 gigabytes per second (GB/s, Fig.  5b ) using 50 threads on a 28 core system. At the same dose rate, to keep up with the DE-16 detector (which has a throughput of ~3.08 GB/s) only 10 CPU cores are sufficient. For perspective, another popular direct electron detector, the K2-IS (Gatan Inc.), nominally outputs bit-packed binary files at approximately 2.2 GB/s. However, since we did not have to incur extra computation time to unpack bits on the raw data from DE-16, the DE-16 benchmarks on Fig.  5 will not directly apply to K2-IS data.

At moderate electron flux, writing directly to GPFS NAS servers using both 10 GbE and IPoIB (Internet Protocol over InfiniBand) has comparable throughputs to that of collecting data locally on the microscope’s computer (Fig.  5 c,d). However, at very low electron flux writing directly to the NAS server with IPoIB has slightly higher throughput. This is likely due to the reduced communication overhead per call in IPoIB and the distributed data access (IBM GPFS) supported by NAS servers, both of which are optimized to handle multiple simultaneous small write requests. In the absence of such a parallel data access ReCoDe still executes at close to 89% parallel (Supplementary Discussion  3 ).

Both the reduction and compression steps are essential for high throughput on-the-fly processing. Without compression, the reduced data is still too large to write over 10 GbE, particularly at moderate electron flux (Fig.  5e ). Without reduction, the data is not compressible enough; the throughput of Deflate-1 compression without any data reduction (Fig.  5f ) is abysmally low even when using 50 threads.

Effects of reduction on counted image quality

In many applications, the L2 and L3 reduced data has to be ultimately reduced to L4 (electron-counted image). Here, we consider how the information lost in L2 and L3 reductions affect the resolution in L4 images. In L4 puddles are reduced to a single pixel, which ideally contains the entry point of the incident electron. However, there is no clear consensus on the best approximation strategy for determining an electron’s entry point, given a secondary electron puddle 1 . The three common strategies are, to reduce the puddle to (1) the pixel that has the maximum intensity, (2) the pixel intensity weighted centroid (center of mass) or (3) the unweighted centroid of the puddle. Unlike L1 reduction, where all the information needed for counting with any of these strategies are retained, with L2 and L3 reductions, the pixel intensity information is either partially or completely lost. The puddles can then only be reduced to the unweighted centroid of the puddle using the third strategy. With L4 reduction, the approximation strategy has to be chosen prior to data acquisition. To evaluate how this information loss affects image quality we performed a knife-edge test using a beam blanker (see “Methods” section for implementation details). The results show (Supplementary Fig.  4 ) that the choice of approximation strategy, and therefore the choice of reduction level, has little consequence on image resolution.

Studying millisecond in-situ dynamics with TEM, such as surface-mediated nanoparticle diffusion in water 20 , requires us to operate at the maximum frame rates of these detectors. In addition, longer total acquisition times would be beneficial for studying reactions such as spontaneous nucleation 21 where the experimenter systematically searches a large surface for samples. Several pixelated TEM electron detectors are now able to achieve sub-millisecond temporal resolutions, with the downside that the local buffer storage accessible to these detectors fills up very quickly. Figure  6 shows that current TEM detectors running at maximum frame rates produce 1 TB of data in several minutes. When the temporal resolution is critical for an imaging modality, reducing the frame rate is not an option. An example is fast operando electron tomography 22 . To capture how the 3D morphology of an object evolves over several seconds, a full-tilt series of the object has to be rapidly acquired at the detector’s peak frame rate. Here again, the duration of these observations can be significantly extended by substantially reducing the output data load with ReCoDe.

figure 6

Each cell’s horizontal and vertical grid position marks the temporal resolution (or, equivalently, frame rate) and frame size of a hypothetical movie-mode data acquisition scenario, respectively. A cell’s text and color indicates the time taken to acquire one terabyte (TB) of data at that frame size and temporal resolution without reduction and compression. For larger frames and high temporal resolution (top right corner), acquisitions lasting merely tens of seconds already produce 1 TB of data. With a 95× reduction in data size the same experiment can span 20 times longer, enabling the observation of millisecond dynamics in reactions that span several minutes. The yellow dots show a few of the frame size-frame rate combinations available for the DE-16 detector.

4D scanning transmission electron microscopy (4D STEM) techniques including Ptychography were used to image weak phase objects and beam sensitive samples such as Metal Oxide Frameworks (MOFs) 23 . Here, a converged electron probe raster scans a sample collecting 2D diffraction patterns at each scan point. Although these experiments can produce hundreds of gigabytes of data in minutes 24 , the diffraction patterns tend to be sparse outside of the central diffraction spot. As noise-robust STEM-Ptychography becomes a reality 25 , their convergent beam electron diffraction patterns will be even sparser. ReCoDe level L1 reduction and compression, which preserves the patterns’ dynamic range while removing only dark noise, are likely to be useful for such data. Once the large datasets in 4D STEM are reduced they will readily fit into the RAM of desktop workstations, which also facilitates sparse and efficient implementations of processing algorithms.

Electron beam-induced damage is a major limitation for all cryo-EM modalities. In single-particle analysis (SPA) the energy deposited by inelastically scattered electrons manifests as sample damage and ice drift, where global and site-specific sample damage is detectable even at exposures as low as 0.1 e/ Å 2 26 . Here higher electron dose fractionation improves resolution in two ways: (1) by reducing coincidence loss and thereby improving detection efficiency 27 and (2) by enabling more accurate estimation of sample drift at a higher temporal resolution 8 . Increasing detector frame rates can reduce the average displacement of each particle captured in each dose-fractionated frame, but doing so further inflates the already large amounts of movie-mode data collected (see Supplementary Fig.  6 ). On-the-fly reduction and compression can significantly reduce the storage costs of movie-mode data, to accommodate image correction algorithms that operate at a degree of dose fractionation that is higher than current practice.

The recently proposed compressed MRCZ format 28 and ReCoDe offer complementary strategies to reduce file sizes generated by electron detectors. MRCZ is ideal for compressing information-dense images of electron counts integrated over longer acquisition times. ReCoDe, however, excels in reducing and compressing the much sparser raw detector data that are used to produce the integrated images typically meant for MRCZ. By doing so ReCoDe can preserve the arrival times of incident electrons that are lost when they are integrated into a single frame. Applying an MRCZ-like scheme on the raw un-reduced signal is inefficient, as shown with the “Unreduced Compression” line in Fig.  3 . Figure  7a compares compression ratios obtained by MRCZ and ReCoDe on publicly available EMPIAR datasets from multiple published results 29 , 30 , 31 , 32 , spanning a range of dose rates and detectors as listed in Table  2 . Figure  7b shows the compression ratios achieved by the two approaches on simulated images. Across the range of dose rates ReCoDe produces better compression ratios on the EMPIAR datasets. As low dose rate datasets (below 0.58 electrons/pixel/frame) are likely to be sparse, ReCoDe as expected, achieves higher compression ratios than MRCZ (Fig.  7b ). However, surprisingly, ReCoDe outperforms MRCZ even for some datasets with much higher average dose rates (EMPIAR-10346 in Fig.  7a ). These datasets have particularly high contrast resulting in higher average dose rates but are still very sparse (Supplementary Fig.  7c ), making them well suited for compression with ReCoDe.

figure 7

a Shows that the compression ratios obtained by ReCoDe (filled stars) on relatively low dose rate EMPIAR datasets are higher than those due to MRCZ (filled circles). b Compression ratios obtained using MRCZ and ReCoDe on simulated 16-bit unsigned integer data. The crossover point for performance occurs at 0.58 electron/pix/frame. At dose rates below this ReCoDe achieves higher compression ratios than MRCZ, whereas at dose rates above this MRCZ achieves slightly higher compression ratios. The number of electron events per pixel follows a Poisson distribution in these simulated datasets. The underlying compression algorithms used in a and b is Blosc + Deflate (zlib) for both MRCZ and ReCoDe. Table  2 lists a short description of the seven EMPIAR dataset used to generate ( a ). Overall in the simulated data, for both compression algorithms, compression ratios reduce as dose rate increases, as expected. However, for the EMPIAR datasets, there are two groups, one for the floating-point data (datasets 0–5) and another for integer data (datasets 6 and 7). Although the floating-point data have lower dose rates than the integer type data, the former is less compressible because they are naturally less sparse than the latter. Nevertheless, within each group, the expected trend (reduction in compression ratio with increasing dose rate) holds true and ReCoDe outperforms MRCZ. A comparison where all the datasets are standardized to the same integer data type, presented in Supplementary Fig.  6 , shows that the results from EMPIAR datasets and simulated data are quite similar.

We have described three novel analysis methods that demonstrate the necessity of reduction levels L1–L3. These methods cannot be applied on counted (L4 reduced) data, as they rely on puddle shape and intensity information. The first is the recalibration of L1 reduced data post acquisition, to improve counting accuracy. The second analysis uses puddle shape information to accurately estimate coincidence loss. When counting electrons, coincidence loss adversely affects spatial resolution (Supplementary Fig.  5 ). However, as we have shown, estimates of coincidence loss from the counted data can be inaccurate (Table  1 ). As reduction levels L1–L3 retain puddle shape information these can be used when accurate coincidence loss estimates are desired. In the third analysis we use a series of L1 reduced data sets with diminishing dose rates and extremely sparse electron events to estimate false positive rates of detecting electron events (Supplementary Note  12 ).

We also describe a novel method for estimating the proportion of backscattered electrons, using counted (L4 reduced) data (Supplementary Note  13 ). Using this analysis we estimated the proportion of primary to backscattered electrons for the DE-16 detector is ~8.6. In the future, it may be possible to even classify and eliminate backscattered electrons based on their sizes, shapes and proximity to primary electrons. Development of such techniques requires retaining more information than is currently done using counted data. The L1-L3 reduction levels in ReCoDe are designed to facilitate such future developments.

In summary, we present the ReCoDe data reduction and compression framework for high-throughput electron-counting detectors, which comprises interchangeable components that can be easily configured to meet application-specific requirements. ReCoDe supports four data reduction levels to balance application-specific needs for information preservation and processing speed. We further tested three electron localization strategies, and show that they produce similar spatial resolutions even when the electron puddle intensity information is absent. By comparing five candidate compression algorithms on reduced electron data, we found that although LZ4 is the fastest, Deflate-1 offers the best compromise between speed and compressibility.

Remarkably, we demonstrated on-the-fly data reduction and compression with ReCoDe on the DE-16 detector for 90 min. Using only a desktop workstation, we continuously converted a 3 GB/s raw input data stream into a ~200 MB/s output that was, in turn, streamed onto networked drives via 10 Gbit ethernet. Crucially, this demonstration showed that on-the-fly data reduction and compression at low dose rates on our fastest S/TEM detectors is not compute-limited if the detector’s raw data stream is accessible (via a RAM-disk) before it is stored to SSDs. Even higher throughputs will be achievable with direct in-memory access to this raw data stream without the need for a RAM-disk. In the absence of fast simultaneous read-write, there is a critical lack of feedback in low dose rate, long time experiments. The experimenter is left blind in such situations, as individual frames do not have sufficient contrast and the frames available on disk cannot be read to produce a summed image with sufficient contrast. On-the-fly data reduction and compression with ReCoDe enables continuous feedback without interrupting data acquisition for hours.

The ReCoDe scheme can dramatically increase the throughput of electron microscopy experiments. Furthermore, the quality of observations for electron microscopy experiments can also improve. In cryo-EM, ReCoDe can support movies of higher frame rates, which can lead to better drift correction and lower coincidence loss. For in-situ experiments, higher frame rates can also improve the temporal and spatial resolution of the imaged samples.

Currently, a clear barrier for commercial vendors to produce higher throughput detectors is that users cannot afford to store the increased raw data that these faster detectors will bring. Going forward, the readout rates of CMOS detectors may increase to their internal megahertz clock rates 33 , or even into the gigahertz regime 34 . This uptrend is troubling if one considers, by default, that a detector’s raw data output rate increases linearly with its readout rate. However, because the ReCoDe format has very little storage overhead per frame, in principle, its processing and storage rate only scales with the total electron dose when the detector readout rate is fast enough to resolve individual electron puddles. Consequently, the ReCoDe output rate will not increase substantially with megahertz frame rates when the total electron dose is held constant. By efficiently reducing raw data into compact representations, ReCoDe prepares us for an exciting future of megahertz electron detectors in three crucial ways: it limits the storage costs of electron microscopy experiments, facilitates much longer data acquisition experimental runs, and very efficient processing algorithms that only compute on the essential features. More broadly, making ReCoDe open source encourages its own development by the community and incentivizes commercial vendors to specialize in much-needed hardware innovation. The full impact of electron counting detectors, quite possibly, is still ahead of us.

Data acquisition

All experimental data were collected on a DE-16 detector (Direct Electron Inc., USA) installed on the JEM-2200FS microscope (JEOL Inc., Tokyo, Japan) equipped with a field emission gun and operating at 200 keV accelerating voltage. StreamPix (Norpix Inc., Montreal, Canada) acquisition software was used to save the data in sequence file format without any internal compression. Data for puddle shape and size analysis (Supplementary Fig.  8 ) and MTF characterization with the knife-edge method (Fig.  5 ) were collected at 690 frames per second and 400 frames per second respectively with an electron flux of ~0.8 e/pixel/s.

All simulations of on-the-fly data collection were performed on a 28-core (14 core × 2 chips) system with 2.6 GHz E5-2690v4 Intel Broadwell Xeon processors and 512 GB DDR4-2400 RAM.

Connected components labeling

To compute the features specific to each electron puddle (e.g., centroids in L4 and the user-chosen summary statistics (ADU sum or maximum) in L2), the set of connected pixels (components) that constitute individual puddles have to be identified from the thresholded image. This connected components labeling can be computationally expensive for large puddles. Fortunately, puddle sizes tend to be small for most back-thinned direct electron detectors. For the DE-16 detector, 90% of the puddles are fewer than five pixels in size (Supplementary Fig.  8 ). Therefore, we use a greedy approach similar to the watershed segmentation algorithm 35 to perform connected components labeling. The algorithm assigns unique labels to each connected component or puddle, and the pixels associated with a given puddle are identified by the label of that puddle. In L2 reduction, these labels are used to extract the chosen summary statistics from the puddle and in L4 reduction, these labels are used to approximate the secondary electron puddle to a single pixel, by computing the centroid or center of mass, etc.).

Representation of puddle centroids

The randomly distributed centroids of secondary electron puddles account for the largest fraction of memory needed to store the reduced frames. We considered three representations for storing these centroids In the first representation, a centroid is encoded as a single 2n-bit linear index. In the second representation, these linear indices are sorted and run-length encoded (RLE) since the ordering of centroids in a single frame is inconsequential. In the third representation, the centroids are encoded as a binary image (similar to L4). The RLE and binary image representations were found to be much more compressible than linear indices (Supplementary Fig.  9 ). Ultimately, the binary image representation was adopted in ReCoDe because the sorting needed for RLE is computationally expensive, for only a marginally higher compression.

Signal-noise calibration

In the current implementation, ReCoDe requires as input a single set of pre-computed calibration parameters, comprising each pixel’s dark and gain corrected threshold for separating signal and noise at that pixel. Any calibration method can be used to compute this calibration frame. The “On-the-fly Calibration” methods subsection below describes a fast routine, for estimating this calibration frame, which we applied to the DE-16 detector. The “Fine Calibration” methods subsection thereafter details a more deliberate data collection approach, where additional diagnostics on the detector are also measured. Both calibration approaches yield practically similar results at dose rates above 10 −4 e/pix/frame (Supplementary Fig.  14 ).

On-the-fly calibration

First, a flat-field illuminated dataset, comprising many raw detector frames preferably at the same low dose rate targeted for actual imaging afterward, is collected. An estimate of the incident dose rate is then computed using this dataset. Whereas this could be obtained with an independent measurement (e.g., Faraday cup), the dose rate computed by the procedure described here factors in the detector’s detective quantum efficiency. Ideally, a pixel’s intensity across the calibration frames would follow a mixture of two well-separated normal distributions, corresponding to either dark noise or signal. However, in practice, because the detector PSF is larger than a pixel, charge sharing from fast electrons incident on neighboring pixels will contribute to a single pixel’s intensity, which causes the noise and signal distributions to overlap severely (Supplementary Fig.  10 ).

The calibration (summarized in Supplementary Note  11 ) begins by first estimating a single global threshold that separates signal from noise for all pixels. Assuming the histogram of dark values are normally distributed, this global threshold is estimated based on a user-specified upper limit on the tolerable false positive rate of a surmised normal distribution( r ). However, because individual pixels behave differently from each other, using the same threshold for all pixels can severely bias electron counting. To remove this bias the global threshold has to be adapted for each pixel individually based on the pixel’s gain and dark noise level.

Now we are ready to estimate the effective detectable electron count on the detector from the dataset directly. Given the low dose rate in this calibration dataset, only in a small fraction of frames does an individual pixel see electron events. Therefore, a pixel’s median across all calibration frames is effectively its dark noise level, at this dose rate. Given the compact PSF and high SNR of DE-16 detectors, to calculate each pixel’s gain, we assume that direct electron hits result in larger intensities than those due to charge sharing, even when the pixels have different gains. If the calibration dataset has a total dose of N e/pixel, where N is sufficiently small such that the probability of two electrons hitting the same pixel is negligible, then a pixel’s gain is the median of the N largest intensities it has across all calibration frames. Therefore, we first estimate the total dose per pixel in the calibration dataset using a few randomly selected small two-dimensional (2D) patches. Separate thresholds are identified for individual pixels in these patches in a similar manner to the global threshold (i.e., assuming normality in the dark distribution and using a false positive rate parameter r ). These thresholds are used to identify the connected components in each selected 2D patch across all frames in the calibration dataset. The number of connected components emanating from the central pixel of a 2D patch across all calibration frames gives an estimate of the number of electron events ( n c ) at the central pixel of that patch. The average of these values across all randomly selected patches ( ṉ c ) is used as the estimated total dose per pixel in the calibration dataset. Here, a puddle is assumed to emanate from the pixel that has the maximum value in the puddle. Finally, using the per-pixel dark noise levels and gains the global threshold is adapted to compute each pixel’s independent threshold. To compute a pixel’s threshold the global threshold is first shifted such that the pixel’s dark noise level matches the global mean dark noise level and then scaled such that the pixel’s gain matches the global gain.

For sufficiently sparse calibration data, even mean pixel intensity, which is much more efficiently computed than median, can be used to estimate the dark noise level for the pixel, although at the expense of a slightly higher false positive rate.

Fine calibration

To further assess the fast, on-the-fly calibration, a slower and more intricate calibration, referred to as fine calibration, was also implemented. The fine calibration adds two steps to the on-the-fly calibration procedure described above, a common-mode correction and a puddle area based filtering. The common-mode correction eliminates dynamically fluctuating biases in electron counting due to correlated thermal fluctuations between pixels that are connected to the same local voltage (hence thermal) source; the area threshold filters electron puddles to reduce false positive puddle detection. Analysis of the temporal response of DE-16’s pixels revealed local detector regions of size 4 × 256 pixels that have correlated responses (Supplementary Note  12 and Supplementary Fig.  12 ). The fine calibration steps are summarized in Supplementary Fig.  13 .

A series of datasets with increasingly sparse data was used to compare the two calibrations. The controlled reduction in dose rate was achieved by increasing magnification in successive datasets by 200× while keeping the electron flux constant. A comparison of the rate of decay of the estimated dose rates with counting following the two calibration strategies revealed that on-the-fly calibration includes a substantial number of false positive puddles (Supplementary Fig.  14A-B ). However, this can be easily remedied with area-based filtering following L1 reduced data acquisition. While the common-mode correction has only a marginal effect on counting with the DE-16 detector (Supplementary Fig.  14C ), it might be essential for other detectors.

Backthinning of direct electron detectors was instrumental in reducing noise from backscattered electrons while also shrinking electron puddle sizes 36 . The smaller puddle sizes and improved signal-to-noise ratio, in turn, made electron counting feasible. By comparing the distribution of neighboring puddle distances in the ultra-low dose rate datasets with those in simulated images, we were able to estimate the ratio of primary to backscattered electrons to be ~8.6 (Supplementary Note  13 ). A feature of L1 reduction is that all the puddle shape information is retained. In the future this shape information may be useful in algorithmically distinguishing backscattered electrons from primary electrons, leading to a further reduction in noise due to backscattered electrons.

On-the-fly compression pipeline

Continuous on-the-fly data reduction and compression for 90 min were performed using 10 cores of the computer shipped with the DE-16 detector. This computer has two E5-2687v4 Intel Xeon processors (24-cores, 12 cores per chip, each core running at an average of 3.0 GHz base clock rate), 128 GB DDR4 RAM, and is connected to a 1 Petabyte IBM GPFS NAS via a 10 GbE connection.

If the raw data stream coming from the detector is accessible in-memory (RAM), reduction compression can be performed directly on the incoming data stream. However, many detectors (including the DE-16 detector) make the raw data available only after it is written to disk. Reduction compression then requires simultaneously reading the data from disk back into RAM while more data from the detector is being written to disk. While sufficiently fast SSDs in RAID 0 can support multiplexed reads and writes to different parts of the RAID partition, a more scalable solution is to use a virtual file pointer to a location in fast DDR RAM (using RAM-disk). DDR RAMs, in fact, have sufficient read-write throughputs such that multiple ReCoDe threads can read different sections of the available data stream parallelly, while new data coming from the detector is written.

For continuous on-the-fly data collection, the StreamPix software was configured with a script to acquire data in five-second chunks and save each chunk in a separate file. The DE-16 acquisition software does not allow direct in-memory access to data coming from the detector to RAM, restricting access to data only after it has been written to disk. While SSDs have fast enough write speeds to keep up with the throughput of the DE-16 detector, on-the-fly reduction compression requires simultaneous write and read, each at 3 GB/s, which is not possible even with SSDs. To overcome this problem, a virtual file pointer to a location in fast DDR RAM (using RAM-disk) was used. When StreamPix finishes writing a five-second file to RAM-disk, the ReCoDe queue manager adds the file to the queue and informs the ReCoDe server. The ReCoDe server then picks off the next five-second file in the queue, where each processing thread in the server independently reads a different subset of frames within this file. Subsequently, each thread independently appends its reduced and compressed output to its own intermediate file on the NAS server via the 10 GigE connection. When the ReCoDe server is finished processing a five-second file in the queue it informs ReCoDe queue manager, which then deletes this file from RAM-disk. When the acquisition is complete all intermediate files are automatically merged into a single ReCoDe file where the reduced and compressed frames are time-ordered.

While the RAM-disk based approach bypasses read-writes to SSDs, it requires copying the same data in RAM twice. First, the data stream from the detector is written to an inaccessible partition on the RAM, then copied to the readable RAM-disk partition. If we had direct access to first copy in the currently inaccessible partition on the RAM, the subsequent copy to the RAM-disk can be eliminated, hence freeing up important read-write bandwidth on the RAM. At the DE-16 detector’s throughput of 3.08 GB/s, this copying (read and write) uses a significant portion of the RAM’s bandwidth (6.16 GB/s out of DDR4 RAM’s 21–27 GB/s, or 20–25 GiB/s, transfer rate). Direct access to the detector’s data stream without such copying will, therefore, enable reduction compression at even higher throughputs.

Data availability

Raw detector data that were used to generate the figures in this manuscript are available upon email request to the corresponding author.

Code availability

A fully parallelized Pythonic implementation of ReCoDe with features for on-the-fly reduction compression is available at the Github code repository 37 : https://github.com/NDLOHGRP/pyReCoDe . Python notebooks used to generate the figures in this manuscript are also included in this code repository.

Datta, A., Chee, S. W., Bammes, B., Jin, L. & Loh, D. What can we learn from the shapes of secondary electron puddles on direct electron detectors? Microsc. Microanal. 23 , 190–191 (2017).

Article   Google Scholar  

Li, X., Zheng, S. Q., Egami, K., Agard, D. A. & Cheng, Y. Influence of electron dose rate on electron counting images recorded with the K2 camera. J. Struct. Biol. 184 , 251–260 (2013).

Article   CAS   Google Scholar  

Johnson, I. J. et al. Development of a fast framing detector for electron microscopy. In 2016 IEEE Nuclear Science Symposium , Medical Imaging Conference and Room-Temperature Semiconductor Detector Workshop (NSS/MIC/RTSD ) 1–2 (IEEE 2016).

Chee, S. W., Anand, U., Bisht, G., Tan, S. F. & Mirsaidov, U. Direct observations of the rotation and translation of anisotropic nanoparticles adsorbed at a liquid-solid interface. Nano Lett. 19 , 2871–2878 (2019).

Article   ADS   CAS   Google Scholar  

Levin, B. D. A., Lawrence, E. L. & Crozier, P. A. Tracking the picoscale spatial motion of atomic columns during dynamic structural change. Ultramicroscopy 213 , 112978 (2020).

Liao, H.-G. et al. Nanoparticle growth. Facet development during platinum nanocube growth. Science 345 , 916–919 (2014).

Zheng, S. Q. et al. MotionCor2: anisotropic correction of beam-induced motion for improved cryo-electron microscopy. Nat. Methods 14 , 331–332 (2017).

Guo, H. et al. Electron-event representation data enable efficient cryoEM file storage with full preservation of spatial and temporal resolution. IUCrJ 7 , 860–869 (2020).

Iudin, A., Korir, P. K., Salavert-Torres, J., Kleywegt, G. J. & Patwardhan, A. EMPIAR: a public archive for raw electron microscopy image data. Nat. Methods 13 , 387–388 (2016).

Henderson, R. et al. Outcome of the first electron microscopy validation task force meeting. Structure 20 , 205–214 (2012).

McCallum, J. C. Disk Drive Prices (1955–2019). https://jcmit.net/diskprice.htm . (2019).

Klein, A. The Cost of Hard Drives Over Time. Backblaze Blog|Cloud Storage & Cloud Backup https://www.backblaze.com/blog/hard-drive-cost-per-gigabyte/ (2017).

Deutsch, L. P. DEFLATE Compressed Data Format Specification Version 1.3. http://zlib.net/ (1996).

Burrows, M. & Wheeler, D. J. Block-sorting Lossless Data Compression Algorithm, SRC Research Report 124, Digital Systems Research Center, Palo Alto., (1994).

bzip2. bzip2: Home. https://www.sourceware.org/bzip2/ (1996).

Pavlov, I. LZMA SDK (Software Development Kit). https://www.7-zip.org/sdk.html (2013).

Collet, Y. lz4. https://github.com/lz4 (2011).

Dean, J., Ghemawat, S. & Gunderson, S. H. snappy. https://github.com/google/snappy (2011).

Masui, K. et al. A compression scheme for radio data in high performance computing. Astron. Comput. 12 , 181–190 (2015).

Article   ADS   Google Scholar  

Chee, S. W., Baraissov, Z., Loh, N. D., Matsudaira, P. T. & Mirsaidov, U. Desorption-mediated motion of nanoparticles at the liquid–solid interface. J. Phys. Chem. C. 120 , 20462–20470 (2016).

Duane Loh, N. et al. Multistep nucleation of nanocrystals in aqueous solution. Nat. Chem. 9 , 77–82 (2016).

PubMed   Google Scholar  

Koneti, S. et al. Fast electron tomography: Applications to beam sensitive samples and in situ TEM or operando environmental TEM studies. Mater. Charact. 151 , 480–495 (2019).

Jiang, Y. et al. Electron ptychography of 2D materials to deep sub-ångström resolution. Nature 559 , 343–349 (2018).

Ophus, C. Four-Dimensional Scanning Transmission Electron Microscopy (4D-STEM): from scanning nanodiffraction to ptychography and beyond. Microsc. Microanal. 25 , 563–582 (2019).

Pelz, P. M., Qiu, W. X., Bücker, R., Kassier, G. & Miller, R. J. D. Low-dose cryo electron ptychography via non-convex Bayesian optimization. Sci. Rep. 7 , 9883 (2017).

Hattne, J. et al. Analysis of global and site-specific radiation damage in Cryo-EM. Structure 26 , 759–766.e4 (2018).

Clough, R. & Kirkland, A. I. Chapter One-direct digital electron detectors. In Advances in Imaging and Electron Physics (ed. Hawkes, P. W.) Vol. 198, 1–42 (Elsevier, 2016).

McLeod, R. A., Diogo Righetto, R., Stewart, A. & Stahlberg, H. MRCZ-A file format for cryo-TEM data with fast compression. J. Struct. Biol. 201 , 252–257 (2018).

Casañal, A. et al. Architecture of eukaryotic mRNA 3′-end processing machinery. Science 358 , 1056–1059 (2017).

Falcon, B. et al. Novel tau filament fold in chronic traumatic encephalopathy encloses hydrophobic molecules. Nature 568 , 420–423 (2019).

Hofmann, S. et al. Conformation space of a heterodimeric ABC exporter under turnover conditions. Nature 571 , 580–583 (2019).

Zhao, P. et al. Activation of the GLP-1 receptor by a non-peptidic agonist. Nature 577 , 432–436 (2020).

Allahgholi, A. et al. AGIPD, a high dynamic range fast detector for the European XFEL. J. Instrum. 10 , C01023 (2015).

El-Desouki, M. et al. CMOS image sensors for high speed applications. Sensors 9 , 430–444 (2009).

Vincent, L. & Soille, P. Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell . 13 , 583–598 (1991).

McMullan, G., Faruqi, A. R. & Henderson, R. Direct electron detectors. Methods Enzymol. 579 , 1–17 (2016).

Datta, A, et al. NDLOHGRP/pyReCoDe v0.1.0 (Version v0.1.0). Zenodo. https://doi.org/10.5281/zenodo.4267160 (2020).

Download references

Acknowledgements

The authors would like to thank Xiaoxu Zhao for his kind contribution of the molybdenum disulfide 2d crystals. We also thank Liang Jin for helpful discussions, and Benjamin Bammes regarding the DE-16 detector and data processing pipeline, plus his careful reading of the manuscript. We are grateful to Ming Pan, Ana Pakzad, and Cory Czarnik for details about the K2-IS detector and associated binary data formats. The authors would also like to acknowledge Chong Ping Lee from the Centre for Bio-Imaging Sciences (CBIS) at the National University of Singapore (NUS) for training and microscope facility management support, and Bai Chang from CBIS for IT infrastructure support. A.D. and Y.B. were funded by the Singapore National Research Foundation (grant NRF-CRP16-2015-05), with additional support from the NUS Startup grant (R-154-000-A09-133), NUS Early Career Research Award (R-154-000-B35-133), and the Singapore Ministry of Education Academic Research Fund Tier 1 Grant (R-154-000-C01-114).

Author information

Authors and affiliations.

Centre for BioImaging Sciences, National University of Singapore, Singapore, Singapore

Abhik Datta, Kian Fong Ng, Deepan Balakrishnan, See Wee Chee, Yvonne Ban, Jian Shi & N. Duane Loh

Department of Biological Sciences, National University of Singapore, Singapore, Singapore

Department of Computer Science and Engineering, Ohio State University, Columbus, OH, USA

Melissa Ding

Department of Physics, National University of Singapore, Singapore, Singapore

See Wee Chee & N. Duane Loh

You can also search for this author in PubMed   Google Scholar

Contributions

D.L. and A.D. conceived the project. A.D., D.B., K.F.N., S.W.C., and J.S. collected data, which A.D., K.F.N., M.D., and Y.B. analyzed. ReCoDe framework was programmed by A.D., under D.L.’s advice. A.D. and D.L. wrote the manuscript with inputs from all the authors.

Corresponding author

Correspondence to N. Duane Loh .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Robert McLeod and the other anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Datta, A., Ng, K.F., Balakrishnan, D. et al. A data reduction and compression description for high throughput time-resolved electron microscopy. Nat Commun 12 , 664 (2021). https://doi.org/10.1038/s41467-020-20694-z

Download citation

Received : 19 November 2019

Accepted : 09 December 2020

Published : 28 January 2021

DOI : https://doi.org/10.1038/s41467-020-20694-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Time-resolved transmission electron microscopy for nanoscale chemical dynamics.

  • Francis M. Alcorn
  • Prashant K. Jain
  • Renske M. van der Veen

Nature Reviews Chemistry (2023)

Observation of formation and local structures of metal-organic layers via complementary electron microscopy techniques

  • Xinxing Peng
  • Philipp M. Pelz
  • Mary C. Scott

Nature Communications (2022)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

experiment data reduction

Data Reduction for Science

Funding agency:.

  • Department of Energy

The DOE SC program in Advanced Scientific Computing Research (ASCR) hereby announces its interest in research applications to explore potentially high-impact approaches in the development and use of data reduction techniques and algorithms to facilitate more efficient analysis and use of massive data sets produced by observations, experiments and simulation. 

SUPPLEMENTARY INFORMATION 

Scientific observations, experiments, and simulations are producing data at rates beyond our capacity to store, analyze, stream, and archive the data in raw form. Of necessity, many research groups have already begun reducing the size of their data sets via techniques such as compression, reduced order models, experiment-specific triggers, filtering, and feature extraction. Once reduced in size, transporting, storing, and analyzing the data is still a considerable challenge – a reality that motivates SC’s Integrated Research Infrastructure (IRI)  program [1] and necessitates further innovation in data-reduction methods. These further efforts should continue to increase the level of mathematical rigor in scientific data reduction to ensure that scientifically-relevant constraints on quantities of interest are satisfied, that methods can be integrated into scientific workflows, and that methods are implemented in a manner that inspires trust that the desired information is preserved. Moreover, as the scientific community continues to drive innovation in artificial intelligence (AI), important opportunities to apply AI methods to the challenges of scientific data reduction and apply data-reduction techniques to enable scientific AI, continue to present themselves [2-4]. 

The drivers for data reduction techniques constitute a broad and diverse set of scientific disciplines that cover every aspect of the DOE scientific mission. An incomplete list includes light sources, accelerators, radio astronomy, cosmology, fusion, climate, materials, combustion, the power grid, and genomics, all of which have either observatories, experimental facilities, or simulation needs that produce unwieldy amounts of raw data. ASCR is interested in algorithms, techniques, and workflows that can reduce the volume of such data, and that have the potential to be broadly applied to more than one application. Applicants who submit a pre-application that focuses on a single science application may be discouraged from submitting a full proposal. 

Accordingly, a virtual DOE workshop entitled “Data Reduction for Science” was held in January of 2021, resulting in a brochure [5] detailing four priority research directions (PRDs) identified during the workshop. These PRDs are (1) effective algorithms and tools that can be trusted by scientists for accuracy and efficiency, (2) progressive reduction algorithms that enable data to be prioritized for efficient streaming, (3) algorithms which can preserve information in features and quantities of interest with quantified uncertainty, and (4) mapping techniques to new architectures and use cases. For additional background, see [6-9]. 

The principal focus of this FOA is to support applied mathematics and computer science approaches that address one or more of the identified PRDs. Research proposed may involve methods primarily applicable to high-performance computing, to scientific edge computing, or anywhere scientific data must be collected or processed. Significant innovations will be required in the development of effective paradigms and approaches for realizing the full potential of data reduction for science. Proposed research should not focus only on particular data sets from specific applications, but rather on creating the body of knowledge and understanding that will inform future scientific advances. Consequently, the funding from this FOA is not intended to incrementally extend current research in the area of the proposed project. Rather, the proposed projects must reflect viable strategies toward the potential solution of challenging problems in data reduction for science. It is expected that the proposed projects will significantly benefit from the exploration of innovative ideas or from the development of unconventional approaches. Proposed approaches may include innovative research with one or more key characteristics, such as compression, reduced order models, experiment-specific triggers, filtering, and feature extraction, and may focus on cross-cutting concepts such as artificial intelligence or trust. Preference may be given to pre-applications that include reduction estimates for at least two science applications. 

Applicant institutions are limited to both: 

• No more than two pre-applications or applications as the lead institution. 

• No more than one pre-application or application for each PI at the applicant institution. 

Applications in excess of the limited number of submissions will be declined without review. 

Award Ceiling:    $3,000,000

$15,000,000

March 19, 2024 at 5:00 PM Eastern Time A Pre-Application is required.

Institutional Review Requirement : If you are planning to submit a proposal to this FOA, please submit project summary with a list of investigators with affiliations, budget outline and any cost-sharing need to Shawn Chester at  [email protected] by February 9 for internal review for institutional selection for this limited submission funding opportunity. 

May 7, 2024 at 11:59 PM Eastern Time

Technical/Scientific Program Contact: 

Dr. William Spotz [Primary] 

[email protected]  

Dr. Hal Finkel 

[email protected]  

Dr. Margaret R. Lentz 

[email protected]  

Data Reduction Recipes

  • First Online: 04 July 2022

Cite this chapter

experiment data reduction

  • Jochen Heidt 3  

Part of the book series: Astrophysics and Space Science Library ((ASSL,volume 467))

550 Accesses

An observer can write excellent observing proposals, be perfectly prepared at the telescope, and think about all necessary calibrations in advance, but he will not be able to extract great science out of his data without careful data reduction. The interplay between atmosphere, telescope, instrument, and sometimes human misbelieving makes data reduction in the near-infrared a real art. In this chapter, we outline the basic concepts of data reduction in the near-infrared. It will be separated into detector-/instrument-related corrections and imaging and spectroscopic data reduction techniques. For the latter, we distinguish between observations taken in integrated and polarized light. Wherever appropriate, we will comment on some mistakes still common across the astronomical community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Artigau, É., Donati, J.F., Delfosse, X.: Planet detection, magnetic field of protostars and brown dwarfs meteorology with SPIRou. Astronomical Society of the Pacific Conference Series , vol. 448, p. 771 (2011)

ADS   Google Scholar  

Artigau, É., Astudillo-Defru, N., Delfosse, X., et al.: Telluric-line subtraction in high-accuracy velocimetry: a PCA-based approach, In: Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, vol. 9149, p. 914905 (2014). https://doi.org/10.1117/12.2056385

Google Scholar  

Banyal, R.K., Reiners, A.: A dual cavity fabry-perot device for high precision doppler measurements in astronomy. J. Astron. Instrum. 6 (1), 1750001-25420 (2017). https://doi.org/10.1142/S2251171717500015

Article   Google Scholar  

Bertaux, J.L., Lallement, R., Ferron, S., et al.: TAPAS, a web-based service of atmospheric transmission computation for astronomy. Astron. Astrophys. 564 , A46 (2014). https://doi.org/10.1051/0004-6361/201322383

Bertin, E., Mellier, Y., Radovich, M., et al.: The TERAPIX pipeline. In: Bohlender, D.A., Durand, D., Handley, T.H. (eds.) Astronomical Data Analysis Software and Systems XI. Astronomical Society of the Pacific Conference Series, vol. 281, p. 228 (2002)

Binnendijk, L.: Properties of Double Stars; A Survey of Parallaxes and Orbits. University of Pennsylvania Press, Philadelphia (1960)

Book   Google Scholar  

Bohlin, R.C., Gordon, K.D., Tremblay, P.E.: Techniques and review of absolute flux calibration from the ultraviolet to the mid-infrared. Publ. Astron. Soc. Pac. 126 (942), 711 (2014). https://doi.org/10.1086/677655

Cardelli, J.A., Clayton, G.C., Mathis, J.S.: The relationship between infrared, optical, and ultraviolet extinction. Astrophys. J. 345 , 245 (1989). https://doi.org/10.1086/167900

Article   ADS   Google Scholar  

Casagrande, L., VandenBerg, D.A.: Synthetic stellar photometry - I. General considerations and new transformations for broad-band systems. Month. Not. Roy. Astron. Soc. 444 (1), 392–419 (2014). https://doi.org/10.1093/mnras/stu1476

de Boor, C.: A Practical Guide to Splines. Springer, New York (1978)

Book   MATH   Google Scholar  

Dierckx, P.: Curve and surface fitting with splines. Clarendon press, Oxford (1993)

MATH   Google Scholar  

Eastman, J., Siverd, R., Gaudi, B.S.: Achieving better than 1 minute accuracy in the heliocentric and barycentric julian dates. Publ. Astron. Soc. Pac. 122 (894), 935 (2010). https://doi.org/10.1086/655938

Elias, J.H., Frogel, J.A., Matthews, K., et al.: Infrared standard stars. Astron. J. 87 , 1029–1034 (1982). https://doi.org/10.1086/113185

Freyhammer, L.M., Andersen, M.I., Arentoft, T., et al.: On cross-talk correction of images from multiple-port CCDs. Exp. Astron. 12 (3), 147–162 (2001). https://doi.org/10.1023/A:1021820418263

Fruchter, A.S., Hook, R.N.: Drizzle: a method for the linear reconstruction of undersampled images. Publ. Astron. Soc. Pac. 114 (792), 144–152 (2002). https://doi.org/10.1086/338393

George, E.M., Tulloch, S.M., Ives, D.J.: Fast method of crosstalk characterization for HxRG detectors (2018). arXiv:1808.00790

Gull, S.F., Daniell, G.J.: Image reconstruction from incomplete and noisy data. Nature 272 (5655), 686–690 (1978). https://doi.org/10.1038/272686a0

Gullikson, K., Dodson-Robinson, S., Kraus, A.: Correcting for telluric absorption: methods, case studies, and release of the TelFit code. Astron. J. 148 (3), 53 (2014). https://doi.org/10.1088/0004-6256/148/3/53

Halverson, S., Mahadevan, S., Ramsey, L., et al.: Development of a new, precise near-infrared doppler wavelength reference: a fiber fabry-perot interferometer (2012). arXiv:1209.2704

Hawarden, T.G., Leggett, S.K., Letawsky, M.B., et al.: JHK standard stars for large telescopes: the UKIRT fundamental and extended lists. Month Not. Roy. Astron. Soc. 325 (2), 563–574 (2001). https://doi.org/10.1046/j.1365-8711.2001.04460.x

Heidt, J., Nilsson, K.: Polarimetry of optically selected BL Lacertae candidates from the SDSS. Astron. Astrophys. 529 , A162 (2011). https://doi.org/10.1051/0004-6361/201116541

Heidt, J., Quirrenbach, A., Hoyer, N., et al.: 3C 294 revisited: deep large binocular telescope AO NIR images and optical spectroscopy. Astron. Astrophys. 628 , A28 (2019). https://doi.org/10.1051/0004-6361/201935892

Hillenbrand, L.A., Foster, J.B., Persson, S.E., et al.: The Y band at 1.035 microns: photometric calibration and the dwarfstellar/substellar color sequence. Publ. Astron. Soc. Pac. 114 (797), 708–720 (2002). https://doi.org/10.1086/341699

Horne, K.: An optimal extraction algorithm for CCD spectroscopy. Publ. Astron. Soc. Pac. 98 , 609–617 (1986). https://doi.org/10.1086/131801

Howell, S.B.: Two-dimensional aperture photometry: signal-to-noise ratio of point-source observations and optimal data-extraction techniques. Publ. Astron. Soc. Pac. 101 , 616 (1989). https://doi.org/10.1086/132477

Karl Pearson, F.R.S.: LIII. on lines and planes of closest fit to systems of points in space. Lond. Edinburgh Dublin Philos. Mag. J. Sci. 2 (11), 559–572 (1901). https://doi.org/10.1080/14786440109462720

Kausch, W., Noll, S., Smette, A., et al.: Molecfit: a general tool for telluric absorption correction. II. Quantitative evaluation on ESO-VLT/X-Shooterspectra. Astron. Astrophys. 576 , A78 (2015). https://doi.org/10.1051/0004-6361/201423909

Kelson, D.D.: Optimal techniques in two-dimensional spectroscopy: background subtraction for the 21st century. Publ. Astron. Soc. Pac. 115 (808), 688–699 (2003). https://doi.org/10.1086/375502

Koekemoer, A.M., Fruchter, A.S., Hook, R.N., et al.: MultiDrizzle: an integrated pyraf script for registering, cleaning and combining images. In: HST Calibration Workshop: Hubble After the Installation of the ACS and the NICMOS Cooling System, p. 337 (2003)

Kriek, M., Shapley, A.E., Reddy, N.A., et al.: The MOSFIRE deep evolution field (MOSDEF) survey: rest-frame optical spectroscopy for ˜1500 H-selected galaxies at 1.37 < z < 3.8. Astrophys. J. Suppl. Ser. 218 (2), 15 (2015). https://doi.org/10.1088/0067-0049/218/2/15

Landolt, A.U., Blondeau, K.L.: The calculation of heliocentric corrections. Publ. Astron. Soc. Pac. 84 (502), 784 (1972). https://doi.org/10.1086/129382

Lawrence, A., Warren, S.J., Almaini, O., et al.: The UKIRT infrared deep sky survey (UKIDSS). Month Not. Roy. Astron. Soc. 379 , 1599–1617 (2007). https://doi.org/10.1111/j.1365-2966.2007.12040.x

Leisenring, J.M., Rieke, M., Misselt, K., et al.: Characterizing persistence in JWST NIRCam flight detectors. In: Holland, A.D., Beletic, J. (eds.) High Energy, Optical, and Infrared Detectors for Astronomy VII. Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, vol. 9915, p. 99152N (2016). https://doi.org/10.1117/12.2233917

Lucy, L.B.: An iterative technique for the rectification of observed distributions. Astron. J. 79 , 745 (1974). https://doi.org/10.1086/111605

Lucy, L.B., Hook, R.N.: Co-adding images with different PSF’s. In: Astronomical Data Analysis Software and Systems I. Astronomical Society of the Pacific Conference Series, vol. 25, p. 277 (1992)

Marcy, G.W., Butler, R.P.: Precision radial velocities with an iodine absorption cell. Publ. Astron. Soc. Pac. 104 , 270 (1992). https://doi.org/10.1086/132989

Marois, C., Lafrenière, D., Doyon, R., et al.: Angular differential imaging: a powerful high-contrast imaging technique. Astrophys. J. 641 (1), 556–564 (2006). https://doi.org/10.1086/500401

Martin, J.: Removal of the H-alpha Ghost Image from STIS CCD Data. η Car Treasury Project Tech Memo 10 (7) (2005)

McMahon, R.: The VISTA Hemisphere Survey(VHS) science goals and status. In: Science from the Next Generation Imaging and Spectroscopic Surveys, p. 37 (2012)

Noll, S., Kausch, W., Kimeswenger, S., et al.: Skycorr: a general tool for spectroscopic sky subtraction. Astron. Astrophys. 567 , A25 (2014). https://doi.org/10.1051/0004-6361/201423908

Patat, F., Romaniello, M.: Error analysis for dual-beam optical linear polarimetry. Publ. Astron. Soc. Pac. 118 (839), 146–161 (2006). https://doi.org/10.1086/497581

Persson, S.E., Murphy, D.C., Krzeminski, W., et al.: A new system of faint near-infrared standard stars. Astron. J. 116 (5), 2475–2488 (1998). https://doi.org/10.1086/300607

Pickles, A.J.: A stellar spectral flux library: 1150-25000 Å. Publ. Astron. Soc. Pac. 110 (749), 863–878 (1998). https://doi.org/10.1086/316197

Richardson, W.H.: Bayesian-based iterative method of image restoration. J. Opt. Soc. Am. 62 (1), 55 (1972)

Rodrigo, C., Solano, E.: The SVO filter profile service. In: Contributions to the XIV.0 Scientific Meeting (virtual) of the Spanish Astronomical Society, p. 182 (2020)

Rodrigo, C., Solano, E., Bayo, A.: SVO filter profile service version 1.0. IVOA Working Draft 15 October 2012 (2012). https://doi.org/10.5479/ADS/bib/2012ivoa.rept.1015R

Rousselot, P., Lidman, C., Cuby, J.G., et al.: Night-sky spectral atlas of OH emission lines in the near-infrared. Astron. Astrophys. 354 , 1134–1150 (2000)

Sarmiento, L.F., Reiners, A., Huke, P., et al.: Comparing the emission spectra of U and Th hollow cathode lamps and a new U line list. Astron. Astrophys. 618 , A118 (2018). https://doi.org/10.1051/0004-6361/201832871

Schlafly, E.F., Finkbeiner, D.P.: Measuring reddening with sloan digital sky survey stellar spectra and recalibrating SFD. Astrophys. J. 737 (2), 103 (2011). https://doi.org/10.1088/0004-637X/737/2/103

Schlegel, D.J., Finkbeiner, D.P., Davis, M.: Maps of dust infrared emission for use in estimation of reddening and cosmic microwave background radiation foregrounds. Astrophys. J. 500 (2), 525–553 (1998). https://doi.org/10.1086/305772

Serkowski, K.: Statistical analysis of the polarization and reddening of the double cluster in perseus. Acta Astron. 8 , 135 (1958)

Simmons, J.F.L., Stewart, B.G.: Point and interval estimation of the true unbiased degree of linear polarization in the presence of low signal-to-noise ratios. Astron. Astrophys. 142 (1), 100–106 (1985)

Skrutskie, M.F., Cutri, R.M., Stiening, R., et al.: The two micron all sky survey (2MASS). Astron. J. 131 , 1163–1183 (2006). https://doi.org/10.1086/498708

Smette, A., Sana, H., Noll, S., et al.: Molecfit: a general tool for telluric absorption correction. I. Method and application to ESO instruments. Astron. Astrophys. 576 , A77 (2015). https://doi.org/10.1051/0004-6361/201423932

Steinmetz, T., Wilken, T., Araujo-Hauck, C., et al.: Laser frequency combs for astronomical observations. Science 321 (5894), 1335 (2008). https://doi.org/10.1126/science.1161030

Stetson, P.B.: On the growth-curve method for calibrating stellar photometry with CCDs. Publ. Astron. Soc. Pac. 102 , 932 (1990). https://doi.org/10.1086/132719

Stokes, G.G.: On the composition and resolution of streams of polarized light from different sources. Trans. Cambridge Philos. Soc. 9 , 399 (1851)

Sung, H., Bessell, M.S.: Standard stars: CCD photometry, transformations and comparisons. Publ. Astron. Soc. Pac. 17 (3), 244–254 (2000). https://doi.org/10.1071/AS00041

Ulmer-Moll, S., Figueira, P., Neal, J.J., et al.: Telluric correction in the near-infrared: standard star or synthetic transmission? Astron. Astrophys. 621 , A79 (2019). https://doi.org/10.1051/0004-6361/201833282

Vacca, W.D., Cushing, M.C., Rayner, J.T.: A method of correcting near-infrared spectra for telluric absorption. Publ. Astron. Soc. Pac. 115 (805), 389–409 (2003). https://doi.org/10.1086/346193

van Dokkum, P.G.: Cosmic-ray rejection by Laplacian edge detection. Publ. Astron. Soc. Pac. 113 (789), 1420–1427 (2001). https://doi.org/10.1086/323894

Wang, W.H., Cowie, L.L., Barger, A.J., et al.: Ultradeep K S imaging in the GOODS-N. Astrophys. J. Suppl. Ser. 187 (1), 251–271 (2010). https://doi.org/10.1088/0067-0049/187/1/251

Wardle, J.F.C., Kronberg, P.P.: The linear polarization of quasi-stellar radio sources at 3.71 and 11.1 centimeters. Astrophys. J. 194 , 249–255 (1974). https://doi.org/10.1086/153240

Weir, N., Djorgovski, S.: MEM: new techniques, applications, and photometry. In: White, R.L. Allen, R.J. (eds.) The Restoration of HST Images and Spectra, p. 31 (1991)

Williams, R.E., Blacker, B., Dickinson, M., et al.: The hubble deep field: observations, data reduction, and galaxy photometry. Astron. J. 112 , 1335 (1996). https://doi.org/10.1086/118105

Wright, J.T., Eastman, J.D.: Barycentric corrections at 1 cm s −1 for precise Doppler velocities. Publ. Astron. Soc. Pac. 126 (943), 838 (2014). https://doi.org/10.1086/678541

Download references

Author information

Authors and affiliations.

Zentrum für Astronomie, Landessternwarte Heidelberg, Universität Heidelberg, Heidelberg, Germany

Jochen Heidt

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this chapter

Heidt, J. (2022). Data Reduction Recipes. In: Astronomy in the Near-Infrared - Observing Strategies and Data Reduction Techniques. Astrophysics and Space Science Library, vol 467. Springer, Cham. https://doi.org/10.1007/978-3-030-98441-0_7

Download citation

DOI : https://doi.org/10.1007/978-3-030-98441-0_7

Published : 04 July 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-98440-3

Online ISBN : 978-3-030-98441-0

eBook Packages : Physics and Astronomy Physics and Astronomy (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Skip to Content

Other ways to search:

  • Events Calendar

DOE Data Reduction for Science

Please see the full solicitation for complete information about the funding opportunity. Below is a summary assembled by the Research & Innovation Office (RIO).

Program Summary

The Data Reduction for Science program seeks applications to explore potentially high-impact approaches in the development and use of data reduction techniques and algorithms to facilitate more efficient analysis and use of massive data sets produced by observations, experiments and simulation.

The drivers for data reduction techniques constitute a broad and diverse set of scientific disciplines that cover every aspect of the DOE scientific mission. An incomplete list includes light sources, accelerators, radio astronomy, cosmology, fusion, climate, materials, combustion, the power grid, and genomics, all of which have either observatories, experimental facilities, or simulation needs that produce unwieldy amounts of raw data. ASCR is interested in algorithms, techniques, and workflows that can reduce the volume of such data, and that have the potential to be broadly applied to more than one application. Applicants who submit a pre-application that focuses on a single science application may be discouraged from submitting a full proposal.

Accordingly, a virtual DOE workshop entitled “Data Reduction for Science” was held in January of 2021, resulting in a brochure [5] detailing four priority research directions (PRDs) identified during the workshop. These PRDs are (1) effective algorithms and tools that can be trusted by scientists for accuracy and efficiency, (2) progressive reduction algorithms that enable data to be prioritized for efficient streaming, (3) algorithms which can preserve information in features and quantities of interest with quantified uncertainty, and (4) mapping techniques to new architectures and use cases. For additional background, see [6-9].

The principal focus of this program is to support applied mathematics and computer science approaches that address one or more of the identified PRDs. Research proposed may involve methods primarily applicable to high-performance computing, to scientific edge computing, or anywhere scientific data must be collected or processed. Significant innovations will be required in the development of effective paradigms and approaches for realizing the full potential of data reduction for science. Proposed research should not focus only on particular data sets from specific applications, but rather on creating the body of knowledge and understanding that will inform future scientific advances. Consequently, the funding from this FOA is not intended to incrementally extend current research in the area of the proposed project. Rather, the proposed projects must reflect viable strategies toward the potential solution of challenging problems in data reduction for science. It is expected that the proposed projects will significantly benefit from the exploration of innovative ideas or from the development of unconventional approaches.

Proposed approaches may include innovative research with one or more key characteristics, such as compression, reduced order models, experiment-specific triggers, filtering, and feature extraction, and may focus on cross-cutting concepts such as artificial intelligence or trust.

Preference may be given to pre-applications that include reduction estimates for at least two science applications.

CU Internal Deadline: 11:59pm MST February 26, 2024

DOE Pre-Application Deadline: 3:00pm MST March 19, 2024

DOE Application Deadline: 9:59pm MST May 7, 2024

Internal Application Requirements (all in PDF format)

  • Project Narrative (3 pages maximum): Please include: 1) Background/Introduction : explain the importance and relevance of the proposed work as well as a review of the relevant literature; 2) Project Objectives: provide a clear, concise statement of the specific objectives/aims of the proposed project; 3) Proposed Research and Methods : identify the hypotheses to be tested (if any) and details of the methods to be used including the integration of experiments with theoretical and computational research efforts; and 4) Promoting Inclusive and Equitable Research (PIER) Plan : describe the activities and strategies to promote equity and inclusion as an intrinsic element to advancing scientific excellence in the research project within the context of the proposing institution and any associated research group(s).
  • Lead PI Curriculum Vitae and Names and Institutional Affiliations of any Coinvestigators
  • Budget Overview (1 page maximum): A basic budget outlining project costs is sufficient; detailed OCG budgets are not required.

To access the online application, visit: https://cuboulderovcr.secure-platform.com/a/solicitations/6943/home

Eligibility

No more than one pre-application or application for each PI at the applicant institution.

The PI on a pre-application, or application may also be listed as a senior or key personnel, including in any role on a proposed subaward, on an unlimited number of separate submissions.

Teams of multiple institutions may submit collaborative applications. Each submitted application in such a team must indicate that it is part of a collaborative project/group. Every partner institution must submit an application through its own sponsored research office. Each multi-institutional team can have only one lead institution.

Limited Submission Guidelines

No more than two pre-applications or applications as the lead institution.

Award Information

Number of Anticipated Awards: 5-10

Period of Performance: 3 years

Ceiling: $150,000 per year  

Floor: $1,000,000 per year

Review Criteria

The internal committee will use DOE’s evaluation criteria (below) for the selection process.

SCIENTIFIC AND/OR TECHNICAL MERIT OF THE PROJECT

  • What is the scientific innovation of the proposed research?
  • What is the likelihood of achieving valuable results?
  • How might the results of the proposed work impact the direction, progress, and thinking in relevant scientific fields of research?
  • How does the proposed work compare with other efforts in its field, both in terms of scientific and/or technical merit and originality?
  • Does the application specify at least one scientific hypothesis motivating the proposed work? Is the investigation of the specified hypothesis or hypotheses scientifically valuable?
  • Is the Data Management Plan suitable for the proposed research? To what extent does it support the validation of research results? To what extent will research products, including data, be made available and reusable to advance the field of research?

APPROPRIATENESS OF THE PROPOSED METHOD OR APPROACH

  • How logical and feasible are the research approaches?
  • Does the proposed research employ innovative concepts or methods?
  • Can the approach proposed concretely contribute to our understanding of the validity of the specified scientific hypothesis or hypotheses?
  • Are the conceptual framework, methods, and analyses well justified, adequately developed, and likely to lead to scientifically valid conclusions?
  • Does the applicant recognize significant potential problems and consider alternative strategies?
  • Is the proposed research aligned with the published priorities identified or incorporated by reference in Section I of this FOA?

COMPETENCY OF APPLICANT’S PERSONNEL AND ADEQUACY OF PROPOSED RESOURCES

  • What is the past performance and potential of the research team?
  • How well qualified is the research team to carry out the proposed research?
  • Are the research environment and facilities adequate for performing the research?
  • Does the proposed work take advantage of unique facilities and capabilities?

REASONABLENESS AND APPROPRIATENESS OF THE PROPOSED BUDGET

  • Are the proposed budget and staffing levels adequate to carry out the proposed research?
  • Is the budget reasonable and appropriate for the scope?

QUALITY AND EFFICACY OF THE PROMOTING INCLUSIVE AND EQUITABLE RESEARCH PLAN

  • Is the proposed Promoting Inclusive and Equitable Research (PIER) Plan suitable for the size and complexity of the proposed project and an integral component of the proposed project?
  • To what extent is the PIER plan likely to lead to participation of individuals from diverse backgrounds, including individuals historically underrepresented in the research community?
  • What aspects of the PIER plan are likely to contribute to the goal of creating and maintaining an equitable, inclusive, encouraging, and professional training and research environment and supporting a sense of belonging among project personnel?
  • How does the proposed plan include intentional mentorship and are the associated mentoring resources reasonable and appropriate?

Search Faculty Experts

Research and expertise across CU Boulder.

  • Research Institutes

Our 12 research institutes conduct more than half of the sponsored research at CU Boulder.

  • Research Computing

A carefully integrated cyberinfrastructure supports CU Boulder research.

  • Research Centers

More than 75 research centers span the campus, covering a broad range of topics.

Research Development, Institutes & Centers

  • Research Development
  • Shared Instrumentation Network
  • Office of Postdoctoral Affairs
  • Research & Innovation Office Bulletin

Research Administration

  • Office of Contracts and Grants
  • Research Integrity (Compliance)
  • Human Research & IRB
  • Office of Animal Resources
  • Research Tools
  • Research Professor Series

Partnerships & Innovation

  • Innovation & Entrepreneurship
  • Venture Partners (formerly Technology Transfer Office)
  • Industry & Foundation Relations
  • AeroSpace Ventures
  • Grand Challenge
  • Center for National Security Initiatives
  • Engineering Mathematics
  • Discrete Mathematics
  • Operating System
  • Computer Networks
  • Digital Logic and Design
  • C Programming
  • Data Structures
  • Theory of Computation
  • Compiler Design
  • Computer Org and Architecture

Data Reduction in Data Mining

  • Data Preprocessing in Data Mining
  • Data Normalization in Data Mining
  • Data Transformation in Data Mining
  • Numerosity Reduction in Data Mining
  • Data Mining in R
  • Data Integration in Data Mining
  • Attribute Subset Selection in Data Mining
  • Redundancy and Correlation in Data Mining
  • Text Mining in Data Mining
  • Data Mining | Set 2
  • Various terms in Data Mining
  • Data Mining: Data Warehouse Process
  • Data Mining and Society
  • Predictive Analysis in Data Mining
  • What is Prediction in Data Mining?
  • Introduction to Data Mining
  • Complex Data Types in Data Mining
  • Tuple Duplication in Data Mining
  • Different Types of Data in Data Mining

Prerequisite – Data Mining   The method of data reduction may achieve a condensed description of the original data which is much smaller in quantity but keeps the quality of the original data. 

INTRODUCTION:

Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the most important information. This can be beneficial in situations where the dataset is too large to be processed efficiently, or where the dataset contains a large amount of irrelevant or redundant information.

There are several different data reduction techniques that can be used in data mining, including:

  • Data Sampling: This technique involves selecting a subset of the data to work with, rather than using the entire dataset. This can be useful for reducing the size of a dataset while still preserving the overall trends and patterns in the data.
  • Dimensionality Reduction: This technique involves reducing the number of features in the dataset, either by removing features that are not relevant or by combining multiple features into a single feature.
  • Data Compression: This technique involves using techniques such as lossy or lossless compression to reduce the size of a dataset.
  • Data Discretization: This technique involves converting continuous data into discrete data by partitioning the range of possible values into intervals or bins.
  • Feature Selection: This technique involves selecting a subset of features from the dataset that are most relevant to the task at hand.
  • It’s important to note that data reduction can have a trade-off between the accuracy and the size of the data. The more data is reduced, the less accurate the model will be and the less generalizable it will be.

In conclusion, data reduction is an important step in data mining, as it can help to improve the efficiency and performance of machine learning algorithms by reducing the size of the dataset. However, it is important to be aware of the trade-off between the size and accuracy of the data, and carefully assess the risks and benefits before implementing it. Methods of data reduction:   These are explained as following below. 

1. Data Cube Aggregation:   This technique is used to aggregate data in a simpler form. For example, imagine the information you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. They involve you in the annual sales, rather than the quarterly average,  So we can summarize the data in such a way that the resulting data summarizes the total sales per year instead of per quarter. It summarizes the data. 

2. Dimension reduction:   Whenever we come across any data which is weakly important, then we use the attribute required for our analysis. It reduces data size as it eliminates outdated or redundant features. 

  • Combination of forwarding and Backward Selection –   It allows us to remove the worst and select the best attributes, saving time and making the process faster. 

3. Data Compression :   The data compression technique reduces the size of the files using different encoding mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression techniques. 

  • Lossless Compression –   Encoding techniques (Run Length Encoding) allow a simple and minimal data size reduction. Lossless data compression uses algorithms to restore the precise original data from the compressed data.  
  • Lossy Compression –   Methods such as the Discrete Wavelet transform technique, PCA (principal component analysis) are examples of this compression. For e.g., the JPEG image format is a lossy compression, but we can find the meaning equivalent to the original image. In lossy-data compression, the decompressed data may differ from the original data but are useful enough to retrieve information from them. 

4. Numerosity Reduction :   In this reduction technique, the actual data is replaced with mathematical models or smaller representations of the data instead of actual data, it is important to only store the model parameter. Or non-parametric methods such as clustering, histogram, and sampling.

5. Discretization & Concept Hierarchy Operation:   Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals. We replace many constant values of the attributes by labels of small intervals. This means that mining results are shown in a concise, and easily understandable way. 

  • Top-down discretization –   If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of attributes and repeat this method up to the end, then the process is known as top-down discretization also known as splitting. 
  • Bottom-up discretization –   If you first consider all the constant values as split points, some are discarded through a combination of the neighborhood values in the interval, that process is called bottom-up discretization. 

Concept Hierarchies:   It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) with high-level concepts (categorical variables such as middle age or Senior). 

For numeric data following techniques can be followed: 

  • Binning –   Binning is the process of changing numerical variables into categorical counterparts. The number of categorical counterparts depends on the number of bins specified by the user. 
  • Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set. 
  • Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from 0-20. 
  • Clustering: Grouping similar data together.   

ADVANTAGED OR DISADVANTAGES OF Data Reduction in Data Mining : 

Data reduction in data mining can have a number of advantages and disadvantages., advantages:.

  • Improved efficiency: Data reduction can help to improve the efficiency of machine learning algorithms by reducing the size of the dataset. This can make it faster and more practical to work with large datasets.
  • Improved performance: Data reduction can help to improve the performance of machine learning algorithms by removing irrelevant or redundant information from the dataset. This can help to make the model more accurate and robust.
  • Reduced storage costs: Data reduction can help to reduce the storage costs associated with large datasets by reducing the size of the data.
  • Improved interpretability: Data reduction can help to improve the interpretability of the results by removing irrelevant or redundant information from the dataset.

Disadvantages:

  • Loss of information: Data reduction can result in a loss of information, if important data is removed during the reduction process.
  • Impact on accuracy: Data reduction can impact the accuracy of a model, as reducing the size of the dataset can also remove important information that is needed for accurate predictions.
  • Impact on interpretability: Data reduction can make it harder to interpret the results, as removing irrelevant or redundant information can also remove context that is needed to understand the results.
  • Additional computational costs: Data reduction can add additional computational costs to the data mining process, as it requires additional processing time to reduce the data.
  • In conclusion, data reduction can have both advantages and disadvantages. It can improve the efficiency and performance of machine learning algorithms by reducing the size of the dataset. However, it can also result in a loss of information, and make it harder to interpret the results. It’s important to weigh the pros and cons of data reduction and carefully assess the risks and benefits before implementing it.  

author

Please Login to comment...

Similar reads.

  • data mining

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

U.S. flag

An official website of the United States government

Here’s how you know

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock A locked padlock ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

https://www.nist.gov/ncnr/data-reduction-analysis

NIST Center for Neutron Research

Data reduction, visualization and analysis, getting your data.

Raw data from non-proprietary experiments is available in our data repository at http://dx.doi.org/10.18434/T4201B (also available by ftp:// at the same location), as specified in our Data Management Plan .

Available software

There are a number of software packages available to NCNR users.  Some must be used locally at the NCNR and others may be downloaded and used at your home institution.  This page is a list of the available programs organized by topic.  Those programs that can be downloaded are indicated with a D in parentheses (D) but most of these programs can be run locally on JAZZ.  The platforms supported for the downloadable programs are indicated within the web pages.

CRYSTALLOGRAPHY

The following software were originally developed by Brian Toby while a staff member at the NCNR. They are currently being maintained at ANL, with links provided here for your convenience. 

  • EXPGUI/GSAS  Reduction and analysis of data for x-ray and neutron diffraction measurements: BT-1 
  • CMPR  Display, index and fit diffraction data
  • other crystallography resources

REFLECTIVITY

Reduction, visualization and analysis of data from the PBR, MAGIK,  and NG-7 reflectometers (with future support for CANDOR).  Support for the reflectivity programs:  paul.kienzle [at] nist.gov (Paul Kienzle)  

  • Reductus online data reduction
  • Refl1D ( D ) advanced fitting with Python
  • Unpolarized neutron reflectivity  (web calculator)
  • Polarized neutron reflectivity  (web calculator)
  • Full details of reflectometry software

INELASTIC NEUTRON SCATTERING

  • DAVE  ( D ) For reduction, visualization and analysis of data from DCS, FCS, HFBS, SPINS, FANS (BT-4), BT2, BT7 and BT9.   Support for DAVE:  richard.azuah [at] nist.gov (Richard Azuah)  
  • XTREAT, TANQENS   For reduction and analysis of data from FCS Support for XTREAT and TANQENS:  taner [at] nist.gov (Taner Yildirim)  and  craig.brown [at] nist.gov (Craig Brown)  
  • FIT (only inside firewall) ( D ) For fitting generic columnar-formatted datasets. Support for FIT:  ross.erwin [at] nist.gov (Ross Erwin)  

Support for SANS software:  jkrzywon [at] nist.gov (Jeff Krzywon)  

  • IGOR* programs  ( D ) For reduction and visualization of data from USANS, NG-B 30m SANS, NG-B 10m and NG-7 SANS
  • SasView  ( D ) For analysis of data from USANS, NG-B 30m SANS, NG-B 10m and NG-7 SANS

Online calculators

In addition to the resources above, there are a number web-accessible tools for experimental planning and simulation:

  • Neutron attenuation and activation (BT-1)
  • X-ray absorption calculator  (anl.gov)
  • VSANS Online Calculator (only inside firewall)

The DVA team is responsible for meeting the visualization and analysis software needs of the NCNR scientific staff and user community.  There are a number of software packages listed above that are relatively mature and some that are in the development phase.  The two major efforts currently in the development phase are the DAVE project and reflectivity software development. 

Team members

Back to blog home

Cuped explained, craig sexauer, cuped is slowly becoming a common term in online experimentation since its coining by microsoft in 2013 ..

Meaning Controlled-experiment Using Pre-Experiment Data , CUPED is frequently cited as— and used as —one of the most powerful algorithmic tools for increasing the speed and accuracy of experimentation programs.

In this article, we’ll:

Cover the background of CUPED

Illustrate the core concepts behind CUPED

Show how you can leverage this tool to run faster and less biased experiments

What CUPED solves:

As an experiment matures and hits its target date for readout, it’s not uncommon to see a result that seems to be only barely outside the range where it would be treated as statistically significant. In a frequentist world, this isn’t sufficient evidence that your change caused a change in user behavior.

nearly significant result

If there was a real effect, you needed more sample size to increase your chances of getting a statistically significant result. In an experiment, the standard error or “noise” goes down with the square root of your sample size. However, sample size is an expensive resource, usually proportional the enrollment window of your experiment.

Waiting for more samples delays your ability to make an informed decision, and it doesn’t guarantee you’ll observe a statistically significant result when there is a real effect.

Even at companies with immense scale like Facebook and Amazon, people have to deal with the pain of waiting for experiments to enroll users and mature because they’re usually looking for relatively small effects.

Consider this: A 0.1% increase to revenue at Facebook is worth upwards of $100 million per year!

For smaller companies, small effect sizes can become infeasible to measure. It would just take too long to get the sample needed to reliably observe a statistically significant change in their target metric.

Because of this cost, a number of methods have been developed in order to decrease the standard error for the same metric and sample size.

CUPED is an extremely popular implementation that uses pre-experiment data to explain away some of the variance in the result data.

The statistical concept behind CUPED

Like many things in experimentation, the core concept behind CUPED is simple, but its implementation can be tricky (and expensive!).

The guiding principle of CUPED is that not all variance in an experiment is random. In fact, a lot of the differences in user outcomes are based on pre-existing factors that have nothing to do with the experiment.

Let’s talk about this for a minute:

Say we want to run a test to see if people run slower with weights attached to them. From a physics perspective, the answer seems pretty obvious. We might record data like this:

test group 1

If we average out our results, we might clearly see the expected effect, but we might not; there’s a lot of variance and overlap in the observed mile times. It should be pretty clear, however, that how fast the runners already were might be an underlying factor. What if we asked them to run a mile a week ago to establish a baseline ?

test group 2

In the context of their “typical” mile time, this effect should be much clearer! We’ve implicitly switched from caring about their raw “mile time” into caring about the difference from what we’d expect!

By doing this, we’ve also “explained” some of the noise and variance in the experiment metric. Before, we saw a difference of 140 seconds between the fastest and slowest runner. Now, we’ve reduced the range in our metric to 65 seconds; this lower range should mean that the variance we’d use to calculate confidence intervals and p-values will be lower.

This is conceptually very similar the original implementation of CUPED; we use the pre-experiment data for a metric to normalize the post-experimental values. How much we normalize is based on how well the pre-experiment data predicts the experiment data - we’ll dive into this later.

Bias correction

Because experimental groups are randomly assigned, there’s a chance that the two groups randomly have different baseline run times. If you’re unlucky, that difference could even be statistically significant. This means that even if the weights did nothing, you might conclude that there’s a difference between the two groups.

If you have access to that baseline data, it’d be possible to conclude that there was a pre-existing difference and be wary of the results. In the example below, it’s pretty obvious that the difference in the groups before the test would make the results extremely skewed:

average mile time versus weights versus no weights cuped data example

You might note that you can see that the weighted runners’ times went up, and the unweighted runners’ times went down. This relative change does match our expectation. Would it be possible to infer that there is an effect here? Correcting this data with CUPED can help!

Conceptually, if one group has a faster average baseline , their experiment results will also be faster. When we apply a CUPED correction, the faster group’s metric will be adjusted downwards relative to the slower group.

In this example, the post-adjustment averages might move something like this, pushing the weights group’s experiment value higher than the control group. We could follow up with a statistical test to understand if the difference in adjusted values is statistically significant .

mile time versus weights versus no weights after cuped applied

Stratification

Some variants of CUPED are ‘non-parametric’ or ‘bucketed’. What this usually means is that (in this example) we would split users into groups based on their pre-experiment run times, and measure metrics relative to the average metric value of that group.

For example, consider the data below - this is for the bucket of users who ran between a 6:30 and 6:40 mile in the baseline:

test group cuped data stratification

Other variables

More complex implementations of CUPED don’t just rely on a single historical data point for the same metric. They can pull in other information as well, as long as it’s independent of the experiment group the user is in.

In the example above, we could add age group as a factor in the experimentation. This has relatively little to do with our experiment, but could be a major factor in people’s mile times! By including this as a factor in CUPED, we can reduce even more variance.

test group cuped data other variables

Using CUPED in practice

In practice, we can’t just subtract out a user’s prior values from their experimental values. The reason for this is also conceptually simple—people’s past behavior isn’t always a perfect predictor for their future behavior.

A mental model for the math we’ll use

Before we go further, it’s useful to understand the relationship between experimentation and regression (the ordinary-least-squares or “OLS” regression you’d run in excel.)

A T-test for a given metric is mathematically equivalent to running a regression where the dependent variable is your metric and the independent variable is a user’s experiment group. To demonstrate this, I generated some data for the example experiment above, where users’ paces are based on a randomly-assigned baseline pace and if they’re in the test group.

The population statistics for this are:

population statistics sample data

Let’s compare the outputs of running a T-test and running an OLS where we use the 1-or-0 test flag as the independent variable.

t-test sample data

Comparing these, we notice a lot of similarities:

The effect size in our T-test (the delta between test and control) is exactly the same as the “test” variable’s coefficient in the OLS regression.

The standard error for the coefficient is the same as the standard error for our T-test.

The p-value for the “test” variable coefficient is the same as for our t-test!

In short, our standard T-test is basically a regression against a 1-or-0 variable!

they're the same picture meme

When we want to make regressions more accurate, we might add relevant explanatory variables. We can do the same for our test; again, this is the core concept behind CUPED.

Let’s include baseline pace as a factor in our regression. We should expect this to change the regression quite a bit, since it’s such a powerful explanatory variable— and it does .

output cuped

Let’s review:

The “test” variable’s coefficient (the estimate of the experiment effect) didn’t change much. That’s expected - unless there was a significant difference between the groups before the experiment we should get a similar estimate of the experiment effect.

The standard error (and accordingly p-value) went down from 4.73 to 2.13. This is because a lot of the noise we previously attributed to our test variable wasn’t random: Tt was coming from users having different baselines, which we’re now accounting for!

Our p-value goes from 0.116 to 0.000 because of the decreased Standard Error. The result, which was previously not statistically significant, is now clearly significant.

Using CUPED with the baseline pace achieves nearly-identical results. To visualize the reduction in Variance/Standard Error, I plotted the distribution of user paces from this sample dataset before and after I applied CUPED:

before cuped vs after cuped example

When we apply CUPED, we see a large reduction in variance and p-value, just like in the regression results. Using the pre-experiment data reduced the variance, p-value, and the data we would need to consistently see this result.

Create a free account

Cuped math and implementation.

For more details on this, please refer to the 2013 Microsoft white paper. We’ve used many formulas that appear in that paper here.

To reduce variance by using other variables, we’ll need to make adjustments such that we end up with an unbiased estimator of group means that we’ll use in our calculations. An unbiased estimator simply means that the expected value of the estimator is equal to the true value of the parameter we’re estimating.

In practice, this means we need to pick an adjustment that is independent of which test group a user is assigned to.

For the original, simplest implementation of CUPED we’ll refer to our pre-experiment values as X and our experiment values as Y . We’ll adjust Y to get a covariate-adjusted Ycv according to the formula below:

math 1

Here, θ could be any derived constant. What this equation means it that, for any θ , we can take two steps:

Multiply the pre-experiment population mean by θ and add it to each user’s result

Subtract from each user’s result θ multiplied by their pre-experiment value

This gives us an unbiased estimator Ycv which factors in the covariate into our estimates. We can calculate the variance of the new estimator term:

math 2

This is the variance of our adjusted estimator for Y. This variance turns out to be the smallest for:

math 3

This is the term we’d use to calculate the slope in an OLS regression ! This is also the term we’ll up using in our data transformation - we take all the data in the experiment and calculate this theta. The final variance for our estimator is

math 4

where ρ is the correlation between X and Y. The correlation between the pre-experiment and post-experiment data is directly linked to how much the variance is reduced. Note that since ρ is bounded between [-1, 1], this new variance will always be less than or equal to the original variance.

In practice

To create a data pipeline for the basic form of CUPED, you need to carry out the following steps. With X referring to pre-experiment data points and Y points referring to experiment data:

Calculate the covariance between Y and X as well as the variance and mean of X. Use this to calculate θ per the formula above.

This requires that users without pre or post-experiment data are included as 0s if they are to be included in the adjustment

For each user, calculate the user’s individual pre-experiment value. It’s common to choose to not apply an adjustment for users who are not eligible for pre-experiment data (for example new users) - this is effectively a one-level striation.

Join the population statistics to the user-level data

Calculate user’s adjusted terms as Y +θ *( population mean of X )- θX

Run and interpret your statistical analysis as you normally would, using the adjusted metrics as your inputs

Implications from the CUPED math (above):

There are many covariates we could use for variance reduction; the main requirement is that it is independent of the experiment group which the user is assigned to. Generally, data from before an experiment is safest.

We commonly use the same metric from before the experiment as a covariate because in practice it’s usually a very effective predictor, and it makes intuitive sense in most cases.

We should calculate the group statistics for the pre-experiment/post-experiment data across the entire experiment population—not on a per-group basis—because it’s possible there’s an interaction effect between the treatment and the pre-exposure data. For example, users who run faster may be better equipped to run with weights , and so the correlation between the pre and post-periods would be different than for slower users.

New Users won’t have pre-experiment data. An experiment with no pre-experiment data won’t be able to leverage CUPED. In these cases, the best bet is to use covariates like demographics if possible.

If an experiment has some new users and some established users, you can use CUPED and split the population by another binary covariate: D o they have pre-experiment data or not? Functionally, this means you just apply CUPED only on users with pre-experiment data as discussed above.

CUPED best practices

CUPED is most effective on existing user experiments where you have access to user’s historical data. For new users experiments, stratification or other covariates like demographics can be useful, but you won’t be able to leverage as rich of a covariate.

CUPED needs historical data to work; this means that you need to make sure your metric data goes back to before the start of the pre-experiment data window.

CUPED’s ability to adjust values is based on how correlated a metric is with its past value for the same user. Some metrics will be very stable for the same user and allow for large adjustments; some are noisy over time for the same user, and you won’t see much of a difference in the adjusted values.

Related reading and resources

How Booking.com increases the power of online experiments

Improving the sensitivity of online controlled experiments by utilizing pre-experiment data

CUPED on Statsig

Request a demo

Actionable intelligence at your fingertips.

With Statsig Analytics you can get answers in just a few clicks. No queries required.

Build fast?

Try statsig today.

experiment data reduction

Recent Posts

How to add feature flags to next.js.

In this tutorial, we show how to setup Next.JS Feature Flags with the Statsig SDKs. We'll use Next.JS App Router, Statsig for product analytics, and also share how to deploy this app with Vercel.

Go from 0 to 1 with Statsig's free suite of data tools for startups

Statsig has four data tools that are ideal for earlier stage companies: Web Analytics, Session Replay, Sidecar (low-code website experimentation), and Product Analytics.

The Marketers go-to tech stack for website optimization

Boost your site's conversions with Statsig's Web Analytics, low code experimentation Sidecar, and Session Replay. Optimize efficiently with deep insights and testing tools.

Experiment scorecards: Essentials and best practices

An experiment scorecard is more than just a collection of numbers; it's a narrative of your experiment's journey from hypothesis to conclusion.

What's the difference between Statsig and PostHog?

Statsig and PostHog both offer suites of tools that help builders be more data-driven in how they develop products, but how do they differ?

Intro to product analytics

Product analytics reveals user interactions, driving informed decisions, enhancing UX, and boosting business outcomes.

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Social justice
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

A smarter way to streamline drug discovery

Press contact :.

Close-up of molecule in front of other molecules

Previous image Next image

The use of AI to streamline drug discovery is exploding. Researchers are deploying machine-learning models to help them identify molecules, among billions of options, that might have the properties they are seeking to develop new medicines.

But there are so many variables to consider — from the price of materials to the risk of something going wrong — that even when scientists use AI, weighing the costs of synthesizing the best candidates is no easy task.

The myriad challenges involved in identifying the best and most cost-efficient molecules to test is one reason new medicines take so long to develop, as well as a key driver of high prescription drug prices.

To help scientists make cost-aware choices, MIT researchers developed an algorithmic framework to automatically identify optimal molecular candidates, which minimizes synthetic cost while maximizing the likelihood candidates have desired properties. The algorithm also identifies the materials and experimental steps needed to synthesize these molecules.

Their quantitative framework, known as Synthesis Planning and Rewards-based Route Optimization Workflow (SPARROW), considers the costs of synthesizing a batch of molecules at once, since multiple candidates can often be derived from some of the same chemical compounds.

Moreover, this unified approach captures key information on molecular design, property prediction, and synthesis planning from online repositories and widely used AI tools.

Beyond helping pharmaceutical companies discover new drugs more efficiently, SPARROW could be used in applications like the invention of new agrichemicals or the discovery of specialized materials for organic electronics.

“The selection of compounds is very much an art at the moment — and at times it is a very successful art. But because we have all these other models and predictive tools that give us information on how molecules might perform and how they might be synthesized, we can and should be using that information to guide the decisions we make,” says Connor Coley, the Class of 1957 Career Development Assistant Professor in the MIT departments of Chemical Engineering and Electrical Engineering and Computer Science, and senior author of a paper on SPARROW.

Coley is joined on the paper by lead author Jenna Fromer SM ’24. The research appears today in Nature Computational Science .

Complex cost considerations

In a sense, whether a scientist should synthesize and test a certain molecule boils down to a question of the synthetic cost versus the value of the experiment. However, determining cost or value are tough problems on their own.

For instance, an experiment might require expensive materials or it could have a high risk of failure. On the value side, one might consider how useful it would be to know the properties of this molecule or whether those predictions carry a high level of uncertainty.

At the same time, pharmaceutical companies increasingly use batch synthesis to improve efficiency. Instead of testing molecules one at a time, they use combinations of chemical building blocks to test multiple candidates at once. However, this means the chemical reactions must all require the same experimental conditions. This makes estimating cost and value even more challenging.

SPARROW tackles this challenge by considering the shared intermediary compounds involved in synthesizing molecules and incorporating that information into its cost-versus-value function.

“When you think about this optimization game of designing a batch of molecules, the cost of adding on a new structure depends on the molecules you have already chosen,” Coley says.

The framework also considers things like the costs of starting materials, the number of reactions that are involved in each synthetic route, and the likelihood those reactions will be successful on the first try.

To utilize SPARROW, a scientist provides a set of molecular compounds they are thinking of testing and a definition of the properties they are hoping to find.

From there, SPARROW collects information on the molecules and their synthetic pathways and then weighs the value of each one against the cost of synthesizing a batch of candidates. It automatically selects the best subset of candidates that meet the user’s criteria and finds the most cost-effective synthetic routes for those compounds.

“It does all this optimization in one step, so it can really capture all of these competing objectives simultaneously,” Fromer says.

A versatile framework

SPARROW is unique because it can incorporate molecular structures that have been hand-designed by humans, those that exist in virtual catalogs, or never-before-seen molecules that have been invented by generative AI models.

“We have all these different sources of ideas. Part of the appeal of SPARROW is that you can take all these ideas and put them on a level playing field,” Coley adds.

The researchers evaluated SPARROW by applying it in three case studies. The case studies, based on real-world problems faced by chemists, were designed to test SPARROW’s ability to find cost-efficient synthesis plans while working with a wide range of input molecules.

They found that SPARROW effectively captured the marginal costs of batch synthesis and identified common experimental steps and intermediate chemicals. In addition, it could scale up to handle hundreds of potential molecular candidates.

“In the machine-learning-for-chemistry community, there are so many models that work well for retrosynthesis or molecular property prediction, for example, but how do we actually use them? Our framework aims to bring out the value of this prior work. By creating SPARROW, hopefully we can guide other researchers to think about compound downselection using their own cost and utility functions,” Fromer says.

In the future, the researchers want to incorporate additional complexity into SPARROW. For instance, they’d like to enable the algorithm to consider that the value of testing one compound may not always be constant. They also want to include more elements of parallel chemistry in its cost-versus-value function.

“The work by Fromer and Coley better aligns algorithmic decision making to the practical realities of chemical synthesis. When existing computational design algorithms are used, the work of determining how to best synthesize the set of designs is left to the medicinal chemist, resulting in less optimal choices and extra work for the medicinal chemist,” says Patrick Riley, senior vice president of artificial intelligence at Relay Therapeutics, who was not involved with this research. “This paper shows a principled path to include consideration of joint synthesis, which I expect to result in higher quality and more accepted algorithmic designs.”

“Identifying which compounds to synthesize in a way that carefully balances time, cost, and the potential for making progress toward goals while providing useful new information is one of the most challenging tasks for drug discovery teams. The SPARROW approach from Fromer and Coley does this in an effective and automated way, providing a useful tool for human medicinal chemistry teams and taking important steps toward fully autonomous approaches to drug discovery,” adds John Chodera, a computational chemist at Memorial Sloan Kettering Cancer Center, who was not involved with this work.

This research was supported, in part, by the DARPA Accelerated Molecular Discovery Program, the Office of Naval Research, and the National Science Foundation.

Share this news article on:

Related links.

  • Coley Group
  • Department of Chemical Engineering
  • Department of Electrical Engineering and Computer Science

Related Topics

  • Computer science and technology
  • Artificial intelligence
  • Drug development
  • Pharmaceuticals
  • Chemical engineering
  • Electrical Engineering & Computer Science (eecs)
  • National Science Foundation (NSF)
  • Defense Advanced Research Projects Agency (DARPA)

Related Articles

machine learning molecule graphic

A smarter way to develop new drugs

molecular prediction graphic

Taking some of the guesswork out of drug discovery

Illustration with a helix that resembles DNA in the middle of the frame, with small circles surrounding it.

Generative AI imagines new protein structures

A large, 3D gray mass binding to smaller molecules in various colors in one spot on the right side.

Speeding up drug discovery with diffusion generative models

Previous item Next item

More MIT News

Heidi Shyu and Eric Evans stand side-by-side holding up a plaque between them. Evans has a medal pinned to his lapel.

Eric Evans receives Department of Defense Medal for Distinguished Public Service

Read full story →

The surface of Titan, containing lake-shaped crevices

Study: Titan’s lakes may be shaped by waves

Catherine D’Ignazio and book cover of “Counting Feminicide”

3 Questions: Catherine D’Ignazio on data science and a quest for justice

Four panels show a neuron glowing in red and yellow. The top left panel shows a neuron looing pretty sharp. Below that are zoomed in sections also looking detailed. On the right is a neuron that looks hazy. Below that are zoomed in sections that are also clouded.

Microscope system sharpens scientists’ view of neural circuit connections

Arvind sits in chair for portrait

Arvind, longtime MIT professor and prolific computer scientist, dies at 77

Four AI experts seated on a stage address an audience

MIT-Takeda Program wraps up with 16 publications, a patent, and nearly two dozen projects completed

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

Bias correction of modelled precipitation from CORDEX-CORE experiments over the Upper Teesta River Basin

  • Guchhait, Soumya
  • Sharma, Aka
  • Dimri, A. P.

For assessing the impact of climate change on socio-eco-hydrological systems at the catchment scale, reasonable and reliable meteorological input data are essential. The modelled, gridded meteorological datasets of several fields, e.g., precipitation, temperature, surface moisture, etc. from Regional Climate Models (RCMs) are widely used as inputs to numerically ecological or hydrological model and estimate the climate change impact on several natural systems inside a catchment. However, in the case of mountainous regions, like the Himalayas due to the scarcity of station observations or human inaccessibility, most of the RCMs contain inherent and systemic biases. Hence, before forcing these RCMs on ecological or hydrological models, the biases present in RCMs must be corrected. This study aims to provide a set of reasonable, bias-corrected precipitation field from 12 simulated CORDEX-CORE model experiments, to provide inputs for the estimation of hydrological changes over the mountainous Upper Teesta River Basin (UTRB) situated at the eastern Himalayas. The model precipitation field from 12 CORDEX-CORE model experiments and the corresponding observed precipitation field from CHELSA V2.1 climatic reanalysis were considered for the reference period of 1979–2005. Their performances were inter-compared and linear scaling, distribution mapping, and power transformation bias correction methods were applied on each grid of the precipitation fields from 12 CORDEX-CORE model experiments. After the application of bias correction methods, all the CORDEX-CORE model experiments show a reduction in bias. Among the bias correction methods, the distribution mapping method had altered the model precipitation fields while preserving the statistical characteristics of the observed data and was found to be more efficient than the other two, while the linear scaling method was found to be worst performing. Although the modified precipitation fields weren't forced with a hydrological model in the present study, the evaluation of performances of bias-corrected model outputs shows that the precipitation field from the ERAINT-COSMO model experiment corrected with the distribution mapping method could be the best fit RCM for studying the hydrological impacts due to climatic changes over the data-scarce basins like the UTRB.

  • CORDEX-CORE;
  • Teesta Basin;
  • Bias correction

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

applsci-logo

Article Menu

experiment data reduction

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

A frequency domain kernel function-based manifold dimensionality reduction and its application for graph-based semi-supervised classification.

experiment data reduction

1. Introduction

2. materials and methods, 2.1. related background, 2.1.1. discrete fourier transformation for two-dimensional image, 2.1.2. high-frequency texture component, 2.2. the proposed method.

FMDR for semi-supervised classification

3. Experiments and Discussions

3.1. preparations, 3.1.1. datasets, 3.1.2. the filter and parameters, 3.1.3. comparison methods and performance indicators, 3.2. visualization of dimensionality reduction, 3.3. semi-supervised classification for facial images, 3.4. algorithm performance with changes in labeled data proportion, 4. conclusions, author contributions, data availability statement, conflicts of interest.

  • Espadoto, M.; Martins, R.M.; Kerren, A.; Hirata, N.S.T.; Telea, A.C. Toward a Quantitative Survey of Dimension Reduction Techniques. IEEE Trans. Vis. Comput. Graph. 2021 , 27 , 2153–2173. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020 , 159 , 296–307. [ Google Scholar ] [ CrossRef ]
  • Köppen, M. The curse of dimensionality. In Proceedings of the 5th Online World Conference on Soft Computing in Industrial Applications (WSC5), Online, 4–18 September 2000; Volume 1, pp. 4–8. [ Google Scholar ]
  • Wang, Z.; Zhang, G.; Xing, X.; Xu, X.; Sun, T. Comparison of dimensionality reduction techniques for multi-variable spatiotemporal flow fields. Ocean. Eng. 2024 , 291 , 116421. [ Google Scholar ] [ CrossRef ]
  • Zeng, C.; Xia, S.; Wang, Z.; Wan, X. Multi-Channel Representation Learning Enhanced Unfolding Multi-Scale Compressed Sensing Network for High Quality Image Reconstruction. Entropy 2023 , 25 , 1579. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Vieira Sobrinho, J.L.; Teles Vieira, F.H.; Assis Cardoso, A. Two-Stage Dimensionality Reduction for Social Media Engagement Classification. Appl. Sci. 2024 , 14 , 1269. [ Google Scholar ] [ CrossRef ]
  • Al-khassaweneh, M.; Bronakowski, M.; Al-Sharoa, E. Multivariate and Dimensionality-Reduction-Based Machine Learning Techniques for Tumor Classification of RNA-Seq Data. Appl. Sci. 2023 , 13 , 12801. [ Google Scholar ] [ CrossRef ]
  • Barkalov, K.; Shtanyuk, A.; Sysoyev, A. A Fast kNN Algorithm Using Multiple Space-Filling Curves. Entropy 2022 , 24 , 767. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Li, J.; Li, Y.; Li, C. Dual-Graph-Regularization Constrained Nonnegative Matrix Factorization with Label Discrimination for Data Clustering. Mathematics 2024 , 12 , 96. [ Google Scholar ] [ CrossRef ]
  • González-Díaz, Y.; Martínez-Trinidad, J.F.; Carrasco-Ochoa, J.A.; Lazo-Cortés, M.S. An Algorithm for Computing All Rough Set Constructs for Dimensionality Reduction. Mathematics 2024 , 12 , 90. [ Google Scholar ] [ CrossRef ]
  • Heidarian Dehkordi, R.; Candiani, G.; Nutini, F.; Carotenuto, F.; Gioli, B.; Cesaraccio, C.; Boschetti, M. Towards an Improved High-Throughput Phenotyping Approach: Utilizing MLRA and Dimensionality Reduction Techniques for Transferring Hyperspectral Proximal-Based Model to Airborne Images. Remote Sens. 2024 , 16 , 492. [ Google Scholar ] [ CrossRef ]
  • Yao, C.; Zheng, L.; Feng, L.; Yang, F.; Guo, Z.; Ma, M. A Collaborative Superpixelwise Autoencoder for Unsupervised Dimension Reduction in Hyperspectral Images. Remote Sens. 2023 , 15 , 4211. [ Google Scholar ] [ CrossRef ]
  • Islam, M.R.; Siddiqa, A.; Ibn Afjal, M.; Uddin, M.P.; Ulhaq, A. Hyperspectral Image Classification via Information Theoretic Dimension Reduction. Remote Sens. 2023 , 15 , 1147. [ Google Scholar ] [ CrossRef ]
  • Tharwat, A.; Gaber, T.; Ibrahim, A.; Hassanien, A.E. Linear discriminant analysis: A detailed tutorial. AI Commun. 2017 , 30 , 169–190. [ Google Scholar ] [ CrossRef ]
  • Greenacre, M.; Groenen, P.J.; Hastie, T.; d’Enza, A.I.; Markos, A.; Tuzhilina, E. Principal component analysis. Nat. Rev. Methods Prim. 2022 , 2 , 100. [ Google Scholar ] [ CrossRef ]
  • Gao, Z.; Cheong, L.F.; Wang, Y.X. Block-sparse RPCA for salient motion detection. IEEE Trans. Pattern Anal. Mach. Intell. 2014 , 36 , 1975–1987. [ Google Scholar ] [ CrossRef ]
  • Shi, S.; Xu, Y.; Xu, X.; Mo, X.; Ding, J. A Preprocessing Manifold Learning Strategy Based on t-Distributed Stochastic Neighbor Embedding. Entropy 2023 , 25 , 1065. [ Google Scholar ] [ CrossRef ]
  • Li, H.; Cui, J.; Zhang, X.; Han, Y.; Cao, L. Dimensionality Reduction and Classification of Hyperspectral Remote Sensing Image Feature Extraction. Remote Sens. 2022 , 14 , 4579. [ Google Scholar ] [ CrossRef ]
  • Das, S.; Routray, A.; Deb, A.K. Fast Semi-Supervised Unmixing of Hyperspectral Image by Mutual Coherence Reduction and Recursive PCA. Remote Sens. 2018 , 10 , 1106. [ Google Scholar ] [ CrossRef ]
  • Wright, J.; Ganesh, A.; Rao, S.; Peng, Y.; Ma, Y. Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In Advances in Neural Information Processing Systems ; Neural Information Processing Systems: La Jolla, CA, USA, 2009; Volume 22. [ Google Scholar ]
  • Tenenbaum, J.B.; de Silva, V.; Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science 2000 , 290 , 2319–2323. [ Google Scholar ] [ CrossRef ]
  • Roweis, S.T.; Saul, L.K. Nonlinear dimensionality reduction by locally linear embedding. Science 2000 , 290 , 2323–2326. [ Google Scholar ] [ CrossRef ]
  • Belkin, M.; Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 2003 , 15 , 1373–1396. [ Google Scholar ] [ CrossRef ]
  • Zhan, K.; Li, C.; Zhu, R. A frequency domain-based machine learning architecture for short-term wave height forecasting. Ocean. Eng. 2023 , 287 , 115844. [ Google Scholar ] [ CrossRef ]
  • Stuchi, J.A.; Angeloni, M.A.; Pereira, R.F.; Boccato, L.; Folego, G.; Prado, P.V.S.; Attux, R.R.F. Improving image classification with frequency domain layers for feature extraction. In Proceedings of the 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), Tokyo, Japan, 25–28 September 2017; pp. 1–6. [ Google Scholar ] [ CrossRef ]
  • Song, W.; Zhang, X.; Chen, Y.; Xu, H.; Wang, L.; Wang, Y. Dimensionality Reduction and Research of Hyperspectral Remote Sensing Images Based on Manifold Learning. Preprints 2024 , 2024011274. [ Google Scholar ] [ CrossRef ]
  • Situ, J. Contrastive Learning Dimensionality Reduction Method Based on Manifold Learning. Adv. Eng. Technol. Res. 2024 , 9 , 522. [ Google Scholar ] [ CrossRef ]
  • Sun, Y.; Chen, J.; Liu, Q.; Liu, B.; Guo, G. Dual-Path Attention Network for Compressed Sensing Image Reconstruction. IEEE Trans. Image Process. 2020 , 29 , 9482–9495. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Xiaojin, Z. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the International Conference on Machine Learning, Los Angeles, CA, USA, 23–24 June 2003; Volume 3, p. 912. [ Google Scholar ]
  • Zheng, J. Targeted Image Reconstruction by Sampling Pre-trained Diffusion Model. In Intelligent Systems and Applications, Proceedings of the Intelligent Systems Conference, Amsterdam, The Netherlands, 7–8 September 2023 ; Springer: Cham, Switzerland, 2023; pp. 552–560. [ Google Scholar ]
  • Ji, P.; Reid, I.; Garg, R.; Li, H.; Salzmann, M. Adaptive low-rank kernel subspace clustering. arXiv 2017 , arXiv:1707.04974. [ Google Scholar ]
  • Martinez, A.; Benavente, R. The AR Face Database: CVC Technical Report, 24 ; CVC: Luxembourg, 1998. [ Google Scholar ]
  • Belhumeur, P.N.; Hespanha, J.P.; Kriegman, D.J. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 1997 , 19 , 711–720. [ Google Scholar ] [ CrossRef ]
  • Georghiades, A.S.; Belhumeur, P.N.; Kriegman, D.J. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intell. 2001 , 23 , 643–660. [ Google Scholar ] [ CrossRef ]
  • Chang, Y.; Liu, H. Semi-supervised classification algorithm based on the KNN. In Proceedings of the 2011 IEEE 3rd International Conference on Communication Software and Networks, Xi’an, China, 27–29 May 2011; pp. 9–12. [ Google Scholar ]
  • Yoder, J.; Priebe, C.E. Semi-supervised k-means++. J. Stat. Comput. Simul. 2017 , 87 , 2597–2608. [ Google Scholar ] [ CrossRef ]
  • Shahid, N.; Kalofolias, V.; Bresson, X.; Bronstein, M.; Vandergheynst, P. Robust principal component analysis on graphs. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2812–2820. [ Google Scholar ]
  • Lee, D.; Seung, H.S. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems ; Neural Information Processing Systems: La Jolla, CA, USA, 2000; Volume 13. [ Google Scholar ]
  • Li, S.; Li, W.; Lu, H.; Li, Y. Semi-supervised non-negative matrix tri-factorization with adaptive neighbors and block-diagonal learning. Eng. Appl. Artif. Intell. 2023 , 121 , 106043. [ Google Scholar ] [ CrossRef ]
  • Ye, F.; Chen, C.; Zheng, Z. Deep autoencoder-like nonnegative matrix factorization for community detection. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Turin, Italy, 22–26 October 2018; pp. 1393–1402. [ Google Scholar ]

Click here to enlarge figure

DatasetAcc
FMDR RPCANMFLEABNMTFDANMF
AT&T 73.8957.0033.0250.3134.3578.3551.95
ORL 45.5942.2532.3331.8531.8541.3033.60
AR 77.0038.4650.2352.2350.2367.6261.27
Yale 45.5928.4832.8531.8531.8541.3033.60
YaleB 43.3032.1431.3832.6328.0745.5629.05
DatasetPre
FMDR RPCANMFLEABNMTFDANMF
AT&T82.8175.8071.2328.3145.0030.9281.56
ORL 58.2460.0546.7131.8740.9955.7946.54
AR 75.5442.6248.5850.3453.5870.1563.11
Yale 58.2435.0946.7131.8740.9955.7946.54
YaleB 41.4648.3032.9631.2431.9957.4729.77
DatasetRec
FMDR RPCANMFLEABNMTFDANMF
AT&T 79.1262.6431.5848.1735.9778.4755.78
ORL 52.2046.6334.3629.6138.3347.6131.84
AR 78.4342.7649.6348.4151.0467.7763.40
Yale 52.2030.8234.3629.6138.3347.6131.84
YaleB 43.4133.1229.8431.4528.8446.0429.77
DatasetF1 score
FMDR RPCANMFLEABNMTFDANMF
AT&T 39.3534.0114.5423.1316.7341.2335.77
ORL 29.2824.1022.0515.9922.3425.2318.45
AR 43.0720.5926.7126.1528.8447.3942.81
Yale 29.2817.8622.0515.9922.3425.2318.45
YaleB 21.5019.2315.6816.4216.0634.0123.38
AlgorithmAccPreRecF1
FMDR
KNN4.245.636.023.50
Kmeans3.623.533.693.77
ATNMTF5.225.897.345.35
LE3.554.174.532.95
NMF4.325.375.133.06
RPCA3.944.554.583.26
DANMF3.024.214.883.23
DatasetRatioAcc
FMDR RPCA NMF LE ABNMTF DANMF
ORL5% 48.8922.5035.8345.5635.2851.3952.22
10% 46.1133.0035.8346.1136.1152.2253.61
15% 53.8732.7546.6960.7649.0758.3358.77
20% 66.7742.2546.4651.2330.8857.9959.63
25% 45.5059.4564.6060.4861.1962.66
30% 64.5645.7558.4864.9760.6362.7763.84
AR5% 53.2022.3130.0030.4030.4047.6044.00
10% 74.0325.7746.9841.3043.5353.0449.00
15% 75.1127.3149.3348.8750.4562.5351.67
20% 77.0038.4650.2352.8050.2367.6261.27
25% 81.6338.8557.2557.9557.0073.8556.36
30% 81.0540.3860.2164.2558.5172.1375.44
Yale5% 41.3316.6725.3325.3325.3332.0020.67
10% 45.3322.0324.0024.6724.6735.3325.67
15% 45.2625.3938.5233.5835.0439.7128.53
20% 45.2628.4832.8531.8531.8541.3033.60
25% 48.3628.2738.5245.6040.3244.6341.20
30% 51.2230.6038.2137.9038.0241.9442.80
YaleB5% 24.2913.6722.7027.5922.0219.9522.53
10% 32.4221.1326.7229.1726.1626.3525.76
15% 41.0324.3631.9134.4733.1142.4028.08
20% 43.3032.1431.3832.6328.0745.5629.05
25% 50.1137.3037.3841.0337.9049.0231.50
30% 52.8640.3640.1643.1940.2053.7232.70
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Liang, Z.; Gong, R.; Tan, G.; Ji, S.; Zhan, R. A Frequency Domain Kernel Function-Based Manifold Dimensionality Reduction and Its Application for Graph-Based Semi-Supervised Classification. Appl. Sci. 2024 , 14 , 5342. https://doi.org/10.3390/app14125342

Liang Z, Gong R, Tan G, Ji S, Zhan R. A Frequency Domain Kernel Function-Based Manifold Dimensionality Reduction and Its Application for Graph-Based Semi-Supervised Classification. Applied Sciences . 2024; 14(12):5342. https://doi.org/10.3390/app14125342

Liang, Zexiao, Ruyi Gong, Guoliang Tan, Shiyin Ji, and Ruidian Zhan. 2024. "A Frequency Domain Kernel Function-Based Manifold Dimensionality Reduction and Its Application for Graph-Based Semi-Supervised Classification" Applied Sciences 14, no. 12: 5342. https://doi.org/10.3390/app14125342

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

  1. Schematic diagram of data reduction.

    experiment data reduction

  2. Rigorous and accelerated data reduction technique.

    experiment data reduction

  3. Data reduction experiment. aCCp\documentclass[12pt]{minimal

    experiment data reduction

  4. Data reduction algorithm.

    experiment data reduction

  5. Data Reduction Technique Findings

    experiment data reduction

  6. The three basic steps of data reduction applied by the workflow

    experiment data reduction

VIDEO

  1. Data Reduction Technologies

  2. Which dimensionality reduction method clusters data using prototypes?

  3. EGME 306B

  4. KDD 2023

  5. Dimension Reduction part 2 Johns Hopkins University Coursera

  6. Data Reduction of Vapor Liquid Equilibrium

COMMENTS

  1. PDF Data Reduction Introduction

    Data Reduction Introduction Once you have designed an experiment, collected the data, and begun thinking about how to communicate the results to other people, your next step will almost always be to construct some kind of graphical representation of your data. In some cases, deciding what to plot on graphs will be fairly straightforward.

  2. A data reduction and compression description for high ...

    With a 95× reduction in data size the same experiment can span 20 times longer, enabling the observation of millisecond dynamics in reactions that span several minutes. The yellow dots show a few ...

  3. Data reduction

    Data reduction is the transformation of numerical or alphabetical digital information derived empirically or experimentally into a corrected, ordered, and simplified form. The purpose of data reduction can be two-fold: reduce the number of data records by eliminating invalid data or produce summary data and statistics at different aggregation levels for various applications.

  4. Data Reduction and Error Analysis

    It is the probability density of finding a particular value of x in the experiment. To get a real probability we would have to integrate. For example, I could ask, what is the probability of finding experimentally that x = 2.10000 when the average value of x was 2.00000.

  5. Recent Advances in Practical Data Reduction

    In experiments on 63 graphs of up to 5 231 vertices from the second DIMACS Implementation Challenge 1, their data reductions reduced all graphs between 1-93%, working best on sparse instances. 36 of the 63 instances were reduced by at least 25%, and 21 instances were reduced by at least 50%.

  6. PDF Principles of Data Reduction1

    ) Data reduction as a partition of the sample space X. Let T = ft : t = T(x) for some x 2 Xg be the image of X under T(x). Then T(x) partitions the sample space into sets A t;t 2 T , de ned by A t = fx : T(x)g. A statistic summarizes the data in that, rather than reporting the entire sample x, it reports only that T(x) = t or, equivalently, x 2 ...

  7. Practical and comparative application of efficient data reduction

    The term data reduction methods are defined as different mathematical algorithms that can be used to decrease big data dimensionality and/or data size by maintaining information. ... as well as designing a chemical experiment which is recently proposed by Sawall et al. in which data reduction is made in both modes of multidimensional data sets ...

  8. Data Reduction for Science

    Accordingly, a virtual DOE workshop entitled "Data Reduction for Science" was held in January of 2021, resulting in a brochure [5] detailing four priority research directions (PRDs) identified during the workshop. These PRDs are (1) effective algorithms and tools that can be trusted by scientists for accuracy and efficiency, (2) progressive ...

  9. Experimental Overview and Data Reduction

    The experiments were performed at the Research Center for Nuclear Physics (RCNP), Osaka University, Japan. Detailed experimental description of the α,α ′) and (d,d ′) reactions used to study isoscalar giant resonances in nuclei are presented in this chapter.The data reduction process needed to obtain the background-free, accurately-calibrated, excitation energy spectra are discussed in ...

  10. Data Reduction Recipes

    7.3 Data Reduction Recipes for Imaging Observations. This section focuses on the various ingredients necessary for an in-depth data reduction of imaging observations. It is divided into the various steps necessary to reduce imaging data in unpolarized and in polarized light.

  11. DOE Data Reduction for Science

    The Data Reduction for Science program seeks applications to explore potentially high-impact approaches in the development and use of data reduction techniques and algorithms to facilitate more efficient analysis and use of massive data sets produced by observations, experiments and simulation. The drivers for data reduction techniques ...

  12. PDF Gemini Series Experiment Data Reduction and Storage Techniques

    Gemini Experiment Data Reduction and Storage: November 4, 2011 Data Formats Expected from Gemini Experiments • Initially Recorded Data -Raw digitizer data is in "P-14 Dig" file format • P-14 dig format is a common format for experiment data recorded at NNSS -Contains date/time information about the recorded event

  13. Real-time data reduction at 100 Tbps: Challenge and opportunity for AI

    Such data reduction traditionally is achieved via real-time high level triggers, which select and save a small subset of collisions of interest. Although triggering is applicable to high energy collider experiments such as those at the Large Hardron Collider at CERN, it is insufficient for these nuclear physics experiments which study diverse ...

  14. PDF Improving the Sensitivity of Online Controlled Experiments by Utilizing

    tially missing pre-experiment data. • Criteria for selecting the best covariates, including the empirical result that using the same metric from the pre-experiment typically gives the greatest variance re-duction. • Validation of the results on real online experiments run at Bing, demonstrating a variance reduction of about

  15. Variance reduction in experiments using covariate adjustment ...

    Using Doubly Robust techniques with pre-experimental data is a safe and efficient path that allows for the reduction of the variance of the treatment effect estimate in an unbiased manner in ...

  16. Increasing experimental power with variance reduction at the BBC

    For this reason, the Experimentation and Optimisation team at the BBC are implementing variance reduction methods in experiment analysis to detect significant findings for smaller effect sizes.

  17. Data Reduction in Data Mining

    Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the most important information. This can be beneficial in situations where the dataset is too large to be processed efficiently, or where the dataset contains a large amount of irrelevant or redundant information.

  18. PDF Variance Reduction Using In-Experiment Data: Efficient and Targeted

    Reduction Using In-Experiment Data: Efficient and Targeted Online Mea-surement for Sparse and Delayed Outcomes. In Proceedingsofthe29thACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '23), August6-10,2023,LongBeach,CA,USA.ACM, New York, NY, USA, 10 pages.

  19. 6 Dimensionality Reduction Algorithms With Python

    Dimensionality Reduction. Dimensionality reduction refers to techniques for reducing the number of input variables in training data. When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the "essence" of the data.

  20. Online Experiments Tricks

    CUPED uses pre-experiment data X (e.g., pre-experiment values of Y) as a control covariate: In other words, the variance of Y is reduced by (1-Corr (X, Y)). We would need the correlation between X and Y to be high for CUPED to work well. In the original paper, it is recommended to use the pre-experiment value of Y as X.

  21. Data Reduction, Visualization and Analysis

    For reduction, visualization and analysis of data from DCS, FCS, HFBS, SPINS, FANS (BT-4), BT2, BT7 and BT9. Support for DAVE: Richard Azuah. XTREAT, TANQENS. For reduction and analysis of data from FCS. Support for XTREAT and TANQENS: Taner Yildirim and Craig Brown. FIT (only inside firewall) ( D) For fitting generic columnar-formatted datasets.

  22. Variance Reduction in Experiments

    Controlled-experiment using pre-experiment data (CUPED) CUPED (Deng et al., 2013) rests on the idea of using a pre-experiment covariate, X, that is highly correlated with the outcome, Y, but is unrelated to the treatment, T. The pre-experiment value of the outcome, Y, is a natural candidate as it meets these criteria. Conditional on having access to such a covariate in our data, we apply CUPED ...

  23. CUPED Explained

    CUPED is slowly becoming a common term in online experimentation since its coining by Microsoft in 2013.. Meaning Controlled-experiment Using Pre-Experiment Data, CUPED is frequently cited as—and used as—one of the most powerful algorithmic tools for increasing the speed and accuracy of experimentation programs.. In this article, we'll: Cover the background of CUPED

  24. A smarter way to streamline drug discovery

    In a sense, whether a scientist should synthesize and test a certain molecule boils down to a question of the synthetic cost versus the value of the experiment. However, determining cost or value are tough problems on their own. For instance, an experiment might require expensive materials or it could have a high risk of failure.

  25. Bias correction of modelled precipitation from CORDEX-CORE experiments

    After the application of bias correction methods, all the CORDEX-CORE model experiments show a reduction in bias. Among the bias correction methods, the distribution mapping method had altered the model precipitation fields while preserving the statistical characteristics of the observed data and was found to be more efficient than the other ...

  26. A Frequency Domain Kernel Function-Based Manifold Dimensionality ...

    With the increasing demand for high-resolution images, handling high-dimensional image data has become a key aspect of intelligence algorithms. One effective approach is to preserve the high-dimensional manifold structure of the data and find the accurate mappings in a lower-dimensional space. However, various non-sparse, high-energy occlusions in real-world images can lead to erroneous ...