The SPLASH-2 Programs: Characterization and Methodological Considerations (1995)
SPLASH-2 (vs. SPLASH)
Represent a wider range of computations in the scientific, engineering and graphics domains.
Use better algorithms and implementations.
Are more architecturally aware.
■ Characteristics and Approach
□ Axes of Characterization
Concurrency and load balancing: How many processors can be effectively utilized by that program, assuming a perfect memory system and communication architecture.
Working set: Program’s temporal locality
Communication to computation ratio: Potential impact of communication latency on performance
Spatial locality: Spatial locality and false sharing in the programs
□ Approach to Characterization
Experimental environment
Execution-driven simulation. Simulate a cache-coherent shared address space multiprocessor with physically distributed memory and one processor per node. Each processor has a single-level cache, using a directory-based protocol.
All memory references complete in a single cycle (regardless of hits or misses)
Data are distributed among the processing nodes according to the guidelines.
Data Sets and Scaling
The data sets are small enough to simulate in a reasonable time, yew large enough to be of interest in their problem domain in practice. We fix the number of processors at 32 for most of our characterization.
Inherent versus Practical Characteristics
Focus on these realistic memory system parameters while still trying to approach inherent properties and avoid too many artifacts.
■ The SPLASH-2 Application Suite
It has 8 complete applications and 4 kernels
Barnes Simulates the interaction of a system of bodies in three dimensions over a number of time-steps, using the Barnes-Hut hierarchical N-body method.
Cholesky Factors a sparse matrix into the product of a lower triangular matrix and its transpose.
FFT FFT kernel is a comoplex 1-D version of the radix root n six-step FFT algorithm
FMM Similates a system of bodies over a number of timesteps. Interactions in two dimensions using a different hierarchical N-body method called the adaptive Fast Multipole Method.
LU Factors a dense matrix into the product of a lower triangular and an upper triangular matrix.
Ocean Studies large-scale ocean movements based on eddy and boundary currents.
Radiosity Computes the equilibrium distribution of light in a scene using the iterative hierarchical diffuse radiosity method.
Radix Integer radix sort kernel
Raytrace Renders a three-dimensional scene using ray tracing.
Volrend Renders a three-dimensional volume using a ray casting technique.
Water-Nsquared Evaluates forces and potentials that occur over time in a system of water molecules.
Water-Spatial Solves the same problem as Water-Nsquared, but uses a more efficient algorithm.
■ Concurrency and Load Balance
Concurrency and load balance: how they change with problem size and number of processors
Study how the computational load balance scales with the number of processors by measuring speedups on a PRAM architectural model.
Figure 1: the PRAM speedups for the SPLASH-2 programs for up to 64 processors
Figure 2: the time spent waiting at synchronization points for 32-processor executions of each application.
The reasons for sub-linear speedups: the sizes of the input data sets.
(load imbalance, not-completely parallelized prefix computation, …)
■ Working Sets and Temporal Locality
......
'Architecture' 카테고리의 다른 글
Interconnection Network Topologies (0) | 2009.07.01 |
---|---|
PARSEC vs. SPLASH-2 (0) | 2009.06.16 |
The PARSEC Benchmark Suite (0) | 2009.06.05 |
CMP vs. SMP (0) | 2009.05.26 |
Evaluating MapReduce for Multi-core and Multiprocessor Systems (0) | 2009.05.25 |