Seminar - Dec 28 (Mon)

Architecture 2009. 12. 29. 14:05
Sort vs. Hash Revisited: Fast join Implementation on Modern Multi-core CPUs

by Changkyu Kim  (Intel's Microprocessor and Programming Research)

1) Introduction

DB의 문제를 architecture로 어떻게 풀것인가?의 문제.

DB operation인 Join은 매우 heavy한 operation. table size가 커지면 load가 exponential하게 증가한다
memory capacity가 증가하고 faster I/O의 등장으로 과거에는 I/O bottleneck이 문제였지만 이제는 CPU resource가 중요하게 되었음.

Join operation은 Hash join, Sort-merge join의 두 종류가 있다.

GOAL: Revisit these two join algorithm in the morden CPU and the future CPUs

2) Hash Join

large hash table -> random memory eccess가 증가
random memory access는 cache line unfriendly access or TLB unfriendly access로 memory bound operation이다.

Main Idea: 이를 compute bound problem으로 바꾸기 위해서 cache안에서 Join computing을 할 수 있도록 한다.
Use (4B key, 4B record) tuples in cache, with partial key and another hash function for record.

전체 과정은 Prolog -> Join -> Epilog 의 순으로 진행된다.

Prolog: compaction to (4B partial key, 4B record pointer)
Hash Join operation: table 1을 Key에 따라 Partition, table 2에서 일치하는 key를 찾아서 match, result table에 쓴다.
Epilog: test if false positive, find the real key with the partial key

이러한 방법으로 In-memory hash join을 함으로써 fastest reported performance를 낼 수 있었음.

문제: (4B key, 4B record) tuples이 가능한가
Prolog와 Epilog작업이 얼마나 expensive한가가 관건

3) Sort-merge Join

partition작업 대신에 table 1의 key 전체를 sort한 후에
table 2에서 일치하는 key를 찾아서 merge.

지금까지는 hash join이 sort-merge join보다 더 성능이 좋다고 알려져 있지만
미래에 wider SIMD가 사용되게 되면, sort-merge join이 더 적합할 것으로 보인다.

4) Evaluation

Prolog와 Epilog에 많은 시간이 소요되지만, 이를 포함해도 성능이 좋아짐
(4B key, 4B record) tuples에 대해서는 여전히 논의해야할 문제이다.

'Architecture' 카테고리의 다른 글

GARNET  (0) 2009.09.15
Network traffic patterns  (0) 2009.08.02
Virtual-Channel Flow Control  (0) 2009.07.08
Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
블로그 이미지

민둥

,

GARNET

Architecture 2009. 9. 15. 22:27
원문: GARNET: A Detailed On-Chip Network Model inside a Full-System Simulator

GARNET
a detailed cycle-accurate interconnection network model inside the GEMS full-system simulation framework.

기존의 GEMS에서는 approximate interconnection model만을 제공.
이 interconnection은 set of links and nodes로 이루어져 있고, 특정 latency와 bandwidth를 설정할 수 있음
detailed router나 network interface는 제공하지 않으며
buffer contention, switch & vc arbitration, realistic link contention, pipeline bubble등은 무시.
그리고 router에서 perfect hardware multicast를 제공한다고 가정한다.

Base GARNET model design

1) State-of-the-art on-chip interconnect
five-state state-of-the-art virtual channel router
router는 topology에 따라서 any # of input, output port를 가질 수 있음
flit-level buffering rather than packet-level buffering.
Buffer Write(BW) + Route Computation(RC) -> VC Allocation(VA) -> Switch Allocation(SA) -> Switch Traversal(ST) -> Link Traversal(LT) 


2) Router microarchitectural components
Seperate VC and switch allocators. (fast and low complexity)

3) Interactions between memory system and GETNET
L1, L2 cache. Shared vs. pricate L2 cache.

4) Point-to-point ordering
message들은 보내진 순서대로 도착되어야 한다.
VC and switch allocators support system-level ordering.

5) Network power
Orion power model.

GARNET configuration and statistics


'Architecture' 카테고리의 다른 글

Seminar - Dec 28 (Mon)  (0) 2009.12.29
Network traffic patterns  (0) 2009.08.02
Virtual-Channel Flow Control  (0) 2009.07.08
Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
블로그 이미지

민둥

,
■ Random traffic
모든 source는 모든 destination으로 1/N traffic
load balance가 매우 좋음 (매우 안좋은 topology나 routing의 경우에도 random traffic에서는 좋게 보일 수 있음)

■ Permutation
각각의 source는 모든 traffic을 하나의 destination으로 보냄

   □ Bit permutation 

Bitcomp

Bitrev

Bitrotation

Shuffle

Transpose

0000

1111(15)

0000(0)

0000(0)

0000(0)

0000(0)

0001

1110(14)

1000(8)

1000(8)

0010(2)

0100(4)

0010

1101(13)

0100(4)

0001(1)

0100(4)

1000(8)

0011

1100(12)

1100(12)

1001(9)

0110(6)

1100(12)

0100

1011(11)

0010(2)

0010(2)

1000(8)

0001(1)

0101

1010(10)

1010(10)

1010(10)

1010(10)

0101(5)

0110

1001(9)

0110(6)

0011(3)

1100(12)

1001(9)

0111

1000(8)

1110(14)

1011(11)

1110(14)

1101(13)

1000

0111(7)

0001(1)

0100(4)

0001(1)

0010(2)

1001

0110(6)

1001(9)

1100(12)

0011(3)

0110(6)

1010

0101(5)

0101(5)

0101(5)

0101(5)

1010(10)

1011

0100(4)

1101(13)

1101(13)

0111(7)

1110(14)

1100

0011(3)

0011(3)

0110(6)

1001(9)

0011(3)

1101

0010(2)

1011(11)

1110(14)

1011(11)

0111(7)

1110

0001(1)

0111(7)

0111(7)

1101(13)

1011(11)

1111

0000(0)

1111(15)

1111(15)

1111(15)

1111(15)


        ㅁ Bit complement : D_i = ~S_i
        ㅁ Bit reverse : D_i = S_b-i-1 (4x4 mesh에서 b=16)
           
[bitcomp]

[bitcomp]

[bitrev]

[bitrev]


        ㅁ Bit rotation : D_i = S_i+1 mod b
        ㅁ Shuffle : D_i = S_i-1 mod b
        ㅁ Transpose : D_i = S_i+b/2 mod b
          
[bitrot]

[bitrot]

[shuffle]

[shuffle]

[shuffle]

[transpose]



   □ Digit permutations
        ㅁ Tornado : D_x = S_x+(k/2-1)mod k (4x4 mesh에서 k=4)
             
[tornado]

[tornado]

        ㅁ Neighbor : D_x = S_x + 1 mod k
             
[neighbor]

[neighbor]



'Architecture' 카테고리의 다른 글

Seminar - Dec 28 (Mon)  (0) 2009.12.29
GARNET  (0) 2009.09.15
Virtual-Channel Flow Control  (0) 2009.07.08
Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
블로그 이미지

민둥

,
Interconnection network is characterized by its topology, routing, and flow control.

Topology: arrangement of nodes and channels into a graph
Routing: how a packet chooses a path in this graph
Flow control: allocation of channel and buffer resources to a packet as it traveses the path

2 types of resources: buffers and channels.
Typically, a single buffer is associated with each channel.
Virtual channels decouple resource allocation by providing multiple buffers for each channel in the network.
In addition to incresing throughput, wirtual channelss provide an additional degree of freedom in allocating resources to packets in the network. 

Virtual Channels:
vc for deadlock avoidance
output queueing or split input queues for partial resource decoupling.

Flow control protocol:
1) how resources (buffers and channel bandwidth) are allocated
2) how packet collisions over resources are resolved

Collisions:
1) blocking P (a packet) in place
2) buffering P in a node prior to where the collision occurs.
3) dropping P
4) misrouting P to a channel other than the one it requires.

Flits.
flits have no routing or sequencing information, the allocation must be  done in a manner that keeps the flits associated with a particular packet together.

Wormhole Routing:
advance each flit of a packet as soon as it arrives at a node (pipelining) and blocks packets in place when required resources are unavailable.
1) reduces the latency of message delivery compared to store and forward routing.
2) requires only a few flit buffers per node.


'Architecture' 카테고리의 다른 글

GARNET  (0) 2009.09.15
Network traffic patterns  (0) 2009.08.02
Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
The PARSEC Benchmark Suite  (0) 2009.06.05
블로그 이미지

민둥

,
□ Butterfly

(2-ary 4-fly)

k-ary n-fly Butterfly Network:
An n stage of radix k switches butterfly network. 
Such a network is composed of N=kn source/destination terminal nodes.
n stages of kn-1 switching nodes each with k inputs and k outputs.
All (n+1)N channels are unidirectional. 

□ Mesh
A k-ary n-mesh (mesh) topology


□ Cmesh (Concentrated Mesh)
Concentrated mesh topology. the 'c' determines the concentration



□ Fat tree
The links in a fat-tree become "fatter" as one moves up the tree towards the root.

□ Torus
A k-ary n-cube (torus) topology
A closed surface defined as the product of two circles


□ Quad Tree
A quadtree is a tree data structure in which each internal node has up to four children. 
Quadtrees are most often used to partition a two dimensional space by recursively subdividing it into four quadrants or regions.


□ Flattened Butterfly
"Flattened Butterfly : A Cost-Efficient Topology for High-Radix Networks" ISCA 2007

□ CMO (Concentrated Multi-dimensional Octagon)
Created by Sang Kyun Kim and Wook-Jin Chung from EE382C

□ MECS (Multidrop Express Channels)
"Scalable On-Chip Interconnect Topologies"


'Architecture' 카테고리의 다른 글

Network traffic patterns  (0) 2009.08.02
Virtual-Channel Flow Control  (0) 2009.07.08
PARSEC vs. SPLASH-2  (0) 2009.06.16
The PARSEC Benchmark Suite  (0) 2009.06.05
The SPLASH-2 Programs  (0) 2009.06.05
블로그 이미지

민둥

,

PARSEC vs. SPLASH-2

Architecture 2009. 6. 16. 16:24
PARSEC vs. SPLASH2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors
Princeton University Technical Report TR-818-08, March 2008

ABSTRACT
우리는 SPLASH-2와 PARSEC benchmark suite 각각의 다른 점과 비슷한 점을 알아본다.
CMP에서의 redundancy와 overlap을 analyze하기 위해 standard statistical method와 machine learning을 사용한다.

1. INTRODUCTION
PARSEC: Princeton Application Repository for Shared-MEmory Computers
Intel과 Princeton University의 joint venture의 결과, CMP에서의 최신 workload들의 collection.

PARSEC은 다른 benchmark들과 어떻게 다른가?
SPLASH-2와 SPEC OMP2001도 여러 domain을 다루지만 High-Performance Computing에 초점.
BioParallel은 bioinformation programs
ALPBench는 multimedia workload를 위한 suite
Minebench는 data mining

SPLASH-2는 현재 가장 많이 쓰이고 있는 suite for scientific studies (of parallel machines with shared memory), 
PARSEC과 비슷하게 하나의 특정 domain에 제한되어있지 않음. 
그러나 PARSEC은 SPLASH-2에 비해서 최신 program들과 넓은 범위의 application domain을 제공

이 논문에서는
- SPLASH-2와 PARSEC을 비교: 얼마나 많은 program이 겹치는가
- 두 suite가 얼마나 닮았는지 식별
- 현재의 technology trend가 program들을 바꾸고 있는지: CMP의 확산과 world data의 massive growth관점에서.

2. OVERVIEW
SPLASH-2가 가장 많이 쓰이는 multithreaded workload중에 하나이긴 하지만
SPLASH-2는 parallel machine들이 아직 비싸고 흔하지 않았던 90년대에 나왔기 때문에
majority of workloads는 High-Performance Computing domain에 대부분 국한되어있음

PARSEC은 2008년에 나왔고, 다음과 같은 5개의 특징을 따른다.
- Multithreaded Application: multiprocessor computers with shared memory의 장점을 누리기 위해 parallelized
- Emerging Workloads: 많은 processing power를 필요로 하는 새로운 application들에 초점
- Diverse: 넓은 범위의 application domain들을 다룸
- Employ State-of-Art Techniques: 각각의 필드에서 가장 최근의 algorithm과 technique를 포함.
- Support Research: 계측과 조작을 허용하는 infrastructure를 제공하여서 research support

PARSEC은 현재 computing problem을 반영하는 input set를 포함한다.
SPLASH-2는 그 오래된 나이 때문에 더 이상 현재의 problem size를 반영하지 못한다.

3. METHODOLOGY
A set of interesting characteristics
Execution-driven simulation to obtain the relevant data
Standard statistical method to compute the similarity of the workloads

3.1 Program Characteristics
CMP에서 thread communication과 data가 어떻게 shared되는지를 반영하는 characteristic을 선택
첫 번째 4개의 특징은 어떤 program인지를 알려준다. 아래의 5개의 특징들은 total/shared working set, program이 shared data를 얼마나 집약적으로 잘 사용하는지 등등의 data usage와 communication등을 반영한다.

cache usage에 관련된 특성들은 cache size에 따라서 변할 수 있다. 우리는 1MB~128MB의 8개의 cache size로 제한한다.
따라서 전체 54개의 characteristics for each of the 26 workloads. (14 from SPLASH-2, 12 from PARSEC)
- Instruction Mix: 4 characteristics
- Working Sets: 8 characteristics (1 x 8 cache sizes)
- Sharing: 42 characteristics (왜?)

3.2 Experimental Setup
Simulate abstract cache hierarchy with CMP$im
Preprocess chosen characteristics with Principal Component Analysis (PCA) to eliminate correlation
Compute similarity with hierarchical clustering
Visualize results with dendrograms and scatter plots

3.3 Removing Correlated Data
PCA(Principal Component Analysis)를 사용하서 correlated information을 제거할 필요가 있다.
PCA는 redundancy analysis에 주로 사용되는 방법.
PC: linear combinations of the original variables

3.4 Measuring Similarity
program의 similarity를 측정하기 위해서 Euclidean distance를 사용.

4. REDUNDANCY ANALYSIS RESULTS
- total variance로 부터 diversity 측정
SPLASH-2: 19.55, PARSEC: 18.98 거의 비슷

- direct comparison
single PCA (모든 특징들의 weight를 동등하게 주어서) 를 이용하여 analysis.
PARSEC이 SPLASH-2보다 훨씬 다양하다.
SPLASH-2의 많은 program들은 redundancy가 심하다. (ex, two version of lu and water) ocean code만 눈에띄게 차이를 보인다.
non-contig ocean을 제외하면 대부분 비슷비슷하다.
SPLASH-2에서 7개의 workloads가 d=~0.42범위내에 있음. 
위쪽에 있는 workload들은 다른 cluster와 0.72정도의 distance가 있고, 이는 따라서 program collection 안에서 unique하다고 볼 수 있다.
PARSEC에서 bodytrack과 vips만 SPLASH-2와 유사하다.

4.1 Multiple Differences
Instruction Mix Differences
Working Set Differences
Sharing Behavior Differences
= No single source for the differences of the two suites.





'Architecture' 카테고리의 다른 글

Virtual-Channel Flow Control  (0) 2009.07.08
Interconnection Network Topologies  (0) 2009.07.01
The PARSEC Benchmark Suite  (0) 2009.06.05
The SPLASH-2 Programs  (0) 2009.06.05
CMP vs. SMP  (0) 2009.05.26
블로그 이미지

민둥

,
The PARSEC Benchmark Suite: Characterization and Architectural Implications (2008)

PARSEC Wiki
http://wiki.cs.princeton.edu/index.php/PARSEC

A benchmark suite for studies of CMPs (Chip-Multiprocessors)
Diverse in working set, locality, data sharing, synchronization, and off-chip traffic.

existing benchmark suites cannot be considered adequate to describe future CMP applications.

■ Motivation

□ Requirements for a Benchmarks Suite

Multi-threaded Applications
Emerging Workloads
Diverse
Employ State-of-Art Techniques
Support Research

□ Limitations of Existing Benchmark Suites

SPLASH-2
Program collection is skewed towards HPC and graphics programs.
Does not include parallelization models such as the pipeline model.
SPEC CPU2006 and OMP2001
Provide a snapshot of current scientific and engineering applications
Workloads such as systems programs and parallelization models which employ the producer-consumer model are not included.
SPEC CPU2006 is a suite of serial programs.
Other Benchmark Suites
Designed to study a specific program area and limited to a single application domain.

■ The PARSEC Benchmark Suite

9 applications and 3 kernels (+PARSEC 2.0 includes RayTrace)

□ Input Sets
test, simdev, simsmall, simmedium, simlarge, native

□ Workloads

Blackscholes (Financial Analysis)
This application is an Intel RMS benchmark. It calculates the prices for a portfolio of European options analytically with the Black-Scholes partial differential equation (PDE). There is no closed-form expression for the Black-Scholes equation and as such it must be computed numerically.
Bodytrack (Computer Vision)
This computer vision application is an Intel RMS workload which tracks a human body with multiple cameras through an image sequence. This benchmark was included due to the increasing significance of computer vision algorithms in areas such as video surveillance, character animation and computer interfaces.
Canneal (Engineering)
This kernel was developed by Princeton University. It uses cache-aware simulated annealing (SA) to minimize the routing cost of a chip design. Canneal uses fine-grained parallelism with a lock-free algorithm and a very aggressive synchronization strategy that is based on data race recovery instead of avoidance.
Dedup (Enterprise Storage)
This kernel was developed by Princeton University. It compresses a data stream with a combination of global and local compression that is called 'deduplication'. The kernel uses a pipelined programming model to mimic real-world implementations. The reason for the inclusion of this kernel is that deduplication has become a mainstream method for new-generation backup storage systems.
Facesim (Animation)
This Intel RMS application was originally developed by Stanford University. It computes a visually realistic animation of the modeled face by simulating the underlying physics. The workload was included in the benchmark suite because an increasing number of animations employ physical simulation to create more realistic effects.
Ferret (Similarity Search)
This application is based on the Ferret toolkit which is used for content-based similarity search. It was developed by Princeton University. The reason for the inclusion in the benchmark suite is that it represents emerging next-generation search engines for non-text document data types. In the benchmark, we have configured the Ferret toolkit for image similarity search. Ferret is parallelized using the pipeline model.
Fluidanimate (Animation)
This Intel RMS application uses an extension of the Smoothed Particle Hydrodynamics (SPH) method to simulate an incompressible fluid for interactive animation purposes. It was included in the PARSEC benchmark suite because of the increasing significance of physics simulations for animations.
Freqmine (Data Mining)
This application employs an array-based version of the FP-growth (Frequent Pattern-growth) method for Frequent Itemset Mining (FIMI). It is an Intel RMS benchmark which was originally developed by Concordia University. Freqmine was included in the PARSEC benchmark suite because of the increasing use of data mining techniques.
Raytrace (PARSEC 2.0 추가됨, Graphics?)
The Intel RMS application uses a version of the raytracing method that would typically be employed for real-time animations such as computer games. It is optimized for speed rather than realism. The computational complexity of the algorithm depends on the resolution of the output image and the scene.
Streamcluster (Data Mining)
This RMS kernel was developed by Princeton University and solves the online clustering problem. Streamcluster was included in the PARSEC benchmark suite because of the importance of data mining algorithms and the prevalence of problems with streaming characteristics.
Swaptions (Financial Analysis)
The application is an Intel RMS workload which uses the Heath-Jarrow-Morton (HJM) framework to price a portfolio of swaptions. Swaptions employs Monte Carlo (MC) simulation to compute the prices.
Vips (Media Processing)
This application is based on the VASARI Image Processing System (VIPS) which was originally developed through several projects funded by European Union (EU) grants. The benchmark version is derived from a print on demand service that is offered at the National Gallery of London, which is also the current maintainer of the system. The benchmark includes fundamental image operations such as an affine transformation and a convolution.
X264 (Media Processing)
This application is an H.264/AVC (Advanced Video Coding) video encoder. H.264 describes the lossy compression of a video stream and is also part of ISO/IEC MPEG-4. The flexibility and wide range of application of the H.264 standard and its ubiquity in next-generation video systems are the reasons for the inclusion of x264 in the PARSEC benchmark suite.

■ Methodology

Parallelization
Working sets and locality
Communication to computation ratio and sharing
Off-chip traffic



'Architecture' 카테고리의 다른 글

Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
The SPLASH-2 Programs  (0) 2009.06.05
CMP vs. SMP  (0) 2009.05.26
Evaluating MapReduce for Multi-core and Multiprocessor Systems  (0) 2009.05.25
블로그 이미지

민둥

,
The SPLASH-2 Programs: Characterization and Methodological Considerations (1995)

SPLASH-2 (vs. SPLASH)

Represent a wider range of computations in the scientific, engineering and graphics domains.
Use better algorithms and implementations.
Are more architecturally aware.

■ Characteristics and Approach

□ Axes of Characterization

Concurrency and load balancing: How many processors can be effectively utilized by that program, assuming a perfect memory system and communication architecture.
Working set: Program’s temporal locality
Communication to computation ratio: Potential impact of communication latency on performance
Spatial locality: Spatial locality and false sharing in the programs

□ Approach to Characterization

Experimental environment
Execution-driven simulation. Simulate a cache-coherent shared address space multiprocessor with physically distributed memory and one processor per node. Each processor has a single-level cache, using a directory-based protocol.
All memory references complete in a single cycle (regardless of hits or misses)
Data are distributed among the processing nodes according to the guidelines.
Data Sets and Scaling
The data sets are small enough to simulate in a reasonable time, yew large enough to be of interest in their problem domain in practice. We fix the number of processors at 32 for most of our characterization.
Inherent versus Practical Characteristics
Focus on these realistic memory system parameters while still trying to approach inherent properties and avoid too many artifacts.

■ The SPLASH-2 Application Suite

It has 8 complete applications and 4 kernels

Barnes Simulates the interaction of a system of bodies in three dimensions over a number of time-steps, using the Barnes-Hut hierarchical N-body method.
Cholesky Factors a sparse matrix into the product of a lower triangular matrix and its transpose.
FFT FFT kernel is a comoplex 1-D version of the radix root n six-step FFT algorithm
FMM Similates a system of bodies over a number of timesteps. Interactions in two dimensions using a different hierarchical N-body method called the adaptive Fast Multipole Method.
LU Factors a dense matrix into the product of a lower triangular and an upper triangular matrix.
Ocean Studies large-scale ocean movements based on eddy and boundary currents.
Radiosity Computes the equilibrium distribution of light in a scene using the iterative hierarchical diffuse radiosity method.
Radix Integer radix sort kernel
Raytrace Renders a three-dimensional scene using ray tracing.
Volrend Renders a three-dimensional volume using a ray casting technique.
Water-Nsquared Evaluates forces and potentials that occur over time in a system of water molecules.
Water-Spatial Solves the same problem as Water-Nsquared, but uses a more efficient algorithm.

■ Concurrency and Load Balance

Concurrency and load balance: how they change with problem size and number of processors
Study how the computational load balance scales with the number of processors by measuring speedups on a PRAM architectural model.

Figure 1: the PRAM speedups for the SPLASH-2 programs for up to 64 processors
Figure 2: the time spent waiting at synchronization points for 32-processor executions of each application.

The reasons for sub-linear speedups: the sizes of the input data sets.
(load imbalance, not-completely parallelized prefix computation, …)

■ Working Sets and Temporal Locality

......

'Architecture' 카테고리의 다른 글

Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
The PARSEC Benchmark Suite  (0) 2009.06.05
CMP vs. SMP  (0) 2009.05.26
Evaluating MapReduce for Multi-core and Multiprocessor Systems  (0) 2009.05.25
블로그 이미지

민둥

,

CMP vs. SMP

Architecture 2009. 5. 26. 18:35
멀티 코어 이전의 멀티프로세서(multi-processor) 시스템의 구조는 여러 프로세서들이 큰 메모리 용량을 버스를 통해 서 공유하는 방식이었다. 각 프로세서마다 독립적인 캐시가 있다. 이러한 시스템을 SMP(Shared 또는 Symmetric Multiprocessor)라고 부른다. 

메모리가 여러 프로세서 간에 공유되어 있으므로, SMP에서 처리되어야 할 부분은 메모리 데이터의 일관성 유지와 프로 세스 간의 동기성(synchronization)이다. 같은 메모리의 데이터가 둘이나 그 이상의 프로세서 안의 캐시에 동시에 저 장되어 있을 경우, 한 프로세서에서 캐시의 데이터를 바꾸면 그 새로운 데이터가 다른 프로세서의 캐시에도 반영되어 같은 메모리의 데이터가 일관성 있게 유지된다. 잘 알려진 방식으로는 한 캐시에서 데이터가 바뀌는 것을 버스에 알려 서 다른 프로세서들이 같은 데이터를 캐시에 갖고 있는 경우 같이 데이터를 바꾸는 스누피 프로토콜이 있다. 

또한 둘이나 그 이상의 프로세서가 동시에 같은 메모리 데이터를 바꾸려고 하면 한 프로세서가 바꾼 데이터를 다른 프 로세서에 의해 잃어버릴 가능성이 발생한다. 이 프로세스 간의 동기성은 주로 메모리를 록킹하고 다른 프로세스가 같 은 메모리에 데이터를 저장하는 것을 막아 놓는 방법을 사용한다.

멀티 코어 시스템은 SMP의 한 예라고 볼 수 있는데 그림 8에서 여러 코어가 한 칩에 들어가 메모리를 공유하는 것을 보여준다. 다만 L2 캐시가 멀티 코어 칩에 들어감으로써 메모리 계층이 L1, L2, 그리고 RAM의 세 계층으로 구성되어 더 복잡해진다. 이러한 멀티 코어 칩에 의한 시스템을 CMP(Chip Multi-processor)라고 부른다. CMP는 SMP에 비해 코어 들이 같은 칩 안에 들어 있기 때문에 코어 간의 데이터 교환 시간이 훨씬 짧고 한 번에 교환되는 데이터양은 훨씬 많 다. 

CMP도 SMP의 일종으로 볼 수 있으므로 SMP에서 고려해야 하는 메모리 데이터 일관성과 프로세스간의 동기성의 문제점 들을 다루어야 한다. 

이제 멀티 코어 칩들을 여러 개 연결하여 더 큰 시스템을 만들 수 있는데, 이는 그림 9에 보여진 것과 같은 클러스터 시스템이 된다. 이 클러스터 시스템에서는 메모리를 꼭 여러 멀티 코어가 공유할 필요가 없어지며, 각 멀티 코어마다 독립적인 운영 체계(Operating System)를 사용하는 분산 시스템으로 운영될 수 있다.



설명이 잘 되어 있어서 찾아본것. 원문: http://www.epnc.co.kr/article/view_serial.asp?se=8&article_idx=9148

'Architecture' 카테고리의 다른 글

Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
The PARSEC Benchmark Suite  (0) 2009.06.05
The SPLASH-2 Programs  (0) 2009.06.05
Evaluating MapReduce for Multi-core and Multiprocessor Systems  (0) 2009.05.25
블로그 이미지

민둥

,
for presentation on 5/28
Evaluating MapReduce for Multi-core and Multiprocessor Systems (HPCA 2007)

[ Abstract ]

This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems.
MapReduce allows programmers to write functional-style code that is automatically parallelized and scheduled in a distributed system.

* Phoenix: an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system.

What they did: 
1. Study Phoenix with multi-core and symmetric multiprocessor systems.
2. Compare MapReduce code to code written in lower-level APIs such as P-threads.

[ 1 Introduction ]

Traditional parallel programming techniques (ex. message passing and shared-memory threads)
require programmers to do:
Manage concurrency explicitly by creating threads and synchronizing them through messages or locks.
Manual management of data locality.
-> Very difficult to write correct and scalable parallel code for non-trivial algorithms.
-> must re-tune the code when the application is ported to a different or larger-scale system.

To simplify parallel coding:
1. Practical programming model
2. Efficient runtime system

Phoenix: a programming API and runtime system based on Google's MapReduce model.
Map: processes the input data & generates a set of intermediate key/value pairs.
Reduce: properly merges the intermediate pairs which have the same key.

Phoenix implementation is based on the same principles but targets shared-memory systems such as multi-core chip and symmetric multiprocessors.
Compare the performance of Phoenix code to tuned parallel code written directly with P-threads.

[ 2 MapReduce Overview ]

==== 2.1 Programming Model

MapReduce: is inspired by functional languages and targets data-intensive computations.
Input data format is application specific, and is specified by the user.
Output is a set of <key, value> pairs.

Map is applied on the input data and produces a list of intermediate <key, value> pairs.
can be executed in parallel on non-overlapping portion of the input data.

Reduce is applied to all intermediate pairs with the same key.
can be executed in parallel on each set of intermediate pairs with same key.

Main benefit: Simplicity!
The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelism. The actual parallelization and the details of concurrency management are left to the runtime system.

==== 2.2 Runtime System

The MapReduce runtime is responsible for parallelization and concurrency control.

To parallelize the Map, it splits the input pairs into units that are processed concurrently on multiple nodes. next, partitions the intermediate pairs using a scheme that keeps pairs with the same key in the same unit (processed in parallel by Reduce tasks). Finally, it must merge and sort the output pairs from all Reduce tasks.

Factors: the size of units, the number of nodes, how units are assigned to nodes dynamically, how buffer space is allocated.

Optimization: 
Reduce function-call overheads by increasing the granularity of Map or Reduce tasks.
Reduce load imbalance by adjusting task granularity or the number of nodes used.
Optimize locality -> prefetch pairs for its current Map or Reduce tasks and prefetch the input for its next Map or Reduce task.

Fault tolerance: when it detects that a node has failed, re-assign the Map or Reduce task it was processing at the time to another node. avoid interference, replicated task will use separate output buffers.

Runtime can dynamically adjust the number of nodes.
There can be non-trivial overheads due to key management, data copying, data sorting, or memory allocation between execution steps.

[ 3 The Phoenix System ]

MapReduce for shared-memory systems.
Efficient execution on multiple cores without burdening the programmer with concurrency management.

Simple API: visible to application programmers
 & Efficient Runtime: handles parallelization, resource management, and fault recovery.

==== 3.1 The Phoenix API

==== 3.2 The Phoenix Runtime

======== 3.2.1 Basic Operation and Control Flow

The runtime is controlled by the scheduler, which is initiated by user code.
The scheduler: 
Creates and manages the threads that run Map and Reduce tasks. 
Manages the buffers used for task communication.
Determines the number of cores to use for this computation.
Spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

Map Stage>
Splitter: divide input pairs into equally sized units to be processed by the Map tasks.
Map tasks are allocated dynamically to workers and each one emits <key, value> pairs.
Partition: splits the intermediate pairs into units for the Reduce tasks.
All values of the same key go to the same unit.
The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

Reduce Stage>
Reduce tasks are also assigned to workers dynamically.
Must process all values for the same key in one task -> may exhibit higher imbalance across workers and dynamic scheduling is more important. Final output from all tasks is merged into a single buffer, sorted by keys.

======== 3.2.2 Buffer Management

All buffers are allocated in shared memory.

Map-Reduce buffers: store the intermediate output pairs. Each worker has its own set of buffer.
Reduce-Merge buffers: store the outputs of Reduce tasks before they are sorted. After sorting, the final output is available in the user allocated Output_data buffer.

======== 3.2.3 Fault Recovery

Phoenix detects faults through timeouts.
The execution time of similar tasks on other workers is used as a yardstick for the timeout interval.

Once a fault is detected or at least suspected, the runtime attempts to re-execute the failed task. Since the original task may still be running, separate output buffers are allocated for the new task to avoid conflict and data corruption.

The current Phoenix code does not provide fault recovery for the scheduler itself.

Google's MapReduce -> always spawn redundant executions of the remaining tasks. As they proactively assume that some workers have performance or failure issues.

======== 3.2.4 Concurrency and Locality Management

Three scheduling approaches
1. Default policy for the specific system
2. Dynamically determine the best policy for each decision
3. Allow the programmer to provide application specific policies.

* Number of Cores and Workers/Core:
Spawn workers to all available cores.
In system with multithreaded cores, we spawn one worker per hardware thread.

* Task Assignment:
To achieve load balance, we always assign Map and Reduce task to workers dynamically.

* Task size:
Each Map task processes a unit of the input data.
Adjust the unit size so that the input and output data for a Map task fit in the L1 data cache.
Tradeoff -> lower overheads (few larger units) & load balance (more smaller units)

* Partition Function:
Determines the distribution of intermediate data. 

[ 4 Methodology ]

==== 4.1 Shared Memory Systems

CMP: UltraSparc T1 multi-core chip with 8 multithreaded cores sharing the L2 cache.
SMP: symmetric multiprocessor with 24 chips.

Promise: the same program should run as efficiently as possible on any type of shared-memory system without any involvement by the user.

==== 4.2 Applications

From domains>
Enterprise computing: WordCount, ReverseIndex, StringMatch
Scientific computing: Matrix Multiply
Artificial intelligence: Kmeans, PCA, Linear Regression
Image processing: Histogram

MapReduce version using Phoenix
& conventional parallel version using P-threads.

* WordCount: determine frequency of words in a file.
Map: process different sections of the input files -> <word, 1> (if the word was found)
Reduce: add up the values for each word

* MatrixMultiply: dense integer matrix multiplication.
Map: computes the results for a set of rows of the output matrix-> <(x,y), value>
Reduce: Identify function

* Reverse Index: Build reverse index for links in HTML files.
Map: parses a collection of HTML files -> <link_key, file_info>
Reduce: combines all files referencing the tame link into a single linked-list. 

* Kmeans: Iterative clustering algorithm to classify 3D data points into groups.
Phoenix scheduler is called multiple times until it converges.
Map: finds distance between each point and each mean and assigns the point to the closest cluster.
emits <cluster_id, data_vector>
Reduce: gathers all points with the same cluster_id, find their centurion (mean vector)
emits <cluster_id, mean vector>

* String Match: Search file with keys for an encrypted word.
Map: processes "encrypt" file & "keys" file, parses a portion of the "keys" files and returns 
<a word in the "keys" file, a flag to indicate whether it was a match as the value>
Reduce: Identify function.

* PCA: Principal components analysis on a matrix.
find the mean vector and the covariance matrix of a set of data points. Use 2 MapReduce iterations.
Map 1: compute the mean for a set of rows -> <row_number, mean>
Map 2: compute a few elements in the required covariance matrix, and is provided with the data required to calculate the value of those elements. -> <(row, column), covariance>
Reduce: identify function

* Histogram: Determine frequency of each RGB component in a set of image.
Map: processes different portion of the image -> <image, frequency of component occurrence>
Reduce: sum up numbers.

* Linear Regression: Compute the best fit line for a set of points.
Map: processes different portions of the file, compute certain summary statistics like sum of squares.
Reduce: finally determine the best fit line.

[ 5 Evaluation ]

Evaluation results for Phoenix using the CMP and SMP shared-memory systems.

==== 5.1 Basic Performance Evaluation

==== 5.2 Dependency to Dataset Size

Increasing the dataset leads to higher speedups over the sequential version for most applications.
Larger dataset allows the Phoenix runtime to better amortize its overheads.
Caching effects are more significant when processing large datasets and load imbalance is more rare.

==== 5.3 Dependency to Unit Size

The unit size determines the number of Map tasks, their memory footprint, and how well their overhead is amortized.
Applications with short term temporal locality in their access patterns perform better with smaller units.
Large units can lead to better performance for some applications which have significant portion of time spent on spawning tasks and merging their outputs.

==== 5.4 Comparison to Pthreads

For the applications that fit naturally into the MapReduce model, Phoenix leads to similar or slightly better speedups.
For Kmeans, PCA, and Histogram, P-threads outperform Phoenix significantly. For these applications, MapReduce program structure is not an efficient fit.

==== 5.5 Fault Recovery

'Architecture' 카테고리의 다른 글

Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
The PARSEC Benchmark Suite  (0) 2009.06.05
The SPLASH-2 Programs  (0) 2009.06.05
CMP vs. SMP  (0) 2009.05.26
블로그 이미지

민둥

,