6월 11일 발표논문

Lab.work 2009. 6. 15. 22:15
발표논문 1 : Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors: A Machine Learning Approach (Micro 2008)
발표논문 2 : Bandwidth adaptive snooping (HPCA 2002)

'Lab.work' 카테고리의 다른 글

PROGRESS REPORT (6/23)  (0) 2009.06.23
PROGRESS REPORT (6/15)  (0) 2009.06.16
6월 4일 발표 논문  (0) 2009.06.09
PROGRESS REPORT (6/8)  (0) 2009.06.09
PROGRESS REPORT (6/1)  (0) 2009.05.31
블로그 이미지

민둥

,

6월 4일 발표 논문

Lab.work 2009. 6. 9. 00:12
발표논문 1 : Self-Optimizing Memory Controllers: A Reinforcement Learning Approach (ISCA 2008)
발표논문 2 : Virtual Hierarchies to Support Server Consolidation (ISCA 2007)


'Lab.work' 카테고리의 다른 글

PROGRESS REPORT (6/15)  (0) 2009.06.16
6월 11일 발표논문  (0) 2009.06.15
PROGRESS REPORT (6/8)  (0) 2009.06.09
PROGRESS REPORT (6/1)  (0) 2009.05.31
5월 27일 수요일  (0) 2009.05.27
블로그 이미지

민둥

,

PROGRESS REPORT (6/8)

Lab.work 2009. 6. 9. 00:09
■ THIS WEEK - WHAT I DID (6/2~6/8)

SigMP seminar (6/4)
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach (ISCA 2008)
Virtual Hierarchies to Support Server Consolidation (ISCA 2007)

Phoenix
I wanted to test performance using 4 cores (vc105 machine), but there was a problem.
To change a cpu option, the machine has to be rebooted with maxcpus boot options.
I asked the manager of the server to reboot with the options, because I couldn't do that remotely.
The manager said, he is still finding a way to do that.

instead of waiting,
I tested Phoenix version 1.0 on Sun T2 machine (cst5210 machine) with 8 cores (64 threads)
the result graph is in the file attached.
for every applications, the results using mapreduce are no better than p-threads.
compare to CMP speedup in the paper, using 8 cores (32 threads), 
even reverse_index and word_count didn't perform well using mapreduce.

I think when we have many workers, the overhead of mapreduce increases.
the approach in the paper was fresh, but I'm not sure it would be good for the future multicore/multiprocessor systems.

Simics & Gems
since the nodes from Prof. Moon do not have enough space, I got an account of pcal6 machine from a student of Prof. Huh, 
now I am sharing pcal6 machine with JeongSeob.
I downloaded and installed Simics 3.0 on pcal6 again.

SigSimics meeting was on Thursday.
I applied for studying PARSEC benchmark together.
I am trying to figure out the characteristic, and understand the difference between PARSEC and SPLASH2.

Paper
The SPLASH-2 Programs: Characterizationand Methodological Considerations
The PARSEC Benchmark Suite: Characterization and Architectural Implications
PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip-Multiprocessors



■ NEXT WEEK - WHAT TO DO (6/9~6/15)

SigMP seminar (6/11)
Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors: A Machine Learning Approach (Micro 2008) 
Bandwidth adaptive snooping (HPCA 2002)

Simics & Gems
Study and get used to Simics. Setup Gems simulator.
run SPLASH2 and PARSEC.

'Lab.work' 카테고리의 다른 글

6월 11일 발표논문  (0) 2009.06.15
6월 4일 발표 논문  (0) 2009.06.09
PROGRESS REPORT (6/1)  (0) 2009.05.31
5월 27일 수요일  (0) 2009.05.27
PROGRESS REPORT (5/22)  (0) 2009.05.22
블로그 이미지

민둥

,
The PARSEC Benchmark Suite: Characterization and Architectural Implications (2008)

PARSEC Wiki
http://wiki.cs.princeton.edu/index.php/PARSEC

A benchmark suite for studies of CMPs (Chip-Multiprocessors)
Diverse in working set, locality, data sharing, synchronization, and off-chip traffic.

existing benchmark suites cannot be considered adequate to describe future CMP applications.

■ Motivation

□ Requirements for a Benchmarks Suite

Multi-threaded Applications
Emerging Workloads
Diverse
Employ State-of-Art Techniques
Support Research

□ Limitations of Existing Benchmark Suites

SPLASH-2
Program collection is skewed towards HPC and graphics programs.
Does not include parallelization models such as the pipeline model.
SPEC CPU2006 and OMP2001
Provide a snapshot of current scientific and engineering applications
Workloads such as systems programs and parallelization models which employ the producer-consumer model are not included.
SPEC CPU2006 is a suite of serial programs.
Other Benchmark Suites
Designed to study a specific program area and limited to a single application domain.

■ The PARSEC Benchmark Suite

9 applications and 3 kernels (+PARSEC 2.0 includes RayTrace)

□ Input Sets
test, simdev, simsmall, simmedium, simlarge, native

□ Workloads

Blackscholes (Financial Analysis)
This application is an Intel RMS benchmark. It calculates the prices for a portfolio of European options analytically with the Black-Scholes partial differential equation (PDE). There is no closed-form expression for the Black-Scholes equation and as such it must be computed numerically.
Bodytrack (Computer Vision)
This computer vision application is an Intel RMS workload which tracks a human body with multiple cameras through an image sequence. This benchmark was included due to the increasing significance of computer vision algorithms in areas such as video surveillance, character animation and computer interfaces.
Canneal (Engineering)
This kernel was developed by Princeton University. It uses cache-aware simulated annealing (SA) to minimize the routing cost of a chip design. Canneal uses fine-grained parallelism with a lock-free algorithm and a very aggressive synchronization strategy that is based on data race recovery instead of avoidance.
Dedup (Enterprise Storage)
This kernel was developed by Princeton University. It compresses a data stream with a combination of global and local compression that is called 'deduplication'. The kernel uses a pipelined programming model to mimic real-world implementations. The reason for the inclusion of this kernel is that deduplication has become a mainstream method for new-generation backup storage systems.
Facesim (Animation)
This Intel RMS application was originally developed by Stanford University. It computes a visually realistic animation of the modeled face by simulating the underlying physics. The workload was included in the benchmark suite because an increasing number of animations employ physical simulation to create more realistic effects.
Ferret (Similarity Search)
This application is based on the Ferret toolkit which is used for content-based similarity search. It was developed by Princeton University. The reason for the inclusion in the benchmark suite is that it represents emerging next-generation search engines for non-text document data types. In the benchmark, we have configured the Ferret toolkit for image similarity search. Ferret is parallelized using the pipeline model.
Fluidanimate (Animation)
This Intel RMS application uses an extension of the Smoothed Particle Hydrodynamics (SPH) method to simulate an incompressible fluid for interactive animation purposes. It was included in the PARSEC benchmark suite because of the increasing significance of physics simulations for animations.
Freqmine (Data Mining)
This application employs an array-based version of the FP-growth (Frequent Pattern-growth) method for Frequent Itemset Mining (FIMI). It is an Intel RMS benchmark which was originally developed by Concordia University. Freqmine was included in the PARSEC benchmark suite because of the increasing use of data mining techniques.
Raytrace (PARSEC 2.0 추가됨, Graphics?)
The Intel RMS application uses a version of the raytracing method that would typically be employed for real-time animations such as computer games. It is optimized for speed rather than realism. The computational complexity of the algorithm depends on the resolution of the output image and the scene.
Streamcluster (Data Mining)
This RMS kernel was developed by Princeton University and solves the online clustering problem. Streamcluster was included in the PARSEC benchmark suite because of the importance of data mining algorithms and the prevalence of problems with streaming characteristics.
Swaptions (Financial Analysis)
The application is an Intel RMS workload which uses the Heath-Jarrow-Morton (HJM) framework to price a portfolio of swaptions. Swaptions employs Monte Carlo (MC) simulation to compute the prices.
Vips (Media Processing)
This application is based on the VASARI Image Processing System (VIPS) which was originally developed through several projects funded by European Union (EU) grants. The benchmark version is derived from a print on demand service that is offered at the National Gallery of London, which is also the current maintainer of the system. The benchmark includes fundamental image operations such as an affine transformation and a convolution.
X264 (Media Processing)
This application is an H.264/AVC (Advanced Video Coding) video encoder. H.264 describes the lossy compression of a video stream and is also part of ISO/IEC MPEG-4. The flexibility and wide range of application of the H.264 standard and its ubiquity in next-generation video systems are the reasons for the inclusion of x264 in the PARSEC benchmark suite.

■ Methodology

Parallelization
Working sets and locality
Communication to computation ratio and sharing
Off-chip traffic



'Architecture' 카테고리의 다른 글

Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
The SPLASH-2 Programs  (0) 2009.06.05
CMP vs. SMP  (0) 2009.05.26
Evaluating MapReduce for Multi-core and Multiprocessor Systems  (0) 2009.05.25
블로그 이미지

민둥

,
The SPLASH-2 Programs: Characterization and Methodological Considerations (1995)

SPLASH-2 (vs. SPLASH)

Represent a wider range of computations in the scientific, engineering and graphics domains.
Use better algorithms and implementations.
Are more architecturally aware.

■ Characteristics and Approach

□ Axes of Characterization

Concurrency and load balancing: How many processors can be effectively utilized by that program, assuming a perfect memory system and communication architecture.
Working set: Program’s temporal locality
Communication to computation ratio: Potential impact of communication latency on performance
Spatial locality: Spatial locality and false sharing in the programs

□ Approach to Characterization

Experimental environment
Execution-driven simulation. Simulate a cache-coherent shared address space multiprocessor with physically distributed memory and one processor per node. Each processor has a single-level cache, using a directory-based protocol.
All memory references complete in a single cycle (regardless of hits or misses)
Data are distributed among the processing nodes according to the guidelines.
Data Sets and Scaling
The data sets are small enough to simulate in a reasonable time, yew large enough to be of interest in their problem domain in practice. We fix the number of processors at 32 for most of our characterization.
Inherent versus Practical Characteristics
Focus on these realistic memory system parameters while still trying to approach inherent properties and avoid too many artifacts.

■ The SPLASH-2 Application Suite

It has 8 complete applications and 4 kernels

Barnes Simulates the interaction of a system of bodies in three dimensions over a number of time-steps, using the Barnes-Hut hierarchical N-body method.
Cholesky Factors a sparse matrix into the product of a lower triangular matrix and its transpose.
FFT FFT kernel is a comoplex 1-D version of the radix root n six-step FFT algorithm
FMM Similates a system of bodies over a number of timesteps. Interactions in two dimensions using a different hierarchical N-body method called the adaptive Fast Multipole Method.
LU Factors a dense matrix into the product of a lower triangular and an upper triangular matrix.
Ocean Studies large-scale ocean movements based on eddy and boundary currents.
Radiosity Computes the equilibrium distribution of light in a scene using the iterative hierarchical diffuse radiosity method.
Radix Integer radix sort kernel
Raytrace Renders a three-dimensional scene using ray tracing.
Volrend Renders a three-dimensional volume using a ray casting technique.
Water-Nsquared Evaluates forces and potentials that occur over time in a system of water molecules.
Water-Spatial Solves the same problem as Water-Nsquared, but uses a more efficient algorithm.

■ Concurrency and Load Balance

Concurrency and load balance: how they change with problem size and number of processors
Study how the computational load balance scales with the number of processors by measuring speedups on a PRAM architectural model.

Figure 1: the PRAM speedups for the SPLASH-2 programs for up to 64 processors
Figure 2: the time spent waiting at synchronization points for 32-processor executions of each application.

The reasons for sub-linear speedups: the sizes of the input data sets.
(load imbalance, not-completely parallelized prefix computation, …)

■ Working Sets and Temporal Locality

......

'Architecture' 카테고리의 다른 글

Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
The PARSEC Benchmark Suite  (0) 2009.06.05
CMP vs. SMP  (0) 2009.05.26
Evaluating MapReduce for Multi-core and Multiprocessor Systems  (0) 2009.05.25
블로그 이미지

민둥

,

PROGRESS REPORT (6/1)

Lab.work 2009. 5. 31. 19:08
■ THIS WEEK - WHAT I DID (5/23~6/1)

SigMP seminar (5/28)
"Evaluating MapReduce for Multi-core and Multiprocessor Systems"

Phoenix
I downloaded Phoenix version 2.0 (http://mapreduce.stanford.edu/)
and tested with the input datasets for each application.

Q: how did the test go? which applications were you able to run and on which machine?
I use “vc105.kaist.ac.kr” for my own study.
Sequential, MapReduce, P-thread versions of code for all applications can be downloaded. Test datasets can be downloaded as well.
I just executed each version of applications and tested with various datasets.
Now I am trying to read code and understand the implementation.

Simics
downloaded and installed Simics 3.0
Now I am studying how to use it.

Q: where did you install Simics?   were you able to run it at all?
I also installed Simics in vc105 machine.
After I registered the license and fixed some problems, Simics runs without error.
So far, it worked well as the user guide, but I am still figuring out how to use it.
I will let you know the progress.

Paper
"The SPLASH-2 Programs: Characterizationand Methodological Considerations"



■ NEXT WEEK - WHAT TO DO (6/2~6/8)

Paper to read
"The PARSEC Benchmark Suite: Characterization and Architectural Implications"
"PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip-Multiprocessors"

Q: is this for sigMP seminar?
The papers are for understanding each benchmarks and simulators.
I will also read the papers for sigMP before the seminar.

Simics
Study and get used to Simics.
Setup Gems simulator.



■ Other Issues

WebCam
What do you think about this one? (http://blog.danawa.com/prod/651757/C/863/892/10683/0)
If it's ok for you, I want to purchase this. 

i don't think we need to use webcam for our conf. call. please just make sure you have a mic and  a speaker (or a headphone set).
Actually, both of us don’t have a mic. 
I think we can borrow a headset from some other labs.

Desktop Sharing Tool
Skype v4.1 beta includes screen sharing. We may use this.

'Lab.work' 카테고리의 다른 글

6월 4일 발표 논문  (0) 2009.06.09
PROGRESS REPORT (6/8)  (0) 2009.06.09
5월 27일 수요일  (0) 2009.05.27
PROGRESS REPORT (5/22)  (0) 2009.05.22
5월 15일 발표 논문  (0) 2009.05.21
블로그 이미지

민둥

,

5월 27일 수요일

Lab.work 2009. 5. 27. 17:02
Core 간의 Communication/Performance/Power/Reliability
Applicaion & Architecture <- 에 초점 

Coherence
power, verification, less overhead, cost efficient

★★ MapReduce!!
참고 "Evaluating MapReduce for Multicore and Multiprocessor Systems"
논문에 나오는 application을 따라 구현 & multicore에서 돌릴 수 있도록 접근



Benchmark: 각각에 대한 논문 찾아 읽어보기

"The SPLASH-2 Programs: Characterization and Methodological Considerations"

"The PARSEC Benchmark Suite: Characterization and Architectural Implications"
"PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip-Multiprocessors"

Simics license/Site license 알아보고 신청하기 (http://www.virtutech.com/)
Garnet: Detailed Interconnection Network Model inside a Full-system Simulation Framework
Orion: A Power-Performance Simulator for Interconnection Networks



매주 화요일 저녁 9시 화상회의
매주 월요일 저녁까지 Progress Report

Cluster account 받기
웹캠 & 마이크 구입

세미나 논문 발표
1. 문제 정의
2. 저자의 접근 방식
3. 기존의 방법과의 차이점
4. 결과 분석
5. 장단점 (나의 생각)

'Lab.work' 카테고리의 다른 글

PROGRESS REPORT (6/8)  (0) 2009.06.09
PROGRESS REPORT (6/1)  (0) 2009.05.31
PROGRESS REPORT (5/22)  (0) 2009.05.22
5월 15일 발표 논문  (0) 2009.05.21
5월 21일 목요일  (0) 2009.05.21
블로그 이미지

민둥

,

CMP vs. SMP

Architecture 2009. 5. 26. 18:35
멀티 코어 이전의 멀티프로세서(multi-processor) 시스템의 구조는 여러 프로세서들이 큰 메모리 용량을 버스를 통해 서 공유하는 방식이었다. 각 프로세서마다 독립적인 캐시가 있다. 이러한 시스템을 SMP(Shared 또는 Symmetric Multiprocessor)라고 부른다. 

메모리가 여러 프로세서 간에 공유되어 있으므로, SMP에서 처리되어야 할 부분은 메모리 데이터의 일관성 유지와 프로 세스 간의 동기성(synchronization)이다. 같은 메모리의 데이터가 둘이나 그 이상의 프로세서 안의 캐시에 동시에 저 장되어 있을 경우, 한 프로세서에서 캐시의 데이터를 바꾸면 그 새로운 데이터가 다른 프로세서의 캐시에도 반영되어 같은 메모리의 데이터가 일관성 있게 유지된다. 잘 알려진 방식으로는 한 캐시에서 데이터가 바뀌는 것을 버스에 알려 서 다른 프로세서들이 같은 데이터를 캐시에 갖고 있는 경우 같이 데이터를 바꾸는 스누피 프로토콜이 있다. 

또한 둘이나 그 이상의 프로세서가 동시에 같은 메모리 데이터를 바꾸려고 하면 한 프로세서가 바꾼 데이터를 다른 프 로세서에 의해 잃어버릴 가능성이 발생한다. 이 프로세스 간의 동기성은 주로 메모리를 록킹하고 다른 프로세스가 같 은 메모리에 데이터를 저장하는 것을 막아 놓는 방법을 사용한다.

멀티 코어 시스템은 SMP의 한 예라고 볼 수 있는데 그림 8에서 여러 코어가 한 칩에 들어가 메모리를 공유하는 것을 보여준다. 다만 L2 캐시가 멀티 코어 칩에 들어감으로써 메모리 계층이 L1, L2, 그리고 RAM의 세 계층으로 구성되어 더 복잡해진다. 이러한 멀티 코어 칩에 의한 시스템을 CMP(Chip Multi-processor)라고 부른다. CMP는 SMP에 비해 코어 들이 같은 칩 안에 들어 있기 때문에 코어 간의 데이터 교환 시간이 훨씬 짧고 한 번에 교환되는 데이터양은 훨씬 많 다. 

CMP도 SMP의 일종으로 볼 수 있으므로 SMP에서 고려해야 하는 메모리 데이터 일관성과 프로세스간의 동기성의 문제점 들을 다루어야 한다. 

이제 멀티 코어 칩들을 여러 개 연결하여 더 큰 시스템을 만들 수 있는데, 이는 그림 9에 보여진 것과 같은 클러스터 시스템이 된다. 이 클러스터 시스템에서는 메모리를 꼭 여러 멀티 코어가 공유할 필요가 없어지며, 각 멀티 코어마다 독립적인 운영 체계(Operating System)를 사용하는 분산 시스템으로 운영될 수 있다.



설명이 잘 되어 있어서 찾아본것. 원문: http://www.epnc.co.kr/article/view_serial.asp?se=8&article_idx=9148

'Architecture' 카테고리의 다른 글

Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
The PARSEC Benchmark Suite  (0) 2009.06.05
The SPLASH-2 Programs  (0) 2009.06.05
Evaluating MapReduce for Multi-core and Multiprocessor Systems  (0) 2009.05.25
블로그 이미지

민둥

,
for presentation on 5/28
Evaluating MapReduce for Multi-core and Multiprocessor Systems (HPCA 2007)

[ Abstract ]

This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems.
MapReduce allows programmers to write functional-style code that is automatically parallelized and scheduled in a distributed system.

* Phoenix: an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system.

What they did: 
1. Study Phoenix with multi-core and symmetric multiprocessor systems.
2. Compare MapReduce code to code written in lower-level APIs such as P-threads.

[ 1 Introduction ]

Traditional parallel programming techniques (ex. message passing and shared-memory threads)
require programmers to do:
Manage concurrency explicitly by creating threads and synchronizing them through messages or locks.
Manual management of data locality.
-> Very difficult to write correct and scalable parallel code for non-trivial algorithms.
-> must re-tune the code when the application is ported to a different or larger-scale system.

To simplify parallel coding:
1. Practical programming model
2. Efficient runtime system

Phoenix: a programming API and runtime system based on Google's MapReduce model.
Map: processes the input data & generates a set of intermediate key/value pairs.
Reduce: properly merges the intermediate pairs which have the same key.

Phoenix implementation is based on the same principles but targets shared-memory systems such as multi-core chip and symmetric multiprocessors.
Compare the performance of Phoenix code to tuned parallel code written directly with P-threads.

[ 2 MapReduce Overview ]

==== 2.1 Programming Model

MapReduce: is inspired by functional languages and targets data-intensive computations.
Input data format is application specific, and is specified by the user.
Output is a set of <key, value> pairs.

Map is applied on the input data and produces a list of intermediate <key, value> pairs.
can be executed in parallel on non-overlapping portion of the input data.

Reduce is applied to all intermediate pairs with the same key.
can be executed in parallel on each set of intermediate pairs with same key.

Main benefit: Simplicity!
The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelism. The actual parallelization and the details of concurrency management are left to the runtime system.

==== 2.2 Runtime System

The MapReduce runtime is responsible for parallelization and concurrency control.

To parallelize the Map, it splits the input pairs into units that are processed concurrently on multiple nodes. next, partitions the intermediate pairs using a scheme that keeps pairs with the same key in the same unit (processed in parallel by Reduce tasks). Finally, it must merge and sort the output pairs from all Reduce tasks.

Factors: the size of units, the number of nodes, how units are assigned to nodes dynamically, how buffer space is allocated.

Optimization: 
Reduce function-call overheads by increasing the granularity of Map or Reduce tasks.
Reduce load imbalance by adjusting task granularity or the number of nodes used.
Optimize locality -> prefetch pairs for its current Map or Reduce tasks and prefetch the input for its next Map or Reduce task.

Fault tolerance: when it detects that a node has failed, re-assign the Map or Reduce task it was processing at the time to another node. avoid interference, replicated task will use separate output buffers.

Runtime can dynamically adjust the number of nodes.
There can be non-trivial overheads due to key management, data copying, data sorting, or memory allocation between execution steps.

[ 3 The Phoenix System ]

MapReduce for shared-memory systems.
Efficient execution on multiple cores without burdening the programmer with concurrency management.

Simple API: visible to application programmers
 & Efficient Runtime: handles parallelization, resource management, and fault recovery.

==== 3.1 The Phoenix API

==== 3.2 The Phoenix Runtime

======== 3.2.1 Basic Operation and Control Flow

The runtime is controlled by the scheduler, which is initiated by user code.
The scheduler: 
Creates and manages the threads that run Map and Reduce tasks. 
Manages the buffers used for task communication.
Determines the number of cores to use for this computation.
Spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

Map Stage>
Splitter: divide input pairs into equally sized units to be processed by the Map tasks.
Map tasks are allocated dynamically to workers and each one emits <key, value> pairs.
Partition: splits the intermediate pairs into units for the Reduce tasks.
All values of the same key go to the same unit.
The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

Reduce Stage>
Reduce tasks are also assigned to workers dynamically.
Must process all values for the same key in one task -> may exhibit higher imbalance across workers and dynamic scheduling is more important. Final output from all tasks is merged into a single buffer, sorted by keys.

======== 3.2.2 Buffer Management

All buffers are allocated in shared memory.

Map-Reduce buffers: store the intermediate output pairs. Each worker has its own set of buffer.
Reduce-Merge buffers: store the outputs of Reduce tasks before they are sorted. After sorting, the final output is available in the user allocated Output_data buffer.

======== 3.2.3 Fault Recovery

Phoenix detects faults through timeouts.
The execution time of similar tasks on other workers is used as a yardstick for the timeout interval.

Once a fault is detected or at least suspected, the runtime attempts to re-execute the failed task. Since the original task may still be running, separate output buffers are allocated for the new task to avoid conflict and data corruption.

The current Phoenix code does not provide fault recovery for the scheduler itself.

Google's MapReduce -> always spawn redundant executions of the remaining tasks. As they proactively assume that some workers have performance or failure issues.

======== 3.2.4 Concurrency and Locality Management

Three scheduling approaches
1. Default policy for the specific system
2. Dynamically determine the best policy for each decision
3. Allow the programmer to provide application specific policies.

* Number of Cores and Workers/Core:
Spawn workers to all available cores.
In system with multithreaded cores, we spawn one worker per hardware thread.

* Task Assignment:
To achieve load balance, we always assign Map and Reduce task to workers dynamically.

* Task size:
Each Map task processes a unit of the input data.
Adjust the unit size so that the input and output data for a Map task fit in the L1 data cache.
Tradeoff -> lower overheads (few larger units) & load balance (more smaller units)

* Partition Function:
Determines the distribution of intermediate data. 

[ 4 Methodology ]

==== 4.1 Shared Memory Systems

CMP: UltraSparc T1 multi-core chip with 8 multithreaded cores sharing the L2 cache.
SMP: symmetric multiprocessor with 24 chips.

Promise: the same program should run as efficiently as possible on any type of shared-memory system without any involvement by the user.

==== 4.2 Applications

From domains>
Enterprise computing: WordCount, ReverseIndex, StringMatch
Scientific computing: Matrix Multiply
Artificial intelligence: Kmeans, PCA, Linear Regression
Image processing: Histogram

MapReduce version using Phoenix
& conventional parallel version using P-threads.

* WordCount: determine frequency of words in a file.
Map: process different sections of the input files -> <word, 1> (if the word was found)
Reduce: add up the values for each word

* MatrixMultiply: dense integer matrix multiplication.
Map: computes the results for a set of rows of the output matrix-> <(x,y), value>
Reduce: Identify function

* Reverse Index: Build reverse index for links in HTML files.
Map: parses a collection of HTML files -> <link_key, file_info>
Reduce: combines all files referencing the tame link into a single linked-list. 

* Kmeans: Iterative clustering algorithm to classify 3D data points into groups.
Phoenix scheduler is called multiple times until it converges.
Map: finds distance between each point and each mean and assigns the point to the closest cluster.
emits <cluster_id, data_vector>
Reduce: gathers all points with the same cluster_id, find their centurion (mean vector)
emits <cluster_id, mean vector>

* String Match: Search file with keys for an encrypted word.
Map: processes "encrypt" file & "keys" file, parses a portion of the "keys" files and returns 
<a word in the "keys" file, a flag to indicate whether it was a match as the value>
Reduce: Identify function.

* PCA: Principal components analysis on a matrix.
find the mean vector and the covariance matrix of a set of data points. Use 2 MapReduce iterations.
Map 1: compute the mean for a set of rows -> <row_number, mean>
Map 2: compute a few elements in the required covariance matrix, and is provided with the data required to calculate the value of those elements. -> <(row, column), covariance>
Reduce: identify function

* Histogram: Determine frequency of each RGB component in a set of image.
Map: processes different portion of the image -> <image, frequency of component occurrence>
Reduce: sum up numbers.

* Linear Regression: Compute the best fit line for a set of points.
Map: processes different portions of the file, compute certain summary statistics like sum of squares.
Reduce: finally determine the best fit line.

[ 5 Evaluation ]

Evaluation results for Phoenix using the CMP and SMP shared-memory systems.

==== 5.1 Basic Performance Evaluation

==== 5.2 Dependency to Dataset Size

Increasing the dataset leads to higher speedups over the sequential version for most applications.
Larger dataset allows the Phoenix runtime to better amortize its overheads.
Caching effects are more significant when processing large datasets and load imbalance is more rare.

==== 5.3 Dependency to Unit Size

The unit size determines the number of Map tasks, their memory footprint, and how well their overhead is amortized.
Applications with short term temporal locality in their access patterns perform better with smaller units.
Large units can lead to better performance for some applications which have significant portion of time spent on spawning tasks and merging their outputs.

==== 5.4 Comparison to Pthreads

For the applications that fit naturally into the MapReduce model, Phoenix leads to similar or slightly better speedups.
For Kmeans, PCA, and Histogram, P-threads outperform Phoenix significantly. For these applications, MapReduce program structure is not an efficient fit.

==== 5.5 Fault Recovery

'Architecture' 카테고리의 다른 글

Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
The PARSEC Benchmark Suite  (0) 2009.06.05
The SPLASH-2 Programs  (0) 2009.06.05
CMP vs. SMP  (0) 2009.05.26
블로그 이미지

민둥

,

PROGRESS REPORT (5/22)

Lab.work 2009. 5. 22. 20:33
■ THIS WEEK - WHAT I DID (5/16~5/22)

Personal schedule
All final tests are over.
Nothing special because of exams.
Computer Aichetecture Project due to Saturday

No SigMP seminar

Meeting was on Thursday
Research specific core application/computation
Setup Simics -> download Gems simulator




■ NEXT WEEK - WHAT TO DO (5/23~5/29)

Research Research Research on topics !! 

SigMP seminar
Prepare SigMP seminar (Thursday 3pm)
"Evaluating MapReduce for Multi-core and Multiprocessor Systems"

'Lab.work' 카테고리의 다른 글

PROGRESS REPORT (6/1)  (0) 2009.05.31
5월 27일 수요일  (0) 2009.05.27
5월 15일 발표 논문  (0) 2009.05.21
5월 21일 목요일  (0) 2009.05.21
PROGRESS REPORT (5/15)  (0) 2009.05.15
블로그 이미지

민둥

,