11월 4일 Meeting

Lab.work 2009. 11. 4. 19:12
다음주 월요일까지 할일 List

@ Phoenix

1) 일단 Ideal과 Mesh의 performance 차이가 얼마나 나는지 알아보기.
즉, Network bottleneck이 얼마나 있는지 알아보기

2) 각 phase에서의 miss rate알아보기

3) 각 Node별, Message와 Data traffic이 시간에 따라 어떻게 변하는지 10,000 cycle 단위로 plot

@ Booksim

0) injection voq 구현 ㅠㅠ

1) A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks  (HPCA 05)
논문을 재구현 할수 있는가?

2) 64 node에서 test 
test1: 8x8 mesh, 64 ring, 4ary 3 fly (buf=16)

3) hotspot이 다른 node에 있을때는 얼마나 다를까?
16 node에서 5번 node에 hotspot일때는 0번 노드에 hotspot이 있을때 보다 congestion이 더 많음!

@ Other

1) 논문 읽기!
Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs (ISCA 09)

2) Garnet 코드 완벽 공부!ㅠㅠ
GARNET: A Detailed On-Chip Network Model inside a Full-System Simulator

다음주 화요일 sigMP 전까지

@ 논문 읽기

1) Design and Evaluation of Hierarchical On-Chip Network Topologies for next generation CMPs (HPCA 2009)

How different is the topology from a CMESH topology?
How is the "global bus" different from a concentration implementation?
hybrid: concentration degree=8, wide bus 사용. router의 개수는 node의 개수 + 추가적인 hardware 필요
cmesh: concentration degree=4, router를 sharing 한다. router의 개수는 nodes/4 + cmesh에 맞는 router 사용.
그래서 각각의 delay가 다를 수 있다는거? locally 연결된 node간의 latency도 다르다.

Why is something like "XShare" needed? (i.e. what created the need for something like XShare?) When does FBFLY topology make sense?
channel slicing과 비슷한 개념으로 생각된다.
하나의 channel을 여러개가 공유하게 되므로 하나가 다 잡아먹고 있는것을 방지하기 위해서 사용하는 것인듯 하다.

What is the impact of "locality" on NoC performance?
What would be a worst-case traffic pattern for the proposed topology in this work?
local traffic은 거의 없이 global traffic이 많은 경우 (with high injection rate)

other issues
실제 application에서는 local traffic이 얼마나 생기는지 궁금.
cmesh와 비교할려면 8-degree cmesh와 비교해야 하는게 아닌가?

2) Express virtual channel Flow Control (ISCA 2007)

What is an "ideal" on-chip network latency?
network contention 없이 router hop count + channel latency만 고려한 latency.

Compared to a conventional 2D mesh network, what makes EVC more complex?
Do you think it is a good idea to implement flow control such as EVC to improve performance, or it is better to change to a different topology?
EVCs를 사용할 경우 router가 복잡해지고, static EVCs를 사용한다면 topology와 차이가 없고, 오히려 topology 형태가 더 좋은 performance를 낼 수 있다고 생각된다. dynamic EVCs를 사용하면 static의 문제점은 해결할 수 있을것으로 보이나, traffic에 따라 매우 다른 결과를 보일 것으로 판단된다.

Most NoC evaluation use "even number" network - for example, 16, 64 nodes, etc..  Why do you think the authors use a 7x7 network in their evaluation?
EVC network를 대칭으로 만들기 위해서 odd number를 사용.

What would be a worst-case traffic pattern using static EVCs?
What is the advantage/disadvantage of using dynamic EVCs?
neighbor node간의 traffic이 많은 경우에는 EVCs를 사용하는것이 conventional mesh 보다 더 안좋을 수 있음.

other issues
위에서 언급한 것처럼, 비슷한 topology 형태를 적용했을때 performance는 어떻게 될지 궁금.
torus topology가 아니기 때문에 양 끝의 node들을 bypass할수는 없다. 
node 수가 많아지게 되면, 즉 하나의 bypass channel 사이에 여러개의 node가 들어가게 되면 starvation문제가 더 심각해 질것으로 예상된다. 

'Lab.work' 카테고리의 다른 글

11월 11일 Meeting  (0) 2009.11.11
Weekly Report (11/9)  (0) 2009.11.10
Weekly Report  (0) 2009.11.04
10월 28일 Meeting  (0) 2009.10.28
Weekly Report (10/27)  (0) 2009.10.28
블로그 이미지

민둥

,

6월 18일 발표논문

Lab.work 2009. 7. 1. 21:14
An OS-Based Alternative to Full Hardware Coherence on Tiled CMPs (HPCA 2008) 
PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites (by 신민정) 

'Lab.work' 카테고리의 다른 글

PROGRESS REPORT  (0) 2009.07.21
PROGRESS REPORT  (0) 2009.07.07
6월 29일 월요일  (0) 2009.06.29
PROGRESS REPORT (6/29)  (0) 2009.06.29
PROGRESS REPORT (6/23)  (0) 2009.06.23
블로그 이미지

민둥

,

6월 11일 발표논문

Lab.work 2009. 6. 15. 22:15
발표논문 1 : Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors: A Machine Learning Approach (Micro 2008)
발표논문 2 : Bandwidth adaptive snooping (HPCA 2002)

'Lab.work' 카테고리의 다른 글

PROGRESS REPORT (6/23)  (0) 2009.06.23
PROGRESS REPORT (6/15)  (0) 2009.06.16
6월 4일 발표 논문  (0) 2009.06.09
PROGRESS REPORT (6/8)  (0) 2009.06.09
PROGRESS REPORT (6/1)  (0) 2009.05.31
블로그 이미지

민둥

,
for presentation on 5/28
Evaluating MapReduce for Multi-core and Multiprocessor Systems (HPCA 2007)

[ Abstract ]

This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems.
MapReduce allows programmers to write functional-style code that is automatically parallelized and scheduled in a distributed system.

* Phoenix: an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system.

What they did: 
1. Study Phoenix with multi-core and symmetric multiprocessor systems.
2. Compare MapReduce code to code written in lower-level APIs such as P-threads.

[ 1 Introduction ]

Traditional parallel programming techniques (ex. message passing and shared-memory threads)
require programmers to do:
Manage concurrency explicitly by creating threads and synchronizing them through messages or locks.
Manual management of data locality.
-> Very difficult to write correct and scalable parallel code for non-trivial algorithms.
-> must re-tune the code when the application is ported to a different or larger-scale system.

To simplify parallel coding:
1. Practical programming model
2. Efficient runtime system

Phoenix: a programming API and runtime system based on Google's MapReduce model.
Map: processes the input data & generates a set of intermediate key/value pairs.
Reduce: properly merges the intermediate pairs which have the same key.

Phoenix implementation is based on the same principles but targets shared-memory systems such as multi-core chip and symmetric multiprocessors.
Compare the performance of Phoenix code to tuned parallel code written directly with P-threads.

[ 2 MapReduce Overview ]

==== 2.1 Programming Model

MapReduce: is inspired by functional languages and targets data-intensive computations.
Input data format is application specific, and is specified by the user.
Output is a set of <key, value> pairs.

Map is applied on the input data and produces a list of intermediate <key, value> pairs.
can be executed in parallel on non-overlapping portion of the input data.

Reduce is applied to all intermediate pairs with the same key.
can be executed in parallel on each set of intermediate pairs with same key.

Main benefit: Simplicity!
The programmer provides a simple description of the algorithm that focuses on functionality and not on parallelism. The actual parallelization and the details of concurrency management are left to the runtime system.

==== 2.2 Runtime System

The MapReduce runtime is responsible for parallelization and concurrency control.

To parallelize the Map, it splits the input pairs into units that are processed concurrently on multiple nodes. next, partitions the intermediate pairs using a scheme that keeps pairs with the same key in the same unit (processed in parallel by Reduce tasks). Finally, it must merge and sort the output pairs from all Reduce tasks.

Factors: the size of units, the number of nodes, how units are assigned to nodes dynamically, how buffer space is allocated.

Optimization: 
Reduce function-call overheads by increasing the granularity of Map or Reduce tasks.
Reduce load imbalance by adjusting task granularity or the number of nodes used.
Optimize locality -> prefetch pairs for its current Map or Reduce tasks and prefetch the input for its next Map or Reduce task.

Fault tolerance: when it detects that a node has failed, re-assign the Map or Reduce task it was processing at the time to another node. avoid interference, replicated task will use separate output buffers.

Runtime can dynamically adjust the number of nodes.
There can be non-trivial overheads due to key management, data copying, data sorting, or memory allocation between execution steps.

[ 3 The Phoenix System ]

MapReduce for shared-memory systems.
Efficient execution on multiple cores without burdening the programmer with concurrency management.

Simple API: visible to application programmers
 & Efficient Runtime: handles parallelization, resource management, and fault recovery.

==== 3.1 The Phoenix API

==== 3.2 The Phoenix Runtime

======== 3.2.1 Basic Operation and Control Flow

The runtime is controlled by the scheduler, which is initiated by user code.
The scheduler: 
Creates and manages the threads that run Map and Reduce tasks. 
Manages the buffers used for task communication.
Determines the number of cores to use for this computation.
Spawns a worker thread that is dynamically assigned some number of Map and Reduce tasks.

Map Stage>
Splitter: divide input pairs into equally sized units to be processed by the Map tasks.
Map tasks are allocated dynamically to workers and each one emits <key, value> pairs.
Partition: splits the intermediate pairs into units for the Reduce tasks.
All values of the same key go to the same unit.
The scheduler must wait for all Map tasks to complete before initiating the Reduce stage.

Reduce Stage>
Reduce tasks are also assigned to workers dynamically.
Must process all values for the same key in one task -> may exhibit higher imbalance across workers and dynamic scheduling is more important. Final output from all tasks is merged into a single buffer, sorted by keys.

======== 3.2.2 Buffer Management

All buffers are allocated in shared memory.

Map-Reduce buffers: store the intermediate output pairs. Each worker has its own set of buffer.
Reduce-Merge buffers: store the outputs of Reduce tasks before they are sorted. After sorting, the final output is available in the user allocated Output_data buffer.

======== 3.2.3 Fault Recovery

Phoenix detects faults through timeouts.
The execution time of similar tasks on other workers is used as a yardstick for the timeout interval.

Once a fault is detected or at least suspected, the runtime attempts to re-execute the failed task. Since the original task may still be running, separate output buffers are allocated for the new task to avoid conflict and data corruption.

The current Phoenix code does not provide fault recovery for the scheduler itself.

Google's MapReduce -> always spawn redundant executions of the remaining tasks. As they proactively assume that some workers have performance or failure issues.

======== 3.2.4 Concurrency and Locality Management

Three scheduling approaches
1. Default policy for the specific system
2. Dynamically determine the best policy for each decision
3. Allow the programmer to provide application specific policies.

* Number of Cores and Workers/Core:
Spawn workers to all available cores.
In system with multithreaded cores, we spawn one worker per hardware thread.

* Task Assignment:
To achieve load balance, we always assign Map and Reduce task to workers dynamically.

* Task size:
Each Map task processes a unit of the input data.
Adjust the unit size so that the input and output data for a Map task fit in the L1 data cache.
Tradeoff -> lower overheads (few larger units) & load balance (more smaller units)

* Partition Function:
Determines the distribution of intermediate data. 

[ 4 Methodology ]

==== 4.1 Shared Memory Systems

CMP: UltraSparc T1 multi-core chip with 8 multithreaded cores sharing the L2 cache.
SMP: symmetric multiprocessor with 24 chips.

Promise: the same program should run as efficiently as possible on any type of shared-memory system without any involvement by the user.

==== 4.2 Applications

From domains>
Enterprise computing: WordCount, ReverseIndex, StringMatch
Scientific computing: Matrix Multiply
Artificial intelligence: Kmeans, PCA, Linear Regression
Image processing: Histogram

MapReduce version using Phoenix
& conventional parallel version using P-threads.

* WordCount: determine frequency of words in a file.
Map: process different sections of the input files -> <word, 1> (if the word was found)
Reduce: add up the values for each word

* MatrixMultiply: dense integer matrix multiplication.
Map: computes the results for a set of rows of the output matrix-> <(x,y), value>
Reduce: Identify function

* Reverse Index: Build reverse index for links in HTML files.
Map: parses a collection of HTML files -> <link_key, file_info>
Reduce: combines all files referencing the tame link into a single linked-list. 

* Kmeans: Iterative clustering algorithm to classify 3D data points into groups.
Phoenix scheduler is called multiple times until it converges.
Map: finds distance between each point and each mean and assigns the point to the closest cluster.
emits <cluster_id, data_vector>
Reduce: gathers all points with the same cluster_id, find their centurion (mean vector)
emits <cluster_id, mean vector>

* String Match: Search file with keys for an encrypted word.
Map: processes "encrypt" file & "keys" file, parses a portion of the "keys" files and returns 
<a word in the "keys" file, a flag to indicate whether it was a match as the value>
Reduce: Identify function.

* PCA: Principal components analysis on a matrix.
find the mean vector and the covariance matrix of a set of data points. Use 2 MapReduce iterations.
Map 1: compute the mean for a set of rows -> <row_number, mean>
Map 2: compute a few elements in the required covariance matrix, and is provided with the data required to calculate the value of those elements. -> <(row, column), covariance>
Reduce: identify function

* Histogram: Determine frequency of each RGB component in a set of image.
Map: processes different portion of the image -> <image, frequency of component occurrence>
Reduce: sum up numbers.

* Linear Regression: Compute the best fit line for a set of points.
Map: processes different portions of the file, compute certain summary statistics like sum of squares.
Reduce: finally determine the best fit line.

[ 5 Evaluation ]

Evaluation results for Phoenix using the CMP and SMP shared-memory systems.

==== 5.1 Basic Performance Evaluation

==== 5.2 Dependency to Dataset Size

Increasing the dataset leads to higher speedups over the sequential version for most applications.
Larger dataset allows the Phoenix runtime to better amortize its overheads.
Caching effects are more significant when processing large datasets and load imbalance is more rare.

==== 5.3 Dependency to Unit Size

The unit size determines the number of Map tasks, their memory footprint, and how well their overhead is amortized.
Applications with short term temporal locality in their access patterns perform better with smaller units.
Large units can lead to better performance for some applications which have significant portion of time spent on spawning tasks and merging their outputs.

==== 5.4 Comparison to Pthreads

For the applications that fit naturally into the MapReduce model, Phoenix leads to similar or slightly better speedups.
For Kmeans, PCA, and Histogram, P-threads outperform Phoenix significantly. For these applications, MapReduce program structure is not an efficient fit.

==== 5.5 Fault Recovery

'Architecture' 카테고리의 다른 글

Interconnection Network Topologies  (0) 2009.07.01
PARSEC vs. SPLASH-2  (0) 2009.06.16
The PARSEC Benchmark Suite  (0) 2009.06.05
The SPLASH-2 Programs  (0) 2009.06.05
CMP vs. SMP  (0) 2009.05.26
블로그 이미지

민둥

,