Meeting 12월 29일

2009. 12. 29.

1) nINTER, nLOCAL 대신 Round-Robin으로 선택하게 하고
Rx, Ry의 같은 방향으로 가는 port에 대해서만 prioritized arbitration구현

2) Pipeline stage
모두다 no VA stage
Baseline router: I, SA, ST, LT
Turning packet: SA/STx/INT, STy/LT
X->X: SA/STx/LT (1 cycle)
Y->Y: SA/STy/LT (1 cycle)

3) Network-only simulator
ex)  a uniform random traffic


1)  n value를 6개 사용한것과 4개 사용한것 비교

2) Count number: How many times is nX > n_max used

3) CPI & Average network latency

4) Baseline router의 Input Buffer: 2, 4로 실험

5) Network-only simulator에서 
for the network-only, you dont' need VCs.. so 16x1 should be sufficient.

Pipeline (I/VA/SA/ST/LT 원래 그대로 실험)

buf가 2, 4일때 -> baseline_buf_4, baseline_buf_2

@ 4개 n value, Prioritized arbitration
buf=2, inter_buf=4, n_max=4 일때 -> low_cost_max4

@ 6개 n value, Prioritized arbitration
buf=2, inter_buf=4, n_max=4 일때 -> low_cost_6n_max4

nX > n_max used 개수 비교
CPI, Average network latency 비교.

Flexible Pipeline

@ Baseline router: I, SA, ST, LT (flexible pipeline, 4 cycle)
buf가 2, 4일때 -> baseline_buf_4, baseline_buf_2
buf가 2이고 5stage (VA 포함) 일때 -> baseline_buf_2_5stage

Single Cycle Router

@ Baseline router
pipeline: I/VA, SA, ST, LT (4 cycle)
buf가 2, 4일때 -> baseline_buf_4, baseline_buf_2

@ 4개 n value, Prioritized arbitration
Turning packet: SA/STx/INT, STy/LT
X->X or Y->Y : SA/(STx or STy)/LT (1 cycle) 

Network only Simulator

@ Fixed-pipeline
Uniform Random traffic, 0.1~1.0까지 실험
buffer size = 16

@ Flexible-pipeline
Uniform Random traffic, 0.1~1.0까지 실험
buffer size = 16, pipeline_stage = 5

0.1   /home/berebere/gems_net_only/simics
0.2   /home/berebere/gems_net_only2/simics
0.3   /home/berebere/gems_net_only3/simics
0.4   /home/berebere/gems_net_only4/simics
0.5   /home/berebere/gems_net_only5/simics
0.6   /home/berebere/gems_net_only6/simics
0.7   /home/berebere/gems_net_only7/simics
0.8   /home/berebere/gems_net_only8/simics
0.9   /home/berebere/gems_net_only9/simics

11월 11일 Meeting

2009. 11. 11.
다음주 월요일(11월 16일) 까지 해야할일!

@ Phoenix

1) Ideal(p2p)과 mesh에 대해서 src + dest traffic을 뽑아서 비교!
2) 많이 사용되는 node는 왜 많이 사용 되는 거지? 이유알아보기

@ CS710 ICN

1) 발표 준비
"Low-Cost Router Microarchitecture ", John Kim, to appear in MICRO 2009, New York, NY December 2009 

2) 논문의 내용을 Garnet을 사용하여서 재구현 하기 ★★★★★

@ Booksim

1) 3000 cycle 이후에 simulation 정지하도록 구현.
거기 까지의 패킷으로만 결과를 출력하기

Weekly Report (11/9)

2009. 11. 10.
@ Phoenix

1) Ideal vs. Mesh
The CPI gap between ideal and mesh is not so big.
I’m afraid even if we optimize mapreduce on the mesh network, 
It may not represent much to the performance.

2) miss rate of each phase
I used 64K 4way L1 cache and 1M 4way L2 cache(shared) for each node.
Garnet does not show the actual miss hit/rate, so we should compare the misses per 1000 inst.

3) injection rate of each node with time.
The result files are attached.
[msg.png] represents the injection rate of coherence messages to the processing time.
[data.png] represents the injection rate of coherence data.

We can see the diagonal lines as expected,
But, at the same time, some nodes are heavily used than other.

@ Booksim

1) injection voq
There was a little problem at the last result.
I fixed the program and the new result is attached.

Voq is 3~4 times better than no-voq when the injection rate is very high.

2) test on 64 nodes
I also tested 8x8 mesh, 64 ring, and 4ary 3 fly.
For fly topology, voq is 8~9 times better than no-voq.
for mesh and torus, the simulation cannot be done to the high injection rate,
we cannot see the difference between voq and no-voq.

3) A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks  (HPCA 05)
Now I am trying to re-implement the system in the paper.


11월 4일 Meeting

2009. 11. 4.
다음주 월요일까지 할일 List

@ Phoenix

1) 일단 Ideal과 Mesh의 performance 차이가 얼마나 나는지 알아보기.
즉, Network bottleneck이 얼마나 있는지 알아보기

2) 각 phase에서의 miss rate알아보기

3) 각 Node별, Message와 Data traffic이 시간에 따라 어떻게 변하는지 10,000 cycle 단위로 plot

@ Booksim

0) injection voq 구현 ㅠㅠ

1) A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks  (HPCA 05)
논문을 재구현 할수 있는가?

2) 64 node에서 test 
test1: 8x8 mesh, 64 ring, 4ary 3 fly (buf=16)

3) hotspot이 다른 node에 있을때는 얼마나 다를까?
16 node에서 5번 node에 hotspot일때는 0번 노드에 hotspot이 있을때 보다 congestion이 더 많음!

@ Other

1) 논문 읽기!
Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs (ISCA 09)

2) Garnet 코드 완벽 공부!ㅠㅠ
GARNET: A Detailed On-Chip Network Model inside a Full-System Simulator

다음주 화요일 sigMP 전까지

@ 논문 읽기

1) Design and Evaluation of Hierarchical On-Chip Network Topologies for next generation CMPs (HPCA 2009)

How different is the topology from a CMESH topology?
How is the "global bus" different from a concentration implementation?
hybrid: concentration degree=8, wide bus 사용. router의 개수는 node의 개수 + 추가적인 hardware 필요
cmesh: concentration degree=4, router를 sharing 한다. router의 개수는 nodes/4 + cmesh에 맞는 router 사용.
그래서 각각의 delay가 다를 수 있다는거? locally 연결된 node간의 latency도 다르다.

Why is something like "XShare" needed? (i.e. what created the need for something like XShare?) When does FBFLY topology make sense?
channel slicing과 비슷한 개념으로 생각된다.
하나의 channel을 여러개가 공유하게 되므로 하나가 다 잡아먹고 있는것을 방지하기 위해서 사용하는 것인듯 하다.

What is the impact of "locality" on NoC performance?
What would be a worst-case traffic pattern for the proposed topology in this work?
local traffic은 거의 없이 global traffic이 많은 경우 (with high injection rate)

other issues
실제 application에서는 local traffic이 얼마나 생기는지 궁금.
cmesh와 비교할려면 8-degree cmesh와 비교해야 하는게 아닌가?

2) Express virtual channel Flow Control (ISCA 2007)

What is an "ideal" on-chip network latency?
network contention 없이 router hop count + channel latency만 고려한 latency.

Compared to a conventional 2D mesh network, what makes EVC more complex?
Do you think it is a good idea to implement flow control such as EVC to improve performance, or it is better to change to a different topology?
EVCs를 사용할 경우 router가 복잡해지고, static EVCs를 사용한다면 topology와 차이가 없고, 오히려 topology 형태가 더 좋은 performance를 낼 수 있다고 생각된다. dynamic EVCs를 사용하면 static의 문제점은 해결할 수 있을것으로 보이나, traffic에 따라 매우 다른 결과를 보일 것으로 판단된다.

Most NoC evaluation use "even number" network - for example, 16, 64 nodes, etc..  Why do you think the authors use a 7x7 network in their evaluation?
EVC network를 대칭으로 만들기 위해서 odd number를 사용.

What would be a worst-case traffic pattern using static EVCs?
What is the advantage/disadvantage of using dynamic EVCs?
neighbor node간의 traffic이 많은 경우에는 EVCs를 사용하는것이 conventional mesh 보다 더 안좋을 수 있음.

other issues
위에서 언급한 것처럼, 비슷한 topology 형태를 적용했을때 performance는 어떻게 될지 궁금.
torus topology가 아니기 때문에 양 끝의 node들을 bypass할수는 없다. 
node 수가 많아지게 되면, 즉 하나의 bypass channel 사이에 여러개의 node가 들어가게 되면 starvation문제가 더 심각해 질것으로 예상된다. 

10월 28일 Meeting

2009. 10. 28.
@ Phoenix

L2 cache size, Memory size
Data는 Memory에 어떤식으로 분배되어 있는지
그리고 Ruby에서 뽑을 수 있는 일반적인 stat들도 같이..

전체 message / cycle => average injection rate를
각 cycle마다 어떻게 변하는지 그래프를 그려보기

@ Booksim

infinite buf를 사용하는것은 문제가 있으므로
injection queue를 voq로 구현하면 문제가 해결!

@ ICN Project

mesh에서 특정 request에 대한 delay를 0으로 주면 그 성능이 어떻게 되는가..
queuing delay를 0으로 바꿨는데 ideal보다 더 좋은 결과를 얻었음-_- 뭐지-_-

Weekly Report (10/12)

2009. 10. 12.
@ Booksim

i ran the simulator with age priority.
but the problem was not the old packets.
[voq_age_hotspot.pdf] (already sent)

and I expanded the mesh size to 8x8 and ran the simulator.
i found out that if a node is close to the hotspot, average latency of voq is less than no-voq.
But a node is far from the hotspot, average latency of voq becomes larger.
[mesh88_voq.xlsx] [mesh88_voq_accepted_packet.xlsx] (already sent)

about the tests on the ring topology:
to prevent routing deadlock, I used twice as many vcs as the number of nodes.
for both 16ring and 64ring, Total # of accepted packets of voq is more than no-voq. 
But if I normalize the data to packets/cycle, the voq result is not better anymore. 
[ring16_voq.xlsx][ring64_voq.xlsx][ring16_voq_accepted_packet.xlsx] (already sent)

to find the reason, i drew the packet flow roughly and compared voq and no-voq, 
where both have 16 vcs and 16 buf, but there is 25% of packet blocking at no-voq.
i'm not sure what i drew is right though, but i can find from the diagram that 
1. the cycle taken was the same.
2. the buf of voq is saturated faster than no-voq.
i should test 8x8 mesh either.
and if these might be the reason, i think i need to see buf status again.

@ CS710 project

we counted # of messages for every sort of coherence message type, and plotted 3D graph.
now we are trying to find the pattern of each type and what topology is suit for the each message type.


Weekly report (9/28)

2009. 9. 29.

First of all, I apologize for late report every week :'( 
I will be more punctual next time. 

@ mapreduce ppt 

The PPT is almost done except for the summary part. 
I thought about what to write, but I think I need to discuss about the new results. 

I spent time to explain the mapreduce concept first, 
And compared the result in the paper with my test result. (ver1.0 and ver2.0 on sunT2 machine) 
Then, I added the dynamic load balancing and execution time breakdown. 
And I divided the test configurations as 4 part, which are input size, thread distribution, cache size, and page size. 

Actually, I tested page size this week, and found that: 
When I changed the page size to 64K (default is 8K), there was no degradation for wordcount. 

So I am testing if I change the page size, 
whether the degradation will disappear on simics result. (where there was degradation after 16 nodes) 

@ Booksim 

I printed the buffer size for each cycle, and found that the buffers of voq were saturated much faster than no-voq. 
I thought that it was because the total amount of packets of voq is more than no-voq. 

So I printed the latency of each result only whey the cycle=5000, 2000. 
then, the voq finally printed out better result than no-voq. 
(and I checked that voq is better than no-voq for the same cycles. 
But final # cycle of voq is twice more than no-voq because of the # of packets.) 

I will test with varying buf size and compare the results. 


Progress Report (9/15)

2009. 9. 15.
■ THIS WEEK  (9/8~9/14)

□ CS 710 project

I compared the cpi of IDEAL, MESH, PT_TO_PT.
Where PT_TO_TO is the topology that garnet offers defaultly (it is almost ideal).
MESH is the 4x4 mesh topology, 
IDEAL uses the same link latency and config with MESH but connected in a p2p manner.

For 3 application: barnes, cont-ocean, noncont-ocean,
The cpi result is PT_TO_PT < IDEAL < MESH
But the gap between them is not so big.

TODO: i think that the network topology does not seriously affect to the performance in splash2 applications
We might need to test specjbb or other and compare the results.

□ Booksim

I plotted 3-D graph of accepted packets for each nodes, but couldn’t find the specific pattern.

TODO: I want to fix the trafficmanager code to inject the same amount of packets for voq and no-voq.
and then, compare the results.

□ Phoenix

I changed the wordcount_map function to count all words.
(before, it only counted the words which consist of a-z)

The speedup pattern of map phase is not changed,
But the reduce time hugely increases.
From the result, the reduce processing time depends on the amount of words.

For the mapreduce version of wordcount, map threads store all words to the intermediate queue,
after that, reduce threads read words from the intermediate queue.
For the pthreads version, there is no intermediate queue between map and reduce. 
So, every word passes map and reduce at a time.

TODO: I think I need to check the size of intermediate queue, 
and check the latency of store, read words.

2009. 8. 26.
■ THIS WEEK  (8/17~8/26)

□ Booksim

* single router
single router has 5 inports and 5 outports, so it only has 0~4 source and 0~4 destination.
# of vcs does not affect the result when using no VOQ.

and when # of vcs is 5, the result is very same with the result of 64vcs.
I think because the single router has only 5 destinations (not 64 dests), 
the result will be the same when # of vcs >= 5.
as you said, # of vcs for VOQ should be equal to the # of ports.

* mesh88 uniform
i changed trafficmanagner to use VOQ for injection. so there is no blocking any more.
overall throughput slightly increases, but the graph pattern remains the same.

□ Phoenix

pthread speedup is almost same with the # of processors when network size is < 24.
but the degradation rate is all different using 32, 64 cpus.

for proc = 1 and for some applications, mapreduce takes significantly more time than pthread does.
i am verifying whether the mapreduce code works correctly.



□ Phoenix
test all application and verify whether the mapreduce code works correctly.

□ Booksim
determine why VOQ (single) does poorly on multiflit network.

2009. 8. 17.
■ THIS WEEK  (8/11~8/16)

□ Phoenix 2.0
I had a simple test of wordcount on simics.
There is still a degradation after 64 nodes.
But, the speedup increases up to ~250 times (using 64 nodes).

□ Booksim 2.0
I ran booksim with bimodal uniform traffic. 
using 64 vcs for each destination (8x8 mesh), the throughput is lower than original one.
When buffer size is 1, the decrease of throughput is huge.
When buffer size is getting bigger, the degradation get smaller, but never exceeds the result which does not use destination info.
I not quite sure, but I think it is because of the under-utilization of vcs.

□ sigMP
I presented the paper at the sigMP seminar.
A Novel Cache Architecture with Enhanced Performance and Security (MICRO 2008)

■ NEXT WEEK  (8/17~8/23)

□ Phoenix 
run all applications on gems with different configurations.

□ Booksim 2.0
Test more with different voq options.
Implement the outqueue router and analysis the results.

